整合营销服务商

电脑端+手机端+微信端=数据同步管理

免费咨询热线:

word转html小工具

啥要做这个软件

  • 部分编辑器排版不是很方便,但对html代码支持程度较好。

word不是直接能装html了吗?你这个有啥不一样

  • word自带转html代码不是很好,内嵌了很多字体和颜色格式,而这些一般第三方编辑器并不支持。
  • 我这个直接就是比较简单的html代码,比如h1、em、p之类的简单代码,理论上支持大部分第三方html格式。

软件特色

  • 基于PyQt5打包,所以软件比较大,30多M。
  • 基于mammoth模块开发,这个提供了doc,docx转html的核心

软件界面




软件获取方式:私信“word”即可获取

近有一个业务是前端要上传word格式的文稿,然后用户上传完之后,可以用浏览器直接查看该文稿,并且可以在富文本框直接引用该文稿,所以上传word文稿之后,后端保存到db的必须是html格式才行,所以涉及到word格式转html格式。

通过调查,这个word和html的处理,有两种方案,方案1是前端做这个转换。方案2是把word文档上传给后台,后台转换好之后再返回给前端。至于方案1,看到大家的反馈都说很多问题,所以就没采用前端转的方案,最终决定是后端转化为html格式并返回给前段预览,待客户预览的时候,确认格式没问题之后,再把html保存到后台(因为word涉及到的格式太多,比如图片,visio图,表格,图片等等之类的复杂元素,转html的时候,可能会很多格式问题,所以要有个预览的过程)。

对于word中普通的文字,问题倒不大,主要是文本之外的元素的处理,比如图片,视频,表格等。针对我本次的文章,只处理了图片,处理的方式是:后台从word中找出图片(当然引入的jar包已经带了获取word中图片的功能),上传到服务器,拿到绝对路径之后,放入到html里面,这样,返回给前端的html内容,就可以直接预览了。


maven引入相关依赖包如下:

 <poi-scratchpad.version>3.14</poi-scratchpad.version>
        <poi-ooxml.version>3.14</poi-ooxml.version>
        <xdocreport.version>1.0.6</xdocreport.version>
        <poi-ooxml-schemas.version>3.14</poi-ooxml-schemas.version>
        <ooxml-schemas.version>1.3</ooxml-schemas.version>
        <jsoup.version>1.11.3</jsoup.version>


<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi-scratchpad.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi-ooxml.version}</version>
        </dependency>
        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>xdocreport</artifactId>
            <version>${xdocreport.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>${poi-ooxml-schemas.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>ooxml-schemas</artifactId>
            <version>${ooxml-schemas.version}</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>${jsoup.version}</version>
        </dependency>

word转html,对于word2003和word2007转换方式不一样,因为word2003和word2007的格式不一样,工具类如下:

使用方法如下:

public String uploadSourceNews(MultipartFile file)  {
        String fileName = file.getOriginalFilename();
        String suffixName = fileName.substring(fileName.lastIndexOf("."));
        if (!".doc".equals(suffixName) && !".docx".equals(suffixName)) {
            throw new UploadFileFormatException();
        }
        DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMM");
        String dateDir = formatter.format(LocalDate.now());
        String directory = imageDir + "/" + dateDir + "/";
        String content = null;
        try {
            InputStream inputStream = file.getInputStream();
            if ("doc".equals(suffixName)) {
                content = wordToHtmlUtil.Word2003ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
            } else {
                content = wordToHtmlUtil.Word2007ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost);
            }
        } catch (Exception ex) {
            logger.error("word to html exception, detail:", ex);
            return null;
        }
        return content;
    }

关于doc和docx的一些存储格式介绍:

docx 是微软开发的基于 xml 的文字处理文件。docx 文件与 doc 文件不同, 因为 docx 文件将数据存储在单独的压缩文件和文件夹中。早期版本的 microsoft office (早于 office 2007) 不支持 docx 文件, 因为 docx 是基于 xml 的, 早期版本将 doc 文件另存为单个二进制文件。

DOCX is an XML based word processing file developed by Microsoft. DOCX files are different than DOC files as DOCX files store data in separate compressed files and folders. Earlier versions of Microsoft Office (earlier than Office 2007) do not support DOCX files because DOCX is XML based where the earlier versions save DOC file as a single binary file.


可能你会问了,明明是docx结尾的文档,怎么成了xml格式了?

很简单:你随便选择一个docx文件,右键使用压缩工具打开,就能得到一个这样的目录结构:


所以你以为docx是一个完整的文档,其实它只是一个压缩文件。


参考:

https://www.cnblogs.com/ct-csu/p/8178932.html

、前言

实现文档在线预览的方式除了上篇文章 文档在线预览新版(一)通过将文件转成图片实现在线预览功能说的将文档转成图片的实现方式外,还有转成pdf,前端通过pdf.js、pdfobject.js等插件来实现在线预览,以及本文将要说到的将文档转成html的方式来实现在线预览。

以下代码分别提供基于aspose、pdfbox、spire来实现来实现txt、word、pdf、ppt、word等文件转图片的需求。

1、aspose

Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS组件的提供商,数十个国家的数千机构都有用过aspose组件,创建、编辑、转换或渲染 Office、OpenOffice、PDF、图像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用组件,未经授权导出文件里面都是是水印(尊重版权,远离破解版)。

需要在项目的pom文件里添加如下依赖

        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-words</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-pdf</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-cells</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-slides</artifactId>
            <version>23.1</version>
        </dependency>

2 、poi + pdfbox

因为aspose和spire虽然好用,但是都是是商用组件,所以这里也提供使用开源库操作的方式的方式。

POI是Apache软件基金会用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程序对Microsoft Office格式档案读和写的功能。

Apache PDFBox是一个开源Java库,支持PDF文档的开发和转换。 使用此库,您可以开发用于创建,转换和操作PDF文档的Java程序。

需要在项目的pom文件里添加如下依赖

		<dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.4</version>
        </dependency>
		<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-excelant</artifactId>
            <version>5.2.0</version>
        </dependency>

3 spire

spire一款专业的Office编程组件,涵盖了对Word、Excel、PPT、PDF等文件的读写、编辑、查看功能。spire提供免费版本,但是存在只能导出前3页以及只能导出前500行的限制,只要达到其一就会触发限制。需要超出前3页以及只能导出前500行的限制的这需要购买付费版(尊重版权,远离破解版)。这里使用免费版进行演示。

spire在添加pom之前还得先添加maven仓库来源

		<repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>https://repo.e-iceblue.cn/repository/maven-public/</url>
        </repository>

接着在项目的pom文件里添加如下依赖

免费版:

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office.free</artifactId>
            <version>5.3.1</version>
        </dependency>

付费版版:

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office</artifactId>
            <version>5.3.1</version>
        </dependency>

二、将文件转换成html字符串

1、将word文件转成html字符串

1.1 使用aspose

public static String wordToHtmlStr(String wordPath) {
        try {
            Document doc = new Document(wordPath); // Address是将要被转化的word文档
            String htmlStr = doc.toString();
            return htmlStr;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

验证结果:

1.2 使用poi

public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        String htmlStr = null;
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            htmlStr = word2007ToHtmlStr(wordPath);
        } else if (ext.equals(".doc")){
            htmlStr = word2003ToHtmlStr(wordPath);
        } else {
            throw new RuntimeException("文件格式不正确");
        }
        return htmlStr;
    }

    public String word2007ToHtmlStr(String wordPath) throws IOException {
        // 使用内存输出流
        try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
            word2007ToHtmlOutputStream(wordPath, out);
            return out.toString();
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用内存输出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }


    private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // Transform document to string
        StringWriter writer = new StringWriter();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
        return writer.toString();
    }

private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  带图片的word,则将图片转为base64编码,保存在一个页面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文档
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

1.3 使用spire

 public String wordToHtmlStr(String wordPath) throws IOException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Document document = new Document();
            document.loadFromFile(wordPath);
            document.saveToFile(outputStream, FileFormat.Html);
            return outputStream.toString();
        }
    }

2、将pdf文件转成html字符串

2.1 使用aspose

public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

验证结果:

2.2 使用 poi + pbfbox

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

2.3 使用spire

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            PdfDocument pdf = new PdfDocument();
            pdf.loadFromFile(pdfPath);
            return outputStream.toString();
        }
    }

3、将excel文件转成html字符串

3.1 使用aspose

public static String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        Workbook workbook = new XSSFWorkbook(fileInputStream);
        DataFormatter dataFormatter = new DataFormatter();
        FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = workbook.getSheetAt(0);
        StringBuilder htmlStringBuilder = new StringBuilder();
        htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
        htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
        htmlStringBuilder.append("</head><body><table>");
        for (Row row : sheet) {
            htmlStringBuilder.append("<tr>");
            for (Cell cell : row) {
                CellType cellType = cell.getCellType();
                if (cellType == CellType.FORMULA) {
                    formulaEvaluator.evaluateFormulaCell(cell);
                    cellType = cell.getCachedFormulaResultType();
                }
                String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
            }
            htmlStringBuilder.append("</tr>");
        }
        htmlStringBuilder.append("</table></body></html>");
        return htmlStringBuilder.toString();
    }

返回的html字符串:

<html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序号</td><td>姓名</td><td>性别</td><td>联系方式</td><td>地址</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr></table></body></html>

3.2 使用poi + pdfbox

public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

3.3 使用spire

public String excelToHtmlStr(String excelPath) throws Exception {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Workbook workbook = new Workbook();
            workbook.loadFromFile(excelPath);
            workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
            return outputStream.toString();
        }
    }

三、将文件转换成html,并生成html文件

有时我们是需要的不仅仅返回html字符串,而是需要生成一个html文件这时应该怎么做呢?一个改动量小的做法就是使用org.apache.commons.io包下的FileUtils工具类写入目标地址:

FileUtils类将html字符串生成html文件示例:

首先需要引入pom:

		<dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.8.0</version>
        </dependency>

相关代码:

String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\书籍\\电子书\\小说\\历史小说\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");

除此之外,还可以对上面的代码进行一些调整,已实现生成html文件,代码调整如下:

1、将word文件转换成html文件

word原文件效果:

1.1 使用aspose

public static void wordToHtml(String wordPath, String htmlPath) {
        try {
            File sourceFile = new File(wordPath);
            String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
            File file = new File(path); // 新建一个空白pdf文档
            FileOutputStream os = new FileOutputStream(file);
            Document doc = new Document(wordPath); // Address是将要被转化的word文档
            HtmlSaveOptions options = new HtmlSaveOptions();
            options.setExportImagesAsBase64(true);
            options.setExportRelativeFontSize(true);
            doc.save(os, options);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

转换成html的效果:

1.2 使用poi + pdfbox

public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            word2007ToHtml(wordPath, htmlPath);
        } else if (ext.equals(".doc")){
            word2003ToHtml(wordPath, htmlPath);
        } else {
            throw new RuntimeException("文件格式不正确");
        }
    }

    public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        //try(OutputStream out = Files.newOutputStream(Paths.get(path))){
        try(FileOutputStream out = new FileOutputStream(htmlPath)){
            word2007ToHtmlOutputStream(wordPath, out);
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用内存输出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }

    public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // 生成html文件地址

        try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
            DOMSource domSource = new DOMSource(htmlDocument);
            StreamResult streamResult = new StreamResult(outStream);
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer serializer = factory.newTransformer();
            serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
            serializer.setOutputProperty(OutputKeys.INDENT, "yes");
            serializer.setOutputProperty(OutputKeys.METHOD, "html");
            serializer.transform(domSource, streamResult);
        }
    }

    private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  带图片的word,则将图片转为base64编码,保存在一个页面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文档
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

转换成html的效果:

1.3 使用spire

public void wordToHtml(String wordPath, String htmlPath) {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        Document document = new Document();
        document.loadFromFile(wordPath);
        document.saveToFile(htmlPath, FileFormat.Html);
    }

转换成html的效果:

因为使用的是免费版,存在页数和字数限制,需要完整功能的的可以选择付费版本。PS:这回76页的文档居然转成功了前50页。

2、将pdf文件转换成html文件

图片版pdf原文件效果:

文字版pdf原文件效果:

2.1 使用aspose

public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        File file = new File(pdfPath);
        String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

图片版PDF文件验证结果:

文字版PDF文件验证结果:

2.2 使用poi + pdfbox

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

图片版PDF文件验证结果:

文字版PDF原文件效果:

2.3 使用spire

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(pdfPath);
        pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
    }

图片版PDF文件验证结果:
因为使用的是免费版,所以只有前三页是正常的。。。有超过三页需求的可以选择付费版本。

文字版PDF原文件效果:

报错了无法转换。。。

java.lang.NullPointerException
	at com.spire.pdf.PdfPageWidget.spr┢⅛(Unknown Source)
	at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.spr†™—(Unknown Source)
	at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr︻┎—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr┻┑—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
	at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘—(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr╻⅝(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.spr┦⅝(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr┞⅝(Unknown Source)
	at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)

3、将excel文件转换成html文件

excel原文件效果:

3.1 使用aspose

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook(excelPath);
        com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
        workbook.save(htmlPath, options);
    }

转换成html的效果:

3.2 使用poi

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
            String htmlStr = excelToHtmlStr(excelPath);
            byte[] bytes = htmlStr.getBytes();
            fileOutputStream.write(bytes);
        }
    }


    public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

转换成html的效果:

3.3 使用spire

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook();
        workbook.loadFromFile(excelPath);
        workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
    }

转换成html的效果:

四、总结

从上述的效果展示我们可以发现其实转成html效果不是太理想,很多细节样式没有还原,这其实是因为这类转换往往都是追求目标是通过使用文档中的语义信息并忽略其他细节来生成简单干净的 HTML,所以在转换过程中复杂样式被忽略,比如居中、首行缩进、字体,文本大小,颜色。举个例子在转换是 会将应用标题 1 样式的任何段落转换为 h1 元素,而不是尝试完全复制标题的样式。所以转成html的显示效果往往和原文档不太一样。这意味着对于较复杂的文档而言,这种转换不太可能是完美的。但如果都是只使用简单样式文档或者对文档样式不太关心的这种方式也不妨一试。

PS:如果想要展示效果好的话,其实可以将上篇文章《文档在线预览(一)通过将txt、word、pdf转成图片实现在线预览功能》说的内容和本文结合起来使用,即将文档里的内容都生成成图片(很可能是多张图片),然后将生成的图片全都放到一个html页面里 ,用html+css来保持样式并实现多张图片展示,再将html返回。开源组件kkfilevie就是用的就是这种做法。

kkfileview展示效果如下:

下图是kkfileview返回的html代码,从html代码我们可以看到kkfileview其实是将文件(txt文件除外)每页的内容都转成了图片,然后将这些图片都嵌入到一个html里,再返回给用户一个html页面。