整合营销服务商

电脑端+手机端+微信端=数据同步管理

免费咨询热线:

前端程序员实现在线预览pdf、word、xls、ppt等文件,超实用!

、前端实现pdf文件在线预览功能

方式一、pdf文件理论上可以在浏览器直接打开预览但是需要打开新页面。在仅仅是预览pdf文件且UI要求不高的情况下可以直接通过a标签href属性实现预览

<a href="文档地址"></a>

方式二、通过jquery插件jquery.media.js实现 这个插件可以实现pdf预览功能(包括其他各种媒体文件)但是对word等类型的文件无能为力。 实现方式: js代码:

<script type="text/javascript" src="jquery-1.7.1.min.js"></script> 
<script type="text/javascript" src="jquery.media.js"></script>
复制代码

html结构:

 <body>
 <div id="handout_wrap_inner"></div>
 </body>
复制代码

调用方式:

<script type="text/javascript"> 
 $('#handout_wrap_inner').media({
		width: '100%',
		height: '100%',
		autoplay: true,
 src:'http://storage.xuetangx.com/public_assets/xuetangx/PDF/PlayerAPI_v1.0.6.pdf',
 }); 
</script>
复制代码

方式三、直接通过页面内嵌iframe

$("<iframe src='"+ this.previewUrl +"' width='100%' height='362px' frameborder='1'>").appendTo($(".video-handouts-preview"));
复制代码

此外还可以在iframe标签之间提供一个提示类似这样

<iframe :src="previewUrl" width="100%" height="100%">
This browser does not support PDFs. Please download the PDF to view it: <a :href="previewUrl">Download PDF</a>
</iframe>
复制代码

方式四、通过标签嵌入内容

<embed :src="previewUrl" type="application/pdf" width="100%" height="100%">

此标签h5特性中包含四个属性:高、宽、类型、预览文件src! 与< iframe > < / iframe > 不同,这个标签是自闭合的的,也就是说如果浏览器不支持PDF的嵌入,那么这个标签的内容什么都看不到!

方式五、标签和iframe使用差别较小

<object :src="previewUrl" width="100%" height="100%">
This browser does not support PDFs. Please download the PDF to view it: <a :href="previewUrl">Download PDF</a>
</object>
复制代码

除方式二以外其他都是直接通过标签将内容引入页面实现预览

方式六、PDFObject

PDFObject实际上也是通过标签实现的直接上代码

<!DOCTYPE html>
<html>
<head>
 <title>Show PDF</title>
 <meta charset="utf-8" />
 <script type="text/javascript" src='pdfobject.min.js'></script>
 <style type="text/css">
 html,body,#pdf_viewer{
 width: 100%;
 height: 100%;
 margin: 0;
 padding: 0;
 }
 </style>
</head>
<body>
 <div id="pdf_viewer"></div>
</body>
<script type="text/javascript">
 if(PDFObject.supportsPDFs){
 // PDF嵌入到网页
 PDFObject.embed("index.pdf", "#pdf_viewer" );
 } else {
 location.href = "/canvas";
 }
</script>
</html>

还可以通过以下代码进行判断是否支持PDFObject预览

if(PDFObject.supportsPDFs){
 console.log("Yay, this browser supports inline PDFs.");
} else {
 console.log("Boo, inline PDFs are not supported by this browser");
}
复制代码

方式七、PDF.js

PDF.js可以实现在html下直接浏览pdf文档,是一款开源的pdf文档读取解析插件,非常强大,能将PDF文件渲染成Canvas。PDF.js主要包含两个库文件,一个pdf.js和一个pdf.worker.js,一个负责API解析,一个负责核心解析。

2、word、xls、ppt文件在线预览功能

word、ppt、xls文件实现在线预览的方式比较简单可以直接通过调用微软的在线预览功能实现 (预览前提:资源必须是公共可访问的)

<iframe src='https://view.officeapps.live.com/op/view.aspx?src=http://storage.xuetangx.com/public_assets/xuetangx/PDF/1.xls' width='100%' height='100%' frameborder='1'>
			</iframe>
复制代码

src就是要实现预览的文件地址 具体文档看这微软接口文档

补充:google的文档在线预览实现同微软(资源必须是公共可访问的)

<iframe :src="'https://docs.google.com/viewer?url="fileurl"></iframe>
复制代码

3、word文件

XDOC可以实现预览以DataURI表示的DOC文档,此外XDOC还可以实现文本、带参数文本、html文本、json文本、公文等在线预览,具体实现方法请看官方文档

下面这种方式可以实现快速预览word但是对文件使用的编辑器可能会有一些限制

<a href="http://www.xdocin.com/xdoc?_func=to&_format=html&_cache=1&_xdoc=http://www.xdocin.com/demo/demo.docx" target="_blank" rel="nofollow">XDOC</a>
复制代码

4、excel文件

目前excel文件已经有了类似pdf.js那样的解析sheet.js

总结:

1、免费纯前端方式实现在线预览word、excel、ppt最优选择微软在线预览(不可编辑)

2、利用后端将文件转为图片,前端以图片形式预览(可行方案)

3、购买在线预览服务例如百度DOC文档服务、永中、I DOC VIEW等

著名:文章内容是从网上搜集资料所得;在次发表只为自己以及头条程序员兄弟日后使用图个方便。

觉得有用记得收藏转发


者:子木 政采云前端团队

转发链接:https://mp.weixin.qq.com/s/Wx_gJLrZftJ_dm2phoUf8g

、前言

实现文档在线预览的方式除了上篇文章 文档在线预览新版(一)通过将文件转成图片实现在线预览功能说的将文档转成图片的实现方式外,还有转成pdf,前端通过pdf.js、pdfobject.js等插件来实现在线预览,以及本文将要说到的将文档转成html的方式来实现在线预览。

以下代码分别提供基于aspose、pdfbox、spire来实现来实现txt、word、pdf、ppt、word等文件转图片的需求。

1、aspose

Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS组件的提供商,数十个国家的数千机构都有用过aspose组件,创建、编辑、转换或渲染 Office、OpenOffice、PDF、图像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用组件,未经授权导出文件里面都是是水印(尊重版权,远离破解版)。

需要在项目的pom文件里添加如下依赖

        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-words</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-pdf</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-cells</artifactId>
            <version>23.1</version>
        </dependency>
        <dependency>
            <groupId>com.aspose</groupId>
            <artifactId>aspose-slides</artifactId>
            <version>23.1</version>
        </dependency>

2 、poi + pdfbox

因为aspose和spire虽然好用,但是都是是商用组件,所以这里也提供使用开源库操作的方式的方式。

POI是Apache软件基金会用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程序对Microsoft Office格式档案读和写的功能。

Apache PDFBox是一个开源Java库,支持PDF文档的开发和转换。 使用此库,您可以开发用于创建,转换和操作PDF文档的Java程序。

需要在项目的pom文件里添加如下依赖

		<dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.4</version>
        </dependency>
		<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>5.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-excelant</artifactId>
            <version>5.2.0</version>
        </dependency>

3 spire

spire一款专业的Office编程组件,涵盖了对Word、Excel、PPT、PDF等文件的读写、编辑、查看功能。spire提供免费版本,但是存在只能导出前3页以及只能导出前500行的限制,只要达到其一就会触发限制。需要超出前3页以及只能导出前500行的限制的这需要购买付费版(尊重版权,远离破解版)。这里使用免费版进行演示。

spire在添加pom之前还得先添加maven仓库来源

		<repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>https://repo.e-iceblue.cn/repository/maven-public/</url>
        </repository>

接着在项目的pom文件里添加如下依赖

免费版:

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office.free</artifactId>
            <version>5.3.1</version>
        </dependency>

付费版版:

		<dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.office</artifactId>
            <version>5.3.1</version>
        </dependency>

二、将文件转换成html字符串

1、将word文件转成html字符串

1.1 使用aspose

public static String wordToHtmlStr(String wordPath) {
        try {
            Document doc = new Document(wordPath); // Address是将要被转化的word文档
            String htmlStr = doc.toString();
            return htmlStr;
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

验证结果:

1.2 使用poi

public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        String htmlStr = null;
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            htmlStr = word2007ToHtmlStr(wordPath);
        } else if (ext.equals(".doc")){
            htmlStr = word2003ToHtmlStr(wordPath);
        } else {
            throw new RuntimeException("文件格式不正确");
        }
        return htmlStr;
    }

    public String word2007ToHtmlStr(String wordPath) throws IOException {
        // 使用内存输出流
        try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
            word2007ToHtmlOutputStream(wordPath, out);
            return out.toString();
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用内存输出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }


    private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // Transform document to string
        StringWriter writer = new StringWriter();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
        return writer.toString();
    }

private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  带图片的word,则将图片转为base64编码,保存在一个页面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文档
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

1.3 使用spire

 public String wordToHtmlStr(String wordPath) throws IOException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Document document = new Document();
            document.loadFromFile(wordPath);
            document.saveToFile(outputStream, FileFormat.Html);
            return outputStream.toString();
        }
    }

2、将pdf文件转成html字符串

2.1 使用aspose

public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

验证结果:

2.2 使用 poi + pbfbox

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new StringWriter();
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
        return writer.toString();
    }

2.3 使用spire

public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            PdfDocument pdf = new PdfDocument();
            pdf.loadFromFile(pdfPath);
            return outputStream.toString();
        }
    }

3、将excel文件转成html字符串

3.1 使用aspose

public static String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        Workbook workbook = new XSSFWorkbook(fileInputStream);
        DataFormatter dataFormatter = new DataFormatter();
        FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = workbook.getSheetAt(0);
        StringBuilder htmlStringBuilder = new StringBuilder();
        htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
        htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
        htmlStringBuilder.append("</head><body><table>");
        for (Row row : sheet) {
            htmlStringBuilder.append("<tr>");
            for (Cell cell : row) {
                CellType cellType = cell.getCellType();
                if (cellType == CellType.FORMULA) {
                    formulaEvaluator.evaluateFormulaCell(cell);
                    cellType = cell.getCachedFormulaResultType();
                }
                String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
            }
            htmlStringBuilder.append("</tr>");
        }
        htmlStringBuilder.append("</table></body></html>");
        return htmlStringBuilder.toString();
    }

返回的html字符串:

<html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序号</td><td>姓名</td><td>性别</td><td>联系方式</td><td>地址</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>1</td><td>张晓玲</td><td>女</td><td>11111111111</td><td>上海市浦东新区xx路xx弄xx号</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦东新区xx路xx弄xx号</td></tr></table></body></html>

3.2 使用poi + pdfbox

public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

3.3 使用spire

public String excelToHtmlStr(String excelPath) throws Exception {
        try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            Workbook workbook = new Workbook();
            workbook.loadFromFile(excelPath);
            workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
            return outputStream.toString();
        }
    }

三、将文件转换成html,并生成html文件

有时我们是需要的不仅仅返回html字符串,而是需要生成一个html文件这时应该怎么做呢?一个改动量小的做法就是使用org.apache.commons.io包下的FileUtils工具类写入目标地址:

FileUtils类将html字符串生成html文件示例:

首先需要引入pom:

		<dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.8.0</version>
        </dependency>

相关代码:

String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\书籍\\电子书\\小说\\历史小说\\最后的可汗.doc");
FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");

除此之外,还可以对上面的代码进行一些调整,已实现生成html文件,代码调整如下:

1、将word文件转换成html文件

word原文件效果:

1.1 使用aspose

public static void wordToHtml(String wordPath, String htmlPath) {
        try {
            File sourceFile = new File(wordPath);
            String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
            File file = new File(path); // 新建一个空白pdf文档
            FileOutputStream os = new FileOutputStream(file);
            Document doc = new Document(wordPath); // Address是将要被转化的word文档
            HtmlSaveOptions options = new HtmlSaveOptions();
            options.setExportImagesAsBase64(true);
            options.setExportRelativeFontSize(true);
            doc.save(os, options);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

转换成html的效果:

1.2 使用poi + pdfbox

public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        String ext = wordPath.substring(wordPath.lastIndexOf("."));
        if (ext.equals(".docx")) {
            word2007ToHtml(wordPath, htmlPath);
        } else if (ext.equals(".doc")){
            word2003ToHtml(wordPath, htmlPath);
        } else {
            throw new RuntimeException("文件格式不正确");
        }
    }

    public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        //try(OutputStream out = Files.newOutputStream(Paths.get(path))){
        try(FileOutputStream out = new FileOutputStream(htmlPath)){
            word2007ToHtmlOutputStream(wordPath, out);
        }
    }

    private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
        ZipSecureFile.setMinInflateRatio(-1.0d);
        InputStream in = Files.newInputStream(Paths.get(wordPath));
        XWPFDocument document = new XWPFDocument(in);
        XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
        // 使用内存输出流
        XHTMLConverter.getInstance().convert(document, out, options);
    }

    public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
        org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
        // 生成html文件地址

        try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
            DOMSource domSource = new DOMSource(htmlDocument);
            StreamResult streamResult = new StreamResult(outStream);
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer serializer = factory.newTransformer();
            serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
            serializer.setOutputProperty(OutputKeys.INDENT, "yes");
            serializer.setOutputProperty(OutputKeys.METHOD, "html");
            serializer.transform(domSource, streamResult);
        }
    }

    private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
        InputStream input = Files.newInputStream(Paths.get(wordPath));
        HWPFDocument wordDocument = new HWPFDocument(input);
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            System.out.println(pictureType);
            if (PictureType.UNKNOWN.equals(pictureType)) {
                return null;
            }
            BufferedImage bufferedImage = ImgUtil.toImage(content);
            String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
            //  带图片的word,则将图片转为base64编码,保存在一个页面中
            StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
            return sb.toString();
        });
        // 解析word文档
        wordToHtmlConverter.processDocument(wordDocument);
        return wordToHtmlConverter.getDocument();
    }

转换成html的效果:

1.3 使用spire

public void wordToHtml(String wordPath, String htmlPath) {
        htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
        Document document = new Document();
        document.loadFromFile(wordPath);
        document.saveToFile(htmlPath, FileFormat.Html);
    }

转换成html的效果:

因为使用的是免费版,存在页数和字数限制,需要完整功能的的可以选择付费版本。PS:这回76页的文档居然转成功了前50页。

2、将pdf文件转换成html文件

图片版pdf原文件效果:

文字版pdf原文件效果:

2.1 使用aspose

public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        File file = new File(pdfPath);
        String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

图片版PDF文件验证结果:

文字版PDF文件验证结果:

2.2 使用poi + pdfbox

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PDDocument document = PDDocument.load(new File(pdfPath));
        Writer writer = new PrintWriter(path, "UTF-8");
        new PDFDomTree().writeText(document, writer);
        writer.close();
        document.close();
    }

图片版PDF文件验证结果:

文字版PDF原文件效果:

2.3 使用spire

public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
        htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(pdfPath);
        pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
    }

图片版PDF文件验证结果:
因为使用的是免费版,所以只有前三页是正常的。。。有超过三页需求的可以选择付费版本。

文字版PDF原文件效果:

报错了无法转换。。。

java.lang.NullPointerException
	at com.spire.pdf.PdfPageWidget.spr┢⅛(Unknown Source)
	at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.spr†™—(Unknown Source)
	at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
	at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr︻┎—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.spr┻┑—(Unknown Source)
	at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
	at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘—(Unknown Source)
	at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr╻⅝(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.spr┦⅝(Unknown Source)
	at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
	at com.spire.pdf.PdfDocumentBase.spr┞⅝(Unknown Source)
	at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)

3、将excel文件转换成html文件

excel原文件效果:

3.1 使用aspose

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook(excelPath);
        com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
        workbook.save(htmlPath, options);
    }

转换成html的效果:

3.2 使用poi

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
            String htmlStr = excelToHtmlStr(excelPath);
            byte[] bytes = htmlStr.getBytes();
            fileOutputStream.write(bytes);
        }
    }


    public String excelToHtmlStr(String excelPath) throws Exception {
        FileInputStream fileInputStream = new FileInputStream(excelPath);
        try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
            DataFormatter dataFormatter = new DataFormatter();
            FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
            org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
            StringBuilder htmlStringBuilder = new StringBuilder();
            htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
            htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
            htmlStringBuilder.append("</head><body><table>");
            for (Row row : sheet) {
                htmlStringBuilder.append("<tr>");
                for (Cell cell : row) {
                    CellType cellType = cell.getCellType();
                    if (cellType == CellType.FORMULA) {
                        formulaEvaluator.evaluateFormulaCell(cell);
                        cellType = cell.getCachedFormulaResultType();
                    }
                    String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                    htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                }
                htmlStringBuilder.append("</tr>");
            }
            htmlStringBuilder.append("</table></body></html>");
            return htmlStringBuilder.toString();
        }
    }

转换成html的效果:

3.3 使用spire

public void excelToHtml(String excelPath, String htmlPath) throws Exception {
        htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
        Workbook workbook = new Workbook();
        workbook.loadFromFile(excelPath);
        workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
    }

转换成html的效果:

四、总结

从上述的效果展示我们可以发现其实转成html效果不是太理想,很多细节样式没有还原,这其实是因为这类转换往往都是追求目标是通过使用文档中的语义信息并忽略其他细节来生成简单干净的 HTML,所以在转换过程中复杂样式被忽略,比如居中、首行缩进、字体,文本大小,颜色。举个例子在转换是 会将应用标题 1 样式的任何段落转换为 h1 元素,而不是尝试完全复制标题的样式。所以转成html的显示效果往往和原文档不太一样。这意味着对于较复杂的文档而言,这种转换不太可能是完美的。但如果都是只使用简单样式文档或者对文档样式不太关心的这种方式也不妨一试。

PS:如果想要展示效果好的话,其实可以将上篇文章《文档在线预览(一)通过将txt、word、pdf转成图片实现在线预览功能》说的内容和本文结合起来使用,即将文档里的内容都生成成图片(很可能是多张图片),然后将生成的图片全都放到一个html页面里 ,用html+css来保持样式并实现多张图片展示,再将html返回。开源组件kkfilevie就是用的就是这种做法。

kkfileview展示效果如下:

下图是kkfileview返回的html代码,从html代码我们可以看到kkfileview其实是将文件(txt文件除外)每页的内容都转成了图片,然后将这些图片都嵌入到一个html里,再返回给用户一个html页面。