Python 抓取公号文章保存成 HTML

次为大家介绍了如果用 Python 抓取公号文章并保存成 PDF 文件存储到本地。但用这种方式下载的 PDF 只有文字没有图片，所以只适用于没有图片或图片不重要的公众号，那如果我想要图片和文字下载下来怎么办？今天就给大家介绍另一种方案——HTML。

需解决的问题

其实我们要解决的有两个问题：

公众号里的图片没有保存到 PDF 文件里。
公众号里的一些代码片段，尤其那些单行代码比较长的，保存成 PDF 会出现代码不全的问题。
PDF 会自动分页，如果是代码或图片就会出现一些问题。

综上问题，我觉得还是把公众号下载成网页 HTML 格式最好看，下面就介绍下如何实现。

功能实现

获取文章链接的方式，和上一篇下载成 PDF 的文章一样，依然是通过公众号平台的图文素材里超链接查询实现，在这里我们直接拿来上一期的代码，进行修改即可。首先将原来文件 gzh_download.py 复制成 gzh_download_html.py，然后在此基础进行代码改造：

# gzh_download_html.py
# 引入模块
import requests
import json
import re
import time
from bs4 import BeautifulSoup
import os

# 打开 cookie.txt
with open("cookie.txt", "r") as file:
    cookie = file.read()
cookies = json.loads(cookie)
url = "https://mp.weixin.qq.com"
#请求公号平台
response = requests.get(url, cookies=cookies)
# 从url中获取token
token = re.findall(r'token=(\d+)', str(response.url))[0]
# 设置请求访问头信息
headers = {
    "Referer": "https://mp.weixin.qq.com/cgi-bin/appmsg?t=media/appmsg_edit_v2&action=edit&isNew=1&type=10&token=" + token + "&lang=zh_CN",
    "Host": "mp.weixin.qq.com",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36",
}

# 循环遍历前10页的文章
for j in range(1, 10, 1):
    begin = (j-1)*5
    # 请求当前页获取文章列表
    requestUrl = "https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin="+str(begin)+"&count=5&fakeid=MzU1NDk2MzQyNg==&type=9&query=&token=" + token + "&lang=zh_CN&f=json&ajax=1"
    search_response = requests.get(requestUrl, cookies=cookies, headers=headers)
    # 获取到返回列表 Json 信息
    re_text = search_response.json()
    list = re_text.get("app_msg_list")
    # 遍历当前页的文章列表
    for i in list:
        # 目录名为标题名，目录下存放 html 和图片
        dir_name = i["title"].replace(' ','')
        print("正在下载文章：" + dir_name)
        # 请求文章的 url ，获取文章内容
        response = requests.get(i["link"], cookies=cookies, headers=headers)
        # 保存文章到本地
        save(response, dir_name, i["aid"])
        print(dir_name + "下载完成!")
    # 过快请求可能会被微信问候，这里进行10秒等待
    time.sleep(10)

好了，从上面代码可以看出，主要就是将原来的方法 pdfkit.from_url(i["link"], i["title"] + ".pdf") 改成了现在的方式，需要用 requests 请求下文章的 URL ，然后再调用保存文章页面和图片到本地的方法，这里的 save() 方法通过以下代码实现。

调用保存方法

#保存下载的 html 页面和图片
def save(search_response,html_dir,file_name):
    # 保存 html 的位置
    htmlDir = os.path.join(os.path.dirname(os.path.abspath(__file__)), html_dir)
    # 保存图片的位置
    targetDir = os.path.join(os.path.dirname(os.path.abspath(__file__)),html_dir + '/images')
    # 不存在创建文件夹
    if not os.path.isdir(targetDir):
        os.makedirs(targetDir)
    domain = 'https://mp.weixin.qq.com/s'
    # 调用保存 html 方法
    save_html(search_response, htmlDir, file_name)
    # 调用保存图片方法
    save_file_to_local(htmlDir, targetDir, search_response, domain)

# 保存图片到本地
def save_file_to_local(htmlDir,targetDir,search_response,domain):
    # 使用lxml解析请求返回的页面
    obj = BeautifulSoup(save_html(search_response,htmlDir,file_name).content, 'lxml')  
    # 找到有 img 标签的内容
    imgs = obj.find_all('img')
    # 将页面上图片的链接加入list
    urls = []
    for img in imgs:
        if 'data-src' in str(img):
            urls.append(img['data-src'])
        elif 'src=""' in str(img):
            pass
        elif "src" not in str(img):
            pass
        else:
            urls.append(img['src'])

    # 遍历所有图片链接，将图片保存到本地指定文件夹，图片名字用0，1，2...
    i = 0
    for each_url in urls:
        # 跟据文章的图片格式进行处理
        if each_url.startswith('//'):
            new_url = 'https:' + each_url
            r_pic = requests.get(new_url)
        elif each_url.startswith('/') and each_url.endswith('gif'):
            new_url = domain + each_url
            r_pic = requests.get(new_url)
        elif each_url.endswith('png') or each_url.endswith('jpg') or each_url.endswith('gif') or each_url.endswith('jpeg'):
            r_pic = requests.get(each_url)
        # 创建指定目录
        t = os.path.join(targetDir, str(i) + '.jpeg')
        print('该文章共需处理' + str(len(urls)) + '张图片，正在处理第' + str(i + 1) + '张……')
        # 指定绝对路径
        fw = open(t, 'wb')
        # 保存图片到本地指定目录
        fw.write(r_pic.content)
        i += 1
        # 将旧的链接或相对链接修改为直接访问本地图片
        update_file(each_url, t, htmlDir)
        fw.close()

    # 保存 HTML 到本地
    def save_html(url_content,htmlDir,file_name):
        f = open(htmlDir+"/"+file_name+'.html', 'wb')
        # 写入文件
        f.write(url_content.content)
        f.close()
        return url_content

    # 修改 HTML 文件,将图片的路径改为本地的路径
    def update_file(old, new,htmlDir):
         # 打开两个文件，原始文件用来读，另一个文件将修改的内容写入
        with open(htmlDir+"/"+file_name+'.html', encoding='utf-8') as f, open(htmlDir+"/"+file_name+'_bak.html', 'w', encoding='utf-8') as fw:
            # 遍历每行，用replace()方法替换路径
            for line in f:
                new_line = line.replace(old, new)
                new_line = new_line.replace("data-src", "src")
                 # 写入新文件
                fw.write(new_line)
        # 执行完，删除原始文件
        os.remove(htmlDir+"/"+file_name+'.html')
        time.sleep(5)
        # 修改新文件名为 html
        os.rename(htmlDir+"/"+file_name+'_bak.html', htmlDir+"/"+file_name+'.html')

好了，上面就是将文章页面和图片下载到本地的代码，接下来我们运行命令 python gzh_download_html.py ，程序开始执行，打印日志如下：

$ python gzh_download_html.py
正在下载文章：学习Python看这一篇就够了！
该文章共需处理3张图片，正在处理第1张……
该文章共需处理3张图片，正在处理第2张……
该文章共需处理3张图片，正在处理第3张……
学习Python看这一篇就够了！下载完成!
正在下载文章：PythonFlask数据可视化
该文章共需处理2张图片，正在处理第1张……
该文章共需处理2张图片，正在处理第2张……
PythonFlask数据可视化下载完成!
正在下载文章：教你用Python下载手机小视频
该文章共需处理11张图片，正在处理第1张……
该文章共需处理11张图片，正在处理第2张……
该文章共需处理11张图片，正在处理第3张……
该文章共需处理11张图片，正在处理第4张……
该文章共需处理11张图片，正在处理第5张……
该文章共需处理11张图片，正在处理第6张……
该文章共需处理11张图片，正在处理第7张……

现在我们去程序存放的目录，就能看到以下都是以文章名称命名的文件夹：

进入相应文章目录，可以看到一个 html 文件和一个名为 images 的图片目录，我们双击打开扩展名为 html 的文件，就能看到带图片和代码框的文章，和在公众号看到的一样。

总结

本文为大家介绍了如何通过 Python 将公号文章批量下载到本地，并保存为 HTML 和图片，这样就能实现文章的离线浏览了。当然如果你想将 HTML 转成 PDF 也很简单，直接用 pdfkit.from_file(xx.html,target.pdf) 方法直接将网页转成 PDF，而且这样转成的 PDF 也是带图片的。

. 安装软件准备

1.1. 软件准备

1.zabbix-2.4.8.tar.gz zabbix-3.0.31.tar.gz

下载地址：https://www.zabbix.com/download

2.php5.4.16.tar.gz

下载地址：https://www.php.net/downloads.php

1.2. 注意事项

安装过程路径、密码尽量不要出现中文、特殊字符、空格、少于8位密码。

注意不可以跨版本升级

2. 环境准备

2.1. 配置 /etc/hosts

IP 主机名用途

10.10.10.181 zabbixserver 监控服务器

2.2. 应用部署路径说明

应用名称

路径

Apache配置文件：/etc/httpd/conf/httpd.conf

Apache发布路径：/var/www/html

Zabbix安装路径：/usr/local/zabbix

Zabbix配置文件：/usr/local/zabbix/etc/zabbix_server.conf

Php配置文件：/etc/php.ini

Mysql安装路径：/var/lib/mysql/

2.3. 防火墙开放端口

Ø 根据上表端口规划情况，在不同服务器操作开放相应端口

# firewall-cmd --permanent --zone=public --add-port=3306/tcp
# firewall-cmd --permanent --zone=public --add-port=80/tcp

Ø 重启防火墙

# firewall-cmd --reload

2.4. 关闭SELINUX

# sed -i "s@SELINUX=enforcing@SELINUX=disabled@g" /etc/selinux/config
# cat /etc/selinux/config | grep SELINUX=
# setenforce 0

3. Mysql与zabbix相关备份

1、Mysql备份

# /etc/init.d/zabbix_server stop
# /etc/init.d/zabbix_agentd stop
# mkdir /opt/bak && cd /opt/bak
# mysqldump -uroot -p zabbix > /opt/bak/zabbix.sql

2、zabbix配置备份

# cp /usr/local/zabbix/etc/zabbix_server.conf /opt/bak
# cp /etc/php.ini /opt/bak
# cp /etc/httpd/conf/httpd.conf /opt/bak
# cp -R /var/www/html/* /opt/bak/html/

4. LAMP环境准备

4.1. 依赖包安装

# yum install httpd php php-gd gcc php-mysql php-xml libcurl-devel curl-* net-snmp* libxml2-* bcmath mbstring php-devel lrzsz wget vim zip unzip net-tools ntpdate ntp php-bcmath php-mbstring-y

4.2. 创建用户

# useradd zabbix -s /sbin/nologin -M

4.3. Mysql安装

参照我的头条文章：CentOS7.x生产环境MySQL社区版yum方式部署

4.4. 建库导入

SQL> create database zabbix;
SQL> grant all on zabbix.* to zabbix@localhost identified by 'zabbixpwd123';
SQL> flush privileges;
# mysql -uroot -p zabbix < /opt/bak/zabbix.sql

4.5. 配置系统内核参数

# vi /etc/sysctl.conf
kernel.shmmax = 34359738368
kernel.shmmni = 4096
kernel.shmall = 8388608
kernel.sem = 1010 129280 1010 128
net.ipv4.ip_local_port_range = 9000 65500
net.core.rmem_default = 4194304
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
fs.aio-max-nr = 1048576
fs.file-max = 6815744
# /sbin/sysctl -p

4.6. 修改系统资源限制

# vi /etc/security/limits.conf
* soft nproc 2047
* hard nproc 16384
* soft nofile 1024
* hard nofile 65536
* soft stack 10240

5. Zabbix2.4.8部署

5.1. Zabbix安装部署

1、下载路径:

# cd /opt/ && wget https://sourceforge.net/projects/zabbix/files/ZABBIX%20Latest%20Stable/2.4.8/zabbix-2.4.8.tar.gz/download?use_mirror=nchc&download=

2、上传zabbix-2.4.8.tar.gz到服务器/opt目录下面

# tar -zxvf zabbix-2.4.8.tar.gz

3、进行编译安装zabbix_server

# find / -name mysql_config
# ./configure --prefix=/usr/local/zabbix --enable-server --enable-agent --with-mysql=/var/lib/mysql/bin/mysql_config --with-net-snmp --with-libcurl --with-libxml2
# make && make install
# cd /opt/zabbix-2.4.8/misc/init.d/fedora/core
# cp zabbix_server /etc/init.d/
# cp zabbix_agentd /etc/rc.d/init.d/
# chmod +x /etc/rc.d/init.d/zabbix_*
# vim /etc/rc.d/init.d/zabbix_server
BASEDIR=/usr/local/zabbix
# vim /etc/rc.d/init.d/zabbix_agentd
BASEDIR=/usr/local/zabbix
# chkconfig zabbix_server on
# chkconfig --add zabbix_server
# chkconfig zabbix_agentd on
# chkconfig --add zabbix_agentd
# cp /opt/bak/zabbix_server.conf /usr/local/zabbix/etc
# cd /opt/ && wget http://www.fping.org/dist/fping-4.2.tar.gz
# tar -zxvf fping-4.2.tar.gz && cd fping-4.2/
# ./configure && make && make install
# which fping
/usr/local/sbin/fping
# find / -name mysql.sock
# mkdir /usr/lib/zabbix/alertscripts -p
# chown -R zabbix:zabbix /usr/lib/zabbix
# egrep -v "^#|^$" /usr/local/zabbix/etc/zabbix_server.conf

#备注：如果数据库与zabbix_server是异机时参数DBHost的配置要修改为对应数据库IP，并注释DBSocket配置；如果机器是相同时要核对DBSocket的具体路径。

LogFile=/tmp/zabbix_server.log
DBHost=localhost
DBName=zabbix
DBUser=zabbix
DBPassword=密码
DBSocket=/var/lib/mysql/mysql.sock
StartPollers=20
AlertScriptsPath=/usr/lib/zabbix/alertscripts
FpingLocation=/usr/local/sbin/fping

5.2. Web配置

# cd /var/www/html/
# cp -R /opt/zabbix-2.4.8/frontends/php/* .
# chown -R apache:apache *
# cp /opt/bak/php.ini /etc/
# vim /etc/httpd/conf/httpd.conf
#将如下代码段
#ServerName www.example.com:80
---修改为
ServerName localhost:80
#模块中注意添加php对应版本的支持
DirectoryIndex index.html index.php
AddType application/x-httpd-php .php .php3 .php4 .php5
# systemctl restart httpd

5.3. 登录Zabbix的Web配置界面

1、在浏览器中打开访问地址：http://10.10.10.181/setup.php

2、检查系统环境，必须全部ok才能继续

安装时检查系统环境时的错误提示：出现bcmath与mbstring显示为fail

解决方法：安装对应php版本的依赖库

# rpm –qa | grep php-devel
# yum -y install php-devel
# cd /opt && tar -zxf php-5.4.16.tar.gz
# cd php-5.4.16/ext/bcmath/
# which phpize
/usr/bin/phpize
# find / -name php-config
/usr/bin/php-config
# /usr/bin/phpize
# ./configure --with-php-config=/usr/bin/php-config
# make && make install
# ll /usr/lib64/php/modules/
# ll /opt/php-5.4.16/ext/bcmath/modules
# cd ../mbstring/
# /usr/bin/phpize
# ./configure --with-php-config=/usr/bin/php-config
# make && make install
# systemctl restart httpd

如果还是出现fail则直接指定库位置,再重启httpd服务

# vim /etc/php.ini
extension=/usr/lib64/php/modules/bcmath.so
extension=/usr/lib64/php/modules/mbstring.so

3、配置mysql数据连接

Test connection #显示ok表示通过

如下图所示的错误时，原因是zabbix_server默认会去读取/var/lib/mysql/下的mysql.sock 解决办法是创建该路径，并创建软连接，操作指令如下

# mkdir /var/lib/mysql
# ln -s /tmp/mysql.sock /var/lib/mysql/mysql.sock
# chown -R mysql:mysql /var/lib/mysql
# vi /etc/php.ini
mysql.default_socket = /var/lib/mysql/mysql.sock
# systemctl restart httpd

同时将Database host修改为127.0.0.1

4、5直接点击Next

6点击Finish（如果提示无法创建，需要手工下载提示的zabbix.conf.php ，并将其上传到服务器/var/www/html/conf/路径下）

最后的登录用户/密码：admin/zabbix

5.4. 解决中文显示与乱码问题

1、解决中文问题

到server的web界面。点击右上角profile，看是否在语言项是否有中文，要是有，直接勾选保存，web界面就可以显示中文，要是没有中文选项，那么进行一下配置。

# vim /var/www/html/include/locales.inc.php
zh_CN' => array('name' => _('Chinese (zh_CN)'), 'display' => false),
---修改为
'zh_CN' => array('name' => _('Chinese (zh_CN)'), 'display' => true),

重启zabbix_server服务：

# service zabbix_server restart
# service zabbix_agentd restart

2、中文乱码问题，在图形等界面部分字体存在乱码问题

将本机C:\Windows\Fonts\simkai.ttf上传到服务器/var/www/html/fonts/

# vim /var/www/html/include/defines.inc.php
define('ZBX_GRAPH_FONT_NAME', 'DejaVuSans');
---修改为
define('ZBX_GRAPH_FONT_NAME', 'simkai');
重启zabbix_server服务：
# service zabbix_server restart

6. Zabbix2.4.8升级至3.0.31

6.1. Zabbix2.4.8相关文件备份

# mkdir /opt/bak24/html -p
# service zabbix_agentd stop
# service zabbix_server stop
# mysqldump -uroot -p zabbix > /opt/bak24/zabbix.sql
# cp -r /usr/local/zabbix /opt/bak24
# cp /etc/php.ini /opt/bak24
# cp /etc/httpd/conf/httpd.conf /opt/bak24
# mv /var/www/html/* /opt/bak24/html/
# mv /etc/init.d/zabbix_agentd /opt/bak24
# mv /etc/init.d/zabbix_server /opt/bak24

6.2. Zabbix安装部署

1、下载路径:

# cd /opt/
# wget https://cdn.zabbix.com/zabbix/sources/stable/3.0/zabbix-3.0.31.tar.gz

2、上传zabbix-3.0.31.tar.gz到服务器/opt目录下面

# tar -zxvf zabbix-3.0.31.tar.gz

3、进行编译安装zabbix_server

# ./configure --prefix=/usr/local/zabbix --enable-server --enable-agent --with-mysql=/var/lib/mysql/bin/mysql_config --with-net-snmp --with-libcurl --with-libxml2
# make && make install
# cd /opt/zabbix-3.0.31/misc/init.d/fedora/core
# cp zabbix_* /etc/init.d/
# chmod +x /etc/rc.d/init.d/zabbix_*
# vim /etc/rc.d/init.d/zabbix_server
BASEDIR=/usr/local/zabbix
# vim /etc/rc.d/init.d/zabbix_agentd
BASEDIR=/usr/local/zabbix
# chkconfig zabbix_server on
# chkconfig --add zabbix_server
# chkconfig zabbix_agentd on
# chkconfig --add zabbix_agentd
# cp /usr/local/zabbix/etc/zabbix_server.conf /opt/bak24/zabbix_server3.conf
# cp /opt/bak24/zabbix/etc/zabbix_server.conf /usr/local/zabbix/etc
# egrep -v "^#|^$" /usr/local/zabbix/etc/zabbix_server.conf

#备注：如果数据库与zabbix_server是异机时参数DBHost的配置要修改为对应数据库IP，并注释DBSocket配置；如果机器是相同时要核对DBSocket的具体路径。

LogFile=/tmp/zabbix_server.log
DBHost=localhost
DBName=zabbix
DBUser=zabbix
DBPassword=密码
DBSocket=/var/lib/mysql/mysql.sock
StartPollers=20
AlertScriptsPath=/usr/lib/zabbix/alertscripts
FpingLocation=/usr/local/sbin/fping

6.3. Web配置

# cd /var/www/html/
# cp -R /opt/zabbix-3.0.31/frontends/php/* .
# cp /opt/bak24/httpd.conf /etc/httpd/conf
# chown -R apache:apache *
# cp /opt/bak24/php.ini /etc/
# vim /etc/httpd/httpd.conf --核对配置信息
#将如下代码段
#ServerName www.example.com:80
---修改为
ServerName localhost:80
#模块中注意添加php对应版本的支持
DirectoryIndex index.html index.php
AddType application/x-httpd-php .php .php3 .php4 .php5
# systemctl restart httpd

6.4. 登录Zabbix的Web配置界面

1、在浏览器中打开访问地址：http://10.10.10.181/setup.php

2、检查系统环境，必须全部ok才能继续

安装或升级时检查系统环境时的错误提示：ldap 显示Warning

解决方法：安装对应php版本的依赖库

# cd /opt && tar -zxf php-5.4.16.tar.gz
# cd php-5.4.16/ext/ldap
# /usr/bin/phpize
# ./configure --with-php-config=/usr/bin/php-config && make && make install

安装ldap报错一：configure: error: Cannot find ldap.h

解决办法：

# yum -y install openldap openldap-devel

安装ldap报错二：configure: error: Cannot find ldap libraries in /usr/lib

解决办法：

# cp -frp /usr/lib64/libldap* /usr/lib/
# /usr/bin/phpize
# make clean && ./configure --with-php-config=/usr/bin/php-config && make && make install
# systemctl restart httpd

如果还是出现Warning则直接指定库位置,再重启httpd服务

# vim /etc/php.ini
extension=/usr/lib64/php/modules/ldap.so

3、配置mysql数据连接

Test connection #显示ok表示通过

升级连接数据库时的错误提示：Cannot connect to the database.

The frontend does not match Zabbix database. Current database version (mandatory/optional): 2040000/2040000.

Required mandatory version: 3000000. Contact your system administrator.

原因：新的Zabbix所需数据库版本与现数据库版本不一致导致，更改版本号即可

解决办法：

# mysql -uroot -p
SQL> use zabbix;
SQL> update dbversion set mandatory=3000000;
SQL> flush privileges;

4、5直接点击Next

6点击Finish

最后的登录用户/密码：admin/zabbix

界面乱码问题处理：

# service zabbix_server start
# tail -100f /var/log/messages
zabbix_server: Starting zabbix_server:
/usr/local/zabbix/sbin/zabbix_server: error while loading shared libraries: libmysqlclient.so.20: cannot open shared object file: No such file or directory
# find / -name 'libmysqlclient*'
/usr/lib64/mysql/libmysqlclient.so.18
/usr/lib64/mysql/libmysqlclient.so.18.0.0
/mysql/mysql/lib/libmysqlclient.a
/mysql/mysql/lib/libmysqlclient.so
/mysql/mysql/lib/libmysqlclient.so.20
/mysql/mysql/lib/libmysqlclient.so.20.3.15
# ln -s /mysql/mysql/lib/libmysqlclient.so.20 /usr/lib64
# tail -100f /tmp/zabbix_server.log ---查看zabbix_server日志，排查升级问题
…………………
17120:20200610:181128.506 completed 98% of database upgrade
17120:20200610:181128.507 completed 99% of database upgrade
17120:20200610:181128.507 completed 100% of database upgrade
17120:20200610:181128.507 database upgrade fully completed
17120:20200610:181128.566 server #0 started [main process]
17128:20200610:181128.566 server #1 started [configuration syncer #1]
17129:20200610:181128.567 server #2 started [db watchdog #1]
17130:20200610:181128.567 server #3 started [poller #1]
…………………

7. 结束

★★建议星标我们★★★

Java进阶架构师★“星标”！这样才不会错过每日进阶架构文章呀。

2020年Java原创面试题库连载中

【000期】Java最全面试题库思维导图

【020期】JavaSE系列面试题汇总（共18篇）

【028期】JavaWeb系列面试题汇总（共10篇）

【042期】JavaEE系列面试题汇总（共13篇）

【049期】数据库系列面试题汇总（共6篇）

【053期】中间件系列面试题汇总（共3篇）

【065期】数据结构与算法面试题汇总（共11篇）

【076期】分布式面试题汇总（共10篇）

【077期】综合面试题系列（一）

【078期】综合面试题系列（二）

【079期】综合面试题系列（三）

【080期】综合面试题系列（四）

【081期】综合面试题系列（五）

【082期】综合面试题系列（六）

【083期】综合面试题系列（七）

【084期】综合面试题系列（八）

【085期】综合面试题系列（九）

【086期】综合面试题系列（十）

【087期】综合面试题系列（十一）

【088期】综合面试题系列（十二）

【089期】综合面试题系列（十三）

项目介绍

本项目是通过学习https://gitee.com/nbsl/idCardCv 后整合tess4j,不需要经过训练直接使用的,当然,你也可以进行训练后进行使用。该项目修改原有的需要安装opencv的过程，全部使用javaccp技术重构,通过javaccp引入需要的c++库进行开发。不需要安装opencv 新增的了前端控制识别区域的功能，新增了后端识别后验证，页面样式主要适应paid，重新修改了后面的识别过程，用户opencv进行图片优化和区域选择，使用tess4j进行数字和x的识别配合样式中的区域在后台裁剪相关区域图片 /idCardCv/src/main/resources/static/js/plugins/cropper/cropper.css

遇到问题

1、java.lang.UnsatisfiedLinkError: C:\Users\Administrator.javacpp\cache\opencv-3.4.3-1.4.3-windows-x86_64.jar\org\bytedeco\javacpp\windows-x86_64\jniopencv_core.dll: Can't find dependent libraries 我的问题是因为没有c++运行环境，我在img/vc_redist.x64.exe中添加了64位的运行环境

身份证号码识别

请求地址 http://localhost:8080/idCard/index 它基于openCV这个开源库。这意味着你可以获取全部源代码，并且移植到opencv支持的所有平台。它是基于java开发。它的识别率较高。图片清晰情况下，号码检测与识别准确率在90%以上。

Required Software

本版本在以下平台测试通过：

windows7 64bit
jdk1.8.0_45
junit 4
opencv4.3
javaccp1.5.3
tess4j4.5.1
tesseract4.0.0

项目更新

1、先前使用base64进行图片的上传比较缓慢，使用webuploader插件进行分片上传，网速慢的时候可以提升速度，尤其是paid浏览器使用。原页面改为idcard_bak.html。

2、原项目中有测试图片保存路径，统一更新到配置文档中。

3、将opencv3.4.3升级到4.3

项目地址

https://gitee.com/endlesshh/idCardCv

PS：如果觉得我的分享不错，欢迎大家随手点赞、在看。


之前，给大家发过三份Java面试宝典，这次新增了一份，目前总共是四份面试宝典，相信在跳槽前一个月按照面试宝典准备准备，基本没大问题。
《java面试宝典5.0》(初中级)
《350道Java面试题：整理自100+公司》（中高级）
《资深java面试宝典-视频版》（资深）
《Java[BAT]面试必备》（资深）
分别适用于初中级，中高级，资深级工程师的面试复习。
内容包含java基础、javaweb、mysql性能优化、JVM、锁、百万并发、消息队列，高性能缓存、反射、Spring全家桶原理、微服务、Zookeeper、数据结构、限流熔断降级等等。
看到这里，证明有所收获

在线咨询

上一篇：HTML5(七)-SVG基础入门
下一篇：js入门三部曲「第二部」ep04 html代码的基本结构#js

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商

Python 抓取公号文章保存成 HTML

需解决的问题

功能实现

调用保存方法

总结

. 安装软件准备

1.1. 软件准备

1.2. 注意事项

2. 环境准备

2.1. 配置 /etc/hosts

2.2. 应用部署路径说明

2.3. 防火墙开放端口

2.4. 关闭SELINUX

3. Mysql与zabbix相关备份

4. LAMP环境准备

4.1. 依赖包安装

4.2. 创建用户

4.3. Mysql安装

4.4. 建库导入

4.5. 配置系统内核参数

4.6. 修改系统资源限制

5. Zabbix2.4.8部署

5.1. Zabbix安装部署

5.2. Web配置

5.3. 登录Zabbix的Web配置界面

5.4. 解决中文显示与乱码问题

6. Zabbix2.4.8升级至3.0.31

6.1. Zabbix2.4.8相关文件备份

6.2. Zabbix安装部署

6.3. Web配置

6.4. 登录Zabbix的Web配置界面

7. 结束

项目介绍

遇到问题

身份证号码识别

Required Software

项目更新

项目地址

您的项目需求