用 Java 拿下 HTML 分分钟写个小爬虫

文适合有 Java 基础知识的人群

本文作者：HelloGitHub-秦人

HelloGitHub 推出的《讲解开源项目》系列，今天给大家带来一款开源 Java 版一款网页元素解析框架——jsoup，通过程序自动获取网页数据。

项目源码地址：https://github.com/jhy/jsoup

一、项目介绍

jsoup 是一款 Java 的 HTML 解析器。可直接解析某个 URL 地址的 HTML 文本内容。它提供了一套很省力的 API，可通过 DOM、CSS 以及类似于 jQuery 选择器的操作方法来取出和操作数据。

jsoup 主要功能：

从一个 URL、文件或字符串中解析 HTML。
使用 DOM 或 CSS 选择器来查找、取出数据。
可操作 HTML 元素、属性、文本。

二、使用框架

2.1 准备工作

掌握 HTML 语法
Chrome 浏览器调试技巧
掌握开发工具 idea 的基本操作

2.2 学习源码

将项目导入 idea 开发工具，会自动下载 maven 项目需要的依赖。源码的项目结构如下：

快速学习源码是每个程序员必备的技能，我总结了以下几点：

阅读项目 ReadMe 文件，可以快速知道项目是做什么的。
概览项目 pom.xml 文件，了解项目引用了哪些依赖。
查看项目结构、源码目录、测试用例目录，好的项目结构清晰，层次明确。
运行测试用例，快速体验项目。

2.3 下载项目

git clone https://github.com/jhy/jsoup

2.4 运行项目测试代码

通过上面的方法，我们很快可知 example 目录是测试代码，那我们直接来运行。注：有些测试代码需要稍微改造一下才可以运行。

例如，jsoup 的 Wikipedia 测试代码：

public class Wikipedia {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
        log(doc.title());

        Elements newsHeadlines = doc.select("#mp-itn b a");
        for (Element headline : newsHeadlines) {
            log("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
        }
    }

    private static void log(String msg, String... vals) {
        System.out.println(String.format(msg, vals));
    }
}

说明：上面代码是获取页面（http://en.wikipedia.org/）包含（#mp-itn b a）选择器的所有元素，并打印这些元素的 title , herf 属性。维基百科国内无法访问，所以上面这段代码运行会报错。

改造后可运行的代码如下：

public static void main(String[] args) throws IOException {
    Document doc = Jsoup.connect("https://www.baidu.com/").get();
    Elements newsHeadlines = doc.select("a[href]");
    for (Element headline : newsHeadlines) {
        System.out.println("href: " +headline.absUrl("href") );
    }
}

三、工作原理

Jsoup 的工作原理，首先需要指定一个 URL，框架发送 HTTP 请求，然后获取响应页面内容，然后通过各种选择器获取页面数据。整个工作流程如下图：

以上面为例：

3.1 发请求

Document doc = Jsoup.connect("https://www.baidu.com/").get();

这行代码就是发送 HTTP 请求，并获取页面响应数据。

3.2 数据筛选

Elements newsHeadlines = doc.select("a[href]");

定义选择器，获取匹配选择器的数据。

3.3 数据处理

for (Element headline : newsHeadlines) {
        System.out.println("href: " +headline.absUrl("href") );
    }

这里对数据只做了一个简单的数据打印，当然这些数据可写入文件或数据的。

四、实战

获取豆瓣读书 -> 新书速递中每本新书的基本信息。包括：书名、书图片链接、作者、内容简介（详情页面）、作者简介（详情页面）、当当网书的价格（详情页面），最后将获取的数据保存到 Excel 文件。

目标链接：https://book.douban.com/latest?icn=index-latestbook-all

4.1 项目 pom.xml 文件

项目引入 jsoup、lombok、easyexcel 三个库。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>JsoupTest</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.target>1.8</maven.compiler.target>
        <maven.compiler.source>1.8</maven.compiler.source>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.13.1</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.12</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>easyexcel</artifactId>
            <version>2.2.6</version>
        </dependency>
    </dependencies>
</project>

4.2 解析页面数据

public class BookInfoUtils {

    public static List<BookEntity> getBookInfoList(String url) throws IOException {
        List<BookEntity>  bookEntities=new ArrayList<>();
        Document doc = Jsoup.connect(url).get();
        Elements liDiv = doc.select("#content > div > div.article > ul > li");
        for (Element li : liDiv) {
            Elements urls = li.select("a[href]");
            Elements imgUrl = li.select("a > img");
            Elements bookName = li.select(" div > h2 > a");
            Elements starsCount = li.select(" div > p.rating > span.font-small.color-lightgray");
            Elements author = li.select("div > p.color-gray");
            Elements description = li.select(" div > p.detail");

            String bookDetailUrl = urls.get(0).attr("href");
            BookDetailInfo detailInfo = getDetailInfo(bookDetailUrl);
            BookEntity bookEntity = BookEntity.builder()
                    .detailPageUrl(bookDetailUrl)
                    .bookImgUrl(imgUrl.attr("src"))
                    .bookName(bookName.html())
                    .starsCount(starsCount.html())
                    .author(author.text())
                    .bookDetailInfo(detailInfo)
                    .description(description.html())
                    .build();
//            System.out.println(bookEntity);
            bookEntities.add(bookEntity);
        }
        return bookEntities;
    }
    /**
     *
     * @param detailUrl
     * @return
     * @throws IOException
     */
    public static BookDetailInfo getDetailInfo(String detailUrl)throws IOException{

        Document doc = Jsoup.connect(detailUrl).get();
        Elements content = doc.select("body");

        Elements price = content.select("#buyinfo-printed > ul.bs.current-version-list > li:nth-child(2) > div.cell.price-btn-wrapper > div.cell.impression_track_mod_buyinfo > div.cell.price-wrapper > a > span");
        Elements author = content.select("#info > span:nth-child(1) > a");
        BookDetailInfo bookDetailInfo = BookDetailInfo.builder()
                .author(author.html())
                .authorUrl(author.attr("href"))
                .price(price.html())
                .build();
        return bookDetailInfo;
    }
}

这里的重点是要获取网页对应元素的选择器。

例如：获取 li.select("div > p.color-gray") 中 div > p.color-gray 是怎么知道的。

使用 chrome 的小伙伴应该都猜到了。打开 chrome 浏览器 Debug 模式，Ctrl + Shift +C 选择一个元素,然后在 html 右键选择 Copy ->Copy selector,这样就可以获取当前元素的选择器。如下图：

4.3 存储数据到 Excel

为了数据更好查看，我将通过 jsoup 抓取的数据存储的 Excel 文件，这里我使用的 easyexcel 快速生成 Excel 文件。

Excel 表头信息

@Data
@Builder
public class ColumnData {

    @ExcelProperty("书名称")
    private String bookName;

    @ExcelProperty("评分")
    private String starsCount;

    @ExcelProperty("作者")
    private String author;

    @ExcelProperty("封面图片")
    private String bookImgUrl;

    @ExcelProperty("简介")
    private String description;

    @ExcelProperty("单价")
    private String price;
}

生成 Excel 文件

public class EasyExcelUtils {

    public static void simpleWrite(List<BookEntity> bookEntityList) {
        String fileName = "D:\\devEnv\\JsoupTest\\bookList" + System.currentTimeMillis() + ".xlsx";
        EasyExcel.write(fileName, ColumnData.class).sheet("书本详情").doWrite(data(bookEntityList));
        System.out.println("excel文件生成完毕...");
    }
    private static List<ColumnData> data(List<BookEntity> bookEntityList) {
        List<ColumnData> list = new ArrayList<>();
        bookEntityList.forEach(b -> {
            ColumnData data = ColumnData.builder()
                    .bookName(b.getBookName())
                    .starsCount(b.getStarsCount())
                    .author(b.getBookDetailInfo().getAuthor())
                    .bookImgUrl(b.getBookImgUrl())
                    .description(b.getDescription())
                    .price(b.getBookDetailInfo().getPrice())
                    .build();
            list.add(data);
        });
        return list;
    }
}

4.4 最终展示效果

最终的效果如下图：

以上就是从想法到实践，我们就在实战中使用了 jsoup 的基本操作。

完整代码地址：https://github.com/hellowHuaairen/JsoupTest

五、最后

Java HTML Parser 库：jsoup，把它当成简单的爬虫用起来还是很方便的吧？

为什么会讲爬虫？大数据，人工智能时代玩的就是数据，数据很重要。作为懂点技术的我们，也需要掌握一种获取网络数据的技能。当然也有一些工具 Fiddler、webscraper 等也可以抓取你想要的数据。

教程至此，你应该也能对 jsoup 有一些感觉了吧。编程是不是也特别有意思呢？参考我上面的实战案例，有好多网站可以实践一下啦～欢迎在评论区晒你的实战。

soup

抓取网页后，需要对网页解析，可以使用字符串处理工具解析页面，也可以使用正则表达式

jsoup 的作用：是一款 Java 的HTML 解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于JQuery的操作方法来取出和操作数据

jsoup的主要功能如下：

1.从一个URL，文件或字符串中解析HTML；

2.使用DOM或CSS选择器来查找、取出数据；

3.可操作HTML元素、属性、文本；

创建练习类

解析URL

第一个参数是访问的url，第二个参数是访问的超时时间

使用标签选择器，获取title标签中的内容

输出结果

读取文件

准备一个简易的HTML文件

获取这个

读取文件，获取字符串，代码及结果

使用dom方式遍历文档

解析文件获取document对象

依据id获取，这个是id的内容，我们获取这个内容

编写代码，显示结果

依据标签获取，我们获取这个标签的内容

代码及结果

依据class获取，获取内容

代码和结果

依据属性，属性内容

代码和结果

接下来从元素中获取数据

首先从元素中获取ID

从元素中获取className

文本

代码及结果

如果内容是两个class

那么代码及结果

从元素中获取属性

代码及结果

获取元素的所有属性

代码及结果

从元素中获取文本内容，这个之前有，代码和结果

加依赖

        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.49</version>
        </dependency>
        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
        </dependency>
        <!--Jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
             <version>1.15.2</version>
        </dependency>
         <!--lombok-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
复制代码

配置application.properties

# MySQL配置
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/demo?useUnicode=true&characterEncoding=utf8
spring.datasource.username=root
spring.datasource.password=123456


# JPA配置
spring.jpa.database=MySQL
spring.jpa.show-sql=true
spring.jpa.generate-ddl=true
spring.jpa.hibernate.ddl-auto=update
spring.jpa.hibernate.naming_strategy=org.hibernate.cfg.ImprovedNamingStrategy

复制代码

POJO

@Entity
@Table(name = "item")
@Data
public class Item {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    //标准产品单位
    private Long spu;
    //库存量单位
    private Long sku;
    //商品标题
    private String title;
    //商品价格
    private Double price;
    //商品图片
    private String pic;
    //商品详情地址
    private String url;
    //店铺;
    private String shop;
    //创建时间
    private Date created;
    //更新时间
    private Date updated;
}
复制代码

Dao

public interface ItemDao extends JpaRepository<Item,Long> {
}
复制代码

Service

public interface ItemService {

    /**
     * 保存商品
     *
     * @param item
     */
    void save(Item item);

    /**
     * 删除所有商品
     */
    void deleteAll();
}


@Service
public class ItemServiceImpl implements ItemService {

    @Autowired
    private ItemDao itemDao;

    @Override
    @Transactional
    public void save(Item item) {
        this.itemDao.save(item);
    }

    @Override
    public void deleteAll() {
        this.itemDao.deleteAll();
    }
}
复制代码

封装HttpClient

@Component
public class HttpUtils {

    private static final String FILEPATH = "D:\\demo\\";

    private PoolingHttpClientConnectionManager cm;

    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();
        //设置最大连接数
        this.cm.setMaxTotal(100);
        //设置每个主机的最大连接数
        this.cm.setDefaultMaxPerRoute(10);
    }

    /**
     * 根据请求地址下载页面数据
     *
     * @param url
     * @return 页面数据
     */
    public String doGetHtml(String url) {
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
        //创建httpGet请求对象，设置url地址
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36");
        //设置请求信息
        httpGet.setConfig(this.getConfig());
        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取响应
            response = httpClient.execute(httpGet);
            //解析响应，返回结果
            if (response.getStatusLine().getStatusCode() == 200) {
                //判断响应体Entity是否不为空，如果不为空就可以使用EntityUtils
                if (response.getEntity() != null) {
                    String content = EntityUtils.toString(response.getEntity(), "utf8");
                    return content;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //返回空串
        return "";
    }


    /**
     * 下载图片
     *
     * @param url
     * @return 图片名称
     */
    public String doGetImage(String url) {
        //获取HttpClient对象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build();
        //创建httpGet请求对象，设置url地址
        HttpGet httpGet = new HttpGet(url);
        //设置请求信息
        httpGet.setConfig(this.getConfig());
        CloseableHttpResponse response = null;
        try {
            //使用HttpClient发起请求，获取响应
            response = httpClient.execute(httpGet);
            //解析响应，返回结果
            if (response.getStatusLine().getStatusCode() == 200) {
                //判断响应体Entity是否不为空
                if (response.getEntity() != null) {
                    //获取图片的后缀
                    String extName = url.substring(url.lastIndexOf("."));
                    //创建图片名，重命名图片
                    String picName = UUID.randomUUID() + extName;
                    //声明OutPutStream
                    OutputStream outputStream = new FileOutputStream(new File(FILEPATH + picName));
                    response.getEntity().writeTo(outputStream);
                    //返回图片名称
                    return picName;
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //关闭response
            if (response != null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        //如果下载失败，返回空串
        return "";
    }

    /**
     * 设置请求信息
     *
     * @return
     */
    private RequestConfig getConfig() {
        RequestConfig config = RequestConfig.custom()
                //创建连接的最长时间
                .setConnectTimeout(1000)
                // 获取连接的最长时间
                .setConnectionRequestTimeout(500)
                //数据传输的最长时间
                .setSocketTimeout(10000)
                .build();

        return config;
    }
}
复制代码

SPU与SKU

SPU

SPU是商品信息聚合的最小单位，是一组可复用、易检索的标准化信息的集合，该集合描述了一个产品的特性。

属性值、特性相同的商品就可以称为一个SPU。

如：某型号某配置某颜色的笔记本电脑就对应一个SPU，它有多种配置，或者多种颜色

SKU

SKU即库存进出计量的单位，可以是以件、盒、托盘等为单位。SKU是物理上不可分割的最小存货单元。在使用时要根据不同业态，不同管理模式来处理。

如：某型号的笔记本电脑有多种配置，8G+512G笔记本电脑就是一个SKU。

爬取分析

爬取笔记本电脑搜索页面。进行分页操作，得到分页请求地址：https://search.jd.com/search?keyword=%E7%94%B5%E8%84%91&wq=%E7%94%B5%E8%84%91&pvid=56a110735c6c491c91416c194aed4c5b&cid3=672&cid2=671&s=56&click=0&page=

所有商品由一个class=J_goodsList的div包裹。div中则是由ul标签包裹的li标签，每一个li标签对应一个商品信息。

li标签包含的需要的商品信息

爬取逻辑

@Component
public class ItemTask {

    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private ItemService itemService;

    /**
     * 使用定时任务抓取最新数据
     *
     * @throws Exception
     */
    @Scheduled(fixedDelay = 50 * 1000)
    public void itemTask() throws Exception {
    	// 每次执行前请客数据
        itemService.deleteAll();
        
        //声明需要解析的初始地址
        String url = "https://search.jd.com/search?keyword=%E7%94%B5%E8%84%91&wq=%E7%94%B5%E8%84%91&pvid=56a110735c6c491c91416c194aed4c5b&cid3=672&cid2=671&s=56&click=0&page=";

        // 按照页面对搜索结果进行遍历解析，注意页面是奇数
        for (int i = 1; i < 10; i = i + 2) {
            String html = httpUtils.doGetHtml(url + i);
            // 解析页面，获取商品数据并存储
            this.parse(html);
        }
        System.out.println("商品数据抓取完成！");
    }

    /**
     * 解析页面，获取商品数据并存储
     *
     * @param html
     * @throws Exception
     */
    private void parse(String html) {
        // 解析html获取Document
        Document doc = Jsoup.parse(html);
        // 获取spu信息
        Elements spuEles = doc.select("div#J_goodsList > ul > li");

        // 循环列表中的SPU信息
        for (int i = 0; i < spuEles.size(); i++) {
            Element element = spuEles.get(i);
            //获取spu
            String strSpu = element.attr("data-spu");
            if (strSpu == null || strSpu.equals("")) {
                continue;
            }
            long spu = Long.parseLong(strSpu);
            //获取sku
            long sku = Long.parseLong(element.attr("data-sku"));

            Item item = new Item();
            //设置商品的spu
            item.setSpu(spu);
            //设置商品的sku
            item.setSku(sku);
            //获取商品的详情的url
            String itemUrl = "https://item.jd.com/" + sku + ".html";
            item.setUrl(itemUrl);

            // 获取商品的图片
            String picUrl = "https:" + element.select("div.p-img").select("a").select("img").attr("data-lazy-img");
            String picName = this.httpUtils.doGetImage(picUrl);
            item.setPic(picName);

            //获取商品的价格
            String strPrice = element.select("div.p-price").select("i").text();
            item.setPrice(Double.parseDouble(strPrice));

            //获取商品的标题
            String title = element.select("div.p-name").select("a").attr("title");
            item.setTitle(title);

            // 店铺名称
            String shopName = element.select("div.p-shop a").text();
            item.setShop(shopName);

            item.setCreated(new Date());
            item.setUpdated(item.getCreated());

            //保存商品数据到数据库中
            this.itemService.save(item);
        }
    }
}
复制代码

配置启动类

@SpringBootApplication
// 开启定时任务
@EnableScheduling
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}
复制代码

执行测试

启动项目，执行测试。查看数据库与本地下载照片。

在线咨询

上一篇：前端入门-html 表单控件使用
下一篇：HTML详细介绍1

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商

用 Java 拿下 HTML 分分钟写个小爬虫

一、项目介绍

二、使用框架

2.1 准备工作

2.2 学习源码

2.3 下载项目

2.4 运行项目测试代码

三、工作原理

3.1 发请求

3.2 数据筛选

3.3 数据处理

四、实战

4.1 项目 pom.xml 文件

4.2 解析页面数据

4.3 存储数据到 Excel

4.4 最终展示效果

五、最后

加依赖

配置application.properties

POJO

Dao

Service

封装HttpClient

SPU与SKU

爬取分析

爬取逻辑

配置启动类

执行测试

您的项目需求