python非静态页面爬取--pyppeteer

最近工作需要使用python爬取一些网页数据，爬取网页是非静态页面

1、只能爬取静态页面

2、seleium需要下注对应版本的浏览器驱动，若是浏览器升级，驱动还得重新下载对应版本的驱动程序，pass

3、最终选择的库，可以开启一个无界面浏览器，可以模拟浏览器打开一个页面，并输入url最终加载指定页面

1、安装库

# 模拟无界面浏览器
pip install pyppeteer
# 页面解析
pip install beautifulsoup4

2、非静态页面爬取demo

# _*_ coding: utf-8 _*_
# @Time: 2023/12/26
# @TODO: 爬取非静态页面demo
# @Author: wkq
import asyncio
from pyppeteer import launch

async def fetch_page_content(url):
    """
    todo 异步请求发送
    :param url: 非静态页面url
    :return: content页面内容
    """
    # 启动无界面浏览器
    browser = await launch(headless=True)
    # 创建一个新的页面
    page = await browser.newPage()
    # 访问目标URL并等待页面加载完成
    await page.goto(url, {'waitUntil': 'networkidle2'})
    # 获取页面的HTML内容
    content = await page.content()
    await browser.close()  # 关闭浏览器
    return content
async def main():
    """

    todo 异步执行请求加载页面，解析页面
    :return: 
    """
    # 非静态页面url
    url = "https://www.sporttery.cn/jczx/jclq/ssjx/20231225/10039058.html"
    html_content = await fetch_page_content(url)
    # 使用BeautifulSoup解析HTML
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    # 打印整个网页的HTML内容
    print(soup.prettify())
if __name__ == "__main__":
    # 使用事件循环执行异步方法
    asyncio.get_event_loop().run_until_complete(main())

怎么在js里打开文件

在里打开文件的核心操作包括：使用File API读取本地文件、通过fetch API获取远程文件、以及通过Node.js的fs模块处理服务器端文件。在此，我们将详细探讨这三种方法中的一种：使用File API读取本地文件。

一、使用File API读取本地文件

1. 文件选择器

要使用File API读取本地文件，首先需要用户选择文件。通过HTML的标签，可以创建一个文件选择器。

2. 读取文件内容

接下来，通过监听文件选择事件，并使用读取文件内容。

document.getElementById('fileInput').addEventListener('change', function(event) {
    const file = event.target.files[0];
    if (file) {
        const reader = new FileReader();
        reader.onload = function(e) {
            console.log(e.target.result); // 读取的文件内容
        };
        reader.readAsText(file); // 以文本形式读取文件
    }
});

对象提供了几种不同的方法来读取文件内容，例如 , , 和。其中方法最常用于读取文本文件。

二、使用fetch API获取远程文件

1. 基本用法

fetch API可以用于获取远程服务器上的文件内容。以下是一个示例：

fetch('https://example.com/file.txt')
    .then(response => response.text())
    .then(data => {
        console.log(data); // 远程文件内容
    })
    .catch(error => {
        console.error('Error fetching the file:', error);
    });

2. 错误处理

使用fetch API时，处理错误非常重要。可以通过catch方法捕获并处理异常情况。

fetch('https://example.com/file.txt')
    .then(response => {
        if (!response.ok) {
            throw new Error('Network response was not ok');
        }
        return response.text();
    })
    .then(data => {
        console.log(data);
    })
    .catch(error => {
        console.error('There was a problem with the fetch operation:', error);
    });

三、使用Node.js的fs模块处理服务器端文件

1. 导入fs模块

在Node.js环境中，可以使用内置的fs模块来读取文件。首先需要导入这个模块：

const fs = require('fs');

2. 读取文件内容

使用fs.方法来异步读取文件内容：

fs.readFile('path/to/file.txt', 'utf8', (err, data) => {
    if (err) {
        console.error('Error reading the file:', err);
        return;
    }
    console.log(data); // 文件内容

});

3. 同步读取文件

如果需要同步读取文件，可以使用fs.方法：

try {
    const data = fs.readFileSync('path/to/file.txt', 'utf8');
    console.log(data); // 文件内容
} catch (err) {
    console.error('Error reading the file:', err);
}

四、使用Blob对象在浏览器中创建文件

1. 创建Blob对象

在浏览器中，也可以使用Blob对象创建一个文件，并通过URL.生成一个下载链接：

const data = new Blob(['Hello, world!'], { type: 'text/plain' });
const url = URL.createObjectURL(data);
const a = document.createElement('a');
a.href = url;
a.download = 'hello.txt';
document.body.appendChild(a);
a.click();
document.body.removeChild(a);

2. 销毁URL对象

使用完URL对象后，应该调用URL.方法来释放内存：

URL.revokeObjectURL(url);

五、使用第三方库处理文件

1. .js

.js是一个常用的库，用于在浏览器中保存文件。以下是一个示例：

import { saveAs } from 'file-saver';
const data = new Blob(['Hello, world!'], { type: 'text/plain' });
saveAs(data, 'hello.txt');

2. JSZip

JSZip是另一个常用的库，用于在浏览器中创建和解压缩ZIP文件。以下是一个示例：

import JSZip from 'jszip';
import { saveAs } from 'file-saver';
const zip = new JSZip();
zip.file('hello.txt', 'Hello, world!');
zip.generateAsync({ type: 'blob' }).then(function(content) {
    saveAs(content, 'example.zip');
});

六、文件操作的安全性与用户体验

1. 文件权限

在浏览器环境中，出于安全考虑，只能访问用户明确选择的文件。无法直接访问用户的文件系统。

2. 用户体验

为了提升用户体验，可以使用文件选择器进行文件操作，并提供清晰的指引和错误提示。例如：

document.getElementById('fileInput').addEventListener('change', function(event) {
    const file = event.target.files[0];
    if (file) {
        const reader = new FileReader();
        reader.onload = function(e) {
            console.log(e.target.result);
        };
        reader.onerror = function(e) {
            console.error('Error reading the file:', e);
            alert('Failed to read the file. Please try again.');
        };
        reader.readAsText(file);
    } else {
        alert('No file selected. Please choose a file.');
    }
});

七、项目管理系统的集成

在研发项目管理中，文件操作经常涉及到文档、配置文件和数据文件的管理。推荐使用以下两个系统来实现更加高效的项目管理：

1. 研发项目管理系统

是一款专业的研发项目管理系统，提供了全面的文件管理功能。通过集成，可以实现文档的集中管理和版本控制，提高团队协作效率。

2. 通用项目协作软件

是一款通用项目协作软件，支持文件共享和管理。通过，可以轻松实现团队成员之间的文件共享和协作，提升工作效率。

总结

在中打开文件可以通过多种方法实现，包括使用File API读取本地文件、通过fetch API获取远程文件、以及通过Node.js的fs模块处理服务器端文件。每种方法都有其适用的场景和优势。在实际应用中，可以根据具体需求选择合适的方法，并结合项目管理系统和，提高文件操作的效率和安全性。

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商

python非静态页面爬取--pyppeteer

您的项目需求