爬虫利器 Beautiful Soup 之搜索文档

eautiful Soup 简介

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库，它提供了一些简单的操作方式来帮助你处理文档导航，查找，修改文档等繁琐的工作。因为使用简单，所以 Beautiful Soup 会帮你节省不少的工作时间。

上一篇文章我们介绍了如何使用 Beautiful Soup 来遍历文档中的节点，这片文章我们继续血学习如何使用 Beautiful Soup 指定文档中搜索到你想要的内容。

Beautiful Soup 搜索文档

同样为了故事的顺利发展，我们继续使用之前的 HTML 文本，下文的所有例子都是基于这段文本的。

html_doc = """
<html><head><title>index</title></head>
<body>
<p class="title"><b>首页</b></p>
<p class="main">我常用的网站
<a href="https://www.google.com" class="website" id="google">Google</a>
<a href="https://www.baidu.com" class="website" id="baidu">Baidu</a>
<a href="https://cn.bing.com" class="website" id="bing">Bing</a>
</p>
<div><!--这是注释内容--></div>
<p class="content1">...</p>
<p class="content2">...</p>
</body>
"""
soup = BeautifulSoup(html_doc, "lxml")

过滤器

正式讲解搜索文档之前，我们有必要了解下 Beautiful Soup 的过滤器，这些过滤器在整个搜索的 API 中都有所体现，他们可以被用在 TAG 的 name 中，属性中，字符串中或他们的混合中。听起来有点绕是么，看几个例子就懂了。

1、根据 TAG 的 name 来查找标签，下面的例子会查找文档中的所有 b 标签。同时要注意统一传入 Unicode 编码以避免 Beautiful Soup 解析编码出错。

# demo 1
tags = soup.find_all('b')
print(tags)

#输出结果
[<b>首页</b>]

2、如果传入正则表达式作为参数，那么 Beautiful Soup 会通过正则表达式的 match() 来匹配内容。

# demo 2
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

#输出结果
body
b

3、如果传入列表参数，那么 Beautiful Soup 会将与列表中任一一个元素匹配的内容返回。

# demo 3
for tag in soup.find_all(['a', 'b']):
    print(tag)

#输出结果
<b>首页</b>
<a class="website" href="https://www.google.com" id="google">Google</a>
<a class="website" href="https://www.baidu.com" id="baidu">Baidu</a>
<a class="website" href="https://cn.bing.com" id="bing">Bing</a>

4、True 可以匹配任何值，下面的例子是查找所有的 TAG 但不会返回字符串。

# demo 4
for tag in soup.find_all(True):
    print(tag.name, end=', ')
 
#输出结果
html, head, title, body, p, b, p, a, a, a, div, p, p,

5、方法。我们可以定义一个方法，该方法只接受一个参数，若该方法返回 True 则表示当前元素匹配并且被找到，返回 False 意味着没找到。下面的例子展示了查找所有同时包含 class 属性和 id 属性的节点。

# demo 5
def has_id_class(tag):
    return tag.has_attr('id') and tag.has_attr('class')

tags = soup.find_all(has_id_class)
for tag in tags:
	print(tag)
	
#输出结果
<a class="website" href="https://www.google.com" id="google">Google</a>
<a class="website" href="https://www.baidu.com" id="baidu">Baidu</a>
<a class="website" href="https://cn.bing.com" id="bing">Bing</a>

大部分情况字符串过滤器就可以满足我们的需求，外加这个神奇的方法过滤器，我们就可以实现各种自定义需求了。

find_all() 函数

该函数搜索当前节点下的所有子节点，其签名如下find_all( name , attrs , recursive , text , **kwargs )。我们可以传入指定 TAG 的 name 来查找节点，上面已经举过例子了，这里不在赘述。我们来看几个其他的用法。

1、如果我们传入 find_all() 函数不是搜索内置的参数名，那么搜索是就会将该参数对应到属性上去。下文的例子表示查找 id 为 google 的节点。

搜索指定名字的属性时可以使用的参数值包括：字符串，正则表达式，列表，True。也就是我们上文介绍过的过滤器。

# demo 6
tags = soup.find_all(id='google')
print(tags[0]['href'])

for tag in soup.find_all(id=True): # 查找所有包含 id 属性的 TAG
	print(tag['href'])

#输出结果
https://www.google.com
https://www.google.com
https://www.baidu.com
https://cn.bing.com

2、按照 CSS 类名搜索，但是镖师 CSS 的关键字 class 在 Python 中是内置关键字，从 Beautiful Soup 4.1.1 版本开始，可以通过 class_ 参数搜索有指定 CSS 类名的 TAG：

class_ 参数同样接受不同类型的过滤器：字符串，正则表达式，方法，True。

# demo 7
tags = soup.find_all("a", class_="website")
for tag in tags:
	print(tag['href'])

def has_seven_characters(css_class):
    return css_class is not None and len(css_class) == 7

for tag in soup.find_all(class_=has_seven_characters):
	print(tag['id'])

#输出结果
https://www.google.com
https://www.baidu.com
https://cn.bing.com
google
baidu
bing

同时，因为 CSS 可以有多个值，所以我们可以分别搜索 CSS 中的每个值。

# demo 8
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
tags = css_soup.find_all("p", class_="strikeout")
print(tags)

#输出结果
[<p class="body strikeout"></p>]

3、不仅可以按照标签和 CSS 来搜索整个文档，还可以使用 text 来按照内容来搜索。同时 text 还可以配合其他属性一起来完成搜索任务。

# demo 9
tags = soup.find_all(text="Google")
print("google : ", tags)

tags = soup.find_all(text=["Baidu", "Bing"])
print("baidu & bing : ", tags)

tags = soup.find_all('a', text="Google")
print("a[text=google] : ", tags)

#输出结果
google :  ['Google']
baidu & bing :  ['Baidu', 'Bing']
a[text=google] :  [<a class="website" href="https://www.google.com" id="google">Google</a>]

4、限制返回数量

有时候文档树过于庞大，我们不想查查找整棵树，只想查找指定数量的节点，或者只想查找子节点，而不想查找孙子节点，指定 limit 或者 recursive 参数即可。

# demo 10
tag = soup.find_all("a", limit=1)
print(tag)

tags = soup.find_all("p", recursive=False)
print(tags)

#输出结果
[<a class="website" href="https://www.google.com" id="google">Google</a>]
[]

因为该对象的儿子节点没有 p 标签，所以返回的是空列表。

find() 函数

该函数只会返回一个结果，与 find_all(some_args, limit=1) 是等价的，唯一的区别就是该函数直接返回结果，而 find_all() 函数返回包含一个结果的列表。另外 find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None。除此之外使用上没有其他差别。

其他函数

除了 find_all() 和 find() 外，Beautiful Soup 中还有 10 个用于搜索的 API，其中中五个用的是与 find_all() 相同的搜索参数，另外 5 个与 find() 方法的搜索参数类似，区别仅是它们搜索文档的范围不同。

find_parents() 和 find_parent() 用来搜索当前节点的父节点。

find_next_siblings() 和 find_next_sibling() 对在当前节点后面解析的所有兄弟节点进行迭代。

find_previous_siblings() 和 find_previous_sibling() 对在当前节点前面解析的所有兄弟节点进行迭代。

find_all_next() 和 find_next() 对当前节点之后的 TAG 和字符串进行迭代。

find_all_previous() 和 find_previous() 对当前节点之前的 TAG 和字符串进行迭代。

以上五组函数的区别仅仅是前者返回一个所有符合搜索条件的节点列表，而后者只返回第一个符合搜索条件的节点。

因为这 10 个 API 的使用和 find_all() 与 find() 大同小异，所有i这里不在举例，读者可以自己探索。

CSS 选择器

在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数即可使用 CSS 选择器的语法找到 TAG。

1、通过某个标签逐层查找。

# demo 11
tags = soup.select("body a")
for tag in tags:
	print(tag['href'])

#输出结果
https://www.google.com
https://www.baidu.com
https://cn.bing.com

2、查找某个标签下的直接子标签

# demo 12
tags = soup.select("p > a")
print(tags)

tags = soup.select("p > #google")
print(tags)

#输出结果
[<a class="website" href="https://www.google.com" id="google">Google</a>, <a class="website" href="https://www.baidu.com" id="baidu">Baidu</a>, <a class="website" href="https://cn.bing.com" id="bing">Bing</a>]
[<a class="website" href="https://www.google.com" id="google">Google</a>]

3、通过 CSS 类名直接查找

# demo 13
tags = soup.select(".website")
for tag in tags:
	print(tag.string)

#输出结果
Google
Baidu
Bing

4、通过标签的 id 属性查找

# demo 14
tags = soup.select("#google")
print(tags)

#输出结果
[<a class="website" href="https://www.google.com" id="google">Google</a>]

5、通过属性的值来查找

# demo 15
tags = soup.select('a[href="https://cn.bing.com"]')
print(tags)

#输出结果
[<a class="website" href="https://cn.bing.com" id="bing">Bing</a>]

Beautiful Soup 总结

本章节介绍了 Beautiful Soup 关于文档搜索的相关操作，熟练掌握这些 API 的操作可以让我们更快更好找到我们想要定位的节点，不要看到这么多函数吓怕了，其实我们只需要熟练掌握 find_all() 和 find() 两个函数即可，其余 API 的使用都大同小异，稍加练习即可快速上手。

编亲身体验，教你如何用Js获取页面关键词

在网络时代，关键词的重要性不言而喻。无论是SEO优化，还是网站内容策划，都需要准确获取页面关键词。但是，如何用Js获取页面关键词呢？小编今天就来为大家分享一下亲身体验的方法。

一、了解Js获取页面关键词的原理

在深入了解如何用Js获取页面关键词之前，我们先来了解一下其原理。通常情况下，搜索引擎会根据网页的标题、描述和内容等信息来确定关键词。而Js获取页面关键词的方法就是通过解析网页源代码，提取其中的文本内容，并进行分析和处理，最终得到页面的关键词。

二、使用正则表达式提取关键词

使用正则表达式是一种常见且有效的方法来提取页面的关键词。我们可以通过正则表达式匹配特定的字符或者字符组合，并将其作为关键词进行保存和处理。

具体操作步骤如下：

1.获取网页源代码

使用`document.documentElement.outerHTML`可以获取当前网页的源代码。

2.匹配关键词

使用正则表达式`/\/`可以匹配到网页中的关键词。

3.提取关键词

使用`match()`方法可以将匹配到的关键词提取出来，并保存在一个数组中。

4.处理关键词

可以使用循环遍历的方式对提取到的关键词进行处理，比如去除空格、转换为小写等。

5.显示关键词

最后，可以将处理后的关键词显示在页面上，供用户参考和使用。

三、Js获取页面关键词的注意事项

在实际应用中，我们还需要注意以下几点：

1.关键词的数量和质量都很重要，不宜过多也不宜过少。一般来说，3~5个关键词比较合适。

2.关键词应该与网页内容密切相关，避免出现无关或重复的关键词。

3.页面的标题、描述和内容也是搜索引擎确定关键词的重要依据，因此需要合理设置和优化。

4. Js获取页面关键词只是一种辅助手段，不能代替其他SEO优化措施和策略。

五、总结

通过上述步骤，我们可以轻松地使用Js获取页面关键词。当然，在实际应用中还有很多细节需要注意，这需要我们根据具体情况进行调整和优化。希望本文对大家有所帮助，谢谢阅读！

六、参考代码

javascript
//获取网页源代码
var html = document.documentElement.outerHTML;
//匹配关键词
var regex =/\<meta\sname=\"keywords\"\scontent=\"(.*?)\"\>/;
var matches = html.match(regex);
//提取关键词
var keywords =[];
if (matches && matches.length >1){
    keywords = matches[1].split(",");
}
//处理关键词
for (var i =0; i < keywords.length;i++){
    keywords[i]= keywords[i].trim().toLowerCase();
}
//显示关键词
console.log(keywords);

以上就是小编亲身体验的Js获取页面关键词的方法，希望能对大家有所启发和帮助。如果还有其他问题，请随时留言，小编会尽快回复解答。谢谢！

近有个需求，在一个react项目中，实现搜索关键字呈现高亮状态。这个在普通的html文件中还好操作些，在react项目中有点懵逼了，因为react项目中很少操作dom，有点无从下手。但最后还是实现了效果，如下：

首先来看看如何在react中操作dom,广大网友给出两种方案：

一：使用选择器：

1、引入react-dom
    import ReactDom from 'react-dom'
2、给react节点设置id或类名等标识
    <span id='tip'></span>
3、定义变量保存dom元素
    var span = document.getElementById('tip')
4、通过ReactDom的findDOMNode()方法修改dom的属性
    ReactDom.findDOMNode(span).style.color = 'red'

二：使用ref属性

1、给指定标签设置ref属性
    <span ref='tip'></span>
2、通过this.refs.ref属性值来修改标签的属性
    this.refs.tip.style.color = "red"

我用第二种方案来操作的：

import React from 'react';
import {  Input } from 'antd';

const { Search } = Input;

// 高亮测试
class Highlight extends React.Component {
  constructor(props) {
    super(props);
    this.state = {
      text:<p>writing to a TLS enabled socket, node::StreamBase::Write calls node::TLSWrap::DoWrite with a freshly allocated WriteWrap object as first argument. If the DoWrite method does not return an error, this object is passed back to the caller as part of a StreamWriteResult structure. This may be exploited to corrupt memory leading to a Denial of Service or potentially other exploits\n" +
        HTTP Request Smuggling in nodejs Affected versions of Node.js allow two copies of a header field in a http request. For example, two Transfer-Encoding header fields. In this case Node.js identifies the first header field and ignores the second. This can lead to HTTP Request Smuggling (https://cwe.mitre.org/data/definitions/444.html).\n" +
        OpenSSL - EDIPARTYNAME NULL pointer de-reference (High) This is a vulnerability in OpenSSL which may be exploited through Node.js. You can read more about it in https://www.openssl.org/news/secadv/20201208.txt</p>
    };
  }

  findHighlight = (keyWord)=>{
      const str=keyWord.replace(/^(\s|\xA0)+|(\s|\xA0)+$/g, '');
    // eslint-disable-next-line react/no-string-refs
    const val= this.refs.tip.innerHTML;
    const content = this.searchdo(val, str);
    // eslint-disable-next-line react/no-string-refs
    this.refs.tip.innerHTML=content;
  };

  searchdo=(content,keyWord)=>{
    const keyWordArr = keyWord.split(' ');
    let re;
    for(let n = 0; n < keyWordArr.length; n +=1) {
      re = new RegExp(`${keyWordArr[n]}`,"gmi");
      // eslint-disable-next-line no-param-reassign
      content = content.replace(re,`<span style="color:#0f0;background-color:#ff0">${keyWordArr[n]}</span>`);
    }
    return content;
  };

  render() {
    const { text} = this.state;
    return (
      <div>
        <Search
          placeholder="请输入查找内容"
          onSearch={value => this.findHighlight(value)}
          style={{ width: 200 }}
        />
        <br />
        <br />
        <div
          style={{
            border:"1px solid #ccc",
            borderRadius:"4px",
            padding:"5px"
          }}
          ref="tip"
        >
          {text}
        </div>
      </div>
    );
  }
}

export default Highlight;

然后就实现了上面的效果，但是这只是最初步的，如果需要完善功能还需要自己进一步改造。

这只是其中一种方案，我还会用另一种方案实现这个效果（更加详解和优化），对你有用的话，欢迎关注！

在线咨询

上一篇：一个简单的 HTML 网页设计代码
下一篇：前端零基础入门-步骤一：页面结构层HTML-01-HTML基础

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商