话说前端53-组件基础

享兴趣，传播快乐，增长见闻，留下美好！

亲爱的您，这里是LearningYard新学苑。

今天小编为大家带来话说前端53-组件基础，欢迎您的访问。

Share interests, spread happiness, increase knowledge, and leave a good legacy!

Dear you, this is The LearningYard Academy.

Today Xiaobian brings you the knwowledge sharing of management principles (7): Crowd Relationship Theory (Mayo's Huasang experiment), welcome to your visit.

组件是 Vue.js 最强大的功能之一，组件可以扩展 HTML 元素，封装可重用的代码。组件系统让我们可以用独立可复用的小组件来构建大型应用，几乎任意类型的应用的界面都可以抽象为一个组件树：

Component is one of the most powerful functions of Vue.js Components can extend HTML elements and encapsulate reusable code. Component system allows us to build large-scale applications with independent and reusable small components, and the interface of almost any type of application can be abstracted into a component tree:

传统方式编写页面：传统方式下，我们进行前端开发时，都是一个html文档对应一个或多个css样式和js文件，且多个页面中，可能出现相同的部分，例如网页导航，例如网页底部信息，又难免出现复用同样的html结构，css样式和js文件，但假如任意改动其中一部分，那整个项目中复用的部分都会随之发生改变，就会造成依赖关系混乱，且不好维护。其次，传统方式编写项目，每一个页面都是一个html文档，每出现一个新页面，就要新增一个html文档和一个或多个css样式及js文件，难免会存在文件较多的问题，每一个网页大多都是一个独立的部分，所以，代码复用率不是很高。

Writing pages in the traditional way: In the traditional way, when we do front-end development, an html document corresponds to one or more css styles and js files, and in multiple pages, the same parts may appear, such as web navigation, such as information at the bottom of a web page, and it is inevitable that the same html structure, css styles and js files will be reused. However, if any part of them is changed at will, the reused parts in the whole project will change accordingly, which will lead to confusion of dependency and difficult maintenance. Secondly, in the traditional way of writing a project, every page is an html document. Every time a new page appears, an html document and one or more css styles and js files will be added, which will inevitably lead to the problem of more files. Most of each webpage is an independent part, so the code reuse rate is not very high.

组件方式编写页面：用组件方式来编写页面，其实简单理解来说就是把一个完整的网页拆分成一个又一个的组件，就比如说，一个网页包含头部导航，主体内容，底部信息。我们可以把网页头部导航划分为一个组件，剩下的同样对应划分为组件。拿顶部导航这个组件来说，这个组件包含了实现顶部导航的html结构，css样式和js代码。每一个组件只负责对应的结构，样式和交互，各司其职，互不干扰，然后由这些一个又有一个的组件组成了一个完整的页面。且网页被拆分为组件后，我们就可以进行组件化编码，最直观的优点或亮点就是，组件复用，也就是多个网页相同的部分，只需要写一个组件然后按需引入就行。

Compiling pages in component mode: Compiling pages in component mode is, in fact, simply speaking, to split a complete web page into one component after another. For example, a web page contains header navigation, main content and bottom information. We can divide the page header navigation into a component, and the rest are also divided into components accordingly. Take the top navigation component as an example. This component includes html structure, css style and js code to realize top navigation. Each component is only responsible for the corresponding structure, style and interaction, each of which does its own job and does not interfere with each other, and then a complete page is composed of these components one after another. And after the web page is divided into components, we can carry out component coding. The most intuitive advantage or highlight is that component reuse, that is, the same part of multiple web pages, only needs to write a component and then introduce it as needed.

组件和模块化：组件：组件是可复用的 Vue 实例，且带有一个名字。我们可以在一个通过 new Vue 创建的 Vue 根实例中，把这个组件作为自定义元素来使用。模块：分属同一功能/业务的代码进行隔离（分装）成独立的模块，可以独立运行，以页面、功能或其他不同粒度划分程度不同的模块，位于业务框架层，模块间通过接口调用，目的是降低模块间的耦合，由之前的主应用与模块耦合，变为主应用与接口耦合，接口与模块耦合。

Component and modularity: Component: A component is a reusable instance of Vue with a name. We can use this component as a custom element in a root instance of Vue created through new Vue. Modules: Codes belonging to the same function/business are isolated (subpackaged) into independent modules, which can run independently. Modules with different degrees are divided by pages, functions or other granularity, which are located in the business framework layer. Modules are called through interfaces to reduce the coupling between modules, from the previous main application to the module, to the main application and the interface, and the interface and the module.

今天的分享就到这里了。如果您对今天的文章有什么独特的想法，欢迎评论留言，让我们相约明天，祝您今天过得开心快乐！

That's it for today's sharing. If you have any unique ideas for today's article, please leave a comment, let us meet tomorrow, I wish you a happy day!

翻译：Google翻译

本文由LearningYard新学苑原创，如有侵权，请联系删除。

文字&排版|李仕阳

审核|李焕

第十三章」非结构化数据提取

在爬取数据的过程中，需要对页面解析和数据提取。

一般来讲对我们而言，需要抓取的是某个网站或者某个应用的内容，提取有用的价值。内容一般分为两部分，非结构化的数据和结构化的数据。

非结构化数据：先有数据，再有结构。

结构化数据：先有结构、再有数据。

不同类型的数据，我们需要采用不同的方式来处理。

13.1 正则表达式

13.1.1 为什么要学正则表达式

实际上爬虫一共就四个主要步骤：

1. 明确目标 (要知道你准备在哪个范围或者网站去搜索)

2. 爬 (将所有的网站的内容全部爬下来)

3. 取 (去掉对我们没用处的数据)

4. 处理数据（按照我们想要的方式存储和使用）

之前的案例里实际上省略了第3步，也就是"取"的步骤。因为我们down下了的数据是全部的网页，这些数据很庞大并且很混乱，大部分的东西使我们不关心的，因此我们需要将之按我们的需要过滤和匹配出来。

那么对于文本的过滤或者规则的匹配，最强大的就是正则表达式，是Python爬虫世界里必不可少的神兵利器。

13.1.2 什么是正则表达式

正则表达式，又称规则表达式，通常被用来检索、替换那些符合某个模式(规则)的文本。

正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

给定一个正则表达式和另一个字符串，我们可以达到如下的目的：

1. 给定的字符串是否符合正则表达式的过滤逻辑（“匹配”）；

2. 通过正则表达式，从文本字符串中获取我们想要的特定部分（“过滤”）。

13.1.3正则表达式匹配规则

1. 字符匹配规则。

2. 预定义字符集（可以写在字符集[…]中）。

3. 数词量（用在字符或者(...)之后）

4.边界匹配。

13.1.4 Python3下正则表达式的模块的加载

在 Python 中，我们可以使用内置的 re 模块来使用正则表达式。

import re

有一点需要特别注意的是，正则表达式使用对特殊字符进行转义，所以如果我们要使用原始字符串，只需加一个 r 前缀。

例子：

import re

#例子一

str1='nihao\tinghai'

print(str1)

#例子二

str2=r'nihao\tinghai'

print(str2)

运行结果：

nihao inghai

nihao\tinghai

13.1.5 compile 函数

compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，供 match() 和 search() 这两个函数使用。

语法格式为：

re.compile(pattern[, flags])

参数：

pattern : 一个字符串形式的正则表达式

flags 可选，表示匹配模式，比如忽略大小写，多行模式等，具体参数为：

re.I 忽略大小写

re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境

re.M 多行模式

re.S 即为' . '并且包括换行符在内的任意字符（' . '不包括换行符）

re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库

re.X 为了增加可读性，忽略空格和' # '后面的注释

例子：

import re

pattern=re.compile(r'\d+') # 用于匹配至少一个数字

m=pattern.match('one12twothree34four') # 查找头部，没有匹配

print(m)

m=pattern.match('one12twothree34four', 2, 10) # 从'e'的位置开始匹配，没有匹配

print(m)

m=pattern.match('one12twothree34four', 3, 10) # 从'1'的位置开始匹配，正好匹配

print(m)

运行结果：

None

<_sre.SRE_Match object; span=(3, 5), match='12'>

13.1.6 正则表达式对象

re.compile() 返回 RegexObject 对象。

re.MatchObject

group() 返回被 RE 匹配的字符串。

start() 返回匹配开始的位置。

end() 返回匹配结束的位置。

span() 返回一个元组包含匹配 (开始,结束) 的位置。

13.1.7 Python3 re模块的2种使用方式

第一种方式：使用compile 函数

1.使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象

2.通过 Pattern 对象提供的一系列方法对文本进行匹配查找，获得匹配结果，一个 Match 对象。

3.最后使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作

compile 函数用于编译正则表达式，生成一个 Pattern 对象，它的一般使用形式如下：

import re

# 将正则表达式编译成 Pattern 对象。

pattern=re.compile(r'\d+')

在上面，我们已将一个正则表达式编译成 Pattern 对象，接下来，我们就可以利用 pattern 的一系列方法对文本进行匹配查找了。

Pattern 对象的一些常用方法主要有：

match 方法：从起始位置开始查找，一次匹配

search 方法：从任何位置开始查找，一次匹配

findall 方法：全部匹配，返回列表

finditer 方法：全部匹配，返回迭代器

split 方法：分割字符串，返回列表

sub 方法：替换

第二种方式：直接使用re. search()/re. findall ()方式。

例子：

import re

old_url='http://www.jikexueyuan.com/course/android/?pageNum=2'

total_page=20

html="""

<head>

</head>

<body>

<div class='topic'> <a href="http://jikexueyuan.com/welcone.html">欢迎参加《听海的Python3接口自动化测试》

<ul>

<li><a href="http://jikexueyuan.com/1.html">这是第一条</a></li>

<li><a href="http://jikexueyuan.com/2.html">这是第二条</a></li>

<li><a href="http://jikexueyuan.com/3.html">这是第三条</a></li>

</ul>

</div>

</body>

</html>

"""

# f.close()

# #任务一：爬取网页标题

# title=re.search('<title>(.*?)</title>',html,re.S).group(1)

# print(title)

# #任务二：爬取链接

# links=re.findall('href="(.*?)">',html)

# print(links)

# #任务三：爬取部分文字内容

# u_text=re.findall('<ul>(.*?)</ul>',html,re.S)[0]

# texts=re.findall('">(.*?)</a>',u_text,re.S)

# for every_text in texts:

# print(texts)

#任务四：sub实现翻页

for i in range(2,total_page+1):

new_link=re.sub('pageNum=\d','pageNum=%d'%i,old_url,re.S)

print(new_link)

13.1.8 re模块之match 方法

match 方法用于查找字符串的头部（也可以指定起始位置），它是一次匹配，只要找到了一个匹配的结果就返回，而不是查找所有匹配的结果。它的一般使用形式如下：

match(string,begin,end)

其中，string 是待匹配的字符串，begin 和end 是可选参数，指定字符串的起始和终点位置，当你指定begin 和end 时，match 方法会根据指定的范围去查询，如果不指定begin 和end 时，match 方法默认匹配字符串的头部。

当匹配成功时，返回一个 Match 对象，如果没有匹配上，则返回 None。

综合例子：

import re

#例子一

str1='ting123hai456'

pattern=re.compile(r'\d+') # 用于匹配至少一个数字

m1=pattern.match(str1) # 查找头部，没有匹配

print(m1)

#例子二

str2='ting123hai456'

pattern=re.compile(r'\d+') # 用于匹配至少一个数字

m2=pattern.match(str2,3,8) # 从'g'的位置开始匹配，没有匹配

print(m2)

#例子三

str3='ting123hai456'

pattern=re.compile(r'\d+') # 用于匹配至少一个数字

m3=pattern.match(str3,4,8) # 从'1'的位置开始匹配，正好匹配

print(m3) # 返回一个 Match 对象

print(m3.group(0))

print(m3.start(0))

print(m3.end(0))

print(m3.span(0))

运行结果：

None

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

在上面，当匹配成功时返回一个 Match 对象，其中：

group([group1, …]) 方法：用于获得一个或多个分组匹配的字符串，当要获得整个匹配的子串时，可直接使用 group() 或 group(0)；

start([group]) 方法：用于获取分组匹配的子串在整个字符串中的起始位置（子串第一个字符的索引），参数默认值为 0；

end([group]) 方法：用于获取分组匹配的子串在整个字符串中的结束位置（子串最后一个字符的索引+1），参数默认值为 0；

span([group]) 方法：返回 (start(group), end(group))。

re.I 与re.S

1. re.I 表示忽略大小写。

2. re.S 表示全文匹配。

例子一：re.I 表示忽略大小写。

import re

pattern=re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I 表示忽略大小写

m=pattern.match('Welcome To Reptiles')

print(m) # 匹配成功，返回一个 Match 对象

print(m.group(0)) # 返回匹配成功的整个子串

print(m.span(0)) # 返回匹配成功的整个子串的索引

print(m.group(1)) # 返回第一个分组匹配成功的子串

print(m.span(1)) # 返回第一个分组匹配成功的子串的索引

print(m.group(2)) # 返回第二个分组匹配成功的子串

print(m.span(2)) # 返回第二个分组匹配成功的子串

print(m.groups()) # 等价于 (m.group(1), m.group(2), ...)

print(m.group(3)) # compile(r'([a-z]+) ([a-z]+)'）只是匹配了2组规则，不存在第三个分组

运行结果：

<_sre.SRE_Match object; span=(0, 10), match='Welcome To'>

Welcome To

(0, 10)

Welcome

(0, 7)

(8, 10)

('Welcome', 'To')

IndexError: no such group

re.S表示全文匹配，讲findall()方法的时候，再用具体的例子展示。

13.1.9 re模块之search 方法

search 方法用于查找字符串的任何位置，它也是一次匹配，只要找到了一个匹配的结果就返回，而不是查找所有匹配的结果，它的一般使用形式如下：

search(string,begin,end)

其中，string 是待匹配的字符串，begin 和end 是可选参数，指定字符串的起始和终点位置，当你指定begin 和end 时，search 方法会根据指定的范围去查询，如果不指定begin 和end 时，match 方法默认任何位置，只要找到了一个匹配的结果就返回。

当匹配成功时，返回一个 Match 对象，如果没有匹配上，则返回 None。

综合例子1：

import re

#例子一

str1='ting123hai456'

pattern=re.compile('\d+')

m1=pattern.search(str1) # 查找字符串任意位置，这里如果使用 match 方法则不匹配

print(m1)

print(m1.group())

print(m1.span())

#例子二

str2='ting123hai456'

pattern=re.compile('\d+')

m2=pattern.search(str2,4,8) # 指定字符串区间

print(m2)

print(m2.group())

print(m2.span())

运行结果:

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

综合例子2：

import re

#例子一

str1='ting123hai456'

pattern=re.compile('\d+')

m1=pattern.search(str1) # 查找字符串任意位置，这里如果使用 match 方法则不匹配

print(m1)

print(m1.group())

print(m1.span())

#例子二

str2='ting123hai456'

pattern=re.compile('\d+')

m2=pattern.search(str2,7,13) # 指定字符串区间

print(m2)

print(m2.group())

print(m2.span())

运行结果：

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

<_sre.SRE_Match object; span=(10, 13), match='456'>

456

(10, 13)

13.1.10 re模块之findall 方法

上面的 match 和 search 方法都是一次匹配，只要找到了一个匹配的结果就返回。然而，在大多数时候，我们需要搜索整个字符串，获得所有匹配的结果。

findall 方法的使用形式如下：

findall(string,begin,end)

其中，string 是待匹配的字符串，begin 和end 是可选参数，指定字符串的起始和终点位置，当你指定begin 和end 时，findall 方法会根据指定的范围去查询，以列表形式返回全部能匹配的子串，如果不指定begin 和end 时，match 方法会全文搜索，以列表形式返回全部能匹配的子串。

findall 以列表形式返回全部能匹配的子串，如果没有匹配，则返回一个空列表。

综合例子：

import re

#例子一

str1='hello123hell world456hel'

pattern=re.compile('hel') # 查找数字

m1=pattern.findall(str1)

print(m1)

#例子二

str2='hello123hell world456hel'

pattern=re.compile('hel') # 查找 hel

m2=pattern.findall(str2, 7, 14)

print(m2)

#例子三

str3='hello123hell world456hel'

pattern=re.compile('hel') # 查找 hel

m3=pattern.findall(str3, 7, 25)

print(m3)

运行结果：

['hel', 'hel', 'hel']

['hel']

['hel', 'hel']

13.1.11 re模块之finditer 方法

finditer 方法的行为跟 findall 的行为类似，也是搜索整个字符串，获得所有匹配的结果。但它返回一个顺序访问每一个匹配结果（Match 对象）的迭代器。

例子：

import re

pattern=re.compile(r'\d+')

m1=pattern.finditer('hello 123456 789')

m2=pattern.finditer('one1two2three3four4', 0, 10)

print(type(m1))

print(type(m2))

print('----- m1 ------')

for a1 in m1: # a1 是 Match 对象

print('matching string: {}, position: {}'.format(a1.group(), a1.span()))

print('----- m2 ------')

for a2 in m2:

print('matching string: {}, position: {}'.format(a2.group(), a2.span()))

运行结果：

----- m1 ------

matching string: 123456, position: (6, 12)

matching string: 789, position: (13, 16)

----- m2 ------

matching string: 1, position: (3, 4)

matching string: 2, position: (7, 8)

13.1.12 split 方法

split 方法按照能够匹配的子串将字符串分割后返回列表，它的使用形式如下：

split(string[, maxsplit])

其中，maxsplit 用于指定最大分割次数，不指定将全部分割。

例子：

import re

p=re.compile(r'[\s\,\;]+')

print(p.split('a,b;; c d'))

运行结果：

['a', 'b', 'c', 'd']

13.1.13 sub 方法

sub 方法用于替换。它的使用形式如下：

sub(repl, string[, count])

其中，repl 可以是字符串也可以是一个函数：

如果 repl 是字符串，则会使用 repl 去替换字符串每一个匹配的子串，并返回替换后的字符串，另外，repl 还可以使用 id 的形式来引用分组，但不能使用编号 0；

如果 repl 是函数，这个方法应当只接受一个参数（Match 对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。

count 用于指定最多替换次数，不指定时全部替换。

例子一：

import re

p=re.compile('123(.*?)123')

s='123asdfxxIxxxxLovexxded123'

f=p.sub('123456789',s)

print(f)

运行结果：

123456789

例子二：

import re

p=re.compile(r'(\w+) (\w+)') # \w=[A-Za-z0-9]

s='hello 123, hello 456'

print(p.sub(r'hello world', s)) # 使用 'hello world' 替换 'hello 123' 和 'hello 456'

print(p.sub(r' ', s)) # 引用分组

def func(m):

return 'hi' + ' ' + m.group(2)

print(p.sub(func, s))

print(p.sub(func, s, 1)) # 最多替换一次

运行结果：

hello world, hello world

123 hello, 456 hello

hi 123, hi 456

hi 123, hello 456

13.1.14 贪婪模式与非贪婪模式

在使用正则匹配的时候，有2种模式：

【贪婪模式】：在整个表达式匹配成功的前提下，尽可能多的匹配 ( * )；

【非贪婪模式】：在整个表达式匹配成功的前提下，尽可能少的匹配 ( ? )；

Python里数量词默认是贪婪的。

综合例子一：

import re

#例子一贪婪模式

s='abbbc'

p=re.compile('ab*')

f1=p.findall(s)

print(f1)

#例子二非贪婪模式

s='abbbc'

p=re.compile('ab*?')

f2=p.findall(s)

print(f2)

运行结果：

['abbb']

['a']

运行结果说明：

使用贪婪的数量词的正则表达式 ab* ，匹配结果： abbb。

* 决定了尽可能多匹配 b，所以a后面所有的 b 都出现了。

使用非贪婪的数量词的正则表达式ab*?，匹配结果： a。

即使前面有 *，但是 ? 决定了尽可能少匹配 b，所以没有 b。

综合例子二：

import re

html="aa<div>test1</div>bb<div>test2</div>cc"

#例子一贪婪模式

p=re.compile('<div>.*</div>')

f1=p.findall(html)

print(f1)

#例子二非贪婪模式

p=re.compile('<div>.*?</div>')

f2=p.findall(html)

print(f2)

运行结果：

['<div>test1</div>bb<div>test2</div>']

['<div>test1</div>', '<div>test2</div>']

运行结果说明：

使用贪婪的数量词的正则表达式：<div>.*</div>

匹配结果：<div>test1</div>bb<div>test2</div>

这里采用的是贪婪模式。在匹配到第一个“</div>”时已经可以使整个表达式匹配成功，但是由于采用的是贪婪模式，所以仍然要向右尝试匹配，查看是否还有更长的可以成功匹配的子串。匹配到第二个“</div>”后，向右再没有可以成功匹配的子串，匹配结束，匹配结果为“<div>test1</div>bb<div>test2</div>”

使用非贪婪的数量词的正则表达式：<div>.*?</div>

匹配结果：<div>test1</div>

正则表达式二采用的是非贪婪模式，在匹配到第一个“</div>”时使整个表达式匹配成功，由于采用的是非贪婪模式，所以结束匹配，不再向右尝试，匹配结果“<div>test1</div>”。

13.1.15 使用正则表达式的爬虫的案例

学会了正则表达式提取数据的相关方法之后，我们就可以进行对爬取到的全部网页源代码进行筛选了，下面讲案例。

案例一：爬取极客学院课程

代码：

import re,requests

class spider(object):

def __init__(self):

print("开始爬取内容")

def getsource(self,source):

html=requests.get(source)

return html.text

def changepage(self,url,total_page):

now_page=int(re.search('pageNum=(\d+)',url,re.S).group(1))

page_group=[]

for i in range(now_page,total_page+1):

link=re.sub('pageNum=(\d+)','pageNum=%s'%i,url,re.S)

page_group.append(link)

return page_group

def geteveryclass(self,html):

everyclass=re.findall('<li id="(.*?)</li>',html,re.S)

return everyclass

def getinfo(self,eachclass):

info={ } #定义一个空的字典

info['title']=re.search('title="(.*?)" alt="',eachclass,re.S).group(1)

info['content']=re.findall('display: none;">[\s]*([\s\S]*?)[\s]*</p>', eachclass)[0]

classlevel=re.findall('<em>(.*?)</em>',eachclass, re.S)

info['classtime']=classlevel[0]

info['classlevel']=classlevel[1]

info['learnnum']=re.search('"learn-number">(.*?)</em>', eachclass, re.S).group(1)

return info

def saveinfo(self,classinfo):

f=open(u'info.txt','a')

for each in classinfo:

f.writelines('title:'+each['title']+'\n')

f.writelines('content:' + each['content'] + '\n')

f.writelines('classtime:' + each['classtime'] + '\n')

f.writelines('classlevel:' + each['classlevel'] + '\n')

f.writelines('learnnum:' + each['learnnum'] + '\n')

f.close()

if __name__=='__main__':

classinfo=[] #定义一个空的列表

url='http://www.jikexueyuan.com/course/?pageNum=1' # 初始的url

jikespider=spider() #实例化一个类 jikespider

all_links=jikespider.changepage(url,20) #调用jikespider里的changepage(url,20)方法，获取1~20页的url

for link in all_links:

print("正在处理页面:"+ link)

html=jikespider.getsource(link) # 调用jikespider里的getsource()方法，获取每个html的text

everyclass=jikespider.geteveryclass(html) #调用jikespider里的geteveryclass()方法爬取everyclassh的html的text存到everyclass列表里

#print(everyclass)

for each in everyclass:

#print(each)

info=jikespider.getinfo(each) #调用jikespider里的getinfo()方法，获取每个视频的title、content、classtime、classlevel、learnnum

classinfo.append(info)

print(classinfo)

jikespider.saveinfo(classinfo)

13.2 XPath介绍

正则虽然很强大，但是正则语法相对比较复杂，比较难掌握，还有另外一种方法：XPath，我们可以先将 HTML文件转换成XML文档，然后用 XPath 查找 HTML 节点或元素。

13.2.1 什么是XML

XML 指可扩展标记语言（EXtensible Markup Language）

XML 是一种标记语言，很类似 HTML

XML 的设计宗旨是传输数据，而非显示数据

XML 的标签需要我们自行定义。

XML 被设计为具有自我描述性。

XML 是 W3C 的推荐标准

13.2.2 XML 和 HTML 的区别

XML文档示例：

<?xml version="1.0" encoding="utf-8"?>

<title lang="en">Everyday Italian</title>

<author>Giada De Laurentiis</author>

</book>

<title lang="en">Harry Potter</title>

<author>J K. Rowling</author>

</book>

<title lang="en">XQuery Kick Start</title>

<author>James McGovern</author>

<author>Per Bothner</author>

<author>Kurt Cagle</author>

<author>James Linn</author>

<author>Vaidyanathan Nagarajan</author>

</book>

<title lang="en">Learning XML</title>

</book>

</bookstore>

HTML DOM 模型示例：

HTML DOM 定义了访问和操作 HTML 文档的标准方法，以树结构方式表达 HTML 文档。

13.2.3 XML的节点关系

1. 父节点（Parent）

每个元素以及属性都有一个父。

下面是一个简单的XML例子中，book 元素是 title、author、year 以及 price 元素的父：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

2. 子节点（Children）

元素节点可有零个、一个或多个子。

在下面的例子中，title、author、year 以及 price 元素都是 book 元素的子：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

3. 同胞（Sibling）

拥有相同的父的节点。

在下面的例子中，title、author、year 以及 price 元素都是同胞：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

4. 先辈（Ancestor）

某节点的父、父的父，等等。

在下面的例子中，title 元素的先辈是 book 元素和 bookstore 元素：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

</bookstore>

5. 后代（Descendant）

某个节点的子，子的子，等等。

在下面的例子中，bookstore 的后代是 book、title、author、year 以及 price 元素：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

</bookstore>

13.2.4 什么是XPath

XPath (XML Path Language) 是一门在 XML 文档中查找信息的语言，可用来在 XML 文档中对元素和属性进行遍历。

W3School官方文档：http://www.w3school.com.cn/xpath/index.asp

13.2.5 XPath 开发工具

1.开源的XPath表达式编辑工具:XMLQuire(XML格式文件可用)

2.Chrome插件 XPath Helper

3.Firefox插件 XPath Checker

13.2.6 选取节点

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。

下面列出了最有用的路径表达式：

谓语（Predicates）

谓语用来查找某个特定的节点或者包含某个指定的值的节点。

谓语被嵌在方括号中。

实例

在下面的表格中，我们已列出了一些路径表达式以及表达式的结果：

实例

在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

路径表达式

结果

/bookstore/book[1]

选取属于 bookstore 子元素的第一个 book 元素。

/bookstore/book[last()]

选取属于 bookstore 子元素的最后一个 book 元素。

/bookstore/book[last()-1]

选取属于 bookstore 子元素的倒数第二个 book 元素。

/bookstore/book[position()<3]

选取最前面的两个属于 bookstore 元素的子元素的 book 元素。

//title[@lang]

选取所有拥有名为 lang 的属性的 title 元素。

//title[@lang='eng']

选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。

/bookstore/book[price>35.00]

选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。

/bookstore/book[price>35.00]/title

选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

【选取未知节点】

XPath 通配符可用来选取未知的 XML 元素。

实例

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

【选取若干路径】

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

实例

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

13.2.7 XPath 运算符

下面列出了可用在 XPath 表达式中的运算符：

言

JavaScript作为Web前端开发的基石，其强大的功能和灵活性不仅体现在网页的动态交互上，更在于其处理数据的能力。数组遍历是JavaScript中最常见的操作之一，尤其在算法题的求解过程中，它扮演着至关重要的角色。本文将深入探讨JavaScript中数组遍历的多种方法，通过具体的算法题示例，帮助读者掌握高效解决问题的技巧。

技术概述

数组遍历方法

在JavaScript中，数组遍历可以通过多种方式进行，每种方法都有其特点和适用场景：

for循环：最传统的遍历方式，适用于所有情况。
forEach()：ES5引入的数组方法，简化了遍历语法。
map()：用于创建新数组，对原数组的每个元素进行映射操作。
filter()：用于筛选数组，返回满足条件的元素组成的新数组。
reduce()：用于对数组元素进行累积操作，常用于求和、合并等场景。
some() 和 every()：用于检查数组中是否存在满足条件的元素或所有元素是否都满足条件。

代码示例

const numbers=[1, 2, 3, 4, 5];

// 使用for循环遍历
for (let i=0; i < numbers.length; i++) {
    console.log(numbers[i]);
}

// 使用forEach遍历
numbers.forEach(number=> console.log(number));

// 使用map创建新数组
const doubled=numbers.map(number=> number * 2);
console.log(doubled); // 输出: [2, 4, 6, 8, 10]

技术细节

工作原理

数组遍历方法本质上是通过迭代数组中的每一个元素来执行特定的逻辑操作。不同的方法提供不同的操作能力，如map用于变换，filter用于筛选，而reduce用于聚合。

难点分析

性能考量：尽管现代JavaScript引擎进行了大量的优化，但在处理大规模数据时，遍历方法的选择仍然会影响性能。
副作用管理：在遍历时避免对原始数组造成不必要的修改，尤其是使用map和filter时。

实战应用

应用场景

假设我们有一道算法题，要求找出数组中所有偶数，并返回它们的平方和。

代码示例

function sumOfSquaresEvenNumbers(numbers) {
    return numbers
        .filter(number=> number % 2===0) // 筛选偶数
        .map(number=> number * number)     // 平方
        .reduce((acc, curr)=> acc + curr, 0); // 求和
}

const result=sumOfSquaresEvenNumbers([1, 2, 3, 4, 5, 6]);
console.log(result); // 输出: 56

优化与改进

潜在问题

性能瓶颈：对于大数据集，多次迭代可能会导致性能下降。
代码冗余：过度使用高阶函数可能导致代码不易理解。

代码示例

function optimizedSumOfSquaresEvenNumbers(numbers) {
    let sum=0;
    for (let number of numbers) {
        if (number % 2===0) {
            sum +=number * number;
        }
    }
    return sum;
}

const optimizedResult=optimizedSumOfSquaresEvenNumbers([1, 2, 3, 4, 5, 6]);
console.log(optimizedResult); // 输出: 56

常见问题

Q: 如何在遍历数组时避免修改原数组？
A: 使用map或filter等方法，它们会返回新数组，而不会修改原数组。

总结与展望

数组遍历不仅是JavaScript编程的基础，也是解决复杂算法问题的利器。通过本文的探讨，我们不仅学习了多种数组遍历的方法，还掌握了如何在实际问题中选择合适的遍历策略，以提高代码的效率和可读性。未来，随着JavaScript语言的不断发展，新的数组方法和迭代器模式将进一步丰富我们的编程工具箱，为开发者提供更加高效和灵活的解决方案。掌握数组遍历的技巧，意味着在算法题的求解中拥有了更多的选择和自信，这也是前端开发者迈向更高层次的关键一步。

在线咨询

上一篇：JavaScript 剩余参数的用法
下一篇：编辑器漏洞详解

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商

话说前端53-组件基础

第十三章」 非结构化数据提取

言

技术概述

数组遍历方法

代码示例

技术细节

工作原理

难点分析

实战应用

应用场景

代码示例

优化与改进

潜在问题

代码示例

常见问题

总结与展望

您的项目需求

第十三章」非结构化数据提取