NLP之文本分类：「Tf-Idf、Word2Vec和

NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

幕组双语原文：NLP之文本分类：「Tf-Idf、Word2Vec和BERT」三种模型比较

英语原文：Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

翻译：雷锋字幕组（关山、wiige）

概要

在本文中，我将使用NLP和Python来解释3种不同的文本多分类策略：老式的词袋法（tf-ldf），著名的词嵌入法（Word2Vec）和最先进的语言模型（BERT）。

NLP（自然语言处理）是人工智能的一个领域，它研究计算机和人类语言之间的交互作用，特别是如何通过计算机编程来处理和分析大量的自然语言数据。NLP常用于文本数据的分类。文本分类是指根据文本数据内容对其进行分类的问题。

我们有多种技术从原始文本数据中提取信息，并用它来训练分类模型。本教程比较了传统的词袋法（与简单的机器学习算法一起使用）、流行的词嵌入模型（与深度学习神经网络一起使用）和最先进的语言模型（和基于attention的transformers模型中的迁移学习一起使用），语言模型彻底改变了NLP的格局。

我将介绍一些有用的Python代码，这些代码可以轻松地应用在其他类似的案例中（仅需复制、粘贴、运行），并对代码逐行添加注释，以便你能复现这个例子（下面是全部代码的链接）。

mdipietro09/DataScience_ArtificialIntelligence_Utils

我将使用“新闻类别数据集”（News category dataset），这个数据集提供了从HuffPost获取的2012-2018年间所有的新闻标题，我们的任务是把这些新闻标题正确分类，这是一个多类别分类问题（数据集链接如下）。

News Category Dataset

特别地，我要讲的是：

设置：导入包，读取数据，预处理，分区。
词袋法：用scikit-learn进行特征工程、特征选择以及机器学习，测试和评估，用lime解释。
词嵌入法：用gensim拟合Word2Vec，用tensorflow/keras进行特征工程和深度学习，测试和评估，用Attention机制解释。
语言模型：用transformers进行特征工程，用transformers和tensorflow/keras进行预训练BERT的迁移学习，测试和评估。

设置

首先，我们需要导入下面的库：

## for data

import json

import pandas as pd

import numpy as np## for plotting

import matplotlib.pyplot as plt

import seaborn as sns## for bag-of-words

from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing## for explainer

from lime import lime_text## for word embedding

import gensim

import gensim.downloader as gensim_api## for deep learning

from tensorflow.keras import models, layers, preprocessing as kprocessing

from tensorflow.keras import backend as K## for bert language model

import transformers

该数据集包含在一个jason文件中，所以我们首先将其读取到一个带有json的字典列表中，然后将其转换为pandas的DataFrame。

lst_dics=

with open('data.json', mode='r', errors='ignore') as json_file:

for dic in json_file:

lst_dics.append( json.loads(dic) )## print the first one

lst_dics[0]

原始数据集包含30多个类别，但出于本教程中的目的，我将使用其中的3个类别：娱乐（Entertainment）、政治（Politics）和科技（Tech）。

## create dtf

dtf=pd.DataFrame(lst_dics)## filter categories

dtf=dtf[ dtf["category"].isin(['ENTERTAINMENT','POLITICS','TECH']) ][["category","headline"]]## rename columns

dtf=dtf.rename(columns={"category":"y", "headline":"text"})## print 5 random rows

dtf.sample(5)

从图中可以看出，数据集是不均衡的：和其他类别相比，科技新闻的占比很小，这会使模型很难识别科技新闻。

在解释和构建模型之前，我将给出一个预处理示例，包括清理文本、删除停用词以及应用词形还原。我们要写一个函数，并将其用于整个数据集上。

'''

Preprocess a string.

:parameter

:param text: string - name of column containing text

:param lst_stopwords: list - list of stopwords to remove

:param flg_stemm: bool - whether stemming is to be applied

:param flg_lemm: bool - whether lemmitisation is to be applied

:return

cleaned text

'''

def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):

## clean (convert to lowercase and remove punctuations and

characters and then strip)

text=re.sub(r'[^\w\s]', '', str(text).lower.strip)

## Tokenize (convert from string to list)

lst_text=text.split ## remove Stopwords

if lst_stopwords is not None:

lst_text=[word for word in lst_text if word not in

lst_stopwords]

## Stemming (remove -ing, -ly, ...)

if flg_stemm==True:

ps=nltk.stem.porter.PorterStemmer

lst_text=[ps.stem(word) for word in lst_text]

## Lemmatisation (convert the word into root word)

if flg_lemm==True:

lem=nltk.stem.wordnet.WordNetLemmatizer

lst_text=[lem.lemmatize(word) for word in lst_text]

## back to string from list

text=" ".join(lst_text)

return text

该函数从语料库中删除了一组单词（如果有的话）。我们可以用nltk创建一个英语词汇的通用停用词列表（我们可以通过添加和删除单词来编辑此列表）。

lst_stopwords=nltk.corpus.stopwords.words("english")

lst_stopwords

现在，我将在整个数据集中应用编写的函数，并将结果存储在名为“text_clean”的新列中，以便你选择使用原始的语料库，或经过预处理的文本。

dtf["text_clean"]=dtf["text"].apply(lambda x:

utils_preprocess_text(x, flg_stemm=False, flg_lemm=True,

lst_stopwords=lst_stopwords))dtf.head

如果你对更深入的文本分析和预处理感兴趣，你可以查看这篇文章。我将数据集划分为训练集（70%）和测试集（30%），以评估模型的性能。

## split dataset

dtf_train, dtf_test=model_selection.train_test_split(dtf, test_size=0.3)## get target

y_train=dtf_train["y"].values

y_test=dtf_test["y"].values

让我们开始吧！

词袋法

词袋法的模型很简单：从文档语料库构建一个词汇表，并计算单词在每个文档中出现的次数。换句话说，词汇表中的每个单词都成为一个特征，文档由具有相同词汇量长度的矢量（一个“词袋”）表示。例如，我们有3个句子，并用这种方法表示它们：

特征矩阵的形状：文档数x词汇表长度

可以想象，这种方法将会导致很严重的维度问题：文件越多，词汇表越大，因此特征矩阵将是一个巨大的稀疏矩阵。所以，为了减少维度问题，词袋法模型通常需要先进行重要的预处理（词清除、删除停用词、词干提取/词形还原）。

词频不一定是文本的最佳表示方法。实际上我们会发现，有些常用词在语料库中出现频率很高，但是它们对目标变量的预测能力却很小。为了解决此问题，有一种词袋法的高级变体，它使用词频-逆向文件频率（Tf-Idf）代替简单的计数。基本上，一个单词的值和它的计数成正比地增加，但是和它在语料库中出现的频率成反比。

先从特征工程开始，我们通过这个流程从数据中提取信息来建立特征。使用Tf-Idf向量器(vectorizer)，限制为1万个单词（所以词长度将是1万），捕捉一元文法（即 "new "和 "york"）和二元文法（即 "new york"）。以下是经典的计数向量器的代码:

ngram_range=(1,2))vectorizer=feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))

现在将在训练集的预处理语料上使用向量器来提取词表并创建特征矩阵。

corpus=dtf_train["text_clean"]vectorizer.fit(corpus)X_train=vectorizer.transform(corpus)dic_vocabulary=vectorizer.vocabulary_

特征矩阵X_train的尺寸为34265（训练集中的文档数）×10000（词长度），这个矩阵很稀疏:

sns.heatmap(X_train.todense[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title('Sparse Matrix Sample')

从特征矩阵中随机抽样（黑色为非零值）

为了知道某个单词的位置，可以这样在词表中查询:

word="new york"dic_vocabulary[word]

如果词表中存在这个词，这行脚本会输出一个数字N，表示矩阵的第N个特征就是这个词。

为了降低矩阵的维度所以需要去掉一些列，我们可以进行一些特征选择（Feature Selection），这个流程就是选择相关变量的子集。操作如下:

将每个类别视为一个二进制位（例如，"科技"类别中的科技新闻将分类为1，否则为0）;
进行卡方检验，以便确定某个特征和其（二进制）结果是否独立;
只保留卡方检验中有特定p值的特征。

y=dtf_train["y"]

X_names=vectorizer.get_feature_names

p_value_limit=0.95dtf_features=pd.DataFrame

for cat in np.unique(y):

chi2, p=feature_selection.chi2(X_train, y==cat)

dtf_features=dtf_features.append(pd.DataFrame(

{"feature":X_names, "score":1-p, "y":cat}))

dtf_features=dtf_features.sort_values(["y","score"],

ascending=[True,False])

dtf_features=dtf_features[dtf_features["score"]>p_value_limit]X_names=dtf_features["feature"].unique.tolist

这将特征的数量从10000个减少到3152个，保留了最有统计意义的特征。选一些打印出来是这样的:

for cat in np.unique(y):

print("# {}:".format(cat))

print(" . selected features:",

len(dtf_features[dtf_features["y"]==cat]))

print(" . top features:", ",".join(

dtf_features[dtf_features["y"]==cat]["feature"].values[:10]))

print(" ")

我们将这组新的词表作为输入，在语料上重新拟合向量器。这将输出一个更小的特征矩阵和更短的词表。

vectorizer=feature_extraction.text.TfidfVectorizer(vocabulary=X_names)vectorizer.fit(corpus)X_train=vectorizer.transform(corpus)dic_vocabulary=vectorizer.vocabulary_

新的特征矩阵X_train的尺寸是34265（训练中的文档数量）×3152（给定的词表长度）。你看矩阵是不是没那么稀疏了:

从新的特征矩阵中随机抽样（非零值为黑色）

现在我们该训练一个机器学习模型试试了。我推荐使用朴素贝叶斯算法：它是一种利用贝叶斯定理的概率分类器，贝叶斯定理根据可能相关条件的先验知识进行概率预测。这种算法最适合这种大型数据集了，因为它会独立考察每个特征，计算每个类别的概率，然后预测概率最高的类别。

classifier=naive_bayes.MultinomialNB

我们在特征矩阵上训练这个分类器，然后在经过特征提取后的测试集上测试它。因此我们需要一个scikit-learn流水线：这个流水线包含一系列变换和最后接一个estimator。将Tf-Idf向量器和朴素贝叶斯分类器放入流水线，就能轻松完成对测试数据的变换和预测。

## pipelinemodel=pipeline.Pipeline([("vectorizer", vectorizer),

("classifier", classifier)])## train classifiermodel["classifier"].fit(X_train, y_train)## testX_test=dtf_test["text_clean"].values

predicted=model.predict(X_test)

predicted_prob=model.predict_proba(X_test)

至此我们可以使用以下指标评估词袋模型了:

准确率: 模型预测正确的比例。
混淆矩阵: 是一张记录每类别预测正确和预测错误数量的汇总表。
ROC: 不同阈值下，真正例率与假正例率的对比图。曲线下的面积(AUC)表示分类器中随机选择的正观察值排序比负观察值更靠前的概率。
精确率: "所有被正确检索的样本数(TP)"占所有"实际被检索到的(TP+FP)"的比例。
召回率: 所有"被正确检索的样本数(TP)"占所有"应该检索到的结果(TP+FN)"的比例。

classes=np.unique(y_test)

y_test_array=pd.get_dummies(y_test, drop_first=False).values

## Accuracy, Precision, Recallaccuracy=metrics.accuracy_score(y_test, predicted)

auc=metrics.roc_auc_score(y_test, predicted_prob,

multi_)

print("Accuracy:", round(accuracy,2))

print("Auc:", round(auc,2))

print("Detail:")

print(metrics.classification_report(y_test, predicted))

## Plot confusion matrixcm=metrics.confusion_matrix(y_test, predicted)

fig, ax=plt.subplots

sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues,

cbar=False)

ax.set(xlabel="Pred", ylabel="True", xticklabels=classes,

yticklabels=classes, title="Confusion matrix")

plt.yticks(rotation=0)

fig, ax=plt.subplots(nrows=1, ncols=2)## Plot rocfor i in range(len(classes)):

fpr, tpr, thresholds=metrics.roc_curve(y_test_array[:,i],

predicted_prob[:,i])

ax[0].plot(fpr, tpr, lw=3,

label='{0} (area={1:0.2f})'.format(classes[i],

metrics.auc(fpr, tpr))

)

ax[0].plot([0,1], [0,1], color='navy', lw=3, line)

ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05],

xlabel='False Positive Rate',

ylabel="True Positive Rate (Recall)",

title="Receiver operating characteristic")

ax[0].legend(loc="lower right")

ax[0].grid(True)

## Plot precision-recall curvefor i in range(len(classes)):

precision, recall, thresholds=metrics.precision_recall_curve(

y_test_array[:,i], predicted_prob[:,i])

ax[1].plot(recall, precision, lw=3,

label='{0} (area={1:0.2f})'.format(classes[i],

metrics.auc(recall, precision))

)

ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall',

ylabel="Precision", title="Precision-Recall curve")

ax[1].legend(loc="best")

ax[1].grid(True)

plt.show

词袋模型能够在测试集上正确分类85%的样本（准确率为0.85），但在辨别科技新闻方面却很吃力（只有252条预测正确）。

让我们探究一下为什么模型会将新闻分类为其他类别，顺便看看预测结果是不是能解释些什么。lime包可以帮助我们建立一个解释器。为让这更好理解，我们从测试集中随机采样一次, 看看能发现些什么:

## select observationi=0

txt_instance=dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanationexplainer=lime_text.LimeTextExplainer(class_names=

np.unique(y_train))

explained=explainer.explain_instance(txt_instance,

model.predict_proba, num_features=3)

explained.show_in_notebook(text=txt_instance, predict_proba=False)

这就一目了然了：虽然"舞台(stage)"这个词在娱乐新闻中更常见, "克林顿(Clinton) "和 "GOP "这两个词依然为模型提供了引导（政治新闻）。

词嵌入

词嵌入（Word Embedding）是将中词表中的词映射为实数向量的特征学习技术的统称。这些向量是根据每个词出现在另一个词之前或之后的概率分布计算出来的。换一种说法，上下文相同的单词通常会一起出现在语料库中，所以它们在向量空间中也会很接近。例如，我们以前面例子中的3个句子为例:

二维向量空间中的词嵌入

在本教程中，我门将使用这类模型的开山怪: Google的Word2Vec（2013）。其他流行的词嵌入模型还有斯坦福大学的GloVe（2014）和Facebook的FastText（2016）。

Word2Vec生成一个包含语料库中的每个独特单词的向量空间，通常有几百维, 这样在语料库中拥有共同上下文的单词在向量空间中的位置就会相互靠近。有两种不同的方法可以生成词嵌入：从某一个词来预测其上下文（Skip-gram）或根据上下文预测某一个词（Continuous Bag-of-Words）。

在Python中，可以像这样从genism-data中加载一个预训练好的词嵌入模型:

nlp=gensim_api.load("word2vec-google-news-300")

我将不使用预先训练好的模型，而是用gensim在训练数据上自己训练一个Word2Vec。在训练模型之前，需要将语料转换为n元文法列表。具体来说，就是尝试捕获一元文法（"york"）、二元文法（"new york"）和三元文法（"new york city"）。

corpus=dtf_train["text_clean"]## create list of lists of unigramslst_corpus=

for string in corpus:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1])

for i in range(0, len(lst_words), 1)]

lst_corpus.append(lst_grams)## detect bigrams and trigramsbigrams_detector=gensim.models.phrases.Phrases(lst_corpus,

delimiter=" ".encode, min_count=5, threshold=10)

bigrams_detector=gensim.models.phrases.Phraser(bigrams_detector)trigrams_detector=gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],

delimiter=" ".encode, min_count=5, threshold=10)

trigrams_detector=gensim.models.phrases.Phraser(trigrams_detector)

在训练Word2Vec时，需要设置一些参数:

词向量维度设置为300;
窗口大小，即句子中当前词和预测词之间的最大距离，这里使用语料库中文本的平均长度;
训练算法使用 skip-grams (sg=1)，因为一般来说它的效果更好。

## fit w2vnlp=gensim.models.word2vec.Word2Vec(lst_corpus, size=300,

window=8, min_count=1, sg=1, iter=30)

现在我们有了词嵌入模型，所以现在可以从语料库中任意选择一个词，将其转化为一个300维的向量。

word="data"nlp[word].shape

甚至可以通过某些维度缩减算法（比如TSNE），将一个单词及其上下文可视化到一个更低的维度空间（2D或3D）。

word="data"

fig=plt.figure## word embedding

tot_words=[word] + [tupla[0] for tupla in

nlp.most_similar(word, topn=20)]

X=nlp[tot_words]## pca to reduce dimensionality from 300 to 3

pca=manifold.TSNE(perplexity=40, n_components=3, init='pca')

X=pca.fit_transform(X)## create dtf

dtf_=pd.DataFrame(X, index=tot_words, columns=["x","y","z"])

dtf_["input"]=0

dtf_["input"].iloc[0:1]=1## plot 3d

from mpl_toolkits.mplot3d import Axes3D

ax=fig.add_subplot(111, projection='3d')

ax.scatter(dtf_[dtf_["input"]==0]['x'],

dtf_[dtf_["input"]==0]['y'],

dtf_[dtf_["input"]==0]['z'], c="black")

ax.scatter(dtf_[dtf_["input"]==1]['x'],

dtf_[dtf_["input"]==1]['y'],

dtf_[dtf_["input"]==1]['z'], c="red")

ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=,

yticklabels=, zticklabels=)

for label, row in dtf_[["x","y","z"]].iterrows:

x, y, z=row

ax.text(x, y, z, s=label)

这非常酷，但词嵌入在预测新闻类别这样的任务上有何裨益呢？词向量可以作为神经网络的权重。具体是这样的:

首先，将语料转化为单词id的填充(padded)序列，得到一个特征矩阵。
然后，创建一个嵌入矩阵，使id为N的词向量位于第N行。
最后，建立一个带有嵌入层的神经网络，对序列中的每一个词都用相应的向量进行加权。

还是从特征工程开始，用 tensorflow/keras 将 Word2Vec 的同款预处理语料（n-grams 列表）转化为文本序列的列表:

## tokenize texttokenizer=kprocessing.text.Tokenizer(lower=True, split=' ',

oov_token="NaN",

filters='!"#$%&*+,-./:;?@[\]^_`{|}~\t\n')

tokenizer.fit_on_texts(lst_corpus)

dic_vocabulary=tokenizer.word_index## create sequencelst_text2seq=tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_train=kprocessing.sequence.pad_sequences(lst_text2seq,

maxlen=15, padding="post", truncating="post")

特征矩阵X_train的尺寸为34265×15（序列数×序列最大长度）。可视化一下是这样的:

sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)

plt.show

特征矩阵(34 265 x 15)

现在语料库中的每一个文本都是一个长度为15的id序列。例如，如果一个文本中有10个词符，那么这个序列由10个id和5个0组成，这个0这就是填充元素（而词表中没有的词其id为1）。我们来输出一下看看一段训练集文本是如何被转化成一个带有填充元素的词序列:

i=0## list of text: ["I like this", ...]len_txt=len(dtf_train["text_clean"].iloc[i].split)print("from: ", dtf_train["text_clean"].iloc[i], "| len:", len_txt)## sequence of token ids: [[1, 2, 3], ...]len_tokens=len(X_train[i])print("to: ", X_train[i], "| len:", len(X_train[i]))## vocabulary: {"I":1, "like":2, "this":3, ...}print("check: ", dtf_train["text_clean"].iloc[i].split[0],

" -- idx in vocabulary -->",

dic_vocabulary[dtf_train["text_clean"].iloc[i].split[0]])print("vocabulary: ", dict(list(dic_vocabulary.items)[0:5]), "... (padding element, 0)")

记得在测试集上也要做这个特征工程:

corpus=dtf_test["text_clean"]## create list of n-gramslst_corpus=

for string in corpus:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1]) for i in range(0,

len(lst_words), 1)]

lst_corpus.append(lst_grams)

## detect common bigrams and trigrams using the fitted detectorslst_corpus=list(bigrams_detector[lst_corpus])

lst_corpus=list(trigrams_detector[lst_corpus])## text to sequence with the fitted tokenizerlst_text2seq=tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_test=kprocessing.sequence.pad_sequences(lst_text2seq, maxlen=15,

padding="post", truncating="post")

X_test (14,697 x 15)

现在我们就有了X_train和X_test，现在需要创建嵌入矩阵，它将作为神经网络分类器的权重矩阵.

## start the matrix (length of vocabulary x vector size) with all 0sembeddings=np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items:

## update the row with vector try:

embeddings[idx]=nlp[word]

## if word not in model then skip and the row stays all 0s except:

pass

这段代码生成的矩阵尺寸为22338×300（从语料库中提取的词表长度×向量维度）。它可以通过词表中的词id。

word="data"print("dic[word]:", dic_vocabulary[word], "|idx")print("embeddings[idx]:", embeddings[dic_vocabulary[word]].shape,

"|vector")

终于要建立深度学习模型了! 我门在神经网络的第一个Embedding层中使用嵌入矩阵，训练它之后就能用来进行新闻分类。输入序列中的每个id将被视为访问嵌入矩阵的索引。这个嵌入层的输出是一个包含输入序列中每个词id对应词向量的二维矩阵（序列长度 x 词向量维度）。以 "我喜欢这篇文章(I like this article) "这个句子为例:

我的神经网络的结构如下:

一个嵌入层，如前文所述, 将文本序列作为输入, 词向量作为权重。
一个简单的Attention层，它不会影响预测，但它可以捕捉每个样本的权重, 以便将作为一个不错的解释器（对于预测来说它不是必需的，只是为了提供可解释性，所以其实可以不用加它）。这篇论文（2014）提出了序列模型（比如LSTM）的Attention机制，探究了长文本中哪些部分实际相关。
两层双向LSTM，用来建模序列中词的两个方向。
最后两层全连接层，可以预测每个新闻类别的概率。

## code attention layerdef attention_layer(inputs, neurons):

x=layers.Permute((2,1))(inputs)

x=layers.Dense(neurons, activation="softmax")(x)

x=layers.Permute((2,1), name="attention")(x)

x=layers.multiply([inputs, x])

return x## inputx_in=layers.Input(shape=(15,))## embeddingx=layers.Embedding(input_dim=embeddings.shape[0],

output_dim=embeddings.shape[1],

weights=[embeddings],

input_length=15, trainable=False)(x_in)## apply attentionx=attention_layer(x, neurons=15)## 2 layers of bidirectional lstmx=layers.Bidirectional(layers.LSTM(units=15, dropout=0.2,

return_sequences=True))(x)

x=layers.Bidirectional(layers.LSTM(units=15, dropout=0.2))(x)## final dense layersx=layers.Dense(64, activation='relu')(x)

y_out=layers.Dense(3, activation='softmax')(x)## compilemodel=models.Model(x_in, y_out)

model.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])

model.summary

现在来训练模型，不过在实际测试集上测试之前，我们要在训练集上划一小块验证集来验证模型性能。

## encode ydic_y_mapping={n:label for n,label in

enumerate(np.unique(y_train))}

inverse_dic={v:k for k,v in dic_y_mapping.items}

y_train=np.array([inverse_dic[y] for y in y_train])## traintraining=model.fit(x=X_train, y=y_train, batch_size=256,

epochs=10, shuffle=True, verbose=0,

validation_split=0.3)## plot loss and accuracymetrics=[k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]

fig, ax=plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title="Training")

ax11=ax[0].twinx

ax[0].plot(training.history['loss'], color='black')

ax[0].set_xlabel('Epochs')

ax[0].set_ylabel('Loss', color='black')for metric in metrics:

ax11.plot(training.history[metric], label=metric)

ax11.set_ylabel("Score", color='steelblue')

ax11.legendax[1].set(title="Validation")

ax22=ax[1].twinx

ax[1].plot(training.history['val_loss'], color='black')

ax[1].set_xlabel('Epochs')

ax[1].set_ylabel('Loss', color='black')for metric in metrics:

ax22.plot(training.history['val_'+metric], label=metric)

ax22.set_ylabel("Score", color="steelblue")

plt.show

Nice！在某些epoch中准确率达到了0.89。为了对词嵌入模型进行评估，在测试集上也要进行预测，并用相同指标进行对比（评价指标的代码与之前相同）。

## testpredicted_prob=model.predict(X_test)

predicted=[dic_y_mapping[np.argmax(pred)] for pred in

predicted_prob]

该模式的表现与前一个模型差不多。其实，它的科技新闻分类也不怎么样。

但它也具有可解释性吗? 是的! 因为在神经网络中放了一个Attention层来提取每个词的权重，我们可以了解这些权重对一个样本的分类贡献有多大。所以这里我将尝试使用Attention权重来构建一个解释器（类似于上一节里的那个）:

## select observationi=0txt_instance=dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanation### 1. preprocess inputlst_corpus=for string in [re.sub(r'[^\w\s]','', txt_instance.lower.strip)]:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1]) for i in range(0,

len(lst_words), 1)]

lst_corpus.append(lst_grams)

lst_corpus=list(bigrams_detector[lst_corpus])

lst_corpus=list(trigrams_detector[lst_corpus])

X_instance=kprocessing.sequence.pad_sequences(

tokenizer.texts_to_sequences(corpus), maxlen=15,

padding="post", truncating="post")### 2. get attention weightslayer=[layer for layer in model.layers if "attention" in

layer.name][0]

func=K.function([model.input], [layer.output])

weights=func(X_instance)[0]

weights=np.mean(weights, axis=2).flatten### 3. rescale weights, remove null vector, map word-weightweights=preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)

weights=[weights[n] for n,idx in enumerate(X_instance[0]) if idx

!=0]

dic_word_weigth={word:weights[n] for n,word in

enumerate(lst_corpus[0]) if word in

tokenizer.word_index.keys}### 4. barplotif len(dic_word_weigth) > 0:

dtf=pd.DataFrame.from_dict(dic_word_weigth, orient='index',

columns=["score"])

dtf.sort_values(by="score",

ascending=True).tail(top).plot(kind="barh",

legend=False).grid(axis='x')

plt.showelse:

print("--- No word recognized ---")### 5. produce html visualizationtext=for word in lst_corpus[0]:

weight=dic_word_weigth.get(word)

if weight is not None:

text.append('' + word + '')

else:

text.append(word)

text=' '.join(text)### 6. visualize on notebookprint("3[1m"+"Text with highlighted words")from IPython.core.display import display, HTML

display(HTML(text))

就像之前一样，"克林顿 (clinton)"和 "老大党(gop) "这两个词激活了模型的神经元，而且这次发现 "高(high) "和 "班加西(benghazi) "与预测也略有关联。

语言模型

语言模型, 即上下文/动态词嵌入（Contextualized/Dynamic Word Embeddings），克服了经典词嵌入方法的最大局限：多义词消歧义，一个具有不同含义的词（如" bank "或" stick"）只需一个向量就能识别。最早流行的是 ELMO（2018），它并没有采用固定的嵌入，而是利用双向 LSTM观察整个句子，然后给每个词分配一个嵌入。

到Transformers时代, 谷歌的论文Attention is All You Need（2017）提出的一种新的语言建模技术，在该论文中，证明了序列模型（如LSTM）可以完全被Attention机制取代，甚至获得更好的性能。

而后谷歌的BERT（Bidirectional Encoder Representations from Transformers，2018）包含了ELMO的上下文嵌入和几个Transformers，而且它是双向的（这是对Transformers的一大创新改进）。BERT分配给一个词的向量是整个句子的函数，因此，一个词可以根据上下文不同而有不同的词向量。我们输入岸河(bank river)到Transformer试试:

txt="bank river"## bert tokenizertokenizer=transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)## bert modelnlp=transformers.TFBertModel.from_pretrained('bert-base-uncased')## return hidden layer with embeddingsinput_ids=np.array(tokenizer.encode(txt))[None,:]

embedding=nlp(input_ids)

embedding[0][0]

如果将输入文字改为 "银行资金(bank money)"，则会得到这样的结果:

为了完成文本分类任务，可以用3种不同的方式来使用BERT:

从零训练它，并将其作为分类器使用。
提取词嵌入，并在嵌入层中使用它们（就像上面用Word2Vec那样）。
对预训练模型进行精调(迁移学习)。

我打算用第三种方式，从预训练的轻量 BERT 中进行迁移学习，人称 Distil-BERT （用6600 万个参数替代1.1 亿个参数）

## distil-bert tokenizertokenizer=transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

在训练模型之前，还是需要做一些特征工程，但这次会比较棘手。为了说明我们需要做什么，还是以我们这句 "我喜欢这篇文章(I like this article) "为例，他得被转化为3个向量（Ids, Mask, Segment）:

尺寸为 3 x 序列长度

首先，我们需要确定最大序列长度。这次要选择一个大得多的数字(比如50)，因为BERT会将未知词分割成子词符(sub-token)，直到找到一个已知的单字。比如若给定一个像 "zzdata "这样的虚构词，BERT会把它分割成["z"，"##z"，"##data"]。除此之外, 我们还要在输入文本中插入特殊的词符，然后生成掩码(musks)和分段(segments)向量。最后，把它们放进一个张量里得到特征矩阵，其尺寸为3（id、musk、segment）x 语料库中的文档数 x 序列长度。

这里我使用原始文本作为语料（前面一直用的是clean_text列）。

corpus=dtf_train["text"]

maxlen=50## add special tokensmaxqnans=np.int((maxlen-20)/2)

corpus_tokenized=["[CLS] "+

" ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',

str(txt).lower.strip))[:maxqnans])+

" [SEP] " for txt in corpus]## generate masksmasks=[[1]*len(txt.split(" ")) + [0]*(maxlen - len(

txt.split(" "))) for txt in corpus_tokenized]

## paddingtxt2seq=[txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) !=maxlen else txt for txt in corpus_tokenized]

## generate idxidx=[tokenizer.encode(seq.split(" ")) for seq in txt2seq]

## generate segmentssegments=for seq in txt2seq:

temp, i=, 0 for token in seq.split(" "):

temp.append(i)

if token=="[SEP]":

i +=1 segments.append(temp)## feature matrixX_train=[np.asarray(idx, dtype='int32'),

np.asarray(masks, dtype='int32'),

np.asarray(segments, dtype='int32')]

特征矩阵X_train的尺寸为3×34265×50。我们可以从特征矩阵中随机挑一个出来看看:

i=0print("txt: ", dtf_train["text"].iloc[0])

print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist])

print("idx: ", X_train[0][i])

print("mask: ", X_train[1][i])

print("segment: ", X_train[2][i])

这段代码在dtf_test["text"]上跑一下就能得到X_test。

现在要从预练好的 BERT 中用迁移学习一个深度学习模型。具体就是，把 BERT 的输出用平均池化压成一个向量，然后在最后添加两个全连接层来预测每个新闻类别的概率.

下面是使用BERT原始版本的代码（记得用正确的tokenizer重做特征工程):

## inputsidx=layers.Input((50), dtype="int32", name="input_idx")

masks=layers.Input((50), dtype="int32", name="input_masks")

segments=layers.Input((50), dtype="int32", name="input_segments")## pre-trained bertnlp=transformers.TFBertModel.from_pretrained("bert-base-uncased")

bert_out, _=nlp([idx, masks, segments])## fine-tuningx=layers.GlobalAveragePooling1D(bert_out)

x=layers.Dense(64, activation="relu")(x)

y_out=layers.Dense(len(np.unique(y_train)),

activation='softmax')(x)## compilemodel=models.Model([idx, masks, segments], y_out)for layer in model.layers[:4]:

layer.trainable=Falsemodel.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])model.summary

这里用轻量级的Distil-BERT来代替BERT:

## inputsidx=layers.Input((50), dtype="int32", name="input_idx")

masks=layers.Input((50), dtype="int32", name="input_masks")## pre-trained bert with configconfig=transformers.DistilBertConfig(dropout=0.2,

attention_dropout=0.2)

config.output_hidden_states=Falsenlp=transformers.TFDistilBertModel.from_pretrained('distilbert-

base-uncased', config=config)

bert_out=nlp(idx, attention_mask=masks)[0]## fine-tuningx=layers.GlobalAveragePooling1D(bert_out)

x=layers.Dense(64, activation="relu")(x)

y_out=layers.Dense(len(np.unique(y_train)),

activation='softmax')(x)## compilemodel=models.Model([idx, masks], y_out)for layer in model.layers[:3]:

layer.trainable=Falsemodel.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])model.summary

最后我们训练.测试并评估该模型 (评价代码与前文一致):

## encode ydic_y_mapping={n:label for n,label in

enumerate(np.unique(y_train))}

inverse_dic={v:k for k,v in dic_y_mapping.items}

y_train=np.array([inverse_dic[y] for y in y_train])## traintraining=model.fit(x=X_train, y=y_train, batch_size=64,

epochs=1, shuffle=True, verbose=1,

validation_split=0.3)## testpredicted_prob=model.predict(X_test)

predicted=[dic_y_mapping[np.argmax(pred)] for pred in

predicted_prob]

BERT的表现要比之前的模型稍好，它能识别的科技新闻要比其他模型多一些.

结语

本文是一个通俗教程，展示了如何将不同的NLP模型应用于多类分类任务上。文中比较了3种流行的方法: 用Tf-Idf的词袋模型, 用Word2Vec的词嵌入, 和用BERT的语言模型. 每个模型都介绍了其特征工程与特征选择、模型设计与测试、模型评价与模型解释，并在(可行时的)每一步中比较了这3种模型。

雷锋字幕组是一个由AI爱好者组成的翻译团队，汇聚五五多位志愿者的力量，分享最新的海外AI资讯，交流关于人工智能技术领域的行业转变与技术创新的见解。

团队成员有大数据专家，算法工程师，图像处理工程师，产品经理，产品运营，IT咨询人，在校师生；志愿者们来自IBM，AVL，Adobe，阿里，百度等知名企业，北大，清华，港大，中科院，南卡罗莱纳大学，早稻田大学等海内外高校研究所。

如果，你也是位热爱分享的AI爱好者。欢迎与雷锋字幕组一起，学习新知，分享成长。

言

在制作网页时，文字是最基本的元素之一。让阅读者更容易阅读，短时间里获得更多信息，是网页创作者的目标。本篇将介绍各种文字格式标签的使用方法。

本篇主要针对初学者的一篇教程，如果你非常熟悉html，可以忽略本篇文章。

标题文字

在网上浏览时经常看到一些标题文字，用来对应章节划分，它们以固定的字号显示，总共有6种级别的标题，从 h1 至 h6 依次减小，如下图：

html 代码：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>标题</title>
</head>
<body>
<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<h3>这是标题 3</h3>
<h4>这是标题 4</h4>
<h5>这是标题 5</h5>
<h6>这是标题 6</h6>
</body>
</html>

标题对齐方式可以使用 align 属性，分别有三个属性：

left —— 左对齐
center —— 居中对齐
right —— 右对齐

如下图：

html代码：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>标题</title>
</head>
<body>
<h1>这是标题 1</h1>
<h2 align="left">这是标题 2</h2>
<h3 align="center">这是标题 3</h3>
<h4 align="right">这是标题 4</h4>
<h5>这是标题 5</h5>
<h6>这是标题 6</h6>
</body>
</html>

文字格式标签

除了标题，网页中普通文字也是不可缺少的，而各种文字效果可以使网页更加漂亮。

只需在<body>和</body>之间输入文字，就会直接在页面中显示，如何设置这些文字的格式，这里使用标签，下面将逐一介绍各种文字格式用法。

一、设置字体、字号、颜色 —— 标签

标签在HTML 4 中用于指定字体、字体大小和文本颜色，但在HTML5 中不支持。

face 属性：字体类型
size 属性：字体字号大小
color 属性：字体颜色

html代码：

<html>
<body>
<div><font face="宋体">字体</font></div>
<div><font size="5">5号字体</font></div>
<div><font color="red">颜色</font></div>
<div><font size="5" face="arial" color="blue">一起使用</font></div>
</body>
</html>

在html5中不建议使用，请用 css 样式代替。

二、粗体、斜体、下划线、删除线—— strong、em、u、del

效果如下：

html代码：

<!DOCTYPE html>
<html>
<body>
<p>这是普通文本 - <strong>这是粗体文本</strong>。</p>
<p>这是普通文本 - <em>这是斜体</em>。</p>
<p>这是普通文本 - <u>这是下划线</u>。</p>
<p>这是普通文本 - <del>这是下划线</del>。</p>
</body>
</html>

注：html 5 和 html 4 相关标签存在巨大差异，比如 strong 和 b 、del 和 s、em 和 i 等效果相同，在html5 中不支持，b、s、i 标签，已不建议使用，关于各种差异，可自己了解下就可以了。

3、上标和下标 —— sup、sub

效果如下：

html代码：

<html>
<body>
<p>
普通文本 <sup>上标</sup>
</p>
<p>
普通文本 <sub>下标</sub>
</p>
<p>
数学公式 X<sup>3</sup> + 5X<sup>2</sup> - 5=0
</p>
<p>
数学公式 X<sub>1</sub> - 2X<sub>1</sub>=0
</p>
</body>
</html>

4、空格——

一般在网页中输入文字时，在段落中明明增加了空格，却在页面中看不到，这是因为在html中，浏览器本身会将2个句子之间的所有半角空白仅当做一个空白来看待。所以在这里使用空格符代替，每个空格符代表一个半角空格，多个空格可以使用多次。

html代码：

由于头条不显示空格字符，所以用图片代替

效果：

5、其它特殊字符

除了空格字符，在网页中还有一些特殊字符也需要使用代码来代替，一般情况下，特殊字符由前缀 “&” 开始、字符名和后缀 “;” 组成，和空格符类似。如下表

特殊字符有很多，这里只列出一些例子，具体自己搜索了解下。

段落

在网页中要把文字有条理地显示，需要使用到段落标签，下面介绍一些与段落相关的标签。

段落标签——p

在网页中，通过 定义为一个段落。

html代码：

<html>
<body>
<p>这是段落。</p>
<p>这是段落。</p>
<p>这是段落。</p>
<p>段落元素由 p 标签定义。</p> 
</body>
</html>

效果：

换行标签——br

在写文字时，除了自动换行外，换可以使用 标签强制文字换行，这个和 p 段落标签不一样。段落标签的换行是隔行的，而br不是，时2行文字更加紧凑。

html代码：

<html>
<body>
<p>
第一个段落<br />换行1<br />换行2<br />换行3<br />最后一行.
</p>
<p>
第二个段落 <br />换行1<br />换行2<br />换行3<br />最后一行.
</p>
</body>
</html>

效果如下：

如果不想文字被浏览器自动换行，可以使用标签处理，如下图：

改行文字不会被自动换行，会看到出现横向滚动条。

保留原始排版方式——pre

在网页制作中，有时需要保留一些特殊的排版效果，这是使用标签控制就会很麻烦，使用<pre>标签就可以保留文本的格式排版效果。如下图：

html代码：

<html>
<body>
<pre>
这是
预格式文本。
它保留了      空格
和换行。
</pre>
<p>pre 标签很适合显示计算机代码：</p>
<pre>
for i=1 to 10
     print i
next i
</pre>
<p>这是一个ok效果</p>
<pre>
  O O    k  K
 O   O   K K
  O O    K  K
</pre>
</body>
</html>

其它标签

右缩进—— blockquote

使用<blockquote>可以实现文字段落缩进，每使用一次，段落就缩进一次，可以嵌套使用。

实例代码：

<html>
<body>
Here comes a long quotation:
<blockquote>
This is a long quotation. This is a long quotation. This is a long quotation. This is a long quotation. This is a long quotation.
</blockquote>
请注意，浏览器在 blockquote 元素前后添加了换行，并增加了外边距。
</body>
</html>

效果如下：

请注意，浏览器在 blockquote 元素前后添加了换行，并增加了外边距。

水平线——hr

在段落和段落之间加上一行水平线，将段落隔开。如下效果：

html代码：

<html>
<body>
<p>hr 标签定义水平线：</p>
<hr />
<p>这是段落。</p>
<hr />
<p>这是段落。</p>
<hr />
<p>这是段落。</p>
</body>
</html>

文字标注——ruby

在网页中可以通过添加对文字的标注来说明某段文本。

效果如下：

html代码：

<!DOCTYPE HTML>
<html>
<body>
<p>ruby 使用语法：</p>
<ruby>
 被说明的文字 <rt> 标注 </rt>
</ruby>
</body>
</html>

其它标签——var、code、kbd等

<dfn>	定义一个定义项目。
<code>	定义计算机代码文本。
<samp>	定义样本文本。
<kbd>	定义键盘文本。它表示文本是从键盘上键入的。它经常用在与计算机相关的文档或手册中。
<var>	定义变量。您可以将此标签与 <pre> 及 <code> 标签配合使用。
<cite>	定义引用。可使用该标签对参考文献的引用进行定义，比如书籍或杂志的标题。

总结

本篇介绍了大部分常用的文本格式标签，在制作网页时会经常使用到。如何掌握这些标签使用，很简单，可以使用文本编辑器或类似w3cshool 在线可编辑预览的工具，亲手写一写，熟悉每个标签的用处，无需死记硬背，关键在于理解。

最后，感谢您的阅读及关注，祝你学习愉快。

上篇：前端入门——HTML的发展历史

下篇：前端入门——html 列表

果文章对你有帮助，记得点赞收藏哦，如果有疑问记得评论区留下你的问题，我会第一时间回复的！

前言

之前书写了使用pytorch进行短文本分类，其中的数据处理方式比较简单粗暴。自然语言处理领域包含很多任务，很多的数据像之前那样处理的话未免有点繁琐和耗时。在pytorch中众所周知的数据处理包是处理图片的torchvision，而处理文本的少有提及，快速处理文本数据的包也是有的，那就是torchtext[1]。下面还是结合上一个案例：【深度学习】textCNN论文与原理——短文本分类(基于pytorch)[2]，使用torchtext进行文本数据预处理，然后再使用torchtext进行模型分类。

关于torchtext的基本使用除了可以参考官方文档，也可以看看这篇文章：TorchText用法示例及完整代码[3]。

下面就开始看看该如何进行处理吧。

1 数据处理

首先导入包：

from torchtext import data

我们处理的语料中，主要涉及两个内容：文本，文本对应的类别。下面使用torchtext构建这两个字段：

# 文本内容，使用自定义的分词方法，将内容转换为小写，设置最大长度等
TEXT = data.Field(tokenize=utils.en_seg, lower=True, fix_length=config.MAX_SENTENCE_SIZE, batch_first=True)
# 文本对应的标签
LABEL = data.LabelField(dtype=torch.float)

其中的一些参数在一个config.py文件中，如下：

# 模型相关参数
RANDOM_SEED = 1000  # 随机数种子
BATCH_SIZE = 128    # 批次数据大小
LEARNING_RATE = 1e-3   # 学习率
EMBEDDING_SIZE = 200   # 词向量维度
MAX_SENTENCE_SIZE = 50  # 设置最大语句长度
EPOCH = 20            # 训练测轮次

# 语料路径
NEG_CORPUS_PATH = './corpus/neg.txt'
POS_CORPUS_PATH = './corpus/pos.txt'

utils.en_seg是自定义的文本分词函数，如下：

def en_seg(sentence):
    """
    简单的英文分词方法，
    :param sentence: 需要分词的语句
    :return: 返回分词结果
    """
    return sentence.split()

当然也可以书写更复杂的，或者使用spacy。下面就是书写读取文本数据到torchtext对象的数据了，便于使用torchtext中的方法，如下：

def get_dataset(corpus_path, text_field, label_field, datatype):
    """
    构建torchtext数据集
    :param corpus_path: 数据路径
    :param text_field: torchtext设置的文本域
    :param label_field: torchtext设置的文本标签域
    :param datatype: 文本的类别
    :return: torchtext格式的数据集以及设置的域
    """
    fields = [('text', text_field), ('label', label_field)]
    examples = []
    with open(corpus_path, encoding='utf8') as reader:
        for line in reader:
            content = line.rstrip()
            if datatype == 'pos':
                label = 1
            else:
                label = 0
            # content[：-2]是由于原始文本最后的两个内容是空格和.，这里直接去掉，并将数据与设置的域对应起来
            examples.append(data.Example.fromlist([content[:-2], label], fields))

    return examples, fields

现在就可以获取torchtext格式的数据了，如下：

# 构建data数据
pos_examples, pos_fields = dataloader.get_dataset(config.POS_CORPUS_PATH, TEXT, LABEL, 'pos')
neg_examples, neg_fields = dataloader.get_dataset(config.NEG_CORPUS_PATH, TEXT, LABEL, 'neg')
all_examples, all_fields = pos_examples + neg_examples, pos_fields + neg_fields

# 构建torchtext类型的数据集
total_data = data.Dataset(all_examples, all_fields)

有了上面的数据，下面就可以快速地为准备模型需要的数据了，如切分，构造批次数据，获取字典等，如下：


# 数据集切分
train_data, test_data = total_data.split(random_state=random.seed(config.RANDOM_SEED), split_ratio=0.8)

# 切分后的数据查看
# # 数据维度查看
print('len of train data: %r' % len(train_data))  # len of train data: 8530
print('len of test data: %r' % len(test_data))  # len of test data: 2132

# # 抽一条数据查看
print(train_data.examples[100].text)
# ['never', 'engaging', ',', 'utterly', 'predictable', 'and', 'completely', 'void', 'of', 'anything', 'remotely',
# 'interesting', 'or', 'suspenseful']
print(train_data.examples[100].label)
# 0

# 为该样本数据构建字典，并将子每个单词映射到对应数字
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

# 查看字典长度
print(len(TEXT.vocab))  # 19206
# 查看字典中前10个词语
print(TEXT.vocab.itos[:10])  # ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', '.', 'is']
# 查找'name'这个词对应的词典序号, 本质是一个dict
print(TEXT.vocab.stoi['name'])  # 2063

# 构建迭代(iterator)类型的数据
train_iterator, test_iterator = data.BucketIterator.splits((train_data, test_data),
                                                           batch_size=config.BATCH_SIZE,
                                                           sort=False)

这样一看，是不是减少了我们书写的很多代码了。下面就是老生常谈的模型预测和模型效果查看了。

2 构建模型并训练

模型的相关理论已在前文介绍，如果忘了可以回过头看看。模型还是那个模型，如下：

import torch
from torch import nn

import config


class TextCNN(nn.Module):
    # output_size为输出类别（2个类别，0和1）,三种kernel，size分别是3,4，5，每种kernel有100个
    def __init__(self, vocab_size, embedding_dim, output_size, filter_num=100, kernel_list=(3, 4, 5), dropout=0.5):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # 1表示channel_num，filter_num即输出数据通道数，卷积核大小为(kernel, embedding_dim)
        self.convs = nn.ModuleList([
            nn.Sequential(nn.Conv2d(1, filter_num, (kernel, embedding_dim)),
                          nn.LeakyReLU(),
                          nn.MaxPool2d((config.MAX_SENTENCE_SIZE - kernel + 1, 1)))
            for kernel in kernel_list
        ])
        self.fc = nn.Linear(filter_num * len(kernel_list), output_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.embedding(x)  # [128, 50, 200] (batch, seq_len, embedding_dim)
        x = x.unsqueeze(1)  # [128, 1, 50, 200] 即(batch, channel_num, seq_len, embedding_dim)
        out = [conv(x) for conv in self.convs]
        out = torch.cat(out, dim=1)  # [128, 300, 1, 1]，各通道的数据拼接在一起
        out = out.view(x.size(0), -1)  # 展平
        out = self.dropout(out)  # 构建dropout层
        logits = self.fc(out)  # 结果输出[128, 2]
        return logits

为了方便模型训练，测试书写了两个函数，当然也和之前的相同，如下：

def binary_acc(pred, y):
    """
    计算模型的准确率
    :param pred: 预测值
    :param y: 实际真实值
    :return: 返回准确率
    """
    correct = torch.eq(pred, y).float()
    acc = correct.sum() / len(correct)
    return acc


def train(model, train_data, optimizer, criterion):
    """
    模型训练
    :param model: 训练的模型
    :param train_data: 训练数据
    :param optimizer: 优化器
    :param criterion: 损失函数
    :return: 该论训练各批次正确率平均值
    """
    avg_acc = []
    model.train()       # 进入训练模式
    for i, batch in enumerate(train_data):
        pred = model(batch.text)
        loss = criterion(pred, batch.label.long())
        acc = binary_acc(torch.max(pred, dim=1)[1], batch.label)
        avg_acc.append(acc)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # 计算所有批次数据的结果
    avg_acc = np.array(avg_acc).mean()
    return avg_acc


def evaluate(model, test_data):
    """
    使用测试数据评估模型
    :param model: 模型
    :param test_data: 测试数据
    :return: 该论训练好的模型预测测试数据，查看预测情况
    """
    avg_acc = []
    model.eval()  # 进入测试模式
    with torch.no_grad():
        for i, batch in enumerate(test_data):
            pred = model(batch.text)
            acc = binary_acc(torch.max(pred, dim=1)[1], batch.label)
            avg_acc.append(acc)
    return np.array(avg_acc).mean()

涉及相关包的话，就自行导入即可。下面就是创建模型和模型训练测试了。好紧张，又到了这个环节了。

# 创建模型
text_cnn = model.TextCNN(len(TEXT.vocab), config.EMBEDDING_SIZE, len(LABEL.vocab))
# 选取优化器
optimizer = optim.Adam(text_cnn.parameters(), lr=config.LEARNING_RATE)
# 选取损失函数
criterion = nn.CrossEntropyLoss()

# 绘制结果
model_train_acc, model_test_acc = [], []

# 模型训练
for epoch in range(config.EPOCH):
    train_acc = utils.train(text_cnn, train_iterator, optimizer, criterion)
    print("epoch = {}, 训练准确率={}".format(epoch + 1, train_acc))

    test_acc = utils.evaluate(text_cnn, test_iterator)
    print("epoch = {}, 测试准确率={}".format(epoch + 1, test_acc))

    model_train_acc.append(train_acc)
    model_test_acc.append(test_acc)

# 绘制训练过程
plt.plot(model_train_acc)
plt.plot(model_test_acc)
plt.ylim(ymin=0.5, ymax=1.01)
plt.title("The accuracy of textCNN mode")
plt.legend(['train', 'test'])
plt.show()

模型最后的结果如下：

模型训练过程

这个和之前结果没多大区别，但是在数据处理中却省去更多的时间，并且也更加规范化。所以还是有时间学习一下torchtext咯。

3 总结

torchtext支持的自然语言处理处理任务还是比较多的，并且自身和带有一些数据集。最近还在做实体识别任务，使用的算法模型是bi-lstm+crf。这个任务的本质就是序列标注，torchtext也是支持这种类型数据的处理的，后期有时间的话也会做相关的介绍，记得关注哦。对啦，本文的全部代码和语料，我都上传到github上了:https://github.com/Htring/NLP_Applications[4]，后续其他相关应用代码也会陆续更新，也欢迎star，指点哦。

参考文献

[1] torchtext: https://pytorch.org/text/stable/index.html

[2]【深度学习】textCNN论文与原理——短文本分类(基于pytorch): https://piqiandong.blog.csdn.net/article/details/110149143

[3] TorchText用法示例及完整代码: https://blog.csdn.net/nlpuser/article/details/88067167

[4] https://github.com/Htring/NLP_Applications: https://github.com/Htring/NLP_Applications

首发公众号【AIAS编程有道】,头条同步。

原创不易，科皮子菊麻烦你关注，转发，评论，感谢你的批评和指导，你的支持是我在头条发布文章的源源动力。我是爱编程，爱算法的科皮子菊，下篇博文见！

在线咨询

上一篇：使用FLIP技术让编写动画事半功倍
下一篇：JS学习之正则

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商

NLP之文本分类：「Tf-Idf、Word2Vec和

概要

设置

语言模型

结语

言

目录

标题文字

文字格式标签

段落

其它标签

总结

前言

1 数据处理

2 构建模型并训练

3 总结

参考文献

您的项目需求