谷歌发布全新AI实战课：手把手教你机器学习，从概念到

谷歌发布全新AI实战课：手把手教你机器学习，从概念到代码

妮岳排槐发自凹非寺
量子位出品 | 公众号 QbitAI

如果你的心里只有一件事。

请问：是不是学习？

Google希望你是，而且还准备扶上马，再送一程。

所以今天一早，大礼包又来了。

手把手教你

今年春天，Google发布了机器学习速成课，英文简称MLCC。而且这套基本全程都有中文的课程，还是完全免费的。

这还不够。

Google觉得光学理论还不够，必须教你理论与实战相结合。

所谓：知行合一。

于是，Google发布了最新的一套课程：Machine Learning Practica（机器学习实践）。这套课程会示范Google如何在产品中使用机器学习。

课程地址在此：

https://developers.google.com/machine-learning/practica/

（.cn域名地址亲测可用）

与之前的课程不同，这套动手实践课程中，包括视频、文档和交互式编程练习。目前已经上线的第一课是图像分类。

在图像分类的实践课程中，可以学习Google如何开发利用最先进的图像分类模型，这也是Google相册背后的核心技术。

迄今为止，已有超过1万名Google员工利用这个实践课程来训练他们自己的图像分类器，最终实现可以识别照片中的猫猫狗狗。

课前准备

想要学习这套课程，也有一些基础要求。

主要是两点：

学过Google机器学习速成课，或者了解机器学习的基本概念
有不错的编程基础知识，以及有一些Python编程经验

这套实践课程使用了Keras API。以及课程中的编程练习，使用了Colab。使用Colab不要求之前有过Keras经验。

课程中代码基本可算是提供了逐步的解释。

目前这套实践课程只发布了图像分类一组，但Google表示更多的实践课程正在：肮！啧！味！

课程简介

在这个课程中，Google首先介绍了图像分类的基本原理，讲述了卷积神经网络（CNN）的构建，以及池化、全连接等概念。

然后，Google会引导你从头开始构建一个CNN网络，并且学习如何防止过拟合，以及利用训练模型进行特征提取和微调。

实践课程一共包括三组练习，分别是：

Exercise 1: Build a Convnet for Cat-vs-Dog Classification
带你构建一个猫狗分类的卷积网络。
Exercise 2: Preventing Overfitting
教你如何防止过拟合，改善提高CNN模型。
Exercise 3: Feature Extraction and Fine-Tuning
教你如何通过特征提取和微调来使用Google的Inception v3模型，并为上面两个练习完成的分类器获取更好的准确性。

课程示范

量子位潜入这个课程内部，带回了第二个实践练习。在这堂课里，谷歌想教会大家在猫狗图像分类中，如何减少过拟合。大家感受一下——

练习2：减少过拟合

预计完成时间：30分钟

在本节练习中，我们将基于在练习1中创建的模型将猫狗分类，并通过一些策略减少过拟合：也就是数据增强（Data Augmentation）和正则化方法dropout，从而提高准确性。

和大象被关进冰箱一样，这得分四步走：

通过对训练图像进行随机转换，来探索数据增强的玩法
在我们数据处理的过程中应用数据增强
在转换中加入dropout
重新训练模型，评估损失和精确度

Let’s get started吧！

数据增强の探索

数据增强是减少视觉模型过拟合的基本方法了，因为我们手头的训练实例为数不多，为了充分利用，我们可通过一些随机的变换“增强”它们，对模型来说，这是不同的图像~

这可以通过在ImageDataGenerator实例读取的图像上增加一些随机转换来实现，比如：

 1from keras.preprocessing.image import ImageDataGenerator
 2
 3datagen=ImageDataGenerator(
 4 rotation_range=40,
 5 width_shift_range=0.2,
 6 height_shift_range=0.2,
 7 shear_range=0.2,
 8 zoom_range=0.2,
 9 horizontal_flip=True,
10 fill_mode='nearest')

还有一些可用的选择：

rotation_range是在0-180之间的一个值，可在此角度内随机旋转图片。
width_shift和height_shift是个范围，指的总宽度或高度的一部分，图像可在此范围内垂直或水平随机转换。
shear_range用于随机剪切。
zoom_range用来随机缩放图片的。
horizontal_flip用于水平随机翻转图像的一半。
fill_mode是用来填充新创造的像素，在图像随机垂直或水平变换后可能用到

注意：此练习中使用的2000张图片摘自Kaggle上的“狗vs猫”数据集，包含25000张图片。为了节约训练时间，这里我们只用到其中的一个子集。

1!wget --no-check-certificate \
2 https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip -O \
3 /tmp/cats_and_dogs_filtered.zip

 1import os
 2import zipfile
 3
 4local_zip='/tmp/cats_and_dogs_filtered.zip'
 5zip_ref=zipfile.ZipFile(local_zip, 'r')
 6zip_ref.extractall('/tmp')
 7zip_ref.close()
 8
 9base_dir='/tmp/cats_and_dogs_filtered'
10train_dir=os.path.join(base_dir, 'train')
11validation_dir=os.path.join(base_dir, 'validation')
12
13# Directory with our training cat pictures
14train_cats_dir=os.path.join(train_dir, 'cats')
15
16# Directory with our training dog pictures
17train_dogs_dir=os.path.join(train_dir, 'dogs')
18
19# Directory with our validation cat pictures
20validation_cats_dir=os.path.join(validation_dir, 'cats')
21
22# Directory with our validation dog pictures
23validation_dogs_dir=os.path.join(validation_dir, 'dogs')
24
25train_cat_fnames=os.listdir(train_cats_dir)
26train_dog_fnames=os.listdir(train_dogs_dir)

接下来，我们将datagen转换应用到训练集里的猫咪图像，生成5个随机变量。这个单元需多运行几次，找到新批次中的随机变量。

 1%matplotlib inline
 2
 3import matplotlib.pyplot as plt
 4import matplotlib.image as mpimg
 5
 6from keras.preprocessing.image import array_to_img, img_to_array, load_img
 7
 8img_path=os.path.join(train_cats_dir, train_cat_fnames[2])
 9img=load_img(img_path, target_size=(150, 150)) # this is a PIL image
10x=img_to_array(img) # Numpy array with shape (150, 150, 3)
11x=x.reshape((1,) + x.shape) # Numpy array with shape (1, 150, 150, 3)
12
13# The .flow() command below generates batches of randomly transformed images
14# It will loop indefinitely, so we need to `break` the loop at some point!
15i=0
16for batch in datagen.flow(x, batch_size=1):
17 plt.figure(i)
18 imgplot=plt.imshow(array_to_img(batch[0]))
19 i +=1
20 if i % 5==0:
21 break

在数据处理过程中应用数据增强

现在，将上述增强的数据应用到数据预处理配置中——

 1# Adding rescale, rotation_range, width_shift_range, height_shift_range,
 2# shear_range, zoom_range, and horizontal flip to our ImageDataGenerator
 3train_datagen=ImageDataGenerator(
 4 rescale=1./255,
 5 rotation_range=40,
 6 width_shift_range=0.2,
 7 height_shift_range=0.2,
 8 shear_range=0.2,
 9 zoom_range=0.2,
10 horizontal_flip=True,)
11
12# Note that the validation data should not be augmented!
13test_datagen=ImageDataGenerator(rescale=1./255)
14
15# Flow training images in batches of 32 using train_datagen generator
16train_generator=train_datagen.flow_from_directory(
17 train_dir, # This is the source directory for training images
18 target_size=(150, 150), # All images will be resized to 150x150
19 batch_size=20,
20 # Since we use binary_crossentropy loss, we need binary labels
21 class_mode='binary')
22
23# Flow validation images in batches of 32 using test_datagen generator
24validation_generator=test_datagen.flow_from_directory(
25 validation_dir,
26 target_size=(150, 150),
27 batch_size=20,
28 class_mode='binary')

神奇之处是，若用增强的数据来训练模型，则不会被认为是相同示例（虽然它们都是从一张图片上得到的）。不过模型眼中这些输入仍紧密相关的，所以还不足以完全消除过拟合。

加入Dropout

不过~还有另外一种流行的策略能减少过拟合，即dropout。

如果你想了解过拟合的基本概念，这里自卖自夸推荐两个之前免费课程中的相关介绍：

https://developers.google.com/machine-learning/crash-course/training-neural-networks/video-lecture

https://developers.google.com/machine-learning/crash-course/

我们从练习1重新配置我们的convnet架构，在最后的分类层前试图添加一些dropout。

 1from keras.models import Model
 2from keras import layers
 3from keras.optimizers import RMSprop
 4from keras import backend as K
 5
 6import tensorflow as tf
 7
 8# Configure the TF backend session
 9tf_config=tf.ConfigProto(
10 gpu_options=tf.GPUOptions(allow_growth=True))
11K.set_session(tf.Session(config=tf_config))
12
13# Our input feature map is 150x150x3: 150x150 for the image pixels, and 3 for
14# the three color channels: R, G, and B
15img_input=layers.Input(shape=(150, 150, 3))
16
17# First convolution extracts 16 filters that are 3x3
18# Convolution is followed by max-pooling layer with a 2x2 window
19x=layers.Conv2D(16, 3, activation='relu')(img_input)
20x=layers.MaxPooling2D(2)(x)
21
22# Second convolution extracts 32 filters that are 3x3
23# Convolution is followed by max-pooling layer with a 2x2 window
24x=layers.Conv2D(32, 3, activation='relu')(x)
25x=layers.MaxPooling2D(2)(x)
26
27# Third convolution extracts 64 filters that are 3x3
28# Convolution is followed by max-pooling layer with a 2x2 window
29x=layers.Convolution2D(64, 3, activation='relu')(x)
30x=layers.MaxPooling2D(2)(x)
31
32# Flatten feature map to a 1-dim tensor
33x=layers.Flatten()(x)
34
35# Create a fully connected layer with ReLU activation and 512 hidden units
36x=layers.Dense(512, activation='relu')(x)
37
38# Add a dropout rate of 0.5
39x=layers.Dropout(0.5)(x)
40
41# Create output layer with a single node and sigmoid activation
42output=layers.Dense(1, activation='sigmoid')(x)
43
44# Configure and compile the model
45model=Model(img_input, output)
46model.compile(loss='binary_crossentropy',
47 optimizer=RMSprop(lr=0.001),
48 metrics=['acc'])

重新训练模型

随着数据的增加和dropout的填入，我们需要重新训练convnet模型。

这一次，我们训练全部的2000张图片，训练了30轮，并对验证了所有的1000个测试图像。

这可能需要几分钟的时间，检验一下你是否能自己编写代码了。

1# WRITE CODE TO TRAIN THE MODEL ON ALL 2000 IMAGES FOR 30 EPOCHS, AND VALIDATE 
2# ON ALL 1,000 TEST IMAGES

评估结果

接下来，我们用数据增强和dropout评估模型训练的结果。

 1# Retrieve a list of accuracy results on training and test data
 2# sets for each training epoch
 3acc=history.history['acc']
 4val_acc=history.history['val_acc']
 5
 6# Retrieve a list of list results on training and test data
 7# sets for each training epoch
 8loss=history.history['loss']
 9val_loss=history.history['val_loss']
10
11# Get number of epochs
12epochs=range(len(acc))
13
14# Plot training and validation accuracy per epoch
15plt.plot(epochs, acc)
16plt.plot(epochs, val_acc)
17plt.title('Training and validation accuracy')
18
19plt.figure()
20
21# Plot training and validation loss per epoch
22plt.plot(epochs, loss)
23plt.plot(epochs, val_loss)
24plt.title('Training and validation loss')

结果不错！模型已经不再过拟合。

事实上，从我们的训练资料来看，随着训练次数的增加，模型的准确度会达到80%!

清理

在运行练习3之前，我们还需要运行以下单元来释放kernel和空闲的内存资源：

1import os, signal
2os.kill(os.getpid(), signal.SIGKILL)

One More Thing

不知道是不是忙中出错，Google这套全新的课程，在我们发稿的时候，遇到了一个尴尬的问题：练习课程无法访问。

你点击练习之后，原本应该是转入一个Colab页面，但是却把多数用户挡在一个这样的界面之上。如图：

链接地址：https://login.corp.google.com

这是啥？

其实，这就是大名鼎鼎的moma，一个Google内部的搜索工具。如果你是Google员工，就能登录访问，进入Google内网。

可能是因为这套实践课程，和MLCC一样，也是之前面向Google内部的课程，所以出现了现在略微尴尬的一幕。

估计，可能很快会修复这个问题。

所以你可以先看看上面量子位搬运的课程示范。

不急~

— 完 —

诚挚招聘

量子位正在招募编辑/记者，工作地点在北京中关村。期待有才气、有热情的同学加入我们！相关细节，请在量子位公众号(QbitAI)对话界面，回复“招聘”两个字。

量子位 QbitAI · 头条号签约作者

?'?' ? 追踪AI技术和产品新动态

CSDN 编者按】一个月前，我们曾发表过一篇标题为《三年后，人工智能将彻底改变前端开发？》的文章，其中介绍了一个彼时名列 GitHub 排行榜 TOP 1 的项目 —— Screenshot-to-code-in-Keras。在这个项目中，神经网络通过深度学习，自动把设计稿变成 HTML 和 CSS 代码，同时其作者 Emil Wallner 表示，“三年后，人工智能将彻底改变前端开发”。

这个 Flag 一立，即引起了国内外非常热烈的讨论，有喜有忧，有褒扬有反对。对此，Emil Wallner 则以非常严谨的实践撰写了系列文章，尤其是在《Turning Design Mockups Into Code With Deep Learning》一文中，详细分享了自己是如何根据 pix2code 等论文构建了一个强大的前端代码生成模型，并细讲了其利用 LSTM 与 CNN 将设计原型编写为 HTML 和 CSS 网站的过程。

以下为全文：

在未来三年内，深度学习将改变前端开发，它可以快速创建原型，并降低软件开发的门槛。

去年，该领域取得了突破性的进展，其中 Tony Beltramelli 发表了 pix2code 的论文[1]，而 Airbnb 则推出了sketch2code[2]。

目前，前端开发自动化的最大障碍是计算能力。但是，现在我们可以使用深度学习的算法，以及合成的训练数据，探索人工前端开发的自动化。

本文中，我们将展示如何训练神经网络，根据设计图编写基本的 HTML 和 CSS 代码。以下是该过程的简要概述：

提供设计图给经过训练的神经网络

神经网络把设计图转化成 HTML 代码

大图请点：https://blog.floydhub.com/generate_html_markup-b6ceec69a7c9cfd447d188648049f2a4.gif

渲染画面

我们将通过三次迭代建立这个神经网络。

首先，我们建立一个简化版，掌握基础结构。第二个版本是 HTML，我们将集中讨论每个步骤的自动化，并解释神经网络的各层。在最后一个版本——Boostrap 中，我们将创建一个通用的模型来探索 LSTM 层。

你可以通过 Github[3] 和 FloydHub[4] 的 Jupyter notebook 访问我们的代码。所有的 FloydHub notebook 都放在“floydhub”目录下，而 local 的东西都在“local”目录下。

这些模型是根据 Beltramelli 的 pix2code 论文和 Jason Brownlee 的“图像标注教程”[5]创建的。代码的编写采用了 Python 和 Keras（TensorFlow 的上层框架）。

如果你刚刚接触深度学习，那么我建议你先熟悉下 Python、反向传播算法、以及卷积神经网络。你可以阅读我之前发表的三篇文章：

开始学习深度学习的第一周[6]
通过编程探索深度学习发展史[7]
利用神经网络给黑白照片上色[8]

核心逻辑

我们的目标可以概括为：建立可以生成与设计图相符的 HTML 及 CSS 代码的神经网络。

在训练神经网络的时候，你可以给出几个截图以及相应的 HTML。

神经网络通过逐个预测与之匹配的 HTML 标签进行学习。在预测下一个标签时，神经网络会查看截图以及到这个点为止的所有正确的 HTML 标签。

下面的 Google Sheet 给出了一个简单的训练数据：

https://docs.google.com/spreadsheets/d/1xXwarcQZAHluorveZsACtXRdmNFbwGtN3WMNhcTdEyQ/edit?usp=sharing

当然，还有其他方法[9]可以训练神经网络，但创建逐个单词预测的模型是目前最普遍的做法，所以在本教程中我们也使用这个方法。

请注意每次的预测都必须基于同一张截图，所以如果神经网络需要预测 20 个单词，那么它需要查看同一张截图 20 次。暂时先把神经网络的工作原理放到一边，让我们先了解一下神经网络的输入和输出。

让我们先来看看“之前的 HTML 标签”。假设我们需要训练神经网络预测这样一个句子：“I can code。”当它接收到“I”的时候，它会预测“can”。下一步它接收到“I can”，继续预测“code”。也就是说，每一次神经网络都会接收所有之前的单词，但是仅需预测下一个单词。

神经网络根据数据创建特征，它必须通过创建的特征把输入数据和输出数据连接起来，它需要建立一种表现方式来理解截图中的内容以及预测到的 HTML 语法。这个过程积累的知识可以用来预测下个标签。

利用训练好的模型开展实际应用与训练模型的过程很相似。模型会按照同一张截图逐个生成文本。所不同的是，你无需提供正确的 HTML 标签，模型只接受迄今为止生成过的标签，然后预测下一个标签。预测从“start”标签开始，当预测到“end”标签或超过最大限制时终止。下面的 Google Sheet 给出了另一个例子：

https://docs.google.com/spreadsheets/d/1yneocsAb_w3-ZUdhwJ1odfsxR2kr-4e_c5FabQbNJrs/edit#gid=0

Hello World 版本

让我们试着创建一个“hello world”的版本。我们给神经网络提供一个显示“Hello World”的网页截图，并教它怎样生成 HTML 代码。

大图请点：https://blog.floydhub.com/hello_world_generation-039d78c27eb584fa639b89d564b94772.gif

首先，神经网络将设计图转化成一系列的像素值，每个像素包含三个通道（红蓝绿），数值为 0-255。

我在这里使用 one-hot 编码[10]来描述神经网络理解 HTML 代码的方式。句子“I can code”的编码如下图所示：

上图的例子中加入了“start”和“end”标签。这些标签可以提示神经网络从哪里开始预测，到哪里停止预测。

我们用句子作为输入数据，第一个句子只包含第一个单词，以后每次加入一个新单词。而输出数据始终只有一个单词。

句子的逻辑与单词相同，但它们还需要保证输入数据具有相同的长度。单词的上限是词汇表的大小，而句子的上限则是句子的最大长度。如果句子的长度小于最大长度，就用空单词补齐——空单词就是全零的单词。

如上图所示，单词是从右向左排列的，这样可以强迫每个单词在每轮训练中改变位置。这样模型就能学习单词的顺序，而非记住每个单词的位置。

下图是四次预测，每行代表一次预测。等式左侧是用红绿蓝三个通道的数值表示的图像，以及之前的单词。括号外面是每次的预测，最后一个红方块代表结束。

#Length of longest sentencemax_caption_len=3#Size of vocabularyvocab_size=3# Load one screenshot for each word and turn them into digitsimages=[]for i in range(2): images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224))))images=np.array(images, dtype=float)# Preprocess input for the VGG16 modelimages=preprocess_input(images)#Turn start tokens into one-hot encodinghtml_input=np.array( [[[0., 0., 0.], #start [0., 0., 0.], [1., 0., 0.]], [[0., 0., 0.], #start <HTML>Hello World!</HTML> [1., 0., 0.], [0., 1., 0.]]])#Turn next word into one-hot encodingnext_words=np.array( [[0., 1., 0.], # <HTML>Hello World!</HTML> [0., 0., 1.]]) # end# Load the VGG16 model trained on imagenet and output the classification featureVGG=VGG16(weights='imagenet', include_top=True)# Extract the features from the imagefeatures=VGG.predict(images)#Load the feature to the network, apply a dense layer, and repeat the vectorvgg_feature=Input(shape=(1000,))vgg_feature_dense=Dense(5)(vgg_feature)vgg_feature_repeat=RepeatVector(max_caption_len)(vgg_feature_dense)# Extract information from the input seqencelanguage_input=Input(shape=(vocab_size, vocab_size))language_model=LSTM(5, return_sequences=True)(language_input)# Concatenate the information from the image and the inputdecoder=concatenate([vgg_feature_repeat, language_model])# Extract information from the concatenated outputdecoder=LSTM(5, return_sequences=False)(decoder)# Predict which word comes nextdecoder_output=Dense(vocab_size, activation='softmax')(decoder)# Compile and run the neural networkmodel=Model(inputs=[vgg_feature, language_input], outputs=decoder_output)model.compile(loss='categorical_crossentropy', optimizer='rmsprop')# Train the neural networkmodel.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)

在 hello world 版本中，我们用到了 3 个 token，分别是“start”、“<HTML><center><H1>Hello World!</H1></center></HTML>”和“end”。token 可以代表任何东西，可以是一个字符、单词或者句子。选择字符作为 token 的好处是所需的词汇表较小，但是会限制神经网络的学习。选择单词作为 token 具有最好的性能。

接下来进行预测：

# Create an empty sentence and insert the start tokensentence=np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]]start_token=[1., 0., 0.] # startsentence[0][2]=start_token # place start in empty sentence# Making the first prediction with the start tokensecond_word=model.predict([np.array([features[1]]), sentence])# Put the second word in the sentence and make the final predictionsentence[0][1]=start_tokensentence[0][2]=np.round(second_word)third_word=model.predict([np.array([features[1]]), sentence])# Place the start token and our two predictions in the sentencesentence[0][0]=start_tokensentence[0][1]=np.round(second_word)sentence[0][2]=np.round(third_word)# Transform our one-hot predictions into the final tokensvocabulary=["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"]for i in sentence[0]: print(vocabulary[np.argmax(i)], end=' ')

输出结果

10 epochs：start start start
100 epochs：start <HTML><center><H1>Hello World!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>
300 epochs：start <HTML><center><H1>Hello World!</H1></center></HTML> end

在这之中，我犯过的错误

先做出可以运行的第一版，再收集数据。在这个项目的早期，我曾成功地下载了整个 Geocities 托管网站的一份旧的存档，里面包含了 3800 万个网站。由于神经网络强大的潜力，我没有考虑到归纳一个 10 万大小词汇表的巨大工作量。
处理 TB 级的数据需要好的硬件或巨大的耐心。在我的 Mac 遇到几个难题后，我不得不使用强大的远程服务器。为了保证工作流程的顺畅，需要做好心里准备租用一台 8 CPU 和 1G 带宽的矿机。
关键在于搞清楚输入和输出数据。输入 X 是一张截图和之前的 HTML 标签。而输出 Y 是下一个标签。当我明白了输入和输出数据之后，理解其余内容就很简单了。试验不同的架构也变得更加容易。
保持专注，不要被诱惑。因为这个项目涉及了深度学习的许多领域，很多地方让我深陷其中不能自拔。我曾花了一周的时间从头开始编写 RNN，也曾经沉迷于嵌入向量空间，还陷入过极限实现方式的陷阱。
图片转换到代码的网络只不过是伪装的图像标注模型。即使我明白这一点，但还是因为许多图像标注方面的论文不够炫酷而忽略了它们。掌握一些这方面的知识可以帮助我们加速学习问题空间。

在 FloydHub 上运行代码

FloydHub 是深度学习的训练平台。我在刚开始学习深度学习的时候发现了这个平台，从那以后我一直用它训练和管理我的深度学习实验。你可以在 10 分钟之内安装并开始运行模型，它是在云端 GPU 上运行模型的最佳选择。

如果你没用过 FloydHub，请参照官方的“2 分钟安装手册”或我写的“5 分钟入门教程”[11]。

克隆代码仓库：

git clone https://github.com/emilwallner/Screenshot-to-code-in-Keras.git

登录及初始化 FloydHub 的命令行工具：

cd Screenshot-to-code-in-Kerasfloyd login
floyd init s2c

在 FloydHub 的云端 GPU 机器上运行 Jupyter notebook：

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter

所有的 notebook 都保存在“FloydHub”目录下，而 local 的东西都在“local”目录下。运行之后，你可以在如下文件中找到第一个 notebook：

floydhub/Helloworld/helloworld.ipynb

如果你想了解详细的命令参数，请参照我这篇帖子：

https://blog.floydhub.com/colorizing-b&w-photos-with-neural-networks/

HTML 版本

在这个版本中，我们将自动化 Hello World 模型中的部分步骤。本节我们将集中介绍如何让模型处理任意多的输入数据，以及建立神经网络中的关键部分。

这个版本还不能根据任意网站预测 HTML，但是我们将在此尝试解决关键性的技术问题，向最终的成功迈进一大步。

概述

我们可以把之前的解说图扩展为如下：

上图中有两个主要部分。首先是编码部分。编码部分负责建立图像特征和之前的标签特征。特征是指神经网络创建的最小单位的数据，用于连接设计图和 HTML 代码。在编码部分的最后，我们把图像的特征连接到之前的标签的每个单词。

另一个主要部分是解码部分。解码部分负责接收聚合后的设计图和 HTML 代码的特征，并创建下一个标签的特征。这个特征通过一个全连接神经网络来预测下一个标签。

设计图的特征

由于我们需要给每个单词添加一张截图，所以这会成为训练神经网络过程中的瓶颈。所以我们不直接使用图片，而是从中提取生成标签所必需的信息。

提取的信息经过编码后保存在图像特征中。这项工作可以由事先训练好的卷积神经网络（CNN）完成。该模型可以通过 ImageNet 上的数据进行训练。

CNN 的最后一层是分类层，我们可以从前一层提取图像特征。

最终我们可以得到 1536 个 8x8 像素的图片作为特征。尽管我们很难理解这些特征的含义，但是神经网络可以从中提取元素的对象和位置。

HTML 标签的特征

在 hello world 版本中，我们采用了 one-hot 编码表现 HTML 标签。在这个版本中，我们将使用单词嵌入（word embedding）作为输入信息，输出依然用 one-hot 编码。

我们继续采用之前的方式分析句子，但是匹配每个 token 的方式有所变化。之前的 one-hot 编码把每个单词当成一个独立的单元，而这里我们把输入数据中的每个单词转化成一系列数字，它们代表 HTML 标签之间的关系。

上例中的单词嵌入是 8 维的，而实际上根据词汇表的大小，其维度会在 50 到 500 之间。

每个单词的 8 个数字表示权重，与原始的神经网络很相似。它们表示单词之间的关系（Mikolov 等，2013[12]）。

以上就是我们建立 HTML 标签特征的过程。神经网络通过此特征在输入和输出数据之间建立联系。暂时先不用担心具体的内容，我们会在下节中深入讨论这个问题。

编码部分

我们需要把单词嵌入的结果输入到 LSTM 中，并返回一系列标签特征，再把这些特征送入 Time distributed dense 层——你可以认为这是拥有多个输入和输出的 dense 层。

同时，图像特征首先需要被展开（flatten），无论数值原来是什么结构，它们都会被转换成一个巨大的数值列表；然后经过 dense 层建立更高级的特征；最后把这些特征与 HTML 标签的特征连接起来。

这可能有点难理解，下面我们逐一分解开来看看。

HTML 标签特征

首先我们把单词嵌入的结果输入到 LSTM 层。如下图所示，所有的句子都被填充到最大长度，即三个 token。

为了混合这些信号并找到更高层的模式，我们加入 TimeDistributed dense 层进一步处理 LSTM 层生成的 HTML 标签特征。TimeDistributed dense 层是拥有多个输入和输出的 dense 层。

图像特征

同时，我们需要处理图像。我们把所有的特征（小图片）转化成一个长数组，其中包含的信息保持不变，只是进行重组。

同样，为了混合信号并提取更高层的信息，我们添加一个 dense 层。由于输入只有一个，所以我们可以使用普通的 dense 层。为了与 HTML 标签特征相连接，我们需要复制图像特征。

上述的例子中我们有三个 HTML 标签特征，因此最终图像特征的数量也同样是三个。

连接图像特征和 HTML 标签特征

所有的句子经过填充后组成了三个特征。因为我们已经准备好了图像特征，所以现在可以把图像特征分别添加到各自的 HTML 标签特征。

添加完成之后，我们得到了 3 个图像-标签特征，这便是我们需要提供给解码部分的输入信息。

解码部分

接下来，我们使用图像-标签的结合特征来预测下一个标签。

在下面的例子中，我们使用三对图形-标签特征，输出下一个标签的特征。

请注意，LSTM 层的 sequence 值为 false，所以我们不需要返回输入序列的长度，只需要预测一个特征，也就是下一个标签的特征，其内包含了最终的预测信息。

最终预测

dense 层的工作原理与传统的前馈神经网络相似，它把下个标签特征的 512 个数字与 4 个最终预测连接起来。用我们的单词表达就是：start、hello、world 和 end。

其中，dense 层的 softmax 激活函数会生成 0-1 的概率分布，所有预测值的总和等于 1。比如说词汇表的预测可能是[0.1,0.1,0.1,0.7]，那么输出的预测结果即为：第 4 个单词是下一个标签。然后，你可以把 one-hot 编码[0，0，0，1]转换为映射值，得出“end”。

# Load the images and preprocess them for inception-resnetimages=[]all_filenames=listdir('images/')all_filenames.sort()for filename in all_filenames: images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))images=np.array(images, dtype=float)images=preprocess_input(images)# Run the images through inception-resnet and extract the features without the classification layerIR2=InceptionResNetV2(weights='imagenet', include_top=False)features=IR2.predict(images)# We will cap each input sequence to 100 tokensmax_caption_len=100# Initialize the function that will create our vocabularytokenizer=Tokenizer(filters='', split=" ", lower=False)# Read a document and return a stringdef load_doc(filename): file=open(filename, 'r') text=file.read() file.close() return text# Load all the HTML filesX=[]all_filenames=listdir('html/')all_filenames.sort()for filename in all_filenames:X.append(load_doc('html/'+filename))# Create the vocabulary from the html filestokenizer.fit_on_texts(X)# Add +1 to leave space for empty wordsvocab_size=len(tokenizer.word_index) + 1# Translate each word in text file to the matching vocabulary indexsequences=tokenizer.texts_to_sequences(X)# The longest HTML filemax_length=max(len(s) for s in sequences)# Intialize our final input to the modelX, y, image_data=list(), list(), list()for img_no, seq in enumerate(sequences): for i in range(1, len(seq)): # Add the entire sequence to the input and only keep the next word for the output in_seq, out_seq=seq[:i], seq[i] # If the sentence is shorter than max_length, fill it up with empty words in_seq=pad_sequences([in_seq], maxlen=max_length)[0] # Map the output to one-hot encoding out_seq=to_categorical([out_seq], num_classes=vocab_size)[0] # Add and image corresponding to the HTML file image_data.append(features[img_no]) # Cut the input sentence to 100 tokens, and add it to the input data X.append(in_seq[-100:]) y.append(out_seq)X, y, image_data=np.array(X), np.array(y), np.array(image_data)# Create the encoderimage_features=Input(shape=(8, 8, 1536,))image_flat=Flatten()(image_features)image_flat=Dense(128, activation='relu')(image_flat)ir2_out=RepeatVector(max_caption_len)(image_flat)language_input=Input(shape=(max_caption_len,))language_model=Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)language_model=LSTM(256, return_sequences=True)(language_model)language_model=LSTM(256, return_sequences=True)(language_model)language_model=TimeDistributed(Dense(128, activation='relu'))(language_model)# Create the decoderdecoder=concatenate([ir2_out, language_model])decoder=LSTM(512, return_sequences=False)(decoder)decoder_output=Dense(vocab_size, activation='softmax')(decoder)# Compile the modelmodel=Model(inputs=[image_features, language_input], outputs=decoder_output)model.compile(loss='categorical_crossentropy', optimizer='rmsprop')# Train the neural networkmodel.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)# map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index==integer: return word return None# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text='START' # iterate over the whole length of the sequence for i in range(900): # integer encode input sequence sequence=tokenizer.texts_to_sequences([in_text])[0][-100:] # pad input sequence=pad_sequences([sequence], maxlen=max_length) # predict next word yhat=model.predict([photo,sequence], verbose=0) # convert probability to integer yhat=np.argmax(yhat) # map integer to word word=word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text +=' ' + word # Print the prediction print(' ' + word, end='') # stop if we predict the end of the sequence if word=='END': break return# Load and image, preprocess it for IR2, extract features and generate the HTMLtest_image=img_to_array(load_img('images/87.jpg', target_size=(299, 299)))test_image=np.array(test_image, dtype=float)test_image=preprocess_input(test_image)test_features=IR2.predict(np.array([test_image]))generate_desc(model, tokenizer, np.array(test_features), 100)

输出结果

生成网站的链接：

250 epochs： https://emilwallner.github.io/html/250_epochs/
350 epochs：https://emilwallner.github.io/html/350_epochs/
450 epochs：https://emilwallner.github.io/html/450_epochs/
550 epochs：https://emilwallner.github.io/html/450_epochs/

如果点击上述链接看不到页面的话，你可以选择“查看源代码”。下面是原网站的链接，仅供参考：

https://emilwallner.github.io/html/Original/

我犯过的错误

与 CNN 相比，LSTM 远比我想像得复杂。为了更好的理解，我展开了所有的 LSTM。关于 RNN 你可以参考这个视频（http://course.fast.ai/lessons/lesson6.html）。另外，在理解原理之前，请先搞清楚输入和输出特征。
从零开始创建词汇表比削减大型词汇表更容易。词汇表可以包括任何东西，如字体、div 大小、十六进制颜色、变量名以及普通单词。
大多数的代码库可以很好地解析文本文档，却不能解析代码。因为文档中所有单词都用空格分开，但是代码不同，所以你得自己想办法解析代码。
用 Imagenet 训练好的模型提取特征也许不是个好主意。因为 Imagenet 很少有网页的图片，所以它的损失率比从零开始训练的 pix2code 模型高 30%。如果使用网页截图训练 inception-resnet 之类的模型，不知结果会怎样。

Bootstrap 版本

在最后一个版本——Bootstrap 版本中，我们使用的数据集来自根据 pix2code 论文生成的 bootstrap 网站。通过使用 Twitter 的 bootstrap（https://getbootstrap.com/），我们可以结合 HTML 和 CSS，并减小词汇表的大小。

我们可以提供一个它从未见过的截图，训练它生成相应的 HTML 代码。我们还可以深入研究它学习这个截图和 HTML 代码的过程。

抛开 bootstrap 的 HTML 代码，我们在这里使用 17 个简化的 token 训练它，然后翻译成 HTML 和 CSS。这个数据集[13]包括 1500 个测试截图和 250 个验证截图。每个截图上平均有 65 个 token，包含 96925 个训练样本。

通过修改 pix2code 论文的模型提供输入数据，我们的模型可以预测网页的组成，且准确率高达 97%（我们采用了 BLEU 4-ngram greedy search，稍后会详细介绍）。

端到端的方法

图像标注模型可以从事先训练好的模型中提取特征，但是经过几次实验后，我发现 pix2code 的端到端的方法可以更好地为我们的模型提取特征，因为事先训练好的模型并没有用网页数据训练过，而且它本来的作用是分类。

在这个模型中，我们用轻量级的卷积神经网络替代了事先训练好的图像特征。我们没有采用 max-pooling 增加信息密度，但我们增加了步长（stride），以确保前端元素的位置和颜色。

有两个核心模型可以支持这个方法：卷积神经网络（CNN)和递归神经网络（RNN）。最常见的递归神经网络就是 LSTM，所以我选择了 RNN。

关于 CNN 的教程有很多，我在别的文章里有介绍。此处我主要讲解 LSTM。

理解 LSTM 中的 timestep

LSTM 中最难理解的内容之一就是 timestep。原始的神经网络可以看作只有两个 timestep。如果输入是“Hello”（第一个 timestep），它会预测“World”（第二个 timestep），但它无法预测更多的 timestep。下面的例子中输入有四个 timestep，每个词一个。

LSTM 适用于包含 timestep 的输入，这种神经网络专门处理有序的信息。模型展开后你会发现，下行的每一步所持有的权重保持不变。另外，前一个输出和新的输入需要分别使用相应的权重。

接下来，输入和输出乘以权重之后相加，再通过激活函数得到该 timestep 的输出。由于权重不随 timestep 变化，所以它们可以从多个输入中获得信息，从而掌握单词的顺序。

下图通过简单图例描述了一个 LSTM 中每个 timestep 的处理过程。

为了更好地理解这个逻辑，我建议你跟随 Andrew Trask 的这篇精彩的教程[14]，尝试从头创建一个 RNN。

理解 LSTM 层中的单元

LSTM 层中的单元（unit）数量决定了它的记忆能力，以及每个输出特征的大小。再次强调，特征是一长列的数值，用于在层与层之间的信息传递。

LSTM 层中的每个单元负责跟踪语法中的不同信息。下图描述了一个单元的示例，其内保存了布局行“div”的信息。我们简化了 HTML 代码，并用于训练 bootstrap 模型。

每个 LSTM 单元拥有一个单元状态（cell state）。你可以把单元状态看作单元的记忆。权重和激活函数可以用各种方式改变状态。因此 LSTM 层可以微调每个输入所需要保存和丢弃的信息。

向输入传递输出特征的同时，还需传递单元状态，LSTM 的每个单元都需要传递自己的单元状态值。为了理解 LSTM 各部分的交互方式，我建议你可以阅读：

Colah 的教程：https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Jayasiri 的 Numpy 实现：http://blog.varunajayasiri.com/numpy_lstm.html
Karphay 的讲座和文章：https://www.youtube.com/watch?v=yCC09vCHzF8； https://karpathy.github.io/2015/05/21/rnn-effectiveness/

dir_name='resources/eval_light/'# Read a file and return a stringdef load_doc(filename): file=open(filename, 'r') text=file.read() file.close() return textdef load_data(data_dir): text=[] images=[] # Load all the files and order them all_filenames=listdir(data_dir) all_filenames.sort() for filename in (all_filenames): if filename[-3:]=="npz": # Load the images already prepared in arrays image=np.load(data_dir+filename) images.append(image['features']) else: # Load the boostrap tokens and rap them in a start and end tag syntax='<START> ' + load_doc(data_dir+filename) + ' <END>' # Seperate all the words with a single space syntax=' '.join(syntax.split()) # Add a space after each comma syntax=syntax.replace(',', ' ,') text.append(syntax) images=np.array(images, dtype=float) return images, texttrain_features, texts=load_data(dir_name)# Initialize the function to create the vocabularytokenizer=Tokenizer(filters='', split=" ", lower=False)# Create the vocabularytokenizer.fit_on_texts([load_doc('bootstrap.vocab')])# Add one spot for the empty word in the vocabularyvocab_size=len(tokenizer.word_index) + 1# Map the input sentences into the vocabulary indexestrain_sequences=tokenizer.texts_to_sequences(texts)# The longest set of boostrap tokensmax_sequence=max(len(s) for s in train_sequences)# Specify how many tokens to have in each input sentencemax_length=48def preprocess_data(sequences, features): X, y, image_data=list(), list(), list() for img_no, seq in enumerate(sequences): for i in range(1, len(seq)): # Add the sentence until the current count(i) and add the current count to the output in_seq, out_seq=seq[:i], seq[i] # Pad all the input token sentences to max_sequence in_seq=pad_sequences([in_seq], maxlen=max_sequence)[0] # Turn the output into one-hot encoding out_seq=to_categorical([out_seq], num_classes=vocab_size)[0] # Add the corresponding image to the boostrap token file image_data.append(features[img_no]) # Cap the input sentence to 48 tokens and add it X.append(in_seq[-48:]) y.append(out_seq) return np.array(X), np.array(y), np.array(image_data)X, y, image_data=preprocess_data(train_sequences, train_features)#Create the encoderimage_model=Sequential()image_model.add(Conv2D(16, (3, 3), padding='valid', activation='relu', input_shape=(256, 256, 3,)))image_model.add(Conv2D(16, (3,3), activation='relu', padding='same', strides=2))image_model.add(Conv2D(32, (3,3), activation='relu', padding='same'))image_model.add(Conv2D(32, (3,3), activation='relu', padding='same', strides=2))image_model.add(Conv2D(64, (3,3), activation='relu', padding='same'))image_model.add(Conv2D(64, (3,3), activation='relu', padding='same', strides=2))image_model.add(Conv2D(128, (3,3), activation='relu', padding='same'))image_model.add(Flatten())image_model.add(Dense(1024, activation='relu'))image_model.add(Dropout(0.3))image_model.add(Dense(1024, activation='relu'))image_model.add(Dropout(0.3))image_model.add(RepeatVector(max_length))visual_input=Input(shape=(256, 256, 3,))encoded_image=image_model(visual_input)language_input=Input(shape=(max_length,))language_model=Embedding(vocab_size, 50, input_length=max_length, mask_zero=True)(language_input)language_model=LSTM(128, return_sequences=True)(language_model)language_model=LSTM(128, return_sequences=True)(language_model)#Create the decoderdecoder=concatenate([encoded_image, language_model])decoder=LSTM(512, return_sequences=True)(decoder)decoder=LSTM(512, return_sequences=False)(decoder)decoder=Dense(vocab_size, activation='softmax')(decoder)# Compile the modelmodel=Model(inputs=[visual_input, language_input], outputs=decoder)optimizer=RMSprop(lr=0.0001, clipvalue=1.0)model.compile(loss='categorical_crossentropy', optimizer=optimizer)#Save the model for every 2nd epochfilepath="org-weights-epoch-{epoch:04d}--val_loss-{val_loss:.4f}--loss-{loss:.4f}.hdf5"checkpoint=ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_weights_only=True, period=2)callbacks_list=[checkpoint]# Train the modelmodel.fit([image_data, X], y, batch_size=64, shuffle=False, validation_split=0.1, callbacks=callbacks_list, verbose=1, epochs=50)

测试准确度

很难找到合理的方式测量准确度。你可以逐个比较单词，但如果预测结果中有一个单词出现了错位，那准确率可能就是 0%了；如果为了同步预测而删除这个词，那么准确率又会变成 99/100。

我采用了 BLEU 分数，它是测试机器翻译和图像标记模型的最佳选择。它将句子分成四个 n-grams，从 1 个单词的序列逐步扩展为 4 个单词。下例，预测结果中的“cat”实际上应该是“code”。

为了计算最终分数，首先需要让每个 n-grams 的得分乘以 25%并求和，即(4/5) * 0.25 + (2/4) * 0.25 + (1/3) * 0.25 + (0/2) * 0.25=02 + 1.25 + 0.083 + 0=0.408；得出的总和需要乘以句子长度的惩罚因子。由于本例中预测句子的长度是正确的，因此这就是最终的分数。

增加 n-grams 的数量可以提高难度。4 个 n-grams 的模型最适合人类翻译。为了进一步了解 BLEU，我建议你可以用下面的代码运行几个例子，并阅读这篇 wiki 页面[15]。

#Create a function to read a file and return its contentdef load_doc(filename): file=open(filename, 'r') text=file.read() file.close() return textdef load_data(data_dir): text=[] images=[] files_in_folder=os.listdir(data_dir) files_in_folder.sort() for filename in tqdm(files_in_folder): #Add an image if filename[-3:]=="npz": image=np.load(data_dir+filename) images.append(image['features']) else: # Add text and wrap it in a start and end tag syntax='<START> ' + load_doc(data_dir+filename) + ' <END>' #Seperate each word with a space syntax=' '.join(syntax.split()) #Add a space between each comma syntax=syntax.replace(',', ' ,') text.append(syntax) images=np.array(images, dtype=float) return images, text#Intialize the function to create the vocabularytokenizer=Tokenizer(filters='', split=" ", lower=False)#Create the vocabulary in a specific ordertokenizer.fit_on_texts([load_doc('bootstrap.vocab')])dir_name='../../../../eval/'train_features, texts=load_data(dir_name)#load model and weightsjson_file=open('../../../../model.json', 'r')loaded_model_json=json_file.read()json_file.close()loaded_model=model_from_json(loaded_model_json)# load weights into new modelloaded_model.load_weights("../../../../weights.hdf5")print("Loaded model from disk")# map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index==integer: return word return Noneprint(word_for_id(17, tokenizer))# generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): photo=np.array([photo]) # seed the generation process in_text='<START> ' # iterate over the whole length of the sequence print('\nPrediction---->\n\n<START> ', end='') for i in range(150): # integer encode input sequence sequence=tokenizer.texts_to_sequences([in_text])[0] # pad input sequence=pad_sequences([sequence], maxlen=max_length) # predict next word yhat=loaded_model.predict([photo, sequence], verbose=0) # convert probability to integer yhat=argmax(yhat) # map integer to word word=word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text +=word + ' ' # stop if we predict the end of the sequence print(word + ' ', end='') if word=='<END>': break return in_textmax_length=48# evaluate the skill of the modeldef evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted=list(), list() # step over the whole set for i in range(len(texts)): yhat=generate_desc(model, tokenizer, photos[i], max_length) # store actual and predicted print('\n\nReal---->\n\n' + texts[i]) actual.append([texts[i].split()]) predicted.append(yhat.split()) # calculate BLEU score bleu=corpus_bleu(actual, predicted) return bleu, actual, predictedbleu, actual, predicted=evaluate_model(loaded_model, texts, train_features, tokenizer, max_length)#Compile the tokens into HTML and cssdsl_path="compiler/assets/web-dsl-mapping.json"compiler=Compiler(dsl_path)compiled_website=compiler.compile(predicted[0], 'index.html')print(compiled_website )print(bleu)

输出

输出示例的链接

网站 1：

生成的网站：https://emilwallner.github.io/bootstrap/pred_1/
原网站：https://emilwallner.github.io/bootstrap/real_1/

网站 2：

生成的网站：https://emilwallner.github.io/bootstrap/pred_2/
原网站：https://emilwallner.github.io/bootstrap/real_2/

网站 3：

生成的网站：https://emilwallner.github.io/bootstrap/pred_3/
原网站：https://emilwallner.github.io/bootstrap/real_3/

网站 4：

生成的网站：https://emilwallner.github.io/bootstrap/pred_4/
原网站：https://emilwallner.github.io/bootstrap/real_4/

网站 5：

生成的网站：https://emilwallner.github.io/bootstrap/pred_5/
原网站：https://emilwallner.github.io/bootstrap/real_5/

我犯过的错误

学会理解模型的弱点，避免盲目测试模型。刚开始的时候，我随便尝试了一些东西，比如 batch normalization、bidirectional network，还试图实现 attention。看了测试数据后发现这些并不能准确地预测颜色和位置，我开始意识到这是 CNN 的弱点。因此我放弃了 maxpooling，改为增加步长。结果测试损失从 0.12 降到了 0.02，BLEU 分数从 85%提高到了 97%。
只使用相关的事先训练好的模型。在数据集很小的时候，我以为事先训练好的图像模型能够提高效率。实验结果表明，端到端的模型虽然更慢，训练也需要更多的内存，但准确率能提高 30%。
在远程服务器上运行模型时要为一些差异做好准备。在我的 Mac 上运行时，文件是按照字母顺序读取的。但在远程服务器上却是随机读取的。结果造成了截图和代码不匹配的问题。虽然依然能够收敛，但在我修复了这个问题后，测试数据的准确率提高了 50%。
务必要理解库函数。词汇表中的空 token 需要包含空格。一开始我没加空格，结果就漏了一个 token。直到看了几次最终输出结果，注意到它从来不会预测某个 token 的时候，我才发现了这个问题。检查后发现那个 token 不在词汇表里。此外，要保证训练和测试时使用的词汇表的顺序相同。
试验时使用轻量级的模型。用 GRU 替换 LSTM 可以让每个 epoch 的时间减少 30%，而且不会对性能有太大影响。

下一步

深度学习很适合应用在前端开发中，因为很容易生成数据，而且如今的深度学习算法可以覆盖绝大多数的逻辑。

其中一个最有意思的方面是在 LSTM 中使用 attention 机制[16]。它不仅能提高准确率，而且可以帮助我们观察 CSS 在生成 HTML 代码的时候，它的注意力在何处。

Attention 还是 HTML 代码、样式表、脚本甚至后台之间沟通的关键因素。attention 层可以追踪参数，帮助神经网络在不同编程语言之间沟通。

但是短期内，最大的难题还在于找到一个可扩展的方法用于生成数据。这样才能逐步加入字体、颜色、单词以及动画。

迄今为止，很多人都在努力实现绘制草图并将其转化为应用程序的模板。不出两年，我们就能实现在纸上绘制应用程序，并在一秒内获得相应的前端代码。Airbnb 设计团队[17]和 Uizard[18] 已经创建了两个原型。

下面是一些值得尝试的实验。

实验

Getting started：

运行所有的模型
尝试不同的超参数
尝试不同的 CNN 架构
加入 Bidirectional 的 LSTM 模型
使用不同的数据集实现模型[19]（你可以通过 FloydHub 的参数“--data ”挂载这个数据集：emilwallner/datasets/100k-html:data）

高级实验

创建能利用特定的语法稳定生成任意应用程序/网页的生成器
生成应用程序模型的设计图数据。将应用程序或网页的截图自动转换成设计，并使用 GAN 产生变化。
通过 attention 层观察每次预测时的图像焦点，类似于这个模型：https://arxiv.org/abs/1502.03044
创建模块化方法的框架。比如一个模型负责编码字体，一个负责颜色，另一个负责布局，并利用解码部分将它们结合在一起。你可以从静态图像特征开始尝试。
为神经网络提供简单的 HTML 组成单元，训练它利用 CSS 生成动画。如果能加入 attention 模块，观察输入源的聚焦就更完美了。

最后，非常感谢 Tony Beltramelli 和 Jon Gold 提供的研究成果和想法，以及对各种问题的解答。谢谢 Jason Brownlee 贡献他的 stellar Keras 教程（我在核心的 Keras 实现中加入了几个他的教程中介绍的 snippets），谢谢 Beltramelli 提供的数据。还要谢谢 Qingping Hou、Charlie Harrington、 Sai Soundararaj、 Jannes Klaas、 Claudio Cabral、 Alain Demenet 和 Dylan Djian 审阅本篇文章。

相关链接

[1] pix2code 论文：https://arxiv.org/abs/1705.07962

[2] sketch2code：https://airbnb.design/sketching-interfaces/

[3] https://github.com/emilwallner/Screenshot-to-code-in-Keras/blob/master/README.md

[4] https://www.floydhub.com/emilwallner/projects/picturetocode

[5] https://machinelearningmastery.com/blog/page/2/

[6] https://blog.floydhub.com/my-first-weekend-of-deep-learning/

[7] https://blog.floydhub.com/coding-the-history-of-deep-learning/

[8] https://blog.floydhub.com/colorizing-b&w-photos-with-neural-networks/

[9] https://machinelearningmastery.com/deep-learning-caption-generation-models/

[10] https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

[11] https://www.youtube.com/watch?v=byLQ9kgjTdQ&t=21s

[12] https://arxiv.org/abs/1301.3781

[13] https://github.com/tonybeltramelli/pix2code/tree/master/datasets

[14] https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

[15] https://en.wikipedia.org/wiki/BLEU

[16] https://arxiv.org/pdf/1502.03044.pdf

[17] https://airbnb.design/sketching-interfaces/

[18] https://www.uizard.io/

[19] http://lstm.seas.harvard.edu/latex/

灵编译整理
量子位出品 | 公众号 QbitAI

就怕前脚刚立志搞个新研究，后脚就发现没有合适的数据集可用。AI工程师从入门到放弃，可能就是这么一会的功夫。

别找了，现在深度学习数据集也能自制了。

在这份教程中，来自fast.ai的小哥哥Francisco Ingham就想手把手教你，如何利用谷歌图片搜索，DIY一份自己的深度学习数据集出来，还不会违反谷歌服务条例。

整装待发，这样的好事其实需要六步就够了。

Let’s Go

第1步：搜索图像

非常简单，就像平时在谷歌图片中查找图像一样，输入关键词，搜索你感兴趣的图像。

谷歌图像最多显示700张图像，所以一页到底再点击“显示更多”，直到加载完毕。

小窍门：输入的关键词越精准，最后得到的数据集质量也越高。

第2步：下载图片

在浏览器中运行下面这段Javascript代码，创建数据集中所有图像的URL：

然后将这些URL保存到一个文件夹中，以备后用。

第3步：创建目录将URL传至服务器

上一步的成果，现在可以拿来用了。不过先得创建一个项目目录。作者将其命名为mkdir MyProject，不过“MyProject”可以替换成你喜欢的项目名字。

按下“Upload”键，将上传URL地址一键上传到这个目录中。

第4步：下载图像

上传到上面目录后，就能把它们从各自的URL下载下来，得到了初版数据集。

也不麻烦,，每个目录中都需要运行一次下面这段代码：

download_images(path/file, dest, max_pics=200)

只需要指定URL文件名和目标文件，就能自动下载保存，在本地就能打开图像了。

Tips：要下载图像的数量可以自己选择。

第5步：筛选图像

查看新鲜出炉的图像，可能会发现一些不需要的图像，此时就需要你手动去筛选和删除它们了。

如果一开始在谷歌搜索中的关键词没有设置好，那这一步可能得多费点时间喽。

第6步：准备训练目录

和数据集的众多兄弟姐妹一样，在开始使用它前，最好还是把里面的图片分成训练、验证和测试集。

过完这道坎，你就拥有了一个DIY的深度学习数据集了，此时有没有感觉赞赞的？

传送门

GitHub项目地址：

https://github.com/lesscomfortable/google-image-dataset

此外，Francisco Ingham还将教程搬到了fast.ai的课程仓库中，是用Jupyter Notebooks写成的。不过刚量子位看时还没有搬完。如果原地址找不到了，不妨来这里看看：

https://github.com/fastai/course-v3/blob/master/nbs/dl1/download_images.ipynb

条条大路通教程，祝你学有所得~

— 完 —

活动策划招聘

量子位正在招聘活动策划，将负责不同领域维度的线上线下相关活动策划、执行。欢迎聪明靠谱的小伙伴加入，并希望你能有一些活动策划或运营的相关经验。相关细节，请在量子位公众号(QbitAI)对话界面，回复“招聘”两个字。

量子位 QbitAI · 头条号签约作者

?'?' ? 追踪AI技术和产品新动态

在线咨询

上一篇：5分钟学会用Python Jinja2模板引擎渲染H
下一篇：网站html代码优化如何操作更利于排名？

您的项目需求

*请认真填写需求信息，我们会在24小时内与您取得联系。

整合营销服务商