DeepFM: A Factorization-Machine based Neural Network for CTR Prediction 阅读笔记_综合

摘要

Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. （现有的方法似乎对低阶或高阶交互有强烈的偏见，或者需要专业的特征工程）In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and highorder feature interactions. （文中展示了端到端的同时强调低阶和高阶特征交互的可行性） Compared to the latest Wide & Deep model from Google, DeepFM has a shared input to its “wide” and “deep” parts, with no need of feature engineering besides raw features. （DeepFM的“Wide”和“Deep”部分共享输入，除了原始特征之外，不需要特征工程。）

引言

CTR预测的重要性

一些推荐系统的目标是最大化点击次数，因此返回给用户的item应该按照预测的CTR进行排序
一些场景例如广告，其目标是提升收入，排序策略可以调整为按CTR*bid进行排序

关于同时考虑低阶和高阶特征交互的重要性，（In general, such interactions of features behind user click behaviors can be highly sophisticated, where both low- and high-order feature interactions should play important roles. ）用户点击行为背后的这些特征的交互作用可能非常复杂，其中低阶和高阶特性交互都应该扮演重要角色。

现象：
- 对主流应用程序市场的研究，发现人们经常在用餐时间下载送餐类应用程序，这表明应用程序类别和时间戳之间的（2阶）交互是一个可以用来做CTR预测的特征。
- 第二个观察发现，男性青少年喜欢射击游戏和RPG游戏，这意味着应用类别、用户性别和年龄的（3阶）交互是另一个可以用来做CTR预测的特征。
wide & deep
- 同时考虑低阶和高阶特征交互相对于单独考虑两者的情况有额外的提升。

其中如何建模特征的交互是一项挑战

一些特性交互很容易理解，因此可以由人工设计（如上面的实例）。
即使对于很容易理解的特征交互，在特征比较多的时候，想要靠人工来做到所有的特征交互也是一项不太可能的工作。
大多数特征交互都隐藏在数据中，很难进行先验识别（例如，经典的关联规则“尿布和啤酒”是从数据中挖掘出来的），只有通过机器学习才能自动捕获。

线性模型：如FTRL的广义线性模型在实践中表现良好；然广义模型没有学习特征交互的能力，一种常用做法就是手动的设计特征交互，这种方法很难用于高阶的特征交互，也很难应对训练集中没有出现过或很少出现的特征交互。

FM通过隐向量的内积来建模特征交互，原则上FM可以建模高阶特征交互，但是复杂度比较高，通常我们用FM来建模二阶的特征交互。

深度神经网络在学习复杂特征方面有很大潜力。基于CNN的模型偏向于相邻特征之间的交互，而基于RNN的模型更适合于具有顺序依赖性的点击数据。[Zhang et al.，2016]研究特征表示并提出FNN，该模型在应用DNN前对FM进行预训练，该模型会受到FM性能的限制。[Qu et al.，2016]研究了特征交互，在嵌入层和完全连接层之间引入了product层，提出了PNN。如[Cheng et al.，2016]所述，PNN和FNN与其他深层模型一样，很少捕捉到低阶特征交互，这也是CTR预测所必需的。（To model both lowand high-order feature interactions, [Cheng et al., 2016] proposes an interesting hybrid network structure (Wide & Deep) that combines a linear (“wide”) model and a deep model. In this model, two different inputs are required for the “wide part” and “deep part”, respectively, and the input of “wide part” still relies on expertise feature engineering.）wide & deep将线性模型（wide）和deep模型结合起来来同时对低阶特征交互和高阶特征交互建模，wide侧的deep侧需要两种不同的输入，wide侧的输入依旧依赖人工的特征工程。

现有模型，要不不能同时兼顾低阶和高阶的特征，要不就需要大量的特征工程。DeepFM可以端到端的学习feature interactions of all orders，且除raw features以外，无需其他特征工程。

DeepFM：通过FM部分对低阶特征交互建模，Deep部分对高阶特征进行建模，（Unlike the wide & deep model [Cheng et al., 2016], DeepFM can be trained endto-end without any feature engineering.）与Wide&Deep不同的是，DeepFM是端到端的且无需特征工程的。
（DeepFM can be trained efficiently because its wide part and deep part, unlike [Cheng et al., 2016], share the same input and also the embedding vector. In [Cheng et al., 2016], the input vector can be of huge size as it includes manually designed pairwise feature interactions in the input vector of its wide part, which also greatly increases its complexity.）与Wide&Deep不同的是，DeepFM的wide侧和deep侧共享输入包括embedding向量，这使得其训练高效，wide&deep的wide侧包含人工设计的特征交互，所以其wide侧的输入规模比较大，这会增加复杂度。
在benchmark data和commercial data上的实验结果表名，相对于其他的CTR模型，DeepFM有consistent improvement 。

设训练集由n个实例 $(\chi, y)$ 组成，其中 $\chi$ 是m-fields的数据集，通常会包含user-item的pair对。 $\chi$ 可能会包含诸如性别、位置等类别特征，也可能包含诸如年龄等连续特征。类别特征用one-hot向量，连续特征就是一个数值，或者离散化后的one-hot向量。这样即可将实例转化为 $(x, y)$ ，其中 $x =[x_{field_{1}}, x_{field_{2}}, \cdots, x_{field_{j}}, \cdots , x_{field_{m}}]$ ，是一个d维的向量，其中 $x_{field_{j}}$ 是 $\chi$ 的第jge字段的向量表示，通常情况下x是高维稀疏向量。

2.1 DeepFM

可以看成三个部分：

线性：sparse（1维embedding，相当于LabelEncoder）+ dense

FM：sparse的embedding，和的平方 - 平方的和

dnn：sparse的embedding + dense，展平

未完待续