Deep Knowledge Tracing(DKT)具体实现

1.      下载包含已在其知识组件中分类的已解决问题的大型数据集。
2.      预处理数据并将输入转换为网络预期格式。
3.      建立和训练一个LSTM网络来预测学生正确回答未来问题的概率。
4.      评估模型并改进它。





神经网络的监督训练需要一个数据集,其中的例子已经包含了预期的标签。因此,本项目使用最新版本的公共数据集“ASSISTments Skillbuilder data 2009-2010”[4],[5]中包含的示例。表1显示了关于这个数据集的一些统计信息。









为了构建模型,它使用了Keras,由于这个工具,我选择了一个 遮罩层是模型中的第一层。该层负责处理用于填充序列和填充不完整批次的掩码值。作为输入,该层将接收一批20个相同大小的序列(每个批的大小可能不同),其中包含246个特性。下一层是由250个单元组成的LSTM层。该层负责查找时间序列中问题之间的关系 







1.      在没有技能id的情况下删除问题。
2.      将数据样本转换为按用户标识分组的序列。
3.      将技能id值转换为从零开始的连续变量。
4.      将数据分成三个数据集(培训、验证和测试)。
5.      将技能id与标签一起编码(问题答案)。
6.      应用一个热编码。
7.      填写不完整的批次。
8.      按相同的顺序填充

def load_dataset(fn, batch_size=32, shuffle=True):df = pd.read_csv(fn)if "skill_id" not in df.columns:raise KeyError(f"The column 'skill_id' was not found on {fn}")if "correct" not in df.columns:raise KeyError(f"The column 'correct' was not found on {fn}")if "user_id" not in df.columns:raise KeyError(f"The column 'user_id' was not found on {fn}")if not (df['correct'].isin([0, 1])).all():raise KeyError(f"The values of the column 'correct' must be 0 or 1.")# Step 1.1 - Remove questions without skilldf.dropna(subset=['skill_id'], inplace=True)# Step 1.2 - Remove users with a single answerdf = df.groupby('user_id').filter(lambda q: len(q) > 1).copy()# Step 2 - Enumerate skill iddf['skill'], _ = pd.factorize(df['skill_id'], sort=True)# Step 3 - Cross skill id with answer to form a synthetic featuredf['skill_with_answer'] = df['skill'] * 2 + df['correct']# Step 4 - Convert to a sequence per user id and shift features 1 timestepseq = df.groupby('user_id').apply(lambda r: (r['skill_with_answer'].values[:-1],r['skill'].values[1:],r['correct'].values[1:],))nb_users = len(seq)# Step 5 - Get Tensorflow Datasetdataset = tf.data.Dataset.from_generator(generator=lambda: seq,output_types=(tf.int32, tf.int32, tf.float32))if shuffle:dataset = dataset.shuffle(buffer_size=nb_users)# Step 6 - Encode categorical features and merge skills with labels to compute target loss.# More info: https://github.com/tensorflow/tensorflow/issues/32142features_depth = df['skill_with_answer'].max() + 1skill_depth = df['skill'].max() + 1dataset = dataset.map(lambda feat, skill, label: (tf.one_hot(feat, depth=features_depth),tf.concat(values=[tf.one_hot(skill, depth=skill_depth),tf.expand_dims(label, -1)],axis=-1)))# Step 7 - Pad sequences per batchdataset = dataset.padded_batch(batch_size=batch_size,padding_values=(MASK_VALUE, MASK_VALUE),padded_shapes=([None, None], [None, None]),drop_remainder=True)length = nb_users // batch_sizereturn dataset, length, features_depth, skill_depth

在步骤1中,检测到66.326个样本需要从数据集中移除。步骤2生成了4.163个序列,而步骤3将技能id重新排列为[0 123]的连续间隔。在表5中,可以看到这些步骤之后的数据集摘要。


表 由于内存限制,每批执行步骤5-8。在步骤7中,不完整的批次用常量值-1填充,在步骤8中,序列按照其批次中的最大序列大小填充相同的常量。在步骤7-8中没有应用一个热编码,因为我们将在模型上使用一个遮罩层来处理它。

import numpy as np
import tensorflow as tffrom deepkt import data_utilclass DKTModel(tf.keras.Model):""" The Deep Knowledge Tracing model.Arguments in __init__:nb_features: The number of features in the input.nb_skills: The number of skills in the dataset.hidden_units: Positive integer. The number of units of the LSTM layer.dropout_rate: Float between 0 and 1. Fraction of the units to drop.Raises:ValueError: In case of mismatch between the provided input dataand what the model expects."""def __init__(self, nb_features, nb_skills, hidden_units=100, dropout_rate=0.2):inputs = tf.keras.Input(shape=(None, nb_features), name='inputs')x = tf.keras.layers.Masking(mask_value=data_util.MASK_VALUE)(inputs)x = tf.keras.layers.LSTM(hidden_units,return_sequences=True,dropout=dropout_rate)(x)dense = tf.keras.layers.Dense(nb_skills, activation='sigmoid')outputs = tf.keras.layers.TimeDistributed(dense, name='outputs')(x)super(DKTModel, self).__init__(inputs=inputs,outputs=outputs,name="DKTModel")def compile(self, optimizer, metrics=None):"""Configures the model for training.Arguments:optimizer: String (name of optimizer) or optimizer instance.See `tf.keras.optimizers`.metrics: List of metrics to be evaluated by the model during trainingand testing. Typically you will use `metrics=['accuracy']`.To specify different metrics for different outputs of amulti-output model, you could also pass a dictionary, such as`metrics={'output_a': 'accuracy', 'output_b': ['accuracy', 'mse']}`.You can also pass a list (len = len(outputs)) of lists of metricssuch as `metrics=[['accuracy'], ['accuracy', 'mse']]` or`metrics=['accuracy', ['accuracy', 'mse']]`.Raises:ValueError: In case of invalid arguments for`optimizer` or `metrics`."""def custom_loss(y_true, y_pred):y_true, y_pred = data_util.get_target(y_true, y_pred)return tf.keras.losses.binary_crossentropy(y_true, y_pred)super(DKTModel, self).compile(loss=custom_loss,optimizer=optimizer,metrics=metrics,experimental_run_tf_function=False)def fit(self,dataset,epochs=1,verbose=1,callbacks=None,validation_data=None,shuffle=True,initial_epoch=0,steps_per_epoch=None,validation_steps=None,validation_freq=1):"""Trains the model for a fixed number of epochs (iterations on a dataset).Arguments:dataset: A `tf.data` dataset. Should return a tupleof `(inputs, (skills, targets))`.epochs: Integer. Number of epochs to train the model.An epoch is an iteration over the entire data provided.Note that in conjunction with `initial_epoch`,`epochs` is to be understood as "final epoch".The model is not trained for a number of iterationsgiven by `epochs`, but merely until the epochof index `epochs` is reached.verbose: 0, 1, or 2. Verbosity mode.0 = silent, 1 = progress bar, 2 = one line per epoch.Note that the progress bar is not particularly useful whenlogged to a file, so verbose=2 is recommended when not runninginteractively (eg, in a production environment).callbacks: List of `keras.callbacks.Callback` instances.List of callbacks to apply during training.See `tf.keras.callbacks`.validation_data: Data on which to evaluatethe loss and any model metrics at the end of each epoch.The model will not be trained on this data.shuffle: Boolean (whether to shuffle the training databefore each epoch)initial_epoch: Integer.Epoch at which to start training(useful for resuming a previous training run).steps_per_epoch: Integer or `None`.Total number of steps (batches of samples)before declaring one epoch finished and starting thenext epoch. The default `None` is equal tothe number of samples in your dataset divided bythe batch size, or 1 if that cannot be determined. If x is a`tf.data` dataset, and 'steps_per_epoch'is None, the epoch will run until the input dataset is exhausted.validation_steps: Only relevant if `validation_data` is provided.Total number of steps (batches ofsamples) to draw before stopping when performing validationat the end of every epoch. If'validation_steps' is None, validationwill run until the `validation_data` dataset is exhausted.validation_freq: Only relevant if validation data is provided. Integeror `collections_abc.Container` instance (e.g. list, tuple, etc.).If an integer, specifies how many training epochs to run before anew validation run is performed, e.g. `validation_freq=2` runsvalidation every 2 epochs. If a Container, specifies the epochs onwhich to run validation, e.g. `validation_freq=[1, 2, 10]` runsvalidation at the end of the 1st, 2nd, and 10th epochs.Returns:A `History` object. Its `History.history` attribute isa record of training loss values and metrics valuesat successive epochs, as well as validation loss valuesand validation metrics values (if applicable).Raises:RuntimeError: If the model was never compiled.ValueError: In case of mismatch between the provided input dataand what the model expects."""return super(DKTModel, self).fit(x=dataset,epochs=epochs,verbose=verbose,callbacks=callbacks,validation_data=validation_data,shuffle=shuffle,initial_epoch=initial_epoch,steps_per_epoch=steps_per_epoch,validation_steps=validation_steps,validation_freq=validation_freq)def evaluate(self,dataset,verbose=1,steps=None,callbacks=None):"""Returns the loss value & metrics values for the model in test mode.Computation is done in batches.Arguments:dataset: `tf.data` dataset. Should return atuple of `(inputs, (skills, targets))`.verbose: 0 or 1. Verbosity mode.0 = silent, 1 = progress bar.steps: Integer or `None`.Total number of steps (batches of samples)before declaring the evaluation round finished.Ignored with the default value of `None`.If x is a `tf.data` dataset and `steps` isNone, 'evaluate' will run until the dataset is exhausted.This argument is not supported with array inputs.callbacks: List of `keras.callbacks.Callback` instances.List of callbacks to apply during evaluation.See [callbacks](/api_docs/python/tf/keras/callbacks).Returns:Scalar test loss (if the model has a single output and no metrics)or list of scalars (if the model has multiple outputsand/or metrics). The attribute `model.metrics_names` will give youthe display labels for the scalar outputs.Raises:ValueError: in case of invalid arguments."""return super(DKTModel, self).evaluate(dataset,verbose=verbose,steps=steps,callbacks=callbacks)def evaluate_generator(self, *args, **kwargs):raise SyntaxError("Not supported")def fit_generator(self, *args, **kwargs):raise SyntaxError("Not supported")



1.      优化器。
2.      批量大小。
3.      dropout。
4.      LSTM单元数。









