当前位置: 代码迷 >> 综合 >> Knowledge Tracing 资源帖1
  详细解决方案

Knowledge Tracing 资源帖1

热度:59   发布时间:2024-01-19 11:11:48.0

介绍知识追踪的常见数据集和代码,博客等等等,我是勤快的搬运工,好好看

数据集

Knowledge Tracing Benchmark Dataset

There are some datasets which are suitable for this task,

KDD Cup 2010  https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp

ASSISTments ASSISTments (google.com)

 OLI Engineering Statics 2011  https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507

JunyiAcademy Math Practicing Log [Annotation]  

slepemapy.cz  https://www.fi.muni.cz/adaptivelearning/?a=data

synthetic  synthetic (github.com)

math2015  

EdNet

 pisa2015math

 workbankr

 critlangacq

The following datasets are prov

The following datasets are provided by EduData ktbd:

Dataset Name Description
synthetic The dataset used in Deep Knowledge Tracing, original dataset can be found in github
assistment_2009_2010 The dataset used in Deep Knowledge Tracing, original dataset can be found in github
junyi Part of preprocessed dataset of junyi, which only includes 1000 most active student interaction sequences .

详细见 EduData/ktbd.md at master · bigdata-ustc/EduData (github.com)

数据格式

知识跟踪任务中,有一种流行的格式(我们将其称为三行(tl)格式)来表示交互序列记录:

5

419,419,419,665,665

1,1,1,0,0

可在深度知识跟踪中找到。 以这种格式,三行由一个交互序列组成。 第一行表示交互序列的长度,第二行表示练习ID,后跟第三行,其中每个元素代表正确答案(即1)或错误答案(即0)

以便处理 某些特殊符号难以以上述格式存储的问题,我们提供了另一种名为json序列的格式来表示交互序列记录:

[[419,1],[419,1],[419,  1],[665、0],[665、0]]序列中的每一项代表一个交互。 该项目的第一个元素是练习ID(在某些作品中,练习ID不是一对一映射到一个知识单元(ku)/概念,但是在junyi中,一个练习包含一个ku),第二个元素是练习ID。 指示学习者是否正确回答了练习,0表示错误,1表示正确1行,一条json记录,对应于学习者的交互顺序。
    我们提供了用于转换两种格式的工具:

# convert tl sequence to json sequence, by default, the exercise tag and answer will be converted into int type
edudata tl2json $src $tar
# convert tl sequence to json sequence without converting
edudata tl2json $src $tar False
# convert json sequence to tl sequence
edudata json2tl $src $tar

Dataset Preprocess

https://github.com/ckyeungac/deepknowledgetracing/blob/master/notebooks/ProcessSkillBuilder0910.ipynb

EduData/ASSISTments2015.ipynb at master · bigdata-ustc/EduData (github.com)

ASSISTments2015 Data Analysis

Data Description

Column Description

Field Annotation
user id Id of the student
log id Unique ID of the logged actions
sequence id Id of the problem set
correct

Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help

import numpy as npimport pandas as pdimport plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as gopath = "2015_100_skill_builders_main_problems.csv"
data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples

pd.set_option('display.max_columns', 500)
data.head()
user_id log_id sequence_id correct
0 50121 167478035 7014 0.0
1 50121 167478043 7014 1.0
2 50121 167478053 7014 1.0
3 50121 167478069 7014 1.0
4 50964 167478041 7014 1.0

General features

data.describe()
user_id log_id sequence_id correct
count 708631.000000 7.086310e+05 708631.000000 708631.000000
mean 296232.978276 1.695323e+08 22683.474821 0.725502
std 48018.650247 3.608096e+06 41593.028018 0.437467
min 50121.000000 1.509145e+08 5898.000000 0.000000
25% 279113.000000 1.660355e+08 7020.000000 0.000000
50% 299168.000000 1.704579e+08 9424.000000 1.000000
75% 335647.000000 1.723789e+08 14442.000000 1.000000
max 362374.000000 1.754827e+08 236309.000000 1.000000
print("The number of records: "+ str(len(data['log_id'].unique())))
The number of records: 708631
print('Part of missing values for every column')
print(data.isnull().sum() / len(data))
Part of missing values for every column
user_id        0.0
log_id         0.0
sequence_id    0.0
correct        0.0
dtype: float64

具体实现代码收集;

https://github.com/seewoo5/KT

DKT (Deep Knowledge Tracing)

  • Paper: https://web.stanford.edu/~cpiech/bio/papers/deepKnowledgeTracing.pdf
  • Model: RNN, LSTM (only LSTM is implemented)
  • GitHub: https://github.com/chrispiech/DeepKnowledgeTracing (Lua)
  • Performances:
Dataset ACC (%) AUC (%) Hyper Parameters
ASSISTments2009 77.02 ± 0.07 81.81 ± 0.10 input_dim=100, hidden_dim=100
ASSISTments2015 74.94 ± 0.04 72.94 ± 0.05 input_dim=100, hidden_dim=100
ASSISTmentsChall 68.67 ± 0.09 72.29 ± 0.06 input_dim=100, hidden_dim=100
STATICS 81.27 ± 0.06 82.87 ± 0.10 input_dim=100, hidden_dim=100
Junyi Academy 85.4 80.58 input_dim=100, hidden_dim=100
EdNet-KT1 72.72 76.99 input_dim=100, hidden_dim=100
  • All models are trained with batch size 2048 and sequence size 200.

DKVMN (Dynamic Key-Value Memory Network)

  • Paper: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p765.pdf
  • Model: Extension of Memory-Augmented Neural Network (MANN)
  • Github: https://github.com/jennyzhang0215/DKVMN (MxNet)
  • Performances:
Dataset ACC (%) AUC (%) Hyper Parameters
ASSISTments2009 75.61 ± 0.21 79.56 ± 0.29 key_dim = 50, value_dim = 200, summary_dim = 50, concept_num = 20, batch_size = 1024
ASSISTments2015 74.71 ± 0.02 71.57 ± 0.08 key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
ASSISTmentsChall 67.16 ± 0.05 67.38 ± 0.07 key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 20, batch_size = 2048
STATICS 80.66 ± 0.09 81.16 ± 0.08 key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 1024
Junyi Academy 85.04 79.68 key_dim = 50, value_dim = 100, summary_dim = 50, concept_num = 50, batch_size = 512
EdNet-KT1 72.32 76.48 key_dim = 100, value_dim = 100, summary_dim = 100, concept_num = 100, batch_size = 256
  • Due to memory issues, not all models are trained with batch size 2048.

NPA (Neural Padagogical Agency)

  • Paper: https://arxiv.org/abs/1906.10910
  • Model: Bi-LSTM + Attention
  • Performances:
Dataset ACC (%) AUC (%) Hyper Parameters
ASSISTments2009 77.11 ± 0.08 81.82 ± 0.13 input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTments2015 75.02 ± 0.05 72.94 ± 0.08 input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
ASSISTmentsChall 69.34 ± 0.03 73.26 ± 0.03 input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
STATICS 81.38 ± 0.14 83.1 ± 0.25 input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
Junyi Academy 85.57 81.10 input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
EdNet-KT1 73.05 77.58 input_dim=100, hidden_dim=100, attention_dim=100, fc_dim=200
  • All models are trained with batch size 2048 and sequence size 200.

SAKT (Self-Attentive Knowledge Tracing)

  • Paper: https://files.eric.ed.gov/fulltext/ED599186.pdf
  • Model: Transformer (1-layer, only encoder with subsequent mask)
  • Github: https://github.com/shalini1194/SAKT (Tensorflow)
  • Performances:
Dataset ACC (%) AUC (%) Hyper Parameters
ASSISTments2009 76.36 ± 0.15 80.78 ± 0.10 hidden_dim=100, seq_size=100, batch_size=512
ASSISTments2015 74.57 ± 0.07 71.49 ± 0.03 hidden_dim=100, seq_size=50, batch_size=512
ASSISTmentsChall 67.53 ± 0.06 69.70 ± 0.32 hidden_dim=100, seq_size=200, batch_size=512
STATICS 80.45 ± 0.13 80.30 ± 0.31 hidden_dim=100, seq_size=500, batch_size=128
Junyi Academy 85.27 80.36 hidden_dim=100, seq_size=200, batch_size=512
EdNet-KT1 72.44 76.60 hidden_dim=200, seq_size=200, batch_size=512

https://github.com/bigdata-ustc/TKT

Knowledge Tracing models implemented by mxnet-gluon. For convenient dataset downloading and preprocessing of knowledge tracing task, visit Edudata for handy api.

Visit https://base.ustc.edu.cn for more of our works.

Performance in well-known Dataset

With EduData, we test the models performance, the AUC result is listed as follows:

model name synthetic assistment_2009_2010 junyi
DKT 0.6438748958881487 0.7442573465541942 0.8305416859735839
DKT+ 0.8062221383790489 0.7483424087919035 0.8497422607539136
EmbedDKT 0.4858168704660636 0.7285572301977586 0.8194401881889697
EmbedDKT+ 0.7340996181876187 0.7490900876356051 0.8405445812109871
DKVMN TBA TBA TBA

The f1 scores are listed as follows:

model name synthetic assistment_2009_2010 junyi
DKT 0.5813237474584396 0.7134380508024369 0.7732850122818582
DKT+ 0.7041804463370387 0.7137627713343819 0.7928075377114897
EmbedDKT 0.4716821311199386 0.7095025134079656 0.7681817174082963
EmbedDKT+ 0.6316953625658291 0.7101790604990228 0.7903592922756097
DKVMN TBA TBA TBA

The information of the benchmark datasets can be found in EduData docs.

In addition, all models are trained 20 epochs with batch_size=16, where the best result is reported. We use adam with learning_rate=1e-3. We also apply bucketing to accelerate the training speed. Moreover, each sample length is limited to 200. The hyper-parameters are listed as follows:

model name synthetic - 50 assistment_2009_2010 - 124 junyi-835
DKT hidden_num=int(100);dropout=float(0.5) hidden_num=int(200);dropout=float(0.5) hidden_num=int(900);dropout=float(0.5)
DKT+ lr=float(0.2);lw1=float(0.001);lw2=float(10.0) lr=float(0.1);lw1=float(0.003);lw2=float(3.0) lr=float(0.01);lw1=float(0.001);lw2=float(1.0)
EmbedDKT hidden_num=int(100);latent_dim=int(35);dropout=float(0.5) hidden_num=int(200);latent_dim=int(75);dropout=float(0.5) hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)
EmbedDKT+ lr=float(0.2);lw1=float(0.001);lw2=float(10.0) lr=float(0.1);lw1=float(0.003);lw2=float(3.0) lr=float(0.01);lw1=float(0.001);lw2=float(1.0)
DKVMN hidden_num=int(50);key_embedding_dim=int(10);value_embedding_dim=int(10);key_memory_size=int(5);key_memory_state_dim=int(10);value_memory_size=int(5);value_memory_state_dim=int(10);dropout=float(0.5) hidden_num=int(50);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(50);key_memory_state_dim=int(50);value_memory_size=int(50);value_memory_state_dim=int(200);dropout=float(0.5) hidden_num=int(600);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(20);key_memory_state_dim=int(50);value_memory_size=int(20);value_memory_state_dim=int(200);dropout=float(0.5)

The number after - in the first row indicates the knowledge units number in the dataset. The datasets we used can be either found in basedata-ktbd or be downloaded by:

pip install EduData
edudata download ktbd

Trick

  • DKT: hidden_num is usually set to the nearest hundred number to the ku_num
  • EmbedDKT: latent_dim is usually set to a value litter than or equal to \sqrt(hidden_num * ku_num)
  • DKVMN: key_embedding_dim = key_memory_state_dim and value_embedding_dim = value_memory_state_dim

Notice

Some interfaces of pytorch may change with version changing, such as

import torch
torch.nn.functional.one_hot
  相关解决方案