当前位置: 代码迷 >> 综合 >> 自己训练Transformers的GPT-2 model时报:You are attempting to pad samples but the tokenizer you are using……
  详细解决方案

自己训练Transformers的GPT-2 model时报:You are attempting to pad samples but the tokenizer you are using……

热度:72   发布时间:2023-12-16 06:51:38.0

我之前没怎么用过Transformers的GPT2,今天尝试了自己训练,结果报:ValueError: You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one. 搜了一下,并不是我一个人遇到了这个问题,例如这里:https://github.com/huggingface/transformers/issues/4122

按照大家的讨论,解决的方法也很简单,在这里就有网友指出了:https://stackoverflow.com/questions/63377135/training-gpt2-and-reformer-from-scratch

You can't use the LineByLineTextDataset class with GPT2 as mentioned here. Use TextDataset instead.

所以换一下即可。我完全是按照之前参考RoBERTa的教程:https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

把模型换成了GPT2,并且按照上面说的,将LineByLineTextDataset换成了:TextDataset,其实网上有些例子也说得挺详细的:https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface

最后贴上自己的代码,大家可以直接在这个基础上预训练自己的GPT2 model:

import torch
import os
import shutil
print(torch.cuda.is_available())############################################################
from pathlib import Path
from tokenizers import ByteLevelBPETokenizerrepo_name="GPT-Corpus"model_dir=repo_name+'-GPTModel'
if(os.path.exists(model_dir)):shutil.rmtree(model_dir)
os.mkdir(model_dir)paths = [str(x) for x in Path(repo_name).glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=15_000, min_frequency=2, special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",
])tokenizer.save_model(model_dir)
############################################################
from transformers import GPT2Configconfig = GPT2Config(vocab_size=15_000,n_positions=512,n_head=2,n_layer=2,n_embd=256,
)
############################################################
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_dir)
############################################################
from transformers import GPT2LMHeadModelmodel = GPT2LMHeadModel(config=config)
print(model.num_parameters())
############################################################
# from transformers import LineByLineTextDataset# dataset = LineByLineTextDataset(
#     tokenizer=tokenizer,
#     file_path=repo_name+"/GPT_Corpus.txt",
#     block_size=128,
# )from transformers import TextDataset
dataset = TextDataset(tokenizer=tokenizer,file_path="GPT_Corpus_whole.txt",block_size=128
)
############################################################
from transformers import DataCollatorForLanguageModelingdata_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False
)
############################################################
from transformers import Trainer, TrainingArgumentstraining_args = TrainingArguments(output_dir="./"+model_dir,overwrite_output_dir=True,num_train_epochs=100,per_gpu_train_batch_size=64,save_steps=20_000,save_total_limit=2,logging_steps=100
)trainer = Trainer(model=model,args=training_args,data_collator=data_collator,train_dataset=dataset,prediction_loss_only=True,
)trainer.train()
trainer.save_model("./"+model_dir)

需要注意的是:这段示例代码中,我用了很小的一个corpus,所以相应做了一些调整。另外,就像前面说的,tokenizer使用的文件夹下的corpus是每一行代表一个sentence的,但是TextDataset那里是把整个语料库搞成了一行。也不知道这样会不会对model有影响,anyway,这样反正是能正常训练了。

如果要使用已经训练好的模型,也很简单:

from transformers import pipelinepred = pipeline("text-generation",model="./GPT-Corpus-GPTModel",tokenizer="./GPT-Corpus-GPTModel"
)result=pred('%输入字符串%')[0]['generated_text']
print(result)

就简单记录这么多。

 

  相关解决方案