
2019/11/18



项目 语言 预训练模型 简介 相关资料
GPT2-Chinese 中文 中文的GPT2训练代码,使用BERT的Tokenizer或Sentencepiece的BPE model(感谢kangzhonghua的贡献,实现BPE模式需要略微修改train.py的代码)。可以写诗,新闻,小说,或是训练通用语言模型。支持字为单位或是分词模式或是BPE模式(需要略微修改train.py的代码)。支持大语料训练。  
gpt2-ml 中文 GPT2 模型, 多语言支持  
roberta_zh 中文 RoBERTa是BERT的改进版,通过改进训练任务和数据生成方式、训练更久、使用更大批次、使用更多数据等获得了State of The Art的效果;可以用Bert直接加载。  
Chinese-PreTrained-XLNet 中文 本项目提供了面向中文的XLNet预训练模型,旨在丰富中文自然语言处理资源,提供多元化的中文预训练模型选择。 我们欢迎各位专家学者下载使用,并共同促进和发展中文资源建设。  
fastText All fastText is a library for efficient learning of word representations and sentence classification.  
transformers All Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers  
bert-as-service All Mapping a variable-length sentence to a fixed-length vector using BERT model https://bert-as-service.readthedocs.io 两行代码玩转Google BERT句向量词向量,
gpt-2-keyword-generation English × Method to encode text for GPT-2 to generate text based on provided keywords  
SC-LSTM All × Implement SC-LSTM model for text generation in control of words, in Python/TensorFlow. 语义控制的文本生成模型 github:char-rnn-tf,
ctrl All Conditional Transformer Language Model for Controllable Generation paper: CTRL: A Conditional Transformer Language Model for Controllable Generation, 这年头,AI都懂得编故事了




观点1 by soloice: 不能

For newbies, NO.

Sentence generation requires sampling from a language model, which gives the probability distribution of the next word given previous contexts. But BERT can't do this due to its bidirectional nature.

For advanced researchers, YES.

You can start with a sentence of all [MASK] tokens, and generate words one by one in arbitrary order (instead of the common left-to-right chain decomposition). Though the text generation quality is hard to control.

Here's the technical report BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model, its errata and the source code.

In summary:

If you would like to do some research in the area of decoding with BERT, there is a huge space to explore
If you would like to generate high quality texts, personally I recommend you to check GPT-2.

观点2 by stuart: 不能; 引用

this experiment by Stephen Mayhew suggests that BERT is lousy at sequential text generation:


although he had already eaten a large meal, he was still very hungry
As before, I masked “hungry” to see what BERT would predict. If it could predict it correctly without any right context, we might be in good shape for generation.

This failed. BERT predicted “much” as the last word. Maybe this is because BERT thinks the absence of a period means the sentence should continue. Maybe it’s just so used to complete sentences it gets confused. I’m not sure.

One might argue that we should continue predicting after “much”. Maybe it’s going to produce something meaningful. To that I would say: first, this was meant to be a dead giveaway, and any human would predict “hungry”. Second, I tried it, and it keeps predicting dumb stuff. After “much”, the next token is “,”.

So, at least using these trivial methods, BERT can’t generate text.


来源: 史上最强通用NLP模型GPT-2:OpenAI刚又发布7.74亿参数版本


考虑到计算能力成本的历史趋势,我们可以期待语言模型在一系列设备上得到更广泛的部署。例如,Hugging Face将1.24亿参数GPT-2移植到Swift CoreML中,以便在iOS设备上进行推理。










今天,模型的训练和部署需要了解ML技术,使用工具的技能以及访问测试平台以进行评估。稳步改进的与语言模型交互的工具,如Talk to Transformer和Write with Transformer,将扩大能够以各种不同方式使用语言模型的参与者的数量。这些对工具可用性的改进将对模型性能和采样方法的改进起到补充作用,并将使语言模型的创造性应用比我们目前看到的更广泛。


