
Pre-Trained Language Models and Their Applications
Haifeng Wang, Jiwei Li, Hua Wu, Eduard Hovy, Yu Sun
Engineering ›› 2023, Vol. 25 ›› Issue (6) : 51-65.
Pre-Trained Language Models and Their Applications
Pre-trained language models have achieved striking success in natural language processing (NLP), leading to a paradigm shift from supervised learning to pre-training followed by fine-tuning. The NLP community has witnessed a surge of research interest in improving pre-trained models. This article presents a comprehensive review of representative work and recent progress in the NLP field and introduces the taxonomy of pre-trained models. We first give a brief introduction of pre-trained models, followed by characteristic methods and frameworks. We then introduce and analyze the impact and challenges of pre-trained models and their downstream applications. Finally, we briefly conclude and address future research directions in this field.
Pre-trained models / Natural language processing
[1] |
Bahl LR, Jelinek F, Mercer RL. A maximum likelihood approach to continuous speech recognition. IEEE Trans Pattern Anal Mach Intell 1983;PAMI-5(2):179‒90.
|
[2] |
Thrun S, Pratt L. Learning to learn. Cham: Springer; 1998.
|
[3] |
Nadas A. Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Trans Acoust Speech Signal Process 1984;32(4):859‒61.
|
[4] |
Chen SF, Goodman J. An empirical study of smoothing techniques for language modeling. Comput Speech Lang 1999;13(4):359‒94.
|
[5] |
Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. J Mach Learn Res 2003;3:1137‒55.
|
[6] |
Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling. In: Proceedings of the 13th Annual Conference of the International Speech Communication Association (Interspeech 2012); 2012 Sep 9‒13; Portland, OR, USA. 2012. p. 194‒7.
|
[7] |
Mikolov T, Zweig G. Context dependent recurrent neural network language model. In: Proceedings of 2012 IEEE Spoken Language Technology Workshop (SLT); 2012 Dec 2‒5; Miami, FL, USA. 2012. p. 234‒9.
|
[8] |
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735‒80.
|
[9] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017 Dec 4‒9; Long Beach, CA, USA. 2017. p. 5998‒6008.
|
[10] |
Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, et al. Mesh-TensorFlow: deep learning for supercomputers. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018); 2018 Dec 3‒8; Montréal, QC, Canada; 2018.
|
[11] |
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 2978‒88.
|
[12] |
Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. 2020. arXiv:2004.05150.
|
[13] |
Press O, Smith NA, Lewis M. Shortformer: better language modeling using shorter inputs. 2020. arXiv:2012.15832.
|
[14] |
Press O, Smith NA, Levy O. Improving transformer models by reordering their sublayers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 2996‒3005.
|
[15] |
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Conference on Neural Information Processing Systems (NIPS 2013); 2013 Dec 5‒10; Lake Tahoe, NV, USA. 2013. p. 3111‒9.
|
[16] |
Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25‒29; Doha, Qatar; 2014. p.1532‒43.
|
[17] |
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. J Mach Learn Res 2011;12:2493‒537.
|
[18] |
Xiong C, Zhong V, Socher R. DCN+: mixed objective and deep residual coattention for question answering. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30‒May 3; Vancouver, BC, Canada; 2018.
|
[19] |
Dai AM, Le QV. Semi-supervised sequence learning. In: Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS 2015); 2015 Dec 7‒12; Montréal, QC, Canada. 2015. p. 3079‒87.
|
[20] |
McCann B, Bradbury J, Xiong C, Socher R. Learned in translation: contextualized word vectors. In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017 Dec 4‒9; Long Beach, CA, USA; 2017.
|
[21] |
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2018 Jun 1‒6; New Orleans, LA, USA; 2018. p. 2227‒37.
|
[22] |
Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. San Francisco: OpenAI; 2018.
|
[23] |
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019 Jun 2‒7; Minneapolis, MN, USA; 2019. p. 4171‒86.
|
[24] |
He P, Liu X, Gao J, Chen W. DeBERTa: decoding-enhanced BERT with disentangled attention. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021); 2021 May 3‒7; Vienna, Austria; 2021.
|
[25] |
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2019;21(140):1‒67.
|
[26] |
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 2020 Dec 7‒12; online. 2020. p. 1877‒901.
|
[27] |
Zhang Z, Han X, Zhou H, Ke P, Gu Y, Ye D, et al. CPM: a large-scale generative Chinese pre-trained language model. AI Open 2021;2:93‒9.
|
[28] |
Zeng W, Ren X, Su T, Wang H, Liao Y, Wang Z, et al. PanGu-a: large-scale autoregressive pretrained Chinese language models with auto-parallel computation. 2021. arXiv:2104.12369.
|
[29] |
Wang S, Sun Y, Xiang Y, Wu Z, Ding S, Gong W, et al. ERNIE 3.0 Titan: exploring larger-scale knowledge enhanced pre-training for language understanding and generation. 2021. arXiv:2112.12731.
|
[30] |
Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 3266‒80.
|
[31] |
Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, et al. ERNIE: enhanced representation through knowledge integration. 2019. arXiv:1904.09223.
|
[32] |
Xiong W, Du J, Wang WY, Stoyanov V. Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In: Proceedings of the 8th International Conference on Learning Representations (ICLR 2020); 2020 Apr 26‒30; Addis Ababa, Ethiopia; 2020.
|
[33] |
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, et al. K-BERT: enabling language representation with knowledge graph. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence; 2020 Feb 7‒12; New York City, NY, USA. Palo Alto: AAAI Press; 2020. p. 2901‒8.
|
[34] |
Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, et al. CoLAKE: contextualized language and knowledge embedding. In: Proceedings of the 28th International Conference on Computational Linguistics; 2020 Dec 8‒13; online. 2020. p. 3660‒70.
|
[35] |
Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 1441‒51.
|
[36] |
Peters ME, Neumann M, Logan IV RL, Schwartz R, Joshi V, Singh S, et al. Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 43‒54.
|
[37] |
Levine Y, Lenz B, Dagan O, Ram O, Padnos D, Sharir O, et al. SenseBERT: driving some sense into BERT. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 4656‒67.
|
[38] |
Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, et al. KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans Assoc Comput Linguist 2021;9:176‒94.
|
[39] |
Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, et al. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. 2021. arXiv:2107.02137.
|
[40] |
Wang R, Tang D, Duan N, Wei Z, Huang X, Ji J, et al. K-Adapter: infusing knowledge into pre-trained models with adapters. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 1405‒18.
|
[41] |
Wu S, BetoDredze M., Bentz, Becas: the surprising cross-lingual effectiveness of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 833‒44.
|
[42] |
Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 8‒14; Vancouver, BC, Canada. 2019. p. 7057‒67.
|
[43] |
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 8440‒51.
|
[44] |
Chi Z, Dong L, Wei F, Yang N, Singhal S, Wang W, et al. InfoXLM: an information-theoretic framework for cross-lingual language model pretraining. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021 Jun 6‒11; online. 2021. p. 3576‒88.
|
[45] |
Ouyang X, Wang S, Pang C, Sun Y, Tian H, Wu H, et al. ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 27‒38.
|
[46] |
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, et al. DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014); 2014 Jun 21‒26; Beijing, China. 2014. p. 647‒55.
|
[47] |
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun 23‒28; Columbus, OH, USA. 2014. p. 580‒7.
|
[48] |
Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22‒29; Venice, Italy. 2017. p. 843‒52.
|
[49] |
Schneider S, Baevski A, Collobert R, Auli M. Wav2vec: unsupervised pretraining for speech recognition. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association (InterSpeech 2019); 2019 Sep 15‒19; Graz, Austria. 2019. p. 3465‒9.
|
[50] |
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2009 Jun 20‒25; Miami, FL, USA. 2009. p. 248‒55.
|
[51] |
Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8‒14; Munich, Germany. 2018. p. 181‒96.
|
[52] |
Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021. arXiv:2106.04560.
|
[53] |
Doersch C, Gupta A, Efros AA. Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2015 Dec 7‒13; Santiago, Chile. 2015. p. 1422‒ 30.
|
[54] |
Noroozi M, Favaro P. Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European Conference on Computer Vision (ECCV); 2016 Oct 8‒16; Amsterdam, NetherlandsThe. 2016. p. 69‒84.
|
[55] |
Misra I, van der Maaten L. Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14‒19; online. 2020. p. 6707‒17.
|
[56] |
Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by predicting image rotations. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30‒May 3; Vancouver, BC, Canada; 2018.
|
[57] |
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021); 2021 May 3‒7; Vienna, Austria; 2021.
|
[58] |
Van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. 2018. arXiv:1807.03748.
|
[59] |
He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14‒19; online. 2020. p. 9729‒38.
|
[60] |
Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020); 2020 Jul 12‒18; online. 2020. p. 1597‒607.
|
[61] |
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 8748‒63.
|
[62] |
Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual and vision‒language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 4904‒16.
|
[63] |
Lu J, Batra D, Parikh D, Lee S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 13‒23.
|
[64] |
Tan H, Bansal M. LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; Hong Kong, China; 2019.
|
[65] |
Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: a simple and performant baseline for vision and language. 2019. arXiv:1908.03557.
|
[66] |
Sun C, Myers A, Vondrick C, Murphy K, Schmid C. VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV); 2019 Oct 27‒Nov 2; Seoul, Republic of Korea. 2019. p. 7464‒73.
|
[67] |
Sun C, Baradel F, Murphy K, Schmid C. Learning video representations using contrastive bidirectional transformer. 2019. arXiv:1906.05743.
|
[68] |
Chuang YS, Liu CL, Lee H, Lee L. SpeechBERT: an audio-and-text jointly learned language model for end-to-end spoken question answering. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020); 2020 Oct 25‒29; Shanghai, China. 2020. p. 4168‒72.
|
[69] |
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, et al. Zero-shot text-toimage generation. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 8821‒31.
|
[70] |
Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, et al. ERNIE-ViL: knowledge enhanced vision‒language representations through scene graphs. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence; 2021 Feb 2‒9; online. Palo Alto: AAAI Press; 2021. p. 3208‒16.
|
[71] |
Gan Z, Chen YC, Li L, Zhu C, Cheng Y, Liu J. Large-scale adversarial training for vision-and-language representation learning. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 2020 Dec 7‒12; online. 2020. p. 6616‒28.
|
[72] |
Cho J, Lei J, Tan H, Bansal M. Unifying vision-and-language tasks via text generation. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 1931‒42.
|
[73] |
Kalyan KS, Rajasekharan A, Sangeetha S. AMMUS: a survey of transformerbased pretrained models in natural language processing. 2021. arXiv:2108.05542.
|
[74] |
Kaliyar RK. A multi-layer bidirectional transformer encoder for pre-trained word embedding: a survey of BERT. In: Proceedings of 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence); 2020 Jan 29‒31; Noida, India. 2020. p. 336‒40.
|
[75] |
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. 2021. arXiv:2107.13586.
|
[76] |
Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent advances in natural language processing via large pre-trained language models: a survey. 2021. arXiv:2111.01243.
|
[77] |
Li J, Tang T, Zhao WX, Wen JR. Pretrained language models for text generation: a survey. 2021. arXiv:2105.10311.
|
[78] |
Zaib M, Sheng QZ, Zhang W. A short survey of pre-trained language models for conversational AI—a new age in NLP. In: Proceedings of the Australasian Computer Science Week Multiconference (ACSW’20); 2020 Feb 3‒7; Melbourne, VIC, Australia. 2020.
|
[79] |
Ramponi A, Plank B. Neural unsupervised domain adaptation in NLP—a survey. In: Proceedings of the 28th International Conference on Computational Linguistics; 2020 Dec 8‒13; onine. 2020. p. 6838‒55.
|
[80] |
Qiu XP, Sun TX, Xu YG, Shao YF, Dai N, Huang XJ. Pre-trained models for natural language processing: a survey. Sci China Technol Sci 2020;63(10):1872‒97.
|
[81] |
Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. 2021. arXiv:2108.07258.
|
[82] |
Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, et al. Pre-trained models: past, present and future. AI Open 2021;2:225‒50.
|
[83] |
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. San Francisco: OpenAI; 2019.
|
[84] |
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 5754‒64.
|
[85] |
Taylor WL. “Cloze procedure”: a new tool for measuring readability. J Mass Commun Q 1953;30(4):415‒33.
|
[86] |
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, et al. GLM: general language model pretraining with autoregressive blank infilling. 2021. arXiv:2103.10360.
|
[87] |
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. SpanBERT: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 2020;8:64‒77.
|
[88] |
Song K, Tan X, Qin T, Lu J, Liu TY. MASS: masked sequence to sequence pretraining for language generation. In: Proceedings of the 36th International Conference on Machine Learning (ICML 2019); 2019 Jun 9‒15; Long Beach, CA, USA. 2019. p. 5926‒36.
|
[89] |
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 7871‒80.
|
[90] |
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized BERT pretraining approach. 2019. arXiv:1907.11692.
|
[91] |
Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model pre-training for natural language understanding and generation. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 13042‒54.
|
[92] |
Bao H, Dong L, Wei F, Wang W, Yang N, Liu X, et al. UniLMv2: pseudo-masked language models for unified language model pre-training. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020); 2020 Jul 12‒18; online. 2020. p. 642‒52.
|
[93] |
Xiao D, Zhang H, Li Y, Sun Y, Tian H, Wu H, et al. ERNIE-GEN: an enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI); 2021 Jan 7‒15; Yokohama, Japan. 2021. p. 3997‒4003.
|
[94] |
Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gapsentences for abstractive summarization. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020); 2020 Jul 12‒ 18; online. 2020. p. 11328‒39.
|
[95] |
Rosset C. Turing-NLG: a 17-billion-parameter language model by Microsoft [Internet]. Redmond: Microsoft; 2020 Feb 13 [cited 2021 Nov 4]. Available from: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17- billion-parameter-language-model-by-microsoft/.
|
[96] |
Kim B, Kim HS, Lee SW, Lee G, Kwak D, Hyeon JD, et al. What changes can large-scale language models bring? Intensive study on HyperCLOVA: billionsscale Korean generative pretrained transformers. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 3405‒24.
|
[97] |
Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, et al. mT5: a massively multilingual pre-trained text-to-text transformer. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021 Jun 6‒11; online. 2021. p. 483‒98.
|
[98] |
Zhang Z, Gu Y, Han X, Chen S, Xiao C, Sun Z, et al. CPM-2: large-scale costeffective pre-trained language models. 2021. arXiv:2106.10715.
|
[99] |
Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. 2021. arXiv: 2101.03961.
|
[100] |
Wu S, Zhao X, Yu T, Zhang R, Shen C, Liu H, et al. Yuan 1.0: large-scale pretrained language model in zero-shot and few-shot learning. 2021. arXiv: 2110.04725.
|
[101] |
Du N, Huang Y, Dai AM, Tong S. Lepikhin D, Xu Y,et al. GLaM: efficient scaling of language models with mixture-of-experts. 2021. arXiv: 2112.06905.
|
[102] |
Rae JW, Borgeaud S, Cai T, Millican K, Hoffmann J, Song F, et al. Scaling language models: methods, analysis & insights from training gopher. 2021. arXiv: 2112.11446.
|
[103] |
Ding M, Yan Z , Hong W, Zheng W, Zhou C, Yin D, et al. CogView: mastering text-to-image generation via transformers. 2021. arXiv: 2105.13290.
|
[104] |
Lin J, Men R, Yang A, Zhou C, Ding M, Zhang Y, et al. M6: a Chinese multimodal pretrainer. 2021. arXiv:2103.00823.
|
[105] |
Li W, Gao C, Niu G, Xiao X, Liu H, Liu J, et al. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 2592‒607.
|
[106] |
Huo Y, Zhang M, Liu G, Lu H, Gao Y, Yang G, et al. WenLan: bridging vision and language by large-scale multi-modal pre-training. 2021. arXiv:2103.06561.
|
[107] |
Zhang H, Yin W, Fang Y, Li L, Duan B, Wu Z, et al. ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation. 2021. arXiv:2112.15283.
|
[108] |
Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, et al. GPipe: efficient training of giant neural networks using pipeline parallelism. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 103‒12.
|
[109] |
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron- LM: training multi-billion parameter language models using model parallelism. 2019. arXiv:1909.08053.
|
[110] |
Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, et al. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 21); 2021 Nov 14‒19; St. Louis, MO, USA; 2021.
|
[111] |
Bian Z, Liu H, Wang B, Huang H, Li Y, Wang C, et al. Colossal-AI: a unified deep learning system for large-scale parallel training. 2021. arXiv:2110.14883.
|
[112] |
Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In: Proceedings of the 5th International Conference on Learning Representations (ICLR 2017); 2017 Apr 24‒26; Toulon, France; 2017.
|
[113] |
Narang S, Diamos G, Elsen E, Micikevicius P, Alben J, Garcia D, et al. Mixed precision training. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30‒May 3; Vancouver, BC, Canada; 2018.
|
[114] |
Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: memory optimizations toward training trillion parameter models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 20); 2020 Nov 9‒19; Atlanta, GA, USA; 2020.
|
[115] |
Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25‒29; Doha, Qatar. 2014. p. 1746‒51.
|
[116] |
Hu H, Richardson K, Xu L, Li L, Kübler S, Moss L. OCNLI: original Chinese natural language inference. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2020 Nov 16‒ 20; online. 2020. p. 3512‒26.
|
[117] |
Shao CC, Liu T, Lai Y, Tseng Y, Tsai S. DRCD: a Chinese machine reading comprehension dataset. 2018. arXiv:1806.00920..
|
[118] |
Schick T, Schütze H. Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; 2021 Apr 19‒23; online. 2021. p. 255‒69.
|
[119] |
Gray S, Radford A, Kingma DP. GPU kernels for block-sparse weights. 2017. arXiv:1711.09224.
|
[120] |
Lin H, Lu Y, Han X, Sun L. Sequence-to-nuggets: nested entity mention detection via anchor-region networks. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 5182‒92.
|
[121] |
Lin Y, Meng Y, Sun X, Han Q, Kuang K, Li J, et al. BertGCN: transductive text classification by combining GCN and BERT. 2021. arXiv: 2105.05727.
|
[122] |
Zhang R, Tetreault J. This email could save your life: introducing the task of email subject line generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 446‒56.
|
[123] |
Zhou H, Zheng C, Huang K, Huang M, Zhu X. KdConv: a Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 7098‒108.
|
[124] |
Cho J, Seo M, Hajishirzi H, et al. Mixture content selection for diverse sequence generation. 2019. arXiv:1909.01953.
|
[125] |
Ribeiro LFR, Zhang Y, Gardent C, Gurevych I. Modeling global and local node contexts for text generation from knowledge graphs. Trans Assoc Comput Linguist 2020;8:589‒604.
|
[126] |
Zhang Y, Sun S, Galley M, Chen YC, Brockett C, Gao X, et al. DialoGPT: largescale generative pre-training for conversational response generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 270‒8.
|
[127] |
Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, et al. Towards a human-like open-domain chatbot. 2020. arXiv:2001.09977.
|
[128] |
Roller S, Dinan E, Goyal N, Ju D, Williamson M, Liu Y, et al. Recipes for building an open-domain chatbot. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; 2021 Apr 19‒23; online. 2021. p. 300‒25.
|
[129] |
DuerOS [Internet]. Beijing: Baidu; c2017 [cited 2021 Nov 4]. Available from: https://dueros.baidu.com/en/index.html.
|
[130] |
Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-2: towards building an open-domain chatbot via curriculum learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACLIJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 2513‒25.
|
[131] |
Gunasekara C, Kim S, D’Haro LF, Rastogi A, Chen YN, Eric M, et al. Overview of the ninth dialog system technology challenge: DSTC9. 2020. arXiv:2011.06486.
|
[132] |
Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-XL: exploring the large-scale pre-training of dialogue generation. 2021. arXiv:2109.09519.
|
[133] |
Wang Y, Ke P, Zheng Y, Huang K, Jiang Y, Zhu X, et al. A large-scale Chinese short-text conversation dataset. In: Proceedings of the 9th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2020); 2020 Oct 14‒18; Zhengzhou, China. 2020. p. 91‒103.
|
[134] |
Qi W, Gong Y, Yan Y, Xu C, Yao B, Zhou B, et al. ProphetNet-X: large-scale pretraining models for English, Chinese, multi-lingual, dialog, and code generation. 2021. arXiv:2104.08006.
|
[135] |
Zhou H, Ke P, Zhang Z, Gu Y, Zheng Y, Zheng C, et al. EVA: an open-domain Chinese dialogue system with large-scale generative pre-training. 2021. arXiv:2108.01547.
|
[136] |
Vinyals O, Le Q. A neural conversational model. 2015. arXiv:1506.05869.
|
[137] |
Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence; 2016 Feb 12‒17; Phoenix, AZ, USA. Palo Alto: AAAI Press; 2016. p. 3776‒83.
|
[138] |
Worswick S. “Mitsuku wins loebner prize 2018!” [Internet]. Medium; 2018 Sep 13 [cited 2021 Nov 4]. Available from: https://medium.com/ pandorabots-blog/mitsuku-wins-loebner-prize-2018-3e8d98c5f2a7.
|
[139] |
Zhou L, Gao J, Li D, Shum HY. The design and implementation of XiaoIce, an empathetic social chatbot. Comput Linguist 2020;46(1):53‒93.
|
[140] |
Xin J, Tang R, Lee J, Yu Y, Lin J. DeeBERT: dynamic early exiting for accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 2246‒51.
|
[141] |
Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, Laroussilhe QD, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning (ICML 2019); 2019 Jun 9‒15; Long Beach, CA, USA. 2019. p. 2790‒9.
|
[142] |
Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 4582‒97.
|
[143] |
Gao T, Fisch A, Chen D. Making pre-trained language models better few-shot learners. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 3816‒30.
|
[144] |
Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 3045‒59.
|
[145] |
Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. 2021. arXiv:2103.10385.
|
[146] |
Han X, Zhao W, Ding N, Liu Z, Sun M. PTR: prompt tuning with rules for text classification. 2021. arXiv:2105.11259.
|
[147] |
Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine learning. 2017. arXiv:1702.08608.
|
[148] |
Wallace E, Feng S, Kandpal N, Gardner M, Singh S. Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 2153‒62.
|
[149] |
Fort K, YesCouillault A., care!we Results of the ethics and natural language processing surveys. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016); 2016 May 23‒28; Portorož, Slovenia. 2016. p. 1593‒600.
|
[150] |
Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014); 2014 Apr 14‒16; Banff, AB, Canada; 2014.
|
[151] |
Hewitt J, Manning CD. A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies; 2019Jun2‒7;Minneapolis,MN,USA.2019. p. 4129‒38.
|
[152] |
Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 3651‒7.
|
[153] |
Linzen T, Dupoux E, Goldberg Y. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans Assoc Comput Linguist 2016;4:521‒35.
|
[154] |
Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2016 Jun 12‒17; DiegoSan, CA, USA. 2016. p. 1135‒44.
|
[155] |
Davison J, Feldman J, Rush AM. Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 1173‒8.
|
[156] |
Petroni F, Rocktäschel T, Riedel S, Lewis P, Bakhtin A, Wu Y, et al. Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 2463‒73.
|
[157] |
Talmor A, Elazar Y, Goldberg Y, Berant J. oLMpics-on what language model pre-training captures. Trans Assoc Comput Linguist 2020;8:743‒58.
|
[158] |
Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. 2020. arXiv:2005.05909.
|
[159] |
Jia J, Liu Y, Gong NZ. BadEncoder: backdoor attacks to pre-trained encoders in self-supervised learning. 2021. arXiv:2108.00352.
|
[160] |
Devlin J. Google-research/bert [Internet]. GitHub; 2018 Oct 11 [cited 2021 Nov 4]. Available from: https://github.com/google-research/bert.
|
[161] |
Baidu Ernie Team. Paddlepaddle/ernie [Internet]. GitHub; 2019 Apr 19 [cited 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/ERNIE.
|
[162] |
Huawei. Pcl-platform.intelligence/pangu-alpha [Internet]. San Francisco: OpenAI; 2021 Apr 26 [cited 2021 Nov 4]. Available from: https://git.openi. org.cn/PCL-Platform.Intelligence/PanGu-Alpha.
|
[163] |
Ding S, Shang J, Wang S, Sun Y, Tian H, Wu H, et al. ERNIE-Doc: a retrospective long-document modeling transformer. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 2914‒27.
|
[164] |
Huggingface [Internet]. Hugging Face; 2020 Apr 26 [cited 2021 Nov 4]. Available from: https://huggingface.co.
|
[165] |
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. FAIRSEQ: a fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Demonstrations); 2019 Jun 2‒7; Minneapolis, MN, USA. 2019. p. 48‒53.
|
[166] |
Baidu PaddlePaddle Team. Paddlepaddle/paddlenlp [Internet]. GitHub; 2020 Nov 16 [cited 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/ PaddleNLP.
|
[167] |
Wenxin ernie [Internet]. Beijing: Baidu; c2021 [cited 2021 Nov 4]. Available from: https://wenxin.baidu.com.
|
[168] |
Alibaba Damo Academy. AliceMind [Internet]. Aliyuncs; c2021 [cited 2021 Nov 4]. Available from: https://alicemind.aliyuncs.com.
|
[169] |
Openai API [Internet]. San Francisco: OpenAI; c2021 [cited 2021 Nov 4]. Available from: https://openai.com/api.
|
[170] |
Garg Y, Chatterjee N. Sentiment analysis of twitter feeds. In: Proceedings of the 3rd International Conference on Big Data Analytics (BDA 2014); 2014 Dec 20‒23; New Delhi, India. 2014. p. 33‒52.
|
[171] |
AlQahtani ASM. Product sentiment analysis for amazon reviews. Int J Comput Sci Inf Technol 2021;13(3):15‒30.
|
[172] |
Singh M, Jakhar AK, Pandey S. Sentiment analysis on the impact of coronavirus in social life using the BERT model. Soc Netw Anal Min 2021;11:33.
|
[173] |
Chen Z, Sokolova M. Sentiment analysis of the COVID-related r/Depression posts. 2021. arXiv:2108.06215.
|
[174] |
Liu Y, Liu J, Chen L, Lu Y, Feng S, Feng Z, et al. ERNIE-SPARSE: learning hierarchical efficient transformer through regularized self-attention. 2022. arXiv:2203.12276.
|
[175] |
Jwa H, Oh D, Park K, Kang JM, Lim H. exBAKE: automatic fake news detection model based on bidirectional encoder representations from transformers (BERT). Appl Sci 2019;9(19):4062.
|
[176] |
Soares LB, FitzGerald N, Ling J, Kwiatkowski T. Matching the blanks: distributional similarity for relation learning. 2019. arXiv:1906.03158.
|
[177] |
Wang Z, Xu Y, Cui L, Shang J, Wei F. LayoutReader: pre-training of text and layout for reading order detection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 4735‒44.
|
[178] |
gpt-2-for-the-advertising-industry [Internet]. San Francisco: OpenAI; 2017 Aug 1 [cited 2021 Nov 4]. Available from: https://www.narrativa.com/gpt-2- for-the-advertising-industry.
|
[179] |
Agarwal R, Kann K. Acrostic poem generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2020 Nov 16‒20; online. 2020. p. 1230‒40.
|
[180] |
Lee DH, Hu Z, Lee RKW. Improving text auto-completion with next phrase prediction. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 4434‒8.
|
[181] |
Mukherjee S, Mukherjee S, Hasegawa M, Awadallah AH, White R. Smart todo: automatic generation of to-do items from emails. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 8680‒9.
|
[182] |
What are Alexa Built-in Devices? [Internet]. Seattle: Amazon; c2010‒2023 [cited 2021 Nov 4]. Available from: https://developer.amazon.com/alexavoice- service.
|
[183] |
Mari A. Voice commerce: understanding shopping-related voice assistants and their effect on brands. In: Proceedings of the International Media Management Academic Association Annual Conference; 2019 Oct 4‒6; Doha, Qatar; 2019.
|
[184] |
Google assistant update speech recognition name pronunciation BERT smart speakers [Internet]. NDTV; 2021 Apr 29 [cited 2021 Nov 4]. Available from: https://gadgets.ndtv.com/apps/news/google-assistant-update-speechrecognition- name-pronunciation-bert-smart-speak.
|
[185] |
Vincent J. The future of AI is a conversation with a computer [Internet]. New York City: The Verge; 2021 Nov 1 [cited 2021 Nov 4]. Available from: https:// www.theverge.com/22734662/ai-language-artificial-intelligence-futuremodels- gpt-3-limitations-bias/.
|
[186] |
Meet the AI powering today’s smartest smartphones [Internet]. San Francisco: Wired; 2017 Aug 1 [cited 2021 Nov 4]. Available from: https:// www.wired.com/sponsored/story/meet-the-ai-powering-todays-smartestsmartphones.
|
[187] |
Nayak P. Understanding searches better than ever before [Internet]. Google; [cited 2021 Nov 4]. Available from: https://blog.google/products/search/ search-language-understanding-bert/.
|
[188] |
Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, et al. ERNIE 2.0: a continual pretraining framework for language understanding. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence; 2020 Feb 7‒12; New York City, NY, USA. Palo Alto: AAAI Press; 2020. p. 8968‒75.
|
[189] |
Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language model for web-scale retrieval in Baidu Search. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 21); 2021 Aug 14‒18; online. 2021. p. 3365‒75.
|
[190] |
Huang JT, Sharma A, Sun S, Xia L, Zhang D, Pronin P, et al. Embedding-based retrieval in Facebook Search. In: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 20); 2020 Jul 6‒ 10; online. 2020. p. 2553‒61.
|
[191] |
Yu P, Fei H, Li P. Cross-lingual language model pretraining for retrieval. In: Proceedings of the Web Conference; 2021 Apr 19‒23; online. 2021. p.1029‒ 39.
|
[192] |
Ni M, Huang H, Su L, Cui E, Bharti T, Wang L, et al. M3P: learning universal representations via multitask multilingual multimodal pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 19‒25; online. 2021. p. 3977‒86.
|
[193] |
Sanh V, Debut L, Chaumond J, DistilBERTWolf T., a distilled version of BERT: smaller, faster, cheaper and lighter. 2019. arXiv:1910.01108.
|
[194] |
Gordon MA, Duh K, Andrews N. Compressing BERT: studying the effects of weight pruning on transfer learning. In: Proceedings of the 5th Workshop on Representation Learning for NLP; 2020 Jul 9; Seattle, WA, USA. 2020. p. 143‒55.
|
[195] |
Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K. I-BERT: integer-only BERT quantization. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 5506‒18.
|
/
〈 |
|
〉 |