Journal Home Online First Current Issue Archive For Authors Journal Information 中文版

Engineering >> 2023, Volume 25, Issue 6 doi: 10.1016/j.eng.2022.04.024

Pre-Trained Language Models and Their Applications

a Baidu Inc., Beijing 100193, China
b College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
c Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Received: 2021-11-10 Revised: 2022-03-08 Accepted: 2022-04-05 Available online: 2022-09-07

Next Previous

Abstract

Pre-trained language models have achieved striking success in natural language processing (NLP), leading to a paradigm shift from supervised learning to pre-training followed by fine-tuning. The NLP community has witnessed a surge of research interest in improving pre-trained models. This article presents a comprehensive review of representative work and recent progress in the NLP field and introduces the taxonomy of pre-trained models. We first give a brief introduction of pre-trained models, followed by characteristic methods and frameworks. We then introduce and analyze the impact and challenges of pre-trained models and their downstream applications. Finally, we briefly conclude and address future research directions in this field.

Figures

Fig. 1

Fig. 2

References

[ 1 ] Bahl LR, Jelinek F, Mercer RL. A maximum likelihood approach to continuous speech recognition. IEEE Trans Pattern Anal Mach Intell 1983;PAMI-5(2):179‒90. link1

[ 2 ] Thrun S, Pratt L. Learning to learn. Cham: Springer; 1998. link1

[ 3 ] Nadas A. Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Trans Acoust Speech Signal Process 1984;32(4):859‒61. link1

[ 4 ] Chen SF, Goodman J. An empirical study of smoothing techniques for language modeling. Comput Speech Lang 1999;13(4):359‒94. link1

[ 5 ] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. J Mach Learn Res 2003;3:1137‒55.

[ 6 ] Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling. In: Proceedings of the 13th Annual Conference of the International Speech Communication Association (Interspeech 2012); 2012 Sep 9‒13; Portland, OR, USA. 2012. p. 194‒7. link1

[ 7 ] Mikolov T, Zweig G. Context dependent recurrent neural network language model. In: Proceedings of 2012 IEEE Spoken Language Technology Workshop (SLT); 2012 Dec 2‒5; Miami, FL, USA. 2012. p. 234‒9. link1

[ 8 ] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735‒80. link1

[ 9 ] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017 Dec 4‒9; Long Beach, CA, USA. 2017. p. 5998‒6008.

[10] Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, et al. Mesh-TensorFlow: deep learning for supercomputers. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018); 2018 Dec 3‒8; Montréal, QC, Canada; 2018. link1

[11] Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 2978‒88. link1

[12] Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. 2020. arXiv:2004.05150.

[13] Press O, Smith NA, Lewis M. Shortformer: better language modeling using shorter inputs. 2020. arXiv:2012.15832. link1

[14] Press O, Smith NA, Levy O. Improving transformer models by reordering their sublayers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 2996‒3005. link1

[15] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Conference on Neural Information Processing Systems (NIPS 2013); 2013 Dec 5‒10; Lake Tahoe, NV, USA. 2013. p. 3111‒9.

[16] Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25‒29; Doha, Qatar; 2014. p.1532‒43. link1

[17] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. J Mach Learn Res 2011;12:2493‒537.

[18] Xiong C, Zhong V, Socher R. DCN+: mixed objective and deep residual coattention for question answering. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30‒May 3; Vancouver, BC, Canada; 2018.

[19] Dai AM, Le QV. Semi-supervised sequence learning. In: Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS 2015); 2015 Dec 7‒12; Montréal, QC, Canada. 2015. p. 3079‒87.

[20] McCann B, Bradbury J, Xiong C, Socher R. Learned in translation: contextualized word vectors. In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017 Dec 4‒9; Long Beach, CA, USA; 2017.

[21] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2018 Jun 1‒6; New Orleans, LA, USA; 2018. p. 2227‒37. link1

[22] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. San Francisco: OpenAI; 2018.

[23] Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019 Jun 2‒7; Minneapolis, MN, USA; 2019. p. 4171‒86.

[24] He P, Liu X, Gao J, Chen W. DeBERTa: decoding-enhanced BERT with disentangled attention. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021); 2021 May 3‒7; Vienna, Austria; 2021.

[25] Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2019;21(140):1‒67.

[26] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 2020 Dec 7‒12; online. 2020. p. 1877‒901.

[27] Zhang Z, Han X, Zhou H, Ke P, Gu Y, Ye D, et al. CPM: a large-scale generative Chinese pre-trained language model. AI Open 2021;2:93‒9. link1

[28] Zeng W, Ren X, Su T, Wang H, Liao Y, Wang Z, et al. PanGu-a: large-scale autoregressive pretrained Chinese language models with auto-parallel computation. 2021. arXiv:2104.12369.

[29] Wang S, Sun Y, Xiang Y, Wu Z, Ding S, Gong W, et al. ERNIE 3.0 Titan: exploring larger-scale knowledge enhanced pre-training for language understanding and generation. 2021. arXiv:2112.12731.

[30] Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 3266‒80.

[31] Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, et al. ERNIE: enhanced representation through knowledge integration. 2019. arXiv:1904.09223.

[32] Xiong W, Du J, Wang WY, Stoyanov V. Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In: Proceedings of the 8th International Conference on Learning Representations (ICLR 2020); 2020 Apr 26‒30; Addis Ababa, Ethiopia; 2020.

[33] Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, et al. K-BERT: enabling language representation with knowledge graph. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence; 2020 Feb 7‒12; New York City, NY, USA. Palo Alto: AAAI Press; 2020. p. 2901‒8. link1

[34] Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, et al. CoLAKE: contextualized language and knowledge embedding. In: Proceedings of the 28th International Conference on Computational Linguistics; 2020 Dec 8‒13; online. 2020. p. 3660‒70. link1

[35] Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 1441‒51. link1

[36] Peters ME, Neumann M, Logan IV RL, Schwartz R, Joshi V, Singh S, et al. Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 43‒54. link1

[37] Levine Y, Lenz B, Dagan O, Ram O, Padnos D, Sharir O, et al. SenseBERT: driving some sense into BERT. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 4656‒67. link1

[38] Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, et al. KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans Assoc Comput Linguist 2021;9:176‒94. link1

[39] Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, et al. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. 2021. arXiv:2107.02137. link1

[40] Wang R, Tang D, Duan N, Wei Z, Huang X, Ji J, et al. K-Adapter: infusing knowledge into pre-trained models with adapters. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 1405‒18. link1

[41] Wu S, BetoDredze M., Bentz, Becas: the surprising cross-lingual effectiveness of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 833‒44. link1

[42] Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 8‒14; Vancouver, BC, Canada. 2019. p. 7057‒67. link1

[43] Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 8440‒51. link1

[44] Chi Z, Dong L, Wei F, Yang N, Singhal S, Wang W, et al. InfoXLM: an information-theoretic framework for cross-lingual language model pretraining. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021 Jun 6‒11; online. 2021. p. 3576‒88. link1

[45] Ouyang X, Wang S, Pang C, Sun Y, Tian H, Wu H, et al. ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 27‒38. link1

[46] Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, et al. DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014); 2014 Jun 21‒26; Beijing, China. 2014. p. 647‒55.

[47] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun 23‒28; Columbus, OH, USA. 2014. p. 580‒7. link1

[48] Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22‒29; Venice, Italy. 2017. p. 843‒52. link1

[49] Schneider S, Baevski A, Collobert R, Auli M. Wav2vec: unsupervised pretraining for speech recognition. In: Proceedings of the 20th Annual Conference of the International Speech Communication Association (InterSpeech 2019); 2019 Sep 15‒19; Graz, Austria. 2019. p. 3465‒9. link1

[50] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2009 Jun 20‒25; Miami, FL, USA. 2009. p. 248‒55. link1

[51] Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, et al. Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8‒14; Munich, Germany. 2018. p. 181‒96. link1

[52] Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021. arXiv:2106.04560. link1

[53] Doersch C, Gupta A, Efros AA. Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2015 Dec 7‒13; Santiago, Chile. 2015. p. 1422‒ 30. link1

[54] Noroozi M, Favaro P. Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European Conference on Computer Vision (ECCV); 2016 Oct 8‒16; Amsterdam, NetherlandsThe. 2016. p. 69‒84. link1

[55] Misra I, van der Maaten L. Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14‒19; online. 2020. p. 6707‒17. link1

[56] Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by predicting image rotations. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30‒May 3; Vancouver, BC, Canada; 2018.

[57] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021); 2021 May 3‒7; Vienna, Austria; 2021.

[58] Van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. 2018. arXiv:1807.03748.

[59] He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14‒19; online. 2020. p. 9729‒38. link1

[60] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020); 2020 Jul 12‒18; online. 2020. p. 1597‒607.

[61] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 8748‒63.

[62] Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual and vision‒language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 4904‒16.

[63] Lu J, Batra D, Parikh D, Lee S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 13‒23.

[64] Tan H, Bansal M. LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; Hong Kong, China; 2019. link1

[65] Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: a simple and performant baseline for vision and language. 2019. arXiv:1908.03557. link1

[66] Sun C, Myers A, Vondrick C, Murphy K, Schmid C. VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV); 2019 Oct 27‒Nov 2; Seoul, Republic of Korea. 2019. p. 7464‒73. link1

[67] Sun C, Baradel F, Murphy K, Schmid C. Learning video representations using contrastive bidirectional transformer. 2019. arXiv:1906.05743.

[68] Chuang YS, Liu CL, Lee H, Lee L. SpeechBERT: an audio-and-text jointly learned language model for end-to-end spoken question answering. In: Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech 2020); 2020 Oct 25‒29; Shanghai, China. 2020. p. 4168‒72. link1

[69] Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, et al. Zero-shot text-toimage generation. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 8821‒31.

[70] Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, et al. ERNIE-ViL: knowledge enhanced vision‒language representations through scene graphs. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence; 2021 Feb 2‒9; online. Palo Alto: AAAI Press; 2021. p. 3208‒16. link1

[71] Gan Z, Chen YC, Li L, Zhu C, Cheng Y, Liu J. Large-scale adversarial training for vision-and-language representation learning. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 2020 Dec 7‒12; online. 2020. p. 6616‒28.

[72] Cho J, Lei J, Tan H, Bansal M. Unifying vision-and-language tasks via text generation. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 1931‒42.

[73] Kalyan KS, Rajasekharan A, Sangeetha S. AMMUS: a survey of transformerbased pretrained models in natural language processing. 2021. arXiv:2108.05542. link1

[74] Kaliyar RK. A multi-layer bidirectional transformer encoder for pre-trained word embedding: a survey of BERT. In: Proceedings of 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence); 2020 Jan 29‒31; Noida, India. 2020. p. 336‒40. link1

[75] Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. 2021. arXiv:2107.13586.

[76] Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent advances in natural language processing via large pre-trained language models: a survey. 2021. arXiv:2111.01243.

[77] Li J, Tang T, Zhao WX, Wen JR. Pretrained language models for text generation: a survey. 2021. arXiv:2105.10311. link1

[78] Zaib M, Sheng QZ, Zhang W. A short survey of pre-trained language models for conversational AI—a new age in NLP. In: Proceedings of the Australasian Computer Science Week Multiconference (ACSW’20); 2020 Feb 3‒7; Melbourne, VIC, Australia. 2020. link1

[79] Ramponi A, Plank B. Neural unsupervised domain adaptation in NLP—a survey. In: Proceedings of the 28th International Conference on Computational Linguistics; 2020 Dec 8‒13; onine. 2020. p. 6838‒55. link1

[80] Qiu XP, Sun TX, Xu YG, Shao YF, Dai N, Huang XJ. Pre-trained models for natural language processing: a survey. Sci China Technol Sci 2020;63(10):1872‒97. link1

[81] Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. 2021. arXiv:2108.07258.

[82] Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, et al. Pre-trained models: past, present and future. AI Open 2021;2:225‒50. link1

[83] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. San Francisco: OpenAI; 2019.

[84] Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 5754‒64.

[85] Taylor WL. “Cloze procedure”: a new tool for measuring readability. J Mass Commun Q 1953;30(4):415‒33. link1

[86] Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, et al. GLM: general language model pretraining with autoregressive blank infilling. 2021. arXiv:2103.10360. link1

[87] Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. SpanBERT: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 2020;8:64‒77. link1

[88] Song K, Tan X, Qin T, Lu J, Liu TY. MASS: masked sequence to sequence pretraining for language generation. In: Proceedings of the 36th International Conference on Machine Learning (ICML 2019); 2019 Jun 9‒15; Long Beach, CA, USA. 2019. p. 5926‒36.

[89] Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 7871‒80. link1

[90] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized BERT pretraining approach. 2019. arXiv:1907.11692.

[91] Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model pre-training for natural language understanding and generation. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 13042‒54.

[92] Bao H, Dong L, Wei F, Wang W, Yang N, Liu X, et al. UniLMv2: pseudo-masked language models for unified language model pre-training. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020); 2020 Jul 12‒18; online. 2020. p. 642‒52.

[93] Xiao D, Zhang H, Li Y, Sun Y, Tian H, Wu H, et al. ERNIE-GEN: an enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI); 2021 Jan 7‒15; Yokohama, Japan. 2021. p. 3997‒4003. link1

[94] Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gapsentences for abstractive summarization. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020); 2020 Jul 12‒ 18; online. 2020. p. 11328‒39.

[95] Rosset C. Turing-NLG: a 17-billion-parameter language model by Microsoft [Internet]. Redmond: Microsoft; 2020 Feb 13 [cited 2021 Nov 4]. Available from: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17- billion-parameter-language-model-by-microsoft/. link1

[96] Kim B, Kim HS, Lee SW, Lee G, Kwak D, Hyeon JD, et al. What changes can large-scale language models bring? Intensive study on HyperCLOVA: billionsscale Korean generative pretrained transformers. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 3405‒24. link1

[97] Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, et al. mT5: a massively multilingual pre-trained text-to-text transformer. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021 Jun 6‒11; online. 2021. p. 483‒98. link1

[98] Zhang Z, Gu Y, Han X, Chen S, Xiao C, Sun Z, et al. CPM-2: large-scale costeffective pre-trained language models. 2021. arXiv:2106.10715. link1

[99] Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. 2021. arXiv: 2101.03961.

[100] Wu S, Zhao X, Yu T, Zhang R, Shen C, Liu H, et al. Yuan 1.0: large-scale pretrained language model in zero-shot and few-shot learning. 2021. arXiv: 2110.04725.

[101] Du N, Huang Y, Dai AM, Tong S. Lepikhin D, Xu Y,et al. GLaM: efficient scaling of language models with mixture-of-experts. 2021. arXiv: 2112.06905.

[102] Rae JW, Borgeaud S, Cai T, Millican K, Hoffmann J, Song F, et al. Scaling language models: methods, analysis & insights from training gopher. 2021. arXiv: 2112.11446.

[103] Ding M, Yan Z , Hong W, Zheng W, Zhou C, Yin D, et al. CogView: mastering text-to-image generation via transformers. 2021. arXiv: 2105.13290.

[104] Lin J, Men R, Yang A, Zhou C, Ding M, Zhang Y, et al. M6: a Chinese multimodal pretrainer. 2021. arXiv:2103.00823. link1

[105] Li W, Gao C, Niu G, Xiao X, Liu H, Liu J, et al. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 2592‒607. link1

[106] Huo Y, Zhang M, Liu G, Lu H, Gao Y, Yang G, et al. WenLan: bridging vision and language by large-scale multi-modal pre-training. 2021. arXiv:2103.06561.

[107] Zhang H, Yin W, Fang Y, Li L, Duan B, Wu Z, et al. ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation. 2021. arXiv:2112.15283.

[108] Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, et al. GPipe: efficient training of giant neural networks using pipeline parallelism. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 9‒14; Vancouver, BC, Canada. 2019. p. 103‒12.

[109] Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron- LM: training multi-billion parameter language models using model parallelism. 2019. arXiv:1909.08053.

[110] Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, et al. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 21); 2021 Nov 14‒19; St. Louis, MO, USA; 2021. link1

[111] Bian Z, Liu H, Wang B, Huang H, Li Y, Wang C, et al. Colossal-AI: a unified deep learning system for large-scale parallel training. 2021. arXiv:2110.14883.

[112] Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In: Proceedings of the 5th International Conference on Learning Representations (ICLR 2017); 2017 Apr 24‒26; Toulon, France; 2017.

[113] Narang S, Diamos G, Elsen E, Micikevicius P, Alben J, Garcia D, et al. Mixed precision training. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30‒May 3; Vancouver, BC, Canada; 2018.

[114] Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: memory optimizations toward training trillion parameter models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 20); 2020 Nov 9‒19; Atlanta, GA, USA; 2020. link1

[115] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25‒29; Doha, Qatar. 2014. p. 1746‒51. link1

[116] Hu H, Richardson K, Xu L, Li L, Kübler S, Moss L. OCNLI: original Chinese natural language inference. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2020 Nov 16‒ 20; online. 2020. p. 3512‒26. link1

[117] Shao CC, Liu T, Lai Y, Tseng Y, Tsai S. DRCD: a Chinese machine reading comprehension dataset. 2018. arXiv:1806.00920..

[118] Schick T, Schütze H. Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; 2021 Apr 19‒23; online. 2021. p. 255‒69. link1

[119] Gray S, Radford A, Kingma DP. GPU kernels for block-sparse weights. 2017. arXiv:1711.09224.

[120] Lin H, Lu Y, Han X, Sun L. Sequence-to-nuggets: nested entity mention detection via anchor-region networks. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 5182‒92. link1

[121] Lin Y, Meng Y, Sun X, Han Q, Kuang K, Li J, et al. BertGCN: transductive text classification by combining GCN and BERT. 2021. arXiv: 2105.05727. link1

[122] Zhang R, Tetreault J. This email could save your life: introducing the task of email subject line generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 446‒56. link1

[123] Zhou H, Zheng C, Huang K, Huang M, Zhu X. KdConv: a Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 7098‒108. link1

[124] Cho J, Seo M, Hajishirzi H, et al. Mixture content selection for diverse sequence generation. 2019. arXiv:1909.01953. link1

[125] Ribeiro LFR, Zhang Y, Gardent C, Gurevych I. Modeling global and local node contexts for text generation from knowledge graphs. Trans Assoc Comput Linguist 2020;8:589‒604. link1

[126] Zhang Y, Sun S, Galley M, Chen YC, Brockett C, Gao X, et al. DialoGPT: largescale generative pre-training for conversational response generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 270‒8. link1

[127] Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, et al. Towards a human-like open-domain chatbot. 2020. arXiv:2001.09977.

[128] Roller S, Dinan E, Goyal N, Ju D, Williamson M, Liu Y, et al. Recipes for building an open-domain chatbot. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; 2021 Apr 19‒23; online. 2021. p. 300‒25. link1

[129] DuerOS [Internet]. Beijing: Baidu; c2017 [cited 2021 Nov 4]. Available from: https://dueros.baidu.com/en/index.html. link1

[130] Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-2: towards building an open-domain chatbot via curriculum learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACLIJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 2513‒25. link1

[131] Gunasekara C, Kim S, D’Haro LF, Rastogi A, Chen YN, Eric M, et al. Overview of the ninth dialog system technology challenge: DSTC9. 2020. arXiv:2011.06486. link1

[132] Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-XL: exploring the large-scale pre-training of dialogue generation. 2021. arXiv:2109.09519. link1

[133] Wang Y, Ke P, Zheng Y, Huang K, Jiang Y, Zhu X, et al. A large-scale Chinese short-text conversation dataset. In: Proceedings of the 9th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2020); 2020 Oct 14‒18; Zhengzhou, China. 2020. p. 91‒103. link1

[134] Qi W, Gong Y, Yan Y, Xu C, Yao B, Zhou B, et al. ProphetNet-X: large-scale pretraining models for English, Chinese, multi-lingual, dialog, and code generation. 2021. arXiv:2104.08006. link1

[135] Zhou H, Ke P, Zhang Z, Gu Y, Zheng Y, Zheng C, et al. EVA: an open-domain Chinese dialogue system with large-scale generative pre-training. 2021. arXiv:2108.01547.

[136] Vinyals O, Le Q. A neural conversational model. 2015. arXiv:1506.05869.

[137] Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence; 2016 Feb 12‒17; Phoenix, AZ, USA. Palo Alto: AAAI Press; 2016. p. 3776‒83. link1

[138] Worswick S. “Mitsuku wins loebner prize 2018!” [Internet]. Medium; 2018 Sep 13 [cited 2021 Nov 4]. Available from: https://medium.com/ pandorabots-blog/mitsuku-wins-loebner-prize-2018-3e8d98c5f2a7. link1

[139] Zhou L, Gao J, Li D, Shum HY. The design and implementation of XiaoIce, an empathetic social chatbot. Comput Linguist 2020;46(1):53‒93. link1

[140] Xin J, Tang R, Lee J, Yu Y, Lin J. DeeBERT: dynamic early exiting for accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 2246‒51. link1

[141] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, Laroussilhe QD, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning (ICML 2019); 2019 Jun 9‒15; Long Beach, CA, USA. 2019. p. 2790‒9.

[142] Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 4582‒97. link1

[143] Gao T, Fisch A, Chen D. Making pre-trained language models better few-shot learners. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 3816‒30. link1

[144] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 3045‒59. link1

[145] Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. 2021. arXiv:2103.10385.

[146] Han X, Zhao W, Ding N, Liu Z, Sun M. PTR: prompt tuning with rules for text classification. 2021. arXiv:2105.11259. link1

[147] Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine learning. 2017. arXiv:1702.08608. link1

[148] Wallace E, Feng S, Kandpal N, Gardner M, Singh S. Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 2153‒62. link1

[149] Fort K, YesCouillault A., care!we Results of the ethics and natural language processing surveys. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016); 2016 May 23‒28; Portorož, Slovenia. 2016. p. 1593‒600.

[150] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014); 2014 Apr 14‒16; Banff, AB, Canada; 2014.

[151] Hewitt J, Manning CD. A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies; 2019Jun2‒7;Minneapolis,MN,USA.2019. p. 4129‒38.

[152] Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Jul 28‒Aug 2; Florence, Italy. 2019. p. 3651‒7. link1

[153] Linzen T, Dupoux E, Goldberg Y. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans Assoc Comput Linguist 2016;4:521‒35. link1

[154] Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2016 Jun 12‒17; DiegoSan, CA, USA. 2016. p. 1135‒44. link1

[155] Davison J, Feldman J, Rush AM. Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 1173‒8. link1

[156] Petroni F, Rocktäschel T, Riedel S, Lewis P, Bakhtin A, Wu Y, et al. Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3‒7; HongKong, China. 2019. p. 2463‒73. link1

[157] Talmor A, Elazar Y, Goldberg Y, Berant J. oLMpics-on what language model pre-training captures. Trans Assoc Comput Linguist 2020;8:743‒58. link1

[158] Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. 2020. arXiv:2005.05909. link1

[159] Jia J, Liu Y, Gong NZ. BadEncoder: backdoor attacks to pre-trained encoders in self-supervised learning. 2021. arXiv:2108.00352. link1

[160] Devlin J. Google-research/bert [Internet]. GitHub; 2018 Oct 11 [cited 2021 Nov 4]. Available from: https://github.com/google-research/bert. link1

[161] Baidu Ernie Team. Paddlepaddle/ernie [Internet]. GitHub; 2019 Apr 19 [cited 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/ERNIE. link1

[162] Huawei. Pcl-platform.intelligence/pangu-alpha [Internet]. San Francisco: OpenAI; 2021 Apr 26 [cited 2021 Nov 4]. Available from: https://git.openi. org.cn/PCL-Platform.Intelligence/PanGu-Alpha. link1

[163] Ding S, Shang J, Wang S, Sun Y, Tian H, Wu H, et al. ERNIE-Doc: a retrospective long-document modeling transformer. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1‒6; online. 2021. p. 2914‒27. link1

[164] Huggingface [Internet]. Hugging Face; 2020 Apr 26 [cited 2021 Nov 4]. Available from: https://huggingface.co. link1

[165] Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. FAIRSEQ: a fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Demonstrations); 2019 Jun 2‒7; Minneapolis, MN, USA. 2019. p. 48‒53. link1

[166] Baidu PaddlePaddle Team. Paddlepaddle/paddlenlp [Internet]. GitHub; 2020 Nov 16 [cited 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/ PaddleNLP. link1

[167] Wenxin ernie [Internet]. Beijing: Baidu; c2021 [cited 2021 Nov 4]. Available from: https://wenxin.baidu.com. link1

[168] Alibaba Damo Academy. AliceMind [Internet]. Aliyuncs; c2021 [cited 2021 Nov 4]. Available from: https://alicemind.aliyuncs.com. link1

[169] Openai API [Internet]. San Francisco: OpenAI; c2021 [cited 2021 Nov 4]. Available from: https://openai.com/api. link1

[170] Garg Y, Chatterjee N. Sentiment analysis of twitter feeds. In: Proceedings of the 3rd International Conference on Big Data Analytics (BDA 2014); 2014 Dec 20‒23; New Delhi, India. 2014. p. 33‒52. link1

[171] AlQahtani ASM. Product sentiment analysis for amazon reviews. Int J Comput Sci Inf Technol 2021;13(3):15‒30. link1

[172] Singh M, Jakhar AK, Pandey S. Sentiment analysis on the impact of coronavirus in social life using the BERT model. Soc Netw Anal Min 2021;11:33. link1

[173] Chen Z, Sokolova M. Sentiment analysis of the COVID-related r/Depression posts. 2021. arXiv:2108.06215.

[174] Liu Y, Liu J, Chen L, Lu Y, Feng S, Feng Z, et al. ERNIE-SPARSE: learning hierarchical efficient transformer through regularized self-attention. 2022. arXiv:2203.12276.

[175] Jwa H, Oh D, Park K, Kang JM, Lim H. exBAKE: automatic fake news detection model based on bidirectional encoder representations from transformers (BERT). Appl Sci 2019;9(19):4062. link1

[176] Soares LB, FitzGerald N, Ling J, Kwiatkowski T. Matching the blanks: distributional similarity for relation learning. 2019. arXiv:1906.03158. link1

[177] Wang Z, Xu Y, Cui L, Shang J, Wei F. LayoutReader: pre-training of text and layout for reading order detection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 4735‒44. link1

[178] gpt-2-for-the-advertising-industry [Internet]. San Francisco: OpenAI; 2017 Aug 1 [cited 2021 Nov 4]. Available from: https://www.narrativa.com/gpt-2- for-the-advertising-industry. link1

[179] Agarwal R, Kann K. Acrostic poem generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2020 Nov 16‒20; online. 2020. p. 1230‒40. link1

[180] Lee DH, Hu Z, Lee RKW. Improving text auto-completion with next phrase prediction. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7‒11; online. 2021. p. 4434‒8. link1

[181] Mukherjee S, Mukherjee S, Hasegawa M, Awadallah AH, White R. Smart todo: automatic generation of to-do items from emails. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020); 2020 Jul 5‒10; online. 2020. p. 8680‒9. link1

[182] What are Alexa Built-in Devices? [Internet]. Seattle: Amazon; c2010‒2023 [cited 2021 Nov 4]. Available from: https://developer.amazon.com/alexavoice- service. link1

[183] Mari A. Voice commerce: understanding shopping-related voice assistants and their effect on brands. In: Proceedings of the International Media Management Academic Association Annual Conference; 2019 Oct 4‒6; Doha, Qatar; 2019. link1

[184] Google assistant update speech recognition name pronunciation BERT smart speakers [Internet]. NDTV; 2021 Apr 29 [cited 2021 Nov 4]. Available from: https://gadgets.ndtv.com/apps/news/google-assistant-update-speechrecognition- name-pronunciation-bert-smart-speak. link1

[185] Vincent J. The future of AI is a conversation with a computer [Internet]. New York City: The Verge; 2021 Nov 1 [cited 2021 Nov 4]. Available from: https:// www.theverge.com/22734662/ai-language-artificial-intelligence-futuremodels- gpt-3-limitations-bias/. link1

[186] Meet the AI powering today’s smartest smartphones [Internet]. San Francisco: Wired; 2017 Aug 1 [cited 2021 Nov 4]. Available from: https:// www.wired.com/sponsored/story/meet-the-ai-powering-todays-smartestsmartphones. link1

[187] Nayak P. Understanding searches better than ever before [Internet]. Google; [cited 2021 Nov 4]. Available from: https://blog.google/products/search/ search-language-understanding-bert/. link1

[188] Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, et al. ERNIE 2.0: a continual pretraining framework for language understanding. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence; 2020 Feb 7‒12; New York City, NY, USA. Palo Alto: AAAI Press; 2020. p. 8968‒75. link1

[189] Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language model for web-scale retrieval in Baidu Search. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 21); 2021 Aug 14‒18; online. 2021. p. 3365‒75. link1

[190] Huang JT, Sharma A, Sun S, Xia L, Zhang D, Pronin P, et al. Embedding-based retrieval in Facebook Search. In: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 20); 2020 Jul 6‒ 10; online. 2020. p. 2553‒61. link1

[191] Yu P, Fei H, Li P. Cross-lingual language model pretraining for retrieval. In: Proceedings of the Web Conference; 2021 Apr 19‒23; online. 2021. p.1029‒ 39. link1

[192] Ni M, Huang H, Su L, Cui E, Bharti T, Wang L, et al. M3P: learning universal representations via multitask multilingual multimodal pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 19‒25; online. 2021. p. 3977‒86. link1

[193] Sanh V, Debut L, Chaumond J, DistilBERTWolf T., a distilled version of BERT: smaller, faster, cheaper and lighter. 2019. arXiv:1910.01108.

[194] Gordon MA, Duh K, Andrews N. Compressing BERT: studying the effects of weight pruning on transfer learning. In: Proceedings of the 5th Workshop on Representation Learning for NLP; 2020 Jul 9; Seattle, WA, USA. 2020. p. 143‒55. link1

[195] Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K. I-BERT: integer-only BERT quantization. In: Proceedings of the 38th International Conference on Machine Learning (ICML 2021); 2021 Jul 18‒24; online. 2021. p. 5506‒18. link1

Related Research