《1. A brief history of pre-trained models》

1. A brief history of pre-trained models

The concept of pre-training is related to transfer learning [1]. The idea of transfer learning is to reuse the knowledge learned from one or more tasks and apply it to new tasks. Traditional transfer learning employs annotated data for supervised training, which has been the common practice for at least a decade. Within deep learning, pre-training with self-supervised learning on massive unannotated data has become the dominant transfer learning approach. The difference is that pre-training methods use unannotated data for self-supervised training and can be applied to various downstream tasks via fine-tuning or few-shot learning.

In natural language processing (NLP), model pre-training is based on the task of language modeling. The goal of language modeling is to predict the next token, given a history of unannotated texts [2–4]. The first milestone of neural language modeling appears in Ref. [5], which models n-gram probabilities through distributed representations of words and feed-forward neural networks. Since then, deep learning methods have begun to dominate the training paradigm of language modeling. In early methods for neural language modeling, recurrent neural networks (RNNs) were widely used [6,7]. Among the RNN family, long shortterm memory (LSTM) [8] stands out due to its advantage of being less prone to the gradient vanishing problem via its well-designed gating mechanism. With the emergence of the model known as transformer [9], considerable efforts have been devoted to building stronger and more efficient language models based on the transformer architecture [10–14]. In neural language modeling, distributed word representations named ‘‘word embeddings” that are learned with models such as Word2Vec [15] and GloVe [16] have become common initializations for the word vectors of deep learning models, significantly improving the performance of downstream tasks such as named-entity recognition [16], part-of-speech tagging [17], and question answering [18].

Although methods that leverage static word embeddings for warm startup can improve the performance of downstream NLP tasks, they lack the ability to represent different meanings of words in context. To solve this problem, context–aware language models were proposed to incorporate the complete context information into the training procedure. Dai and Le [19] introduced context–aware language modeling, which uses unannotated data to improve sequence learning with recurrent networks. This achieves significant performance improvement in sentiment analysis, text classification, and object classification tasks. In 2017, contextualized word vectors were proposed, which are derived from an encoder that is pre-trained on machine translation and then transferred to a variety of downstream NLP tasks [20]. However, these studies use a small amount of data for pre-training and do not achieve consistent performance improvement across all NLP tasks. Nonetheless, these pioneering studies greatly motivated follow-up pre-training methods for context modeling.

In another pioneering study on pre-trained models (PTMs), embeddings from language models were proposed to leverage bidirectional LSTMs in order to learn contextual word representations, and the pre-trained contextual embeddings were then applied to downstream tasks [21]. This method demonstrated great improvements in a broad range of NLP tasks, including question answering, textual entailment, sentiment analysis, semantic role labeling, coreference resolution, and named-entity extraction.

Since then, numerous PTMs within the ‘‘pre-training then finetuning” paradigm have started to emerge. Generative pre-training (GPT) [22] was the first model to use unidirectional transformers as the backbone for the GPT of language models, thereby illustrating the dramatic potential of pre-training methods for diverse downstream tasks. Following GPT [23], the first model to leverage bidirectional transformers was called Bidirectional Encoder Representations from Transformers (BERT); this model learns bidirectional contexts by means of conditioning on both the left and the right contexts in deep stacked layers. BERT introduced a denoising autoencoding pre-training task, termed masked language modeling (MLM), to recover the corrupted tokens of input sentences according to their contexts, in what was akin to a cloze task. This approach greatly boosted the performance gain of downstream natural language understanding (NLU) tasks. In this type of pretraining, which is also known as self-supervised learning, the pre-training labels are derived from unannotated data. By resorting to web-scale unannotated data from the Internet, PTMs can automatically learn syntactic and semantic representations.

The great success of PTMs has attracted a wide range of interest in scaling them up and exploring the boundaries of pre-training techniques; examples include decoding-enhanced BERT with disentangled attention (DeBERTa) [24], text-to-text transfer transformers (T5) [25], GPT-3 [26], large-scale generative Chinese pretrained language model (CPM) [27], PanGu-α [28], and ERNIE 3.0 Titan [29]. Large-scale PTMs, such as GPT-3, have now demonstrated the powerful capabilities of zero-shot and few-shot learning. With dozens of examples, GPT-3 achieved a performance similar to that of BERT, being fine-tuned with tens of thousands of pieces of data on SuperGLUE [30]. GPT-3 can also generate high-quality creative texts so that even humans cannot determine whether or not the texts are written by a human. The success of GPT-3 makes it possible to use this model for general-purpose text generation, which was considered to be impossible in the past decades. 

Another line of pre-training methods has attempted to incorporate knowledge in order to enhance the representation capability of PTM [31]. Some studies employ linguistic knowledge to design entity-related tasks with weak supervision. For example, they corrupt entity spans in texts and use knowledge-masking strategies such as entity-level or phrase-level masking [31] and entity replacement prediction [32] to better learn lexical, syntactic, and semantic information from texts. Another direction of research integrates structured knowledge together with plain texts into pre-training, such as knowledge-enabled BERT (KBERT) [33], contextualized language and knowledge embedding (CoLAKE) [34], enhanced language representation with informative entities (ERNIE-THU) [35], knowledge-enhanced BERT (KnowBERT) [36], SenseBERT [37], knowledge embedding and pre-trained language representation (KEPLER) [38], and ERNIE 3.0 [39]. ERNIE 3.0, which powers PTMs with knowledge, has achieved new state-of-the-art (SOTA) performances across 54 Chinese NLP benchmarks, as well as some English benchmarks, including SuperGLUE [30]. Moreover, K-Adapter [40] uses multiple adapters for different tasks independently in order to better fuse various knowledge sources and mitigate catastrophic forgetting. Knowledge-based incorporation has dramatically improved knowledge sharing between unstructured text and structured knowledge, greatly promoting the capacity of knowledge memorization and reasoning in PTMs [39].

However, the aforementioned models only focus on richresource languages, such as English and Chinese, and thus may overlook numerous low-resource languages. Recent work on multilingual models is aiming to transfer knowledge from richresource languages to low-resource languages by modeling the semantic representation of disparate languages in a unified vector space. Inspirited by BERT, multilingual BERT (mBERT) was developed and released; this model is trained via multilingual masked language modeling (MMLM) on multilingual corpora [41]. From an intuitive perspective, the use of parallel corpora is conducive to learning cross-lingual representations in different languages. Therefore, cross-lingual language model (XLM) [42] leverages bilingual sentence pairs to perform translation language modeling (TLM), which encourages models to align the representations of two languages together. Researchers have also released more multilingual language models, such as XLM-RoBERTa (XLM-R) [43], InfoXLM [44], and ERNIE-M [45], by improving MMLM or TLM. These studies have demonstrated that pre-trained multilingual language models can significantly improve performance of multilingual NLP tasks or low-resource language tasks.

Given the success of PTMs in NLP, these models have quickly been extended to other fields such as computer vision [46–48] and speech processing [49]. Although self-supervised pre-training has been the most successful transfer learning method in NLP, the PTMs used for computer vision tasks are diversified. The dominant method in computer vision tasks is still supervised learning. Sun et al. [48] show that representation learning holds promise for advancing model performance based on large-scale (noisy) annotated datasets, such as ImageNet [50] or JTF300M [48]. These methods learn visual representations and significantly improve the performance of various downstream vision tasks [48]. Selfsupervised pre-training have also been explored in computer vision [51–56]. Doersch et al. [53] propose various prediction tasks as propse tasks to learn visual representations. Dosovitskiy et al. [57] explore the masked patch prediction task using transformer architecture for images and demonstrates that pre-trained transformers achieve excellent results compared with convolutional neural networks (CNNs).

Recently, contrastive learning has been successfully utilized for visual self-supervised pre-training. Contrastive predictive coding [58] has achieved strong results in various scenarios, including speech, image, and text. These methods [58–60] attempt to maximize the similarity of two augmentations of an image and minimize the similarity of different images with contrastive loss. More recently, pre-training methods have been advanced by utilizing language supervision for visual representation learning [61], achieving a strong performance in image classification tasks and other vision tasks.

Pre-training methods have also been applied to multimodal applications, in which texts are combined with other modalities, such as images [62–65], videos [66,67], and speech [68], enabling a broad application scope of PTMs. Such methods [63] significantly improve the performance of various multimodal tasks by jointly learning task-agnostic representations of images and texts. Based on the transformer architecture, PTMs build cross-modal semantic alignments from large-scale image-text pairs. For image generation, DALL-E [69] and CLIP-guided generation [61] leverage multimodal language and vision input to render compelling visual scenes. Although the most commonly used pre-training tasks for multimodal context are MLM and masked region prediction, Yu et al. [70] propose knowledge-enhanced scene graph prediction to capture the alignments of more detailed semantics. Gan et al. [71] incorporate adversarial training into pre-training and achieves higher performance. Cho et al. [72] formulate multimodal pre-training as a unified language modeling task based on multimodal context. This demonstrates that PTMs are playing a critical role in the artificial intelligence (AI) community and will potentially promote the unification of the pre-training framework across research fields such as speech, computer vision, and NLP.

There are some existing reviews on PTMs. Some focus on particular types and applications of PTMs, such as transformer-based pre-trained language models [73], BERT-based training techniques [74], prompted-based learning [75], data augmentation [76], text generation [77], and conversational agent design [78]. Another line provides a panoramic perspective of the whole progress of PTMs. For example, Ramponi and Plank [79] provide an overview from early traditional non-neural methods to PTMs in NLP. Qiu et al. [80] systematically categorize existing PTMs from four different perspectives and outlines some potential directions of PTMs for future research. Bommasani et al. [81] propose the concept of foundation models to unify PTMs in different subfields such as NLP, computer vision, and speech, and analyzes their opportunities and challenges in various AI domains. Han et al. [82] take a deep look into the history of PTMs to reveal the crucial position of PTMs in the AI development spectrum. In our review, we mainly focus on the PTMs in NLP: We first provide a detailed analysis of different PTMs and trends in PTMs at scale, discussing their impact on the field of NLP and the main challenges of PTMs; we then focus on our observations of and practices in the industrial applications of PTMs.

In this paper, we will first summarize the methods and taxonomy of pre-trained language models in Section 2, followed by a discussion of the impact and challenges of pre-trained language models in Section 3. Next, we will introduce the industrial applications of pre-training techniques in Section 4. Finally, we will conclude and address potential future work in this area.

《2. Methods of PTMs》

2. Methods of PTMs

《2.1. Different frameworks and extensions of PTMs》

2.1. Different frameworks and extensions of PTMs

When working with PTMs, it is essential to design efficient training methods that can fully use unannotated data and assist downstream fine-tuning. In this section, we briefly introduce some widely used pre-training frameworks to date. Fig. 1 summarizes the existing prevalent pre-training frameworks, which can be classified into three categories: transformer decoders only; transformer encoders only; and trans-former decoder–encoders. A brief description of each category is given below, and more detail is provided in the subsections that follow.

《Fig. 1》

Fig. 1. An illustration of the existing prevalent pre-training frameworks, where x is the original sentence, xt (t = 1, 2, ..., T) is the tth token, T is the sequence length, and M(x) is the set of masked tokens in x. S denotes the start token embedding of a sequence. p1, p2, p3, and p4 denote the position embeddings of the first to fourth tokens. P is the conditional probability. i and j indicate the start and the end indices of input tokens of the encoder, respectively.

• Transformer decoders only frameworks use a unidirectional (left-to-right) transformer decoder as the pre-training backbone and predict tokens in a unidirectional autoregressive fashion. Here, ‘‘auto-regression” refers to predicting the current token based on historical tokens—that is, the partial sequence on the left of the current token. More specifically, given the text sequence (where x is the original sentence, xt (t = 1, 2, ..., T) is the tth token, and T is the sequence length), an autoregressive model factorizes the likelihood of the input text sequence as  , where p is the likelihood of the input text sequence.

• Transformer encoder only frameworks leverage a bidirectional transformer encoder and aim to recover corrupted tokens, given the input sentences with randomly masked tokens.

• Transformer encoder–decoder frameworks aim at pretraining a sequence-to-sequence (seq2seq) generation model by masking tokens on the source side and recovering them on the target side. These frameworks consist of two classes: ① seq2seq encoder–decoders, which consist of a bidirectional transformer encoder and a unidirectional decoder with separate parameters; and ② unified encoder–decoders, in which a bidirectional transformer encoder and a left-toright decoder are simultaneously pre-trained with shared model parameters.

2.1.1. Transformer decoders only

The objective for language modeling is to predict the next token auto-regressively, given its history. The nature of auto-regression entails the future invisibility of input tokens at each position; that is, each token can only attend to the preceding words. GPT [22] was the first model to use the transformer decoder architecture as its backbone. Given a sequence of words as context, GPT computes the probability distribution of the next word with the masked multi-head self-attention of the transformer. In the fine-tuning phase, the pre-trained parameters are set as the initialization of the model for downstream tasks. GPT is pre-trained on the BooksCorpus dataset, which is nearly the same size as the 1B Word Benchmark. It has hundreds of millions of parameters and improves SOTA results on nine out of 12 NLP datasets, showing the potential of large-scale PTMs. GPT-2 [83] follows the unidirectional framework with a transformer decoder that was trained with a larger corpus, namely, WebText, and 1.5 billion model parameters. GPT-2 achieves SOTA results on seven out of eight tested language modeling datasets in a zero-shot setting. GPT-3 [26] further increases the parameters of the transformer to 175 billion and introduces in-context learning. Both GPT-2 and GPT-3 can be applied to downstream tasks without fine-tuning. They achieve a strong performance by scaling up the model size and dataset size.

Unidirectional language modeling lacks attention on its full contexts on both sides, which may degrade its performance on downstream tasks. To tackle this problem, Yang et al. [84] propose the use of permuted language modeling (PLM), which performs autoregressive modeling on permuting input tokens. For example, a permutation of the sentence ‘‘I love the movie” can be ‘‘I the movie love.” Once the permutation is chosen, the last few tokens of the permuted sentence are the target to predict. In the above example, the token ‘‘love” is the target, depending on the visible context ‘‘I the movie.” An advantage of PLM is that it can fully leverage the contextual information for different masked tokens, thus building dependent context relationships with both preceding and successive words. To enable PLM, Yang et al. [84] propose a novel two-stream self-attention mechanism, with one query stream to compute the query vectors and another content stream to compute the key/context vectors. The two-stream selfattention approach evades the leakage of visible context to the masked positions.

2.1.2. Transformer encoders only

Pre-trained transformer encoders, such as BERT [23], have become the standard in NLP systems. BERT uses an MLM framework with a transformer as the backbone. In the pre-training stage, BERT randomly replaces tokens with a special token [MASK] and tries to recover corrupted words based on their contextual representations. It also adopts an objective of next-sentence prediction (NSP) to capture the discourse relations between two sentences, which is helpful for sentence-level tasks, such as question answering. Devlin et al. [23] refer to this procedure as a cloze task, according to Ref. [85]. BERT was pre-trained on a combination of the BooksCorpus (800 M words) and English Wikipedia (2500 M words), and achieved great improvements on 17 NLP tasks, attaining a level even better than a human performance on some of the downstream tasks. However, BERT’s shortcomings are also obvious: Because the [MASK] token does not appear in real data during fine-tuning, it creates a mismatch between pre-training and finetuning. To amend this discrepancy, BERT uses a novel method to mask tokens: Among the 15% of the random positions that would have to be masked, only 80% are replaced by the [MASK] token, while 10% are kept as the original tokens, and 10% are replaced by random tokens in the training process. This masking strategy causes the model to take more steps to converge, since only 15% of the tokens in the training batch are predicted. Another problem with BERT is that it predicts tokens independently without considering other masked tokens. The model proposed in Ref. [86], a unified encoder–decoder model, tends to solve this problem by blanking out text spans of input sentences and predicting the masked span auto-regressively, which mitigates the independent assumption of masked tokens within the same span in the pretraining of masked language models.

Following the success of BERT, an enormous amount of research effort has gone into MLM. SpanBERT [87] is designed to predict spans of text. It chooses to mask random contiguous spans instead of random tokens, and a span boundary prediction objective is introduced to force the model to predict masked spans according to the structural information of the span boundaries. It also achieves better performance by replacing the NSP objective in BERT with single-sequence training. SpanBERT outperforms BERT on span-related tasks such as question answering and coreference resolution. Like SpanBERT, which uses lexical analysis and chunking tools to locate the span boundary, enhanced representation through knowledge integration (ERNIE) [31] uses a Chinese tokenizer to obtain phrase information and then replaces the random token masking in BERT with the entity or phrase masking. ERNIE also utilizes a named-entity recognition toolkit to identify the entity boundary and randomly masks tokens at the entity level, thus enabling the integration of external knowledge into model pre-training.

2.1.3. Transformer encoder–decoders

Transformer encoder–decoder architecture is dedicated to natural language generation (NLG) tasks. Unlike NLU, which focuses on comprehending texts, NLG aims to generate a coherent, meaningful, and human-like natural language expression according to specific inputs. For example, the goal of machine translation is to generate a sentence in the target language with the same meaning as the given source language input; for text summarization, the goal is to generate a short version of the input document that captures the core meanings and opinions. The critical point is to model two sequences simultaneously—one for the input and the other for the output.

Song et al. [88] proposes Masked Sequence-to-Sequence Learning (MASS) for language generation, in order to pre-train a seq2seq model. The basic idea of MASS is to take a sentence with a masked fragment (i.e., several consecutive tokens) as input and predict the masked fragment conditioned on the encoder representations. In this way, MASS successfully transforms the transformer encoder framework into an autoregressive framework by masking on the source side and predicting on the target side. MASS uses monolingual data from the News Crawl Datasets of Workshop on Machine Translation (WMT) to pre-train the model, and shows substantial improvement on machine translation quality in comparison with models directly trained on annotated data.

Pre-training on both a transformer encoder and a transformer decoder results in a unified model that can simultaneously deal with both language understanding and language generation. One member of this class is the standard transformer encoder–decoder model that does not share unified encoder and decoder components. Bidirectional and Auto-Regressive Transformers (BART) [89] proposes a similar objective as MASS, but differs in that MASS masks a consecutive series of tokens—that is, n-grams of the input—while BART corrupts text with an arbitrary noising function—that is, masking/deleting/replacing/exchanging random tokens in different positions. BART can be viewed as a combination of the above two architectures: The random masking strategy on the source side enables the model to deal with NLU tasks, and the overall seq2seq pre-training framework enables the model to be generalized to NLG tasks. Pre-trained on 160 GB data of news, books, stories, and web text, BART achieves comparable results to RoBERTa [90] and new SOTA results on dialogue and abstractive text summarization. Another member of this category unifies the encoder and decoder as identical transformer blocks. Dong et al. [91] and Bao et al. [92] also propose a unified language model pre-training framework for NLU and generation. These studies partition the self-attention matrix into three components: the bidirectional component, the unidirectional component, and the seq2seq component, which respectively stand for unidirectional, bidirectional, and seq2seq language models. Their experiments show performance gains over using a single pre-training objective. Du et al. [86] propose a variant of the model reported in Ref. [91], putting the masked tokens on the right of the unmasked tokens and conducting autoregressive blank filling. Xiao et al. [93] mask multiple segments at different granularities to encourage the decoder to rely more on the encoder representations, thus enhancing the correlation between the encoder and the decoder. Zhang et al. [94] adopt a different approach: First, a sentence is removed according to the pre-defined importance criteria from an input document, and then the removed sentence is generated based on the remaining context sentences. This strategy performs auto-regression at the sentence level and prompts whole-document understanding and summary-like generation. Experiments on 12 downstream summarization tasks demonstrate SOTA results, showing the effectiveness of the gap-sentence pre-training method.

《2.2. Scaling up PTMs》

2.2. Scaling up PTMs

Recent advances in NLP have demonstrated a promising trend toward scaling up PTMs with billions of parameters. OpenAI researchers trained a model called GPT-3, which has 175 billion parameters [26]. GPT-3 achieves strong performance on many NLP datasets, including question answering, machine translation, and three-digit arithmetic. GPT-3 demonstrates that scaling up language models significantly improves task-agnostic and fewshot performances, sometimes even achieving better results than prior SOTA fine-tuning approaches [26]. Although large PTMs are a promising direction, training large-scale PTMs is a challenging task, which requires massive training data and graphics processing unit (GPU) resources. Thus, efficient model training algorithms play a crucial role in scaling up PTMs. The following section introduces the prevalent large-scale PTMs as well as the training methods used to achieve them.

2.2.1. PTMs at scale

Table 1 [24–28,39,95–102] summarizes the mainstream largescale PTMs. The size of PTMs has become increasingly larger in recent years, ranging from 2.6 billion to even 175 billion parameters. Large-scale pre-trained language models embrace a potpourri of training recipes including exponentially increased trainable parameters, pre-training architectures, knowledge enhancement, language-specific corpora, and different pre-trained tasks to support the billion-level training of PTMs. Although training methods differ among these models, all the PTMs use transformers [9] as the standard backbone due to the latter’s efficient parallel computing performance. Since training large-scale models requires massive unsupervised data, research on scaling up PTMs focuses on high-resource languages such as English and Chinese.

《Table 1》

Table 1 Summary of large-scale pre-trained language models.

ZeRO: zero redundancy optimizer; MoE: mixture-to-expert.

According to the different designs used in pre-training architectures, large-scale PTMs can be generally classified into three classes (as in Section 2.1): encoder only, decoder only, and encoder–decoder. The majority of large PTMs leverage the decoder only or the encoder–decoder architecture, whereas only a few large models adopt an encoder-only design. This is because encoderonly models cannot perform well on generation tasks, such as text summarization and dialogue generation, while decoder-only models that are designed for language generation can shed light on not only NLG but also language understanding tasks via prevalent prompting techniques such as GPT-3 [26].

• Encoder-only models at scale employ a bidirectional transformer encoder to learn contextual representations; they demonstrate impressive performance on NLU tasks. For example, DeBERTa1.5B [24], which consists of 48 transformer layers with 1.5 billion parameters, applied a disentangled attention mechanism and enhanced the mask decoder to surpass human performance on the SuperGLUE [30] benchmark. Since a bidirectional nature makes the model unable to be directly used in NLG tasks, DeBERTa trained another version of a unified encoder–decoder to adapt to NLG tasks.

• Decoder-only models use transformer decoders by applying autoregressive masks to prevent the current token from attending to future tokens. Examples include GPT-3 [26], CPM [27], and PanGu-α [28]. This line of PTMs aims at generating human-like texts. Turing-NLG [95] is a 17-billionparameter language model that has achieved strong performance in language model benchmarks. GPT-3, with 175 billion parameters, can strikingly write samples that deceive human readers, demonstrating that large-scale language models can dramatically advance few-shot learning scenarios with in-context learning. In addition to English large-scale monolingual PTMs, there are also models for other languages such as Chinese and Korean. CPM [27] (2.6 billion parameters) and PanGu-α [28] (200 billion parameters) are two Chinese variants of GPT-3, while HyperCLOVA [96] is a 204- billion-parameter Korean variant. 

• Encoder–decoder models can be further categorized into two classes: ① conventional seq2seq encoder–decoders and ② unified encoder–decoders. Conventional seq2seq encoder– decoders adopt the classic transformer encoder–decoder architecture for pre-training. Recent work includes the T5 [25], the multilingual T5 (mT5) [97], and the large-scale cost-effective pre-trained language model (CPM-2) [98]. T5 [25], which has up to 11 billion parameters, unifies the NLP tasks in one framework by casting the language understanding and generation tasks in a text-to-text manner. As the multilingual variant of T5, mT5 [97], which has up to 13 billion parameters, has extended the monolingual data to 101 human languages and outperformed the previous SOTA results on a variety of multilingual benchmarks. CPM-2 [98], with 11 billion parameters, is a bilingual model trained on Chinese and English, whose mixture-of-expert (MoE) version, denoted as CPM-2-MoE, has 198 billion parameters. This model has demonstrated excellent general language intelligence via fine-tuning and prompting. Another kind of encoder–decoder model is the unified encoder–decoder framework, in which the encoder–decoder architecture shares the same module and applies different mask strategies for MLM and autoregressive language modeling. ERNIE 3.0 [39] jointly learns language understanding and generation by designing two separate heads for understanding and generation, which share a task-agnostic representation. As the third-generation PTMs (with ten billion parameters) in the ERNIE series, ERNIE 3.0 combines the merits of both autoregressive causal language models and autoencoding models to train large-scale knowledge-enhanced PTMs. It has outranked the SOTA performance on a variety of NLP benchmarks, including SuperGLUE [30]. These methods have demonstrated superior performance because they all tend to unify multiple NLP tasks in one model and use different kinds of corpora or knowledge to enhance the performance.

Most of the above-mentioned large-scale models are trained on plain texts without integrating knowledge. Therefore, some researchers have attempted to incorporate knowledge such as linguistic knowledge and world knowledge into PTMs. ERNIE 3.0 pretrained transformers on massive unstructured texts and knowledge graphs to learn lexical, syntactic, and semantic information. It enriched the PTMs through knowledge integration, phrase masking, and named-entity masking. 

The dramatic progress in language PTMs has attracted research interest on multimodal pre-training [72,103–107]. Table 2 [69,103,104,107] lists the details of large-scale multimodal PTMs. DALL-E [69] is a 12-billion variant of GPT-3 that was trained on 250 million English text–image pairs to generate images according to language descriptions, thereby improving the zero-shot learning performance. ERNIE-ViLG [107] uses a unified GPT framework for bidirectional image–text generation, formulating both the image and text generation as autoregressive generative tasks. As a result, it outperforms previous methods on generative tasks such as textto-image generation and image captioning with a ten-billion parameter model pre-trained on 145 million high-quality Chinese text–image pairs. Moreover, the multi-modality-to-multi-modality multi-task mega-transformer (M6) [104] is a 100-billionparameter transformer encoder, which is trained on over 1.9 TB images and 292 GB Chinese texts. M6 achieved strong performance in visual question answering, image captioning, and Chinese image–text matching. In addition to their improvements on multimodal tasks, these models can improve the performance of monomodal tasks, such as text classification, inference, summarization, and question generation [105]. These results show that multimodal pre-training can leverage multimodal information to enhance both image representation and text representation, which in turn improves the performance of both multimodal tasks and NLP tasks.

《Table 2》

Table 2 Large-scale multimodal PTMs.

2.2.2. Efficient training of large-scale models

The exponential increment of the PTMs’ size has posed a great challenge for efficient training due to the limited GPU memory and unaffordable training time. Therefore, it is non-trivial to leverage efficient training techniques to speed up large-scale model training.

2.2.2.1. Dense models. Data parallelism is a simple solution that allocates different data partitions to multiple workers and duplicates identical parameters at all workers. However, it usually suffers from a small per-GPU batch size. Another solution is model parallelism, in which model parameters are partitioned over different workers. However, conventional optimization algorithms require extra memory per parameter to store intermediate states, which hinders the model size from being updated efficiently. Pipeline parallelism combines the merits of both model parallelism and data parallelism to reduce time costs. GPipe [108] uses a novel batch-splitting pipelining algorithm by first partitioning a minibatch of training samples into smaller micro-batches and then aggregating the gradient update simultaneously at the end. Megatron-LM [109] is an intra-layer model parallel approach for transformer networks, which adds a few synchronization primitives on the self-attention and multi-layer perceptron blocks. PTD-P [110] combines pipeline, tensor, and data parallelism across multi-GPU servers with a novel interleaved pipelining scheduling strategy, increasing the throughput by more than 10%. Recently, Colossal-AI [111] implemented a combination of various data, pipeline, sequence, and multiple tensor parallelism for large-scale model training, which can be a good option for training dense models.

2.2.2.2. Sparse models. The sparsely gated MoE model [112] achieved more than 1000 times the increment in model capacity using a sparsely gated combination of multiple expert subnetworks. By leveraging the ensemble mechanism, MoE employs the gated unit to determine which top-k sub-networks should be activated for prediction.

Switch transformers [91] have advanced the scale of PTMs with up to trillions of parameters by simplifying the sparse routing and replacing the feed-forward fully connected layers with switch routing, in which each sample is routed to only a single expert.

2.2.2.3. Other efficient training strategies. Recent techniques for memory-efficient optimization include mixed-precision training [113] and memory-efficient adaptive optimization. Mixedprecision training utilizes half-precision floating-point numbers without losing model accuracy, which nearly halves the memory requirements. Other studies have aimed at memory-efficient adaptive optimization. For example, the zero redundancy optimizer (ZeRO) [114], which is the catalyst that powers Turing-NLG, consists of ZeRO-data parallelism (DP) and ZeRO-residual (R) algorithms that aim at reducing the memory footprint of the model states and the residual memory consumption, respectively. First, ZeRO-DP optimizes the optimizer states, gradients, and parameters by performing optimizer state partitioning, adding gradient partitioning, and adding parameter partitioning. Then, ZeRO-R optimizes the residual memory through the removal of activation replication, pre-definition of appropriate temporary buffer size, and proactive memory management.

《3. Impact and challenges of PTMs》

3. Impact and challenges of PTMs

《3.1. Impact of PTMs in NLP》

3.1. Impact of PTMs in NLP

The emergence of PTMs has enabled a significant breakthrough in the field of NLP. Before PTMs, many studies focused on designing specialized models for specific NLP tasks, which usually could not be used for other tasks. For example, Kim [115] proposes the TextCNN model for text classification, and Hochreiter and Schmidhuber [8] propose the LSTM model for language generation. Since their emergence, PTMs have started to serve as foundation models in NLP due to their impressive capabilities in representation learning. This has opened up a new ‘‘pre-training then finetuning” paradigm for NLP. This paradigm can fully exploit unannotated data to train a foundation model and then fine-tune it with limited task-specific annotated data. Even with limited annotated data, the performance of the downstream NLP tasks is greatly improved. Fig. 2 [23,39,116,117] demonstrates the evolution of SOTA results on five NLP benchmarks from supervised models without pre-training to PTMs such as BERT and ERNIE 3.0. It is evident that PTMs significantly outperform the previous non-PTMs, and the knowledge-enhanced ERNIE 3.0 has steadily exceeded BERT on many NLP tasks. Another important trend is to adopt PTMs to unify almost all NLP tasks. For example, T5 [25] casts both language understanding and generation tasks in a text-to-text manner and tackles all NLP tasks using a seq2seq PTM. Thus, the NLP community has witnessed the emerging trend of task unification.

《Fig. 2》

Fig. 2. The evolution shift of representation techniques on various NLP benchmarks. Results are from Refs. [23,39,116,117]. SuperGLUE is an NLU leaderboard consisting of a set of difficult language understanding tasks; an original Chinese natural language inference dataset (OCNLI), a Chinese machine reading comprehension dataset (DRCD), a large scale Chinese short text summarization dataset (LCSTS), and a Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation (KdConv) are evaluation corpora for natural language inference, machine reading comprehension, text summarization, and dialogue generation, respectively. w/o: without.

GPT-3 [26] has shown a promising performance in zero-shot learning or few-shot learning. Along with GPT-3, a new promptexploiting training [118] has been proposed to reformulate the task paradigm. Thus, pre-training then prompt tuning has initiated a new trend to better leverage PTMs. Instead of adapting PTMs to downstream tasks with fine-tuning, downstream tasks are predefined as ‘‘slot-filling” tasks: Given a human-designed template with slots, let the PTMs learn to fill out these templates. This framework has been proven powerful, as it enables language models to adapt to few-shot or zero-shot scenarios; as a result, it has attracted wide attention in the NLP community. We generally describe the impact of PLMs in the following three aspects: NLU, NLG, and dialogue. For dialogue, PTMs focus on response generation. Here, we take dialogue as a separate category due to its large amount of related work.

3.1.1. Natural language understanding

NLU is a broad topic in NLP that contains many tasks, such as named-entity recognition, sentiment analysis, document classification, reading comprehension, semantic matching, natural language inference, and information extraction. Table 3 [39,116,117,119,120] compares the performance of models with and without pre-training techniques on four different NLU tasks. It can be seen that models with pre-training outperform those without pre-training by a clear margin. Thus, PTMs have become the standard backbone in NLU tasks. Numerous researchers have employed PTMs to provide task-agnostic representations and then design task-specific architectures or objectives to enhance the NLU performance. For example, BertGCN [121] combines the representative capacity of BERT and transductive learning from graph convolutional networks to advance its performance of text classification, which increases its accuracy by around 4%.

《Table 3》

Table 3 SOTA performance with and without pre-training on NLU tasks.

Results are from Refs. [39,116,117,119,120]. w/: with; SST-2: Stanford Sentiment Treebank v2; OCNLI: Original Chinese Natural Language Inference; DRCD: Delta Reading Comprehension Dataset.

To compare the performances of the PTMs on NLU tasks, researchers uploaded their results on two benchmarks, GLUE and SuperGLUE. These PTMs now outperform humans on these two leaderboards. In addition, multilingual models such as mBERT [41], XLM [42], mT5 [97], and ERNIE-M [45] use a unified model to represent various languages such that the learned information can be shared among different languages. This technology alleviates the data sparseness problem in low-resource languages and reduces the demand to train specialized language models for each specific language. This new paradigm is changing the focus of research on NLP from designing specialized models for multilingual tasks to studying how PTMs can be used in these tasks.

3.1.2. Natural language generation

NLG tasks, such as text summarization, question generation, and data-to-text generation, are very challenging in NLP. Due to the huge search space, it is difficult for methods before the PTM era, which suffer from insufficient annotation data and limited model parameters, to generate fluent, coherent, and informative text. As shown in Table 4 [94, 122–125], PTMs have played a key role in improving the performance of NLG tasks. Large-scale PTMs automatically learn word combinations and sentence expressions from unannotated data, which significantly improves the models’ ability in language generation in terms of fluency, coherence, and informativeness. ERNIE-GEN [93] uses an enhanced multi-flow seq2seq pre-training and fine-tuning framework and incorporates a span-by-span generation task to generate consecutive entities, which has achieved new SOTA results on five typical NLG tasks. Researchers and practitioners also pre-train task-specific transformer models on generation tasks, such as MASS [88] and PEGASUS [94]. More specifically, MASS adopts the encoder–decoder framework to reconstruct a sentence fragment, given the remaining part of the sentence, and achieves significant improvements over baselines without pre-training on machine translation. PEGASUS was used to pre-train a large-scale encoder-decoder model with a well-designed pre-training objective, which achieved a SOTA performance on all 12 text-summarization tasks. With the growth of the model size, PTMs gradually show notable ability in creative writing. Models such as GPT-3, HyperCLOVA, and ENRIE 3.0 are capable of generating articles, questions and answers, novels, and program codes via only zero-shot learning. The quality of the generated texts is sometimes comparable with that of human-written texts. For example, humans only achieve 52% accuracy in distinguishing real news from fake news generated by GPT-3.

《Table 4》

Table 4 SOTA performance with and without pre-training on NLG tasks.

Results are from Refs. [94,122–125]. ESLC: English Skills Learning Center; BLEU: bilingual evaluation understudy; ROUGE-L: recall-oriented understudy for gisting evaluationlongest common subsequence.

3.1.3. Dialogue

In the past few years, several representative dialoguegeneration models have been pre-trained with human-like conversations collected from social media, including Twitter, Reddit, Weibo, and Baidu Tieba. Based on the general language model GPT-2 [83], DialoGPT [126] has been trained for response generation using Reddit comments. Meena [127] scales up the network to 2.6 billion parameters and employs more social media conversations in the training process, resulting in a significant improvement in response quality. To mitigate undesirable toxic or bias traits in large corpora, Blender [128] further fine-tunes the PTM with human-annotated datasets and emphasizes the desirable conversational skills of engagingness, empathy, and personality. In addition, to alleviate the safe-response problem in open-domain chitchat, PLATO [129] encodes the discrete latent variable into transformers for diverse response generation. Moreover, PLATO-2 [130] further scales up PLATO via curriculum learning for both Chinese and English response generation. The Ninth Dialog System Technology Challenge (DSTC-9) [131] revealed that PLATO-2 delivers a superior performance in multiple conversational tasks, including open-domain chitchat, knowledge-grounded dialogue, and task-oriented conversation. Recently, PLATO-XL [132] was scaled up to 11 billion parameters, with multi-party-aware pretraining being carried out to better distinguish roles in social media conversations. Other Chinese dialogue PTMs that have been developed on a modest scale include Cdial-GPT [133], ProphetNet-X [134], and EVA [135]

With these large-scale dialogue PTMs, some of the problems that plague traditional end-to-end neural approaches [136,137] are alleviated significantly, including deficiencies in response fluency and context relevance. Moreover, in comparison with existing chatbots that rely on complex frameworks, such as Mitsuku [138] and XiaoIce [139], these dialogue PTMs demonstrate superior performance in multi-turn conversations, especially in terms of engagingness and humanness.

《3.2. Key research challenges》

3.2. Key research challenges

Although PTMs have significantly improved the performance of NLP tasks, there are still some key challenges for PTM applications, such as interpretability, robustness, reasoning capability, and the deployment of large-scale PTMs. This section describes these challenges in the hope that additional future efforts can be devoted to these directions.

3.2.1. Deployability

One trend in PTMs is the substantial increase in capacity. Since the release of GPT [22] and BERT [23], PTMs have scaled exponentially with respect to both the number of parameters and the size of the pre-training data. For example, the largest version of GPT-3 [26] requires a total training computation of 3.64 × 103 petaflopdays, resulting in a total number of around 3.14 × 1023 flops and costing millions of dollars. The rapid growth in model size raises concerns regarding the tradeoff between scale and deployability. Two types of strategy have been proposed to tackle this issue: ① Large-scale PTMs are only used as the foundation model via application programming interface (API) calls, similar to the way in which the GPT-3 model is used. This strategy enables the efficient use of PTMs and evades model deployment on each device, but significantly limits the model’s application scope. ② Large models are compressed to smaller ones [140] for potential deployment. Typical compressing techniques include model compression and knowledge distillation. Unfortunately, existing compressing techniques are unable to compress super-large PTMs (e.g., GPT-3) to a suitable size for deployment on a single GPU or terminal device such as a laptop or cell phone. Advanced research in model compression is thus imperative in order to make large PTMs available to more users. Another promising direction is to use parameter-efficient techniques, such as prompt tuning [141– 146], to reduce the memory budget of deployment; this remains as a large area for further exploration.

3.2.2. Model trustworthiness

Another challenge of PTMs is their trustworthiness, which mainly involves their interpretability [147] and robustness [148]. Although PTMs have achieved SOTA performances across various tasks, how they make decisions are sometimes obscure to humans, which makes PTM models difficult to be applied in fields where model interpretability is essential, such as health-care and law [149]. Consequently, there is a growing interest in interpreting deep neural models [150]. In particular, many studies aim to understand what PTMs have learned in their representations [151].

Some studies have been published on the trustworthiness of deep neural models. These include: linguistic structural analyses on PTMs [152], which aim to analyze the linguistic knowledge that is learned by pre-trained language models and to understand the reason for their success; model behavioral analyses [153], which evaluate model robustness and reliability with multiple test sets; and post-hoc explanation analyses [154], which aim to provide understandable explanations for the predictions of deep neural models.

Despite the research that has already been done in this field, the following challenges must be addressed in order to build trustworthy systems: ① general interpretation methods for NLP tasks (existing interpretation methods are designed for classification tasks); ② causal analysis between model prediction and learned knowledge or extracted explanations; and ③ a comprehensive evaluation platform for interpretability, including evaluation data and metrics.

3.2.3. Commonsense knowledge and reasoning

Large-scale PTMs have been found to encode some commonsense knowledge [155]. Nevertheless, appropriate probing tasks need to be designed in order to mine the commonsense knowledge learned in PTMs—such as formulating a relational knowledgeextraction task as the completion of fill-in-the-blank statements—so as to examine the knowledge-learning ability of PTMs [156]. Although PTMs learn some knowledge from texts, there is still a large amount of knowledge that cannot be obtained from texts alone. One possible direction is to have models learn this kind of knowledge from both visual inputs and text inputs.

In addition to commonsense knowledge, other studies are questioning whether PTMs are endowed with reasoning abilities. For example, Talmor et al. [157] design different tasks to evaluate the reasoning abilities of PTMs. The researchers disentangle pre-training from fine-tuning and find that the reasoning capabilities are poor for most PTMs, revealing that existing PTMs lack the ability to reason. To alleviate this problem, one possible direction could be to integrate prior knowledge into the PTMs in order to guide the models to learn reasoning rules implicitly.

3.2.4. Model security

One severe issue with PTMs is their vulnerability to adversarial examples, which can mislead the model into producing a specific wrong prediction when perturbations are injected into the input [158]. This susceptibility exposes PTMs to safety concerns: The models can be easily attacked with adversarial patterns by third parties, resulting in irreparable loss in real-world applications. In addition to adversarial attacks, another form of attack—namely, backdoor attacks—is a threat to PTMs. Unlike adversarial attacks, which usually act during the inference process of a neural model, backdoor attacks hack the model during training [159]. If a model is deliberately trained on backdoor data, it will be extremely dangerous for users to use this model in applications involving privacy and security concerns. Future work could aim to improve the robustness of PTMs toward adversarial attacks. To deal with backdoor attacks, a model should be able to detect in the input the triggers that can activate the backdoor attack and remove the triggers, thus enhancing model security.

《4. Applications of PTMs》

4. Applications of PTMs

《4.1. Platforms and toolkits for applications》

4.1. Platforms and toolkits for applications

Due to their universality, PTMs have become foundation models in NLP. Many researchers have developed a series of open-source toolkits and platforms to make better use of PTMs. These toolkits and platforms usually contain various PTMs, fine-turning tools, and model-compression tools.

4.1.1. Toolkits

When researchers propose a new pre-trained language model, they often open-source a corresponding toolkit for developers. Such toolkits usually provide codes for downstream task development based on the specific model, and therefore lack generality. Typical toolkits include google-research/bert [160], PaddlePaddle/ ERNIE [161], and PCL-Platform.Intelligence/PanGu-α [162]. These toolkits provide a series of open-sourced PTMs, such as BERT, ERNIE, and PanGu-α, along with source code and training data. For example, the ERNIE toolkit provides not only the source code, training data, and PTM of ERNIE but also a couple of enhanced ERNIE series models, such as ERNIE-Doc [163] and ERNIE-ViL [70]. In order to deploy the ERNIE model to online service, the ERNIE toolkit also provides a model-compression tool.

With the intensive publish of PTMs, knowing how to use these models in a unified toolkit has become an urgent need. Given this background, toolkits for general NLP applications have been developed. Typical toolkits include HuggingFace/Transformers [164], Fairseq [165], and PaddleNLP [166]. PTMs are integrated in a user-friendly way into such general-purpose toolkits. Taking HuggingFace as an example, this toolkit integrates the codes for different kinds of PTMs and codes for downstream application developments, including classification, generation, summarization, translation, question answering, and so forth.

4.1.2. Platforms

Besides toolkits, platforms provide users with PTM services for customization. These platforms can provide facilities for developers to build models and deploy them to online services. For example, Baidu Wenxin [167] is a platform to facilitate the use of PTMs. This platform meets the needs of both experienced developers and junior developers. It enables developers to easily build their models with data and model configuration only. It also provides experienced developers with toolkits to train their models that are tailored for applications. Other platforms such as AliceMind [168] provide similar services with no significant differences. OpenAI API [169] is another kind of platform that is used to develop applications based only on PTMs. OpenAI API is based on GPT-3 [26]; it provides specific high-level functions, such as English-to-French translation, grammar correction, question answering, advertisement generation, and product-name generation.

《4.2. Applications》

4.2. Applications

PTMs have been widely deployed in real applications, including document intelligence, content creation, virtual assistant, and intelligent search engines. Below, we describe how PTMs are applied in each field.

4.2.1. Document intelligence

One widely studied application for PTMs is document intelligence, which includes sentiment analysis, news classification, anti-spam detection, and information extraction. Sentiment analysis is widely used to identify sentiment polarity, such as public opinion, for market research, brand reputation analysis, and social media influence. Garg and Chatterjee [170] propose analyzing the sentiment of Twitter feeds using a PTM and classifying them into three categories: positive, negative, and neutral. AlQahtani [171] proposes analyzing customer reviews on products by combining data-mining techniques with PTMs. Recently, Singh et al. [172] analyzed public sentiment on the impact of the coronavirus on social life using a PTM. Chen and Sokolova [173] propose analyzing the sentiments in the coronavirus disease 2019 (COVID-19)-related messages in a popular social media platform, where users share their stories to seek support from other users, especially during the COVID-19 pandemic. Experimental results show that PTMs can achieve significant performance gain in classifying sentiment polarities, demonstrating the effectiveness of PTMs.

News classification and anti-spam detection can also be modeled as classification tasks. Ding et al. [163] apply PTMs to classify news into extreme left-wing or right-wing standpoints. Liu et al. [174] propose classifying the papers published in Arxiv.org into 11 categories, including math, computer science, and so forth. Jwa et al. [175] use BERT to detect fake news by analyzing the relationship between the headline and the body text in news.

Document information extraction is widely used in industry. Many AI cloud services contain tools for information extraction [176], such as Google AI Cloud, Baidu AI Cloud, and Alibaba AI Cloud. Among these services, Baidu has built a PTM-based platform, TextMind, for document information-extraction applications, including receipt analysis for expense reimbursements, information extraction from resumes, financial statement analysis, contract analysis, and legal judgment analysis. One of the world’s largest online home retailers, Wayfair, also applies BERT to extract information from customer messages.

Document image understanding is another important research topic in document intelligence for automatically reading, understanding, and analyzing business documents. A series of multimodal document PTMs [177] has been proposed to jointly model interactions between text, image, and layout information in business documents for many document image understanding tasks, such as receipt understanding, document image classification, and document information extraction. Applica proposes a solution to take into consideration layout, graphics, and text in order to enable the extraction of precise answers for complex business processes in financial services, insurance services, life sciences, and so on.

4.2.2. Content creation

Content creation tasks are usually designed to verify the performance of recently proposed large-scale models [22]. For example, Narrativa applies GPT-2 for content automation from just a few words provided by customers and generates high-quality advertisement content [178]. GPT-2 has demonstrated its ability to generate content for e-commerce in order to relieve humans from laborious tasks. Microsoft has also demonstrated that the pretrained generation model Turing-NLG is beneficial for autosuggest recommendations [95]. Moreover, many researchers have built various demo applications based on GPT-3, including applications for ad generation, AI copywriting, book writing, code generation, customer service, and so forth. As for visual content creation, pre-trained multimodal generative models such as DALL-E [69], CogView [103], and ERNIE-ViLG [107] have greatly improved the quality and fidelity of generated images. The results from CogView have demonstrated this model’s capability to generate high-quality images in a single domain such as industrial fashion design, so this model has been deployed in online fashion production.

In addition to these industrial applications, researchers have shown the potential ability of PTMs for creative writing, including poem generation [179], lyrics generation [27], e-mail auto completion [180], to-do generation [181], auto-completion for sentences and paragraphs, and even a long novel generation [22]. Although PTMs exhibit strong generative capabilities, an increasing number of concerns have arisen regarding generative models, including privacy and copyright. 

4.2.3. Virtual assistants

Virtual assistants are adopted in many applications nowadays. Typical applications include smart speakers, such as Alexa [182] from Amazon and Xiaodu [129] from Baidu. Such applications have used PTMs and have shown that PTMs can provide excellent language understanding ability for spoken language and voice recognition [183] in smart speakers. With the benefit brought by PTMs, these smart speakers can respond to weather forecast queries, sing songs on demand, and vocally control smart home devices. Moreover, smart speakers can chat with humans on a broad range of topics and thus establish a closer and more stable relationship between users and the system. In addition to the usage of PTMs in smart speakers, PTMs have been deployed in mobile-phone-based virtual assistants, such as Siri and Google Assistant. For example, NDTV [184] proposes that PTMs can improve the interaction quality, while Vincent [185] proposes that PTMs can be used in intelligent customer service robots to recognize customer sentiments.

As PTMs are applied more and more widely in virtual assistants, the responses generated by chatting bots are becoming more human-like. For example, Microsoft has proposed a PLM-based model called DialoGPT that learns from the comment history of Reddit and can fluently reply to users. Google has also suggested the use of PLMs to develop a chatbot application that can ‘‘chat about anything” [127]. To make the robots more human-like, Facebook applied PLM to a series of dialogue chatbots named Blender and Blender 2.0 [128]. Shortly afterwards, Baidu proposed PLATO-XL [132], a PLM-based model, to further push the performance of a chatbot and reach the SOTA in terms of both human evaluation and automatic evaluation metrics. Thanks to the performance improvement brought by PTMs, these applications can be very robust in interactions with users [186].

4.2.4. Intelligent search

Aside from the applications mentioned above, PTMs are widely used in search engines. Google has already applied PTMs in its Google Search and achieved significant improvements [187]. Baidu has also applied PTMs, ERNIE 2.0 [188] and ERNIE 3.0 [39], as its backbone to support semantic matching by encoding text into dense representations for better retrieval performance in Baidu Search [189]. Facebook [190] has revealed a unified embedding framework for personalized systems and noted that their future work will contain PTMs.

To address the surging demand for multimedia content searches, the performance of image and video search engines can be enhanced through the utilization of multimodal PTMs. For example, WenLan [106] developed two real-world applications based on image–text matching, thereby demonstrating the power of multimodal pre-training.

To further improve the performance of search engines, researchers have recently paid an increasing amount of attention to multilingual search engine models. Multilingual models are pre-trained with a multilingual corpus to learn cross-language information [191]. The most significant advantage of multilingual models is their transferability across languages, which improves their performance on low-resource languages.

《5. Conclusions and future work》

5. Conclusions and future work

PTMs can fully exploit unannotated data for self-supervised learning and have become the foundation models in NLP, significantly improving the performance of downstream NLP tasks. The emergence of PTMs opens up a new ‘‘pre-training then finetuning” paradigm for NLP. With the increase of model parameters, PTMs show promising performance in zero-shot learning or fewshot learning. Their success in NLP is triggering more research devoted to PTMs in other fields such as computer vision, speech processing, and multimodal understanding and generation, revealing their potential to act as foundation models in these fields.

Despite the dramatic success of PTMs in NLP, there is still a long way to go to achieve artificial general intelligence. First, PTMs are black boxes that are poorly understood. Their interpretability and robustness have yet to be explored due to the nonlinearity of transformer models. Thus, it is difficult to use PTMs to make reliable decisions and reasoning before we fully understand their principles. It is worth devoting a great deal of effort to researching the uncertainty of PTMs. Furthermore, current multimodal and multilingual pre-training [192] is still in the early stage. Unifying multimodal and multilingual pre-training will emerge as an exciting trend for further exploration, which may improve the performance of these low-resource tasks. Another promising direction is to incorporate prior knowledge into PTMs to improve their reasoning abilities and efficiency. Existing work on knowledge pre-training, such as K-BERT [33] and ERNIE 3.0 [39], has injected knowledge triplets into pre-training or fine-tuning. However, PTMs have demonstrated limited capability for commonsense awareness and reasoning, which require further improvement. Although largescale PTMs have demonstrated strong generalization capabilities, efficiently deploying them is still an open question. For applications that require low latency, model compression of PTMs remains a promising direction. Existing model-compression methods consist of distillation [193], pruning [194], quantization [195], and so forth. However, how to efficiently build large-scale PTMs with a deployable inference time remains an ongoing challenge. In addition, designing more efficient architecture in place of or to improve transformers remains an open problem.

In summary, there is still a long way to go for PTMs to be able to make reliable decisions and carry out reliable planning, which are essential elements of AI. More efficient and powerful neural networks need to be proposed and developed. Fortunately, the use of PTMs in real applications continues to provide an increased amount of data and address new challenges, potentially promoting the rapid development of new pre-trained methods.

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Haifeng Wang, Jiwei Li, Hua Wu, Eduard Hovy, and Yu Sun declare that they have no conflict of interest or financial conflicts to disclose.