Journal Home Online First Current Issue Archive For Authors Journal Information 中文版

Engineering >> 2020, Volume 6, Issue 3 doi: 10.1016/j.eng.2019.12.014

Progress in Neural NLP: Modeling, Learning, and Reasoning

Microsoft Research Asia, Beijing 100080, China

Received: 2019-04-30 Revised: 2019-08-30 Accepted: 2019-10-13 Available online: 2020-01-07

Next Previous

Abstract

Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on enabling computers to understand and process human languages. In the last five years, we have witnessed the rapid development of NLP in tasks such as machine translation, question-answering, and machine reading comprehension based on deep learning and an enormous volume of annotated and unannotated data. In this paper, we will review the latest progress in the neural network-based NLP framework (neural NLP) from three perspectives: modeling, learning, and reasoning. In the modeling section, we will describe several fundamental neural network-based modeling paradigms, such as word embedding, sentence embedding, and sequence-to-sequence modeling, which are widely used in modern NLP engines. In the learning section, we will introduce widely used learning methods for NLP models, including supervised, semi-supervised, and unsupervised learning; multitask learning; transfer learning; and active learning. We view reasoning as a new and exciting direction for neural NLP, but it has yet to be well addressed. In the reasoning section, we will review reasoning mechanisms, including the knowledge, existing non-neural inference methods, and new neural inference methods. We emphasize the importance of reasoning in this paper because it is important for building interpretable and knowledge-driven neural NLP models to handle complex tasks. At the end of this paper, we will briefly outline our thoughts on the future directions of neural NLP.

Figures

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Fig. 12

Fig. 13

Fig. 14

Fig. 15

Fig. 16

Fig. 17

Fig. 18

References

[ 1 ] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009 Jun 20–25; Miami, FL, USA; 2009. p. 248– 55.

[ 2 ] Xiong W, Wu L, Alleva F, Droppo J, Huang X, Stolcke A. The Microsoft 2017 conversational speech recognition system. In: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing; 2018 Apr 15–20; Calgary, AB, Canada; 2018. p. 5934–8.

[ 3 ] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training [Internet]. [cited 2019 Apr 29]. Available from: https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/language-unsupervised/language_understanding_paper.pdf. link1

[ 4 ] Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics; 2019 Jun 2–7; Minneapolis, MN, USA; 2019. p. 4171–86.

[ 5 ] Yang ZL, Dai Z, Yang YM, Carbonell J, Salakhutdinov R, Le QV. XLNet: generalized autoregressive pretraining for language understanding. 2019. arXiv:1906.08237.

[ 6 ] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. arXiv:1301.3781.

[ 7 ] Firth JR. A synopsis of linguistic theory 1930–1955. In: Firth JR. Studies in linguistic analysis. Oxford: Blackwell; 1957. p. 1–31.

[ 8 ] Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; 2014 Oct 25–29; Doha, Qatar; 2014. p. 1532–43.

[ 9 ] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of the 2018 Annual Conference of the North American Chapter of the Association for Computational Linguistics; 2018 Jun 1–6; New Orleans, LA, USA; 2018.

[10] Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning; 2008 Jul 5–9; Helsinki, Finland; 2008. p. 160–7.

[11] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. 2014. arXiv:1406.1078.

[12] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014. arXiv:1409.0473.

[13] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31rd Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA; 2017.

[14] Yu M, Dredze M. Improving lexical embeddings with semantic knowledge. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics; 2014 Jun 23–25; Baltimore, MD, USA; 2014. p. 545–50.

[15] Zhang J, Luan H, Sun M, Zhai F, Xu J, Zhang M, et al. Improving the transformer translation model with document-level context. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018 Oct 31– Nov 4; Brussels, Belgium; 2018. p. 533–42.

[16] Wu Y, Wu W, Xing C, Xu C, Li Z, Zhou M. A sequential matching framework for multi-turn response selection in retrieval-based chatbots. Comput Linguist 2019;45(1):163–97. link1

[17] Gu J, Bradbury J, Xiong C. Li VOK, Socher R. Non-autoregressive neural machine translation. 2017. arXiv:1711.02281.

[18] Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal 2002;38 (4):367–78. link1

[19] Rumelhart DE, Hinton GE, Williams RJ. Learning representations by backpropagating errors. Nature 1986;323(9):533–6. link1

[20] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 2011;12(Jul):2121–59. link1

[21] Zeiler MD. ADADELTA: an adaptive learning rate method. 2012. arXiv:1212.5701.

[22] Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 2015 International Conference on Learning Representations; 2015 May 7–9; San Diego, CA, USA; 2015.

[23] Shen S, Cheng Y, He Z, He W, Wu H, Sun M, et al. 2016. Minimum risk training for neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics; 2016 Aug 7–12; Berlin, Germany; 2016.

[24] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics; 2002 Jul 7–12; Philadelphia, PA, USA; 2002. p. 311–8.

[25] Zhang Z, Wu S, Liu S, Li M, Zhou M, Xu T. Regularizing neural machine translation by target-bidirectional agreement. In: Proceedings of the 31rd AAAI Conference on Artificial Intelligence; 2019 Jan 27–Feb 1; Honolulu, HI, USA; 2019.

[26] Xia Y, Tian F, Wu L, Lin J, Qin T, Yu N, et al. Deliberation networks: sequence generation beyond one-pass decoding. In: Proceedings ot the 31rd Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA; 2017.

[27] Zhang W, Feng Y, Meng F, You D, Liu Q. Bridging the gap between training and inference for neural machine translation. 2019. arXiv:1906.02448.

[28] Zhu XJ. Semi-supervised learning literature survey. Madison: University of Wisconsin-Madison; 2005. link1

[29] Cheng Y, Xu W, He Z, He W, Wu H, Sun M, et al. Semi-supervised learning for neural machine translation. 2016. arXiv:1606.04596.

[30] He D, Xia Y, Qin T, Wang L, Yu N, Liu T, et al. Dual learning for machine translation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016 Dec 5–10; Barcelona, Spain; 2016. p. 820–8.

[31] Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data. 2015. arXiv:1511.06709.

[32] Zhang Z, Liu S, Li M, Zhou M, Chen E. Joint training for neural machine translation models with monolingual data. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence; 2018 Feb 2–7; New Orleans, LA, USA; 2018.

[33] Kingma DP, Welling M. Auto-encoding variational bayes. 2013. arXiv:1312.6114.

[34] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Proceedings on Neural Information Processing Systems (NIPS 2014); 2014 Dec 8–13; Montreal, QC, Canada; 2014. pp. 2672– 80.

[35] Hu W, Tan Y. Generating adversarial malware examples for black-box attacks based on GAN. 2017. arXiv:1702.05983.

[36] Semeniuta S, Severyn A, Gelly S. On accurate evaluation of GANs for language generation. 2018. arXiv:1806.04936.

[37] Lample G, Conneau A, Denoyer L. Ranzato M. Unsupervised machine translation using monolingual corpora only. 2017. arXiv:1711.00043.

[38] Ren S, Zhang Z, Liu S, Zhou M, Ma S. Unsupervised neural machine translation with SMT as posterior regularization. 2019. arXiv:1901.04112.

[39] Conneau A, Lample G. Cross-lingual language model pretraining. 2019. arXiv:1901.07291.

[40] McCann B, Keskar NS, Xiong C, Socher R. The natural language decathlon: multitask learning as question answering. 2018. arXiv:1806.08730.

[41] Liu X, He P, Chen W, Gao J. Multi-task deep neural networks for natural language understanding. 2019. arXiv:1901.11504.

[42] Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; 2018 Oct 31– Nov 4; Brussels, Belgium; 2018. p. 353–5.

[43] Zoph B, Le QV. Neural architecture search with reinforcement learning. 2016. arXiv:1611.01578.

[44] Pushp PK, Srivastava MM. Train once, test anywhere: zero-shot learning for text classification. 2017. arXiv:1712.05972.

[45] Srivastava S, Labutov I, Mitchell T. Zero-shot learning of classifiers from natural language quantification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics; 2018 Jul 15–20; Melbourne, VIC, Australia; 2018. p. 306–16.

[46] Johnson M, Schuster M, Le QV, Krikun M, Wu Y, Chen Z, et al. Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans Assoc Comput Linguist 2017;5:339–51. link1

[47] Schmidhuber J. Evolutionary principles in self-referential learning: on learning how to learn [dissertation]. München: Technische Universität München; 1987. link1

[48] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. 2017. arXiv:1703.03400.

[49] Gu JT, Wang Y, Chen Y, Cho K, Li VOK. Meta-learning for low-resource neural machine translation. 2018. arXiv:1808.08437.

[50] Subramanian S, Trischler A, Bengio Y, Pal CJ. Learning general purpose distributed sentence representations via large scale multi-task learning. 2018. arXiv:1804.00079.

[51] Settles B. Active learning literature survey. Madison: University of WisconsinMadison; 2009. link1

[52] He J, Chen J, He X, Gao J, Li L, Deng L, et al. Deep reinforcement learning with a natural language action space. 2015. arXiv:1511.04636.

[53] Wu L, Xia Y, Zhao L, Tian F, Qin T, Lai J, et al. Adversarial neural machine translation. 2017. arXiv:1704.06933.

[54] Alzantot M, Sharma Y, Elgohary A, Ho BJ, Srivastava M, Chang KW. Generating natural language adversarial examples. 2018. arXiv:1804.07998.

[55] Miller GA. WordNet: a lexical database for English. Commun ACM 1995;38 (11):39–41. link1

[56] Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: a nucleus for a web of open data. In: Proceedings of the 2007 International Semantic Web Conference; 2007 Nov 11–15; Busan, Korea; 2007. p. 722–35.

[57] Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J. Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data; 2008 Jun 9–12; Vancouver, BC, Canada; 2008. p. 1247–50.

[58] Vrandecˇic´ D, Krötzsch M. Wikidata: a free collaborative knowledgebase. Commun ACM 2014;57(10):78–85. link1

[59] Etzioni O, Cafarella M, Downey D, Kok S, Popescu AM, Shaked T, et al. Webscale information extraction in knowitall: (preliminary results). In: Proceedings of the 13th International Conference on World Wide Web; 2004 May 17–20; New York, NY, USA; 2004. p. 100–10.

[60] Fabian MS, Gjergji K, Gerhard WE. YAGO: a core of semantic knowledge unifying WordNet and Wikipedia. In: Proceedings of the 16th International Conference on World Wide Web; 2007 May 8–12; Banff, AL, Canada; 2007. p. 697–706.

[61] Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM. Toward an architecture for never-ending language learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence; 2010 Jul 11–15; Atlanta, GA, USA; 2010.

[62] Lenat DB. CYC: a large-scale investment in knowledge infrastructure. Commun ACM 1995;38(11):33–8. link1

[63] Liu H, Singh P. ConceptNet—a practical commonsense reasoning tool-kit. BT Technol J 2004;22(4):211–26. link1

[64] Tandon N, De Melo G, Weikum G. Acquiring comparative commonsense knowledge from the web. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence; 2014 Jul 27–31; Quebec City, QC, Canada; 2014.

[65] Roth D, Yih W. A linear programming formulation for global inference in natural language tasks. In: Proceedings of the 8th Conference on Computational Natural Language Learning; 2004 May 6–7; Boston, MA, USA; 2004.

[66] Khashabi D, Khot T, Sabharwal A, Clark P, Etzioni O, Roth D. Question answering via integer programming over semi-structured knowledge. 2016. arXiv: 1604.06076.

[67] Khot T, Sabharwal A, Clark P. Answering complex questions using open information extraction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics; 2017 Jul 30–Aug 4; Vancouver, BC, Canada; 2017. p. 311–6.

[68] Khashabi D, Khot T, Sabharwal A, Roth D. Question answering as global reasoning over semantic abstractions. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence; 2018 Feb 2–7; New Orleans, LA, USA; 2018.

[69] Punyakanok V, Roth D, Yih WT, Zimak D. Semantic role labeling via integer linear programming inference. In: Proceedings of the 20th International Conference on Computational Linguistics; 2004 Aug 23–27; Geneva, Switzerland; 2004.

[70] Srikumar V, Roth D. A joint model for extended semantic role labeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing; 2011 Jul 27–31; Edinburgh, UK; 2011. p. 129–39.

[71] Richardson M, Domingos P. Markov logic networks. Mach Learn 2006;62(1– 2):107–36. link1

[72] Besag J. Statistical analysis of non-lattice data. Statistician 1975;24(3):179–95. link1

[73] Poon H, Domingos P. Unsupervised semantic parsing. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing; 2009 Aug 6–7; Singapore, Singapore; 2009.

[74] Pawar S, Bhattacharya P, Palshikar GK. End-to-end relation extraction using Markov logic networks. 2017. arXiv:1712.00988.

[75] Dai HJ, Tsai RTH, Hsu WL. Entity disambiguation using a Markov-logic network. In: Proceedings of the 5th Internatioanl Joint Conference on Natural Language Processing; 2011 Nov 8–13; Chiang Mai, Thailand; 2011. p. 846–55.

[76] Culotta A, McCallum A. Practical Markov logic containing first-order quantifiers with application to identity uncertainty. In: Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing; 2006 Jun 9; New York, NY, USA; 2006. p. 41–8. link1

[77] Weston J, Bordes A, Chopra S, Rush AM, van Merriënboer B, Joulin A, et al. Towards AI-complete question answering: a set of prerequisite toy tasks. 2015. arXiv:1502.05698.

[78] Sukhbaatar S, Szlam A, Weston J, Fergus R. End-to-end memory networks. In: Proceedings of the 2015 Neural Information Processing Systems Conference; 2015 Dec 7–12; Montreal, QC, Canada; 2015. link1

[79] Miller AH, Fisch A, Dodge J, Karimi AH, Bordes A, Weston J. Key-value memory networks for directly reading documents. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; 2016 Nov 1–5; Austin, TX, USA; 2016. p. 1400–9.

[80] Yang Y, Yih WT, Meek C. WikiQA: a challenge dataset for open-domain question answering. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; 2015 Sep 17–21; Lisbon, Portugal; 2015. p. 2013–8.

[81] Bordes A, Boureau YL, Weston J. Learning end-to-end goal-oriented dialog. In: Proceedings of the 2017 International Conference on Learning Representations; 2017 Apr 24–26; Toulon, France; 2017.

[82] Guo D, Tang D, Duan N, Zhou M, Yin J. Dialog-to-action: conversational question answering over a large-scale knowledge base. In: Proceedings of the 2018 Neural Information Processing Systems Conference; 2018 Dec 3–8; Montreal, QC, Canada; 2018. p. 2942–51.

[83] Saha A, Pahuja V, Khapra MM, Sankaranarayanan K, Chandar S. Complex sequential question answering: towards learning to converse over linked question answer pairs with a knowledge graph. In: Proceedings of the 32nd AAAI Conference on Artificial Intellgence; 2018 Feb 2–7; New Orleans, LA, USA; 2018.

[84] Zhou H, Young T, Huang M, Zhao H, Xu J, Zhu X. Commonsense knowledge aware conversation generation with graph attention. In: Proceeding of the 27th International Joint Conference on Artificial Intelligence; 2018 Jul 13–19; Stockholm, Sweden; 2018. p. 4623–9.

[85] Zhong V, Xiong C, Socher R. Seq2SQL: generating structured queries from natural language using reinforcement learning. 2017. arXiv:1709.00103.

[86] Trivedi P, Maheshwari G, Dubey M, Lehmann J. LC-QuAD: a corpus for complex question answering over knowledge graphs. In: Proceedings of the 2017 International Semantic Web Conference; 2017 Oct 21–25; Vienna, Austria; 2017. p. 210–8.

[87] Talmor A, Berant J. The web as a knowledge-base for answering complex questions. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology; 2018 Jun 3–5; New Orleans, LA, USA; 2018. p. 641–51.

[88] Levesque HJ, Davis E, Morgenstern L. The winograd schema challenge. In: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning; 2012 Jun 10–14; Rome, Italy; 2012.

[89] Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, et al. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. 2018. arXiv:1803.05457.

[90] Talmor A, Herzig J, Lourie N, Berant J. CommonsenseQA: a question answering challenge targeting commonsense knowledge. 2018. arXiv:1811.00937.

[91] Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, et al. ATOMIC: An atlas of machine commonsense for if-then reasoning. In: Proceedings of the 31rd AAAI Conference on Artificial Intelligence; 2019 Jan 27–Feb 1; Honolulu, HI, USA; 2019.

[92] Yang Z, Qi P, Zhang S, Bengio Y, Cohen WW, Salakhutdinov R, et al. HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018 Oct 31–Nov 4; Brussels, Belgium; 2018. p. 2369–80.

[93] Kocˇisky´ T, Schwarz J, Blunsom P, Dyer C, Hermann KM, Melis G, et al. The narrativeQA reading comprehension challenge. Trans Assoc Comput Linguist 2018;6:317–28. link1

[94] Khashabi D, Chaturvedi S, Roth M, Upadhyay S, Roth D. Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology; 2018 Jun 3–5; New Orleans, LA, USA; 2018. p. 252–62.

[95] Reddy S, Chen D, Manning CD. CoQA: a conversational question answering challenge. 2018. arXiv:1808.07042.

[96] Hudson DA, Manning CD. GQA: a new dataset for real-world visual reasoning and compositional question answering. 2019. arXiv:1902.09506.

[97] R. Zellers Y. Bisk A. Farhadi Y. Choi From recognition to cognition: visual commonsense reasoning. In: Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition; 2019 Jun 16–20; Long Beach, CA, USA; 2019. 6720–31.

[98] Zellers R, Bisk Y, Schwartz R, Choi Y. SWAG: a large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018 Oct 31–Nov 4; Brussels, Belgium; 2018. p. 93–104.

Related Research