期刊首页 优先出版 当期阅读 过刊浏览 作者中心 关于期刊 English

《工程(英文)》 >> 2020年 第6卷 第3期 doi: 10.1016/j.eng.2020.01.007

深度神经网络加速器体系结构概述

a Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA
b Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93106-9560, USA

收稿日期: 2019-03-31 修回日期: 2019-09-06 录用日期: 2020-01-09

下一篇 上一篇

摘要

最近,由于可使用的大数据和计算能力的快速增长,人工智能重新获得了巨大的关注和投资。机器学习(ML)方法已成功应用于解决学术界和工业界的许多问题。尽管大数据应用的高速增长为ML的发展提供动力,但它也给传统计算机系统带来了数据处理速度和可扩展性方面的严峻挑战。专门为AI应用程序设计的计算平台已经从对冯·诺依曼(von Neumann)平台的补充发展到必备的独立技术解决方案。这些平台属于更大的类别,被称为“专有域计算”,专注于针对AI的特定定制。在本文中,我们特别总结了用于深度神经网络(DNN)的加速器设计(即DNN加速器)的最新进展。我们从计算单元、数据流优化、网络模型、基于新兴技术的体系结构以及针对新兴应用的加速器等方面讨论支持DNN执行的各种体系结构。我们还提供了有关AI芯片设计未来趋势的展望。

图片

图1

图2

图3

图4

图5

图6

图7

图8

图9

图10

图11

图12

图13

图14

参考文献

[ 1 ] Turing AM. Computing machinery and intelligence. Mind 1950;LIX (236):433–60. 链接1

[ 2 ] McCarthy J, Minsky ML, Rochester N, Shannon CE. A proposal for the Dartmouth summer research project on artificial intelligence: August 31, 1955. AI Mag 2006;27(4):12. 链接1

[ 3 ] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM 2017;60(6):84–90. 链接1

[ 4 ] Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. 2014. arXiv:1406.1078.

[ 5 ] Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, et al. Indatacenter performance analysis of a tensor processing unit. In: Proceedings of 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture; 2017 Jun 24–28; Toronto, ON, Canada; 2017. p. 1–12.

[ 6 ] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943;5(4):115–33. 链接1

[ 7 ] Keener JP, Hoppensteadt FC, Rinzel J. Integrate-and-fire models of nerve membrane response to oscillatory input. SIAM J Appl Math 1981;41 (3):503–17. 链接1

[ 8 ] LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989;1(4):541–51. 链接1

[ 9 ] Pino RE, Li H, Chen Y, Hu M, Liu B. Statistical memristor modeling and case study in neuromorphic computing. In: Proceedings of DAC Design Automation Conference 2012; 2012 Jun 3–7; San Francisco, CA, USA; 2012. p. 585–90.

[10] Maass W. Networks of spiking neurons: the third generation of neural network models. Neural Networks 1997;10(9):1659–71. 链接1

[11] Wulf WA, McKee SA. Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Archit News 1995;23(1):20–4. 链接1

[12] Guo X, Ipek E, Soyata T. Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing. In: Proceedings of the 37th Annual International Symposium on Computer Architecture; 2010 Jun 19–23; SaintMalo, France; 2010. p. 371–82.

[13] Liu X, Mao M, Liu B, Li H, Chen Y, Li B, et al. RENO: a high-efficient reconfigurable neuromorphic computing accelerator design. In: Proceedings of 2015 52nd ACM/EDAC/IEEE Design Automation Conference; 2015 Jun 8–12; San Francisco, CA, USA; 2015. p. 1–6.

[14] Jiang L, Kim M, Wen W, Wang D. XNOR-POP: a processing-in-memory architecture for binary convolutional neural networks in Wide-IO2 DRAMs. In: Proceedings of 2017 IEEE/ACM International Symposium on Low Power Electronics and Design; 2017 Jul 24–26; Taipei, China; 2017. p. 1–6.

[15] Chen YH, Krishna T, Emer JS, Sze V. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 2017;52(1):127–38. 链接1

[16] Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, et al. DaDianNao: a machinelearning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture; 2014 Dec 13–17; Cambridge, UK; 2014. p. 609–22.

[17] Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, et al. PuDianNao: a polyvalent machine learning accelerator. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems; 2015 Mar 14–18; Istanbul, Turkey; 2015. p. 369–81.

[18] Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems; 2014 March 1–5; Salt Lake City, UT, USA; 2014. p. 269–84.

[19] Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, et al. ShiDianNao: shifting vision processing closer to the sensor. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture ; 2015 Jun 13–17; Portland, OR, USA; 2015. p. 92–104.

[20] Akopyan F, Sawada J, Cassidy A, Alvarez-Icaza R, Arthur J, Merolla P, et al. Truenorth: design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans Comput Aided Des Integr Circ Syst 2015;34 (10):1537–57. 链接1

[21] Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, et al. PRIME: a novel processing-inmemory architecture for neural network computation in ReRAM-based main memory. SIGARCH Comput Archit News 2016;44(3):27–39. 链接1

[22] Song L, Qian X, Li H, Chen Y. Pipelayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture; 2017 Feb 4–8; Austin, TX, USA; 2017.

[23] Chen F, Song L, Chen Y. ReGAN: a pipelined ReRAM-based accelerator for generative adversarial networks. In: Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference; 2018 Jan 22–25; Jeju, Republic of Korea; 2018.

[24] Liu C, Yang Q, Yan B, Yang J, Du X, Zhu W, et al. A memristor crossbar based computing engine optimized for high speed and accuracy. In: Proceedings of the 2016 IEEE Computer Society Annual Symposium on VLSI; 2016 Jul 11–13; Pittsburgh, PA, USA; 2016. p. 110–5.

[25] Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan JP, Hu M, et al. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. SIGARCH Comput Archit News 2016;44(3):14–26. 链接1

[26] Qiao X, Cao X, Yang H, Song L, Li H. Atomlayer: a universal ReRAM-based CNN accelerator with atomic layer computation. In: Proceedings of the 55th Annual Design Automation Conference; 2018 Jun 24–29; San Francisco, CA, USA; 2018.

[27] Strukov DB, Snider GS, Stewart DR, Williams RS. The missing memristor found. Nature 2008;453:80–3. 链接1

[28] Esmaeilzadeh H, Sampson A, Ceze L, Burger D. Neural acceleration for generalpurpose approximate programs. Commun ACM 2014;58(1):105–15. 链接1

[29] UCI machine learning repository [Internet]. Irvine: University of California [cited 2019 Jan 18]. Available from: http://archive.ics.uci.edu/ml/. 链接1

[30] LeCun Y, Cortes C, Burges CJC. The MNIST database [Internet]. [cited 2019 Jan 18]. Available from: http://yann.lecun.com/exdb/mnist/. 链接1

[31] Chen Y, Chen T, Xu Z, Sun N, Temam O. DianNao family: energy-efficient hardware accelerators for machine learning. Commun ACM 2016;59 (11):105–12. 链接1

[32] Liu S, Du Z, Tao J, Han D, Luo T, Xie Y, et al. Cambricon: an instruction set architecture for neural networks. In: Proceedings of the 43rd International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 393–405. 链接1

[33] Google I/O’17 [Internet]. California: Google [cited 2019 Jan 18]. Available from: https://events.google.com/io2017/. 链接1

[34] Google I/O’18 [Internet]. California: Google [cited 2019 Jan 18]. Available from: https://events.google.com/io2018/. 链接1

[35] Google Cloud Next’18 [Internet]. California: Google [cited 2019 Jan 18]. Available from: https://cloud.withgoogle.com/next18/sf/. 链接1

[36] Chen YH, Emer J, Sze V. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of the 2016 ACM/ IEEE 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 367–79.

[37] HC29 (2017) [Internet]. Hot chips [cited 2019 Jan 18]. Available from: https:// www.hotchips.org/archives/2010s/hc29/. 链接1

[38] GraphCore [Internet]. Bristol: GraphCore [cited 2019 Jan 18]. Available from: https://www.graphcore.ai/technology. 链接1

[39] Pawlowski JT. Hybrid memory cube (HMC). In: Proceedings of the 2011 IEEE Hot Chips 23 Symposium; 2011 Aug 17–19; Stanford, CA, USA; 2011.

[40] Farmahini-Farahani A, Ahn JH, Morrow K, Kim NS. NDA: near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In: Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture; 2015 Feb 7–11; Burlingame, CA, USA; 2015. p. 283–95.

[41] Hu M, Li H, Wu Q, Rose GS. Hardware realization of BSB recall function using memristor crossbar arrays. Proceedings of the 49th Annual Design Automation Conference; 2012 Jun 3–7; San Francisco, CA, USA; 2012. p. 498–503.

[42] Hu M, Strachan JP, Li Z, Grafals EM, Davila N, Graves C, et al. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix–vector multiplication. In: Proceedings of the 53nd Annual Design Automation Conference; 2016 Jun 5–9; Austin, TX, USA; 2016. p. 1–6.

[43] Kim D, Kung J, Chai S, Yalamanchili S, Mukhopadhyay S. Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. In: Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture ; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 380–92.

[44] Lu H, Wei X, Lin N, Yan G, Li X. Tetris: re-architecting convolutional neural network computation for machine learning accelerators. In: Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design; 2018 Nov 5–8; San Diego, CA, USA; 2018. p. 1–8.

[45] Wen W, Wu C, Wang Y, Chen Y, Li H. Learning structured sparsity in deep neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016 Dec 5–10; Barcelona, Spain; 2016. p. 2082–90.

[46] Han S, Pool J, Narang S, Mao H, Gong E, Tang S, et al. DSD: dense-sparse-dense training for deep neural networks. 2016. arXiv:1607.04381.

[47] Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, et al. EIE: efficient inference engine on compressed deep neural network. In: Proceedings of the 2016 ACM/ IEEE 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 243–54.

[48] Albericio J, Judd P, Hetherington T, Aamodt T, Jerger NE, Moshovos A. Cnvlutin: ineffectual-neuron-free deep neural network computing. In: Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 1–13.

[49] Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, et al. Cambricon-X: an accelerator for sparse neural networks. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture; 2016 Oct 15–19; Taipei, China; 2016.

[50] Zhou X, Du Z, Guo Q, Liu S, Liu C, Wang C, et al. Cambricon-S: addressing irregularity in sparse neural networks through a cooperative software/ hardware approach. In: Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture; 2018 Oct 20–24; Fukuoka, Japan; 2018. p. 15–28.

[51] Ji H, Song L, Jiang L, Li HH, Chen Y. ReCom: an efficient resistive accelerator for compressed deep neural networks. In: Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition; 2018 Mar 19–23; Dresden, Germany; 2018. p. 237–40.

[52] Migacz S. 8-bit inference with TensorRT [Internet]. Available from: http://ondemand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-withtensorrt.pdf. 链接1

[53] Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In: Proceedings of the 2018 ACM/ IEEE 45th Annual International Symposium on Computer Architecture; 2018 Jun 1–6; Los Angeles, CA, USA; 2018. p. 688–98.

[54] Jain S, Venkataramani S, Srinivasan V, Choi J, Chuang P, Chang L. CompensatedDNN: energy efficient low-precision deep neural networks by compensating quantization errors. In: Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference; 2018 Jun 24–28; San Francisco, CA, USA; 2018. p. 1–6.

[55] Mao H, Song M, Li T, Dai Y, Shu J. LerGAN: a zero-free, low data movement and PIM-based GAN architecture. In: Proceedings of the 2018 51st Annual IEEE/ ACM International Symposium on Microarchitecture; 2018 Oct 20–24; Fukuoka, Japan; 2018. p. 669–81.

[56] Song M, Zhang J, Chen H, Li T. Towards efficient microarchitectural design for accelerating unsupervised GAN-based deep learning. In: Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture; 2018 Feb 24–28; Vienna, Austria; 2018. p. 66–77.

[57] Yazdanbakhsh A, Samadi K, Kim NS, Esmaeilzadeh H. GANAX: a unified MIMDSIMD acceleration for generative adversarial networks. In: Proceedings of the 45th Annual International Symposium on Computer Architecture; 2018 Jun 2– 6; Los Angeles, CA, USA; 2018. p. 650–61.

[58] Han S, Kang J, Mao H, Hu Y, Li X, Li Y, et al. ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; 2017 Feb 22– 24; Monterey, CA, USA; 2017. p. 75–84.

[59] Shin D, Lee J, Lee J, Yoo H. 14.2 DNPU: an 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In: Proceedings of the 2017 IEEE International Solid-State Circuits Conference; 2017 Feb 5–9; San Francisco, CA, USA; 2017. p. 240–1.

[60] Gao C, Neil D, Ceolini E, Liu SC, Delbruck T. DeltaRNN: a power-efficient recurrent neural network accelerator. In: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; 2018 Feb 25– 27; Monterey, CA, USA; 2018. p. 21–30.

[61] Song L, Mao J, Zhuo Y, Qian X, Li H, Chen Y. HyPar: towards hybrid parallelism for deep learning accelerator array. In: Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture; 2019 Feb 16–20; Washington, DC, USA; 2019. p. 56–68.

[62] Bojnordi MN, Ipek E. Memristive Boltzmann machine: a hardware accelerator for combinatorial optimization and deep learning. In: Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture; 2016 Mar 12–16; Barcelona, Spain; 2016. p. 1–13.

[63] Chen A, Lin M. Variability of resistive switching memories and its impact on crossbar array performance. In: Proceedings of the 2011 International Reliability Physics Symposium; 2011 Apr 10–14; Monterey, CA, USA; 2011. p. MY.7.1–4.

[64] Dongale TD, Patil KP, Mullani SB, More KV, Delekar SD, Patil PS, et al. Investigation of process parameter variation in the memristor based resistive random access memory (RRAM): effect of device size variations. Mater Sci Semicond Process 2015;35:174–80. 链接1

[65] Ambrogio S, Balatti S, Cubeta A, Calderoni A, Ramaswamy N, Ielmini D. Understanding switching variability and random telegraph noise in resistive RAM. In: Proceedings of the 2013 IEEE International Electron Devices Meeting; 2013 Dec 9–11; Washington, DC, USA; 2013. p. 31.5.1–4.

[66] Choi S, Yang Y, Lu W. Random telegraph noise and resistance switching analysis of oxide based resistive memory. Nanoscale 2014; 6(1):400–4. 链接1

[67] Beckmann K, Holt J, Manem H, van Nostrand J, Cady NC. Nanoscale hafnium oxide RRAM devices exhibit pulse dependent behavior and multi-level resistance capability. MRS Adv 2016;1(49):3355–60. 链接1

[68] Chen YY, Goux L, Clima S, Govoreanu B, Degraeve R, Kar GS, et al. Endurance/ retention trade-off on HfO2/metalCap 1T1R bipolar RRAM. IEEE Trans Electron Dev 2013;60(3):1114–21. 链接1

[69] Wong HP, Lee H, Yu S, Chen Y, Wu Y, Chen P, et al. Metal-oxide RRAM. Proc IEEE 2012;100(6):1951–70. 链接1

[70] Xue CX, Chen WH, Liu JS, Li JF, Lin WY, Lin WE, et al. A 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN-based AI edge processors. In: Proceedings of the 2019 IEEE International Solid-State Circuits Conference; 2019 Feb 17–21; San Francisco, CA, USA; 2019. p. 388–90.

[71] LiKamWa R, Hou Y, Gao J, Polansky M, Zhong L. RedEye: analog ConvNet image sensor architecture for continuous mobile vision. In: Proceedings of the 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 255–66.

相关研究