
A Survey of Accelerator Architectures for Deep Neural Networks
Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, Tianqi Tang
Engineering ›› 2020, Vol. 6 ›› Issue (3) : 264-274.
A Survey of Accelerator Architectures for Deep Neural Networks
Recently, due to the availability of big data and the rapid growth of computing power, artificial intelligence (AI) has regained tremendous attention and investment. Machine learning (ML) approaches have been successfully applied to solve many problems in academia and in industry. Although the explosion of big data applications is driving the development of ML, it also imposes severe challenges of data processing speed and scalability on conventional computer systems. Computing platforms that are dedicatedly designed for AI applications have been considered, ranging from a complement to von Neumann platforms to a “must-have” and standalone technical solution. These platforms, which belong to a larger category named “domain-specific computing,” focus on specific customization for AI. In this article, we focus on summarizing the recent advances in accelerator designs for deep neural networks (DNNs)—that is, DNN accelerators. We discuss various architectures that support DNN executions in terms of computing units, dataflow optimization, targeted network topologies, architectures on emerging technologies, and accelerators for emerging applications. We also provide our visions on the future trend of AI chip designs.
Deep neural network / Domain-specific architecture / Accelerator
[1] |
Turing AM. Computing machinery and intelligence. Mind 1950;LIX (236):433–60.
|
[2] |
McCarthy J, Minsky ML, Rochester N, Shannon CE. A proposal for the Dartmouth summer research project on artificial intelligence: August 31, 1955. AI Mag 2006;27(4):12.
|
[3] |
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM 2017;60(6):84–90.
|
[4] |
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. 2014. arXiv:1406.1078.
|
[5] |
Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, et al. Indatacenter performance analysis of a tensor processing unit. In: Proceedings of 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture; 2017 Jun 24–28; Toronto, ON, Canada; 2017. p. 1–12.
|
[6] |
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943;5(4):115–33.
|
[7] |
Keener JP, Hoppensteadt FC, Rinzel J. Integrate-and-fire models of nerve membrane response to oscillatory input. SIAM J Appl Math 1981;41 (3):503–17.
|
[8] |
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989;1(4):541–51.
|
[9] |
Pino RE, Li H, Chen Y, Hu M, Liu B. Statistical memristor modeling and case study in neuromorphic computing. In: Proceedings of DAC Design Automation Conference 2012; 2012 Jun 3–7; San Francisco, CA, USA; 2012. p. 585–90.
|
[10] |
Maass W. Networks of spiking neurons: the third generation of neural network models. Neural Networks 1997;10(9):1659–71.
|
[11] |
Wulf WA, McKee SA. Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Archit News 1995;23(1):20–4.
|
[12] |
Guo X, Ipek E, Soyata T. Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing. In: Proceedings of the 37th Annual International Symposium on Computer Architecture; 2010 Jun 19–23; SaintMalo, France; 2010. p. 371–82.
|
[13] |
Liu X, Mao M, Liu B, Li H, Chen Y, Li B, et al. RENO: a high-efficient reconfigurable neuromorphic computing accelerator design. In: Proceedings of 2015 52nd ACM/EDAC/IEEE Design Automation Conference; 2015 Jun 8–12; San Francisco, CA, USA; 2015. p. 1–6.
|
[14] |
Jiang L, Kim M, Wen W, Wang D. XNOR-POP: a processing-in-memory architecture for binary convolutional neural networks in Wide-IO2 DRAMs. In: Proceedings of 2017 IEEE/ACM International Symposium on Low Power Electronics and Design; 2017 Jul 24–26; Taipei, China; 2017. p. 1–6.
|
[15] |
Chen YH, Krishna T, Emer JS, Sze V. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 2017;52(1):127–38.
|
[16] |
Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, et al. DaDianNao: a machinelearning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture; 2014 Dec 13–17; Cambridge, UK; 2014. p. 609–22.
|
[17] |
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, et al. PuDianNao: a polyvalent machine learning accelerator. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems; 2015 Mar 14–18; Istanbul, Turkey; 2015. p. 369–81.
|
[18] |
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems; 2014 March 1–5; Salt Lake City, UT, USA; 2014. p. 269–84.
|
[19] |
Du Z, Fasthuber R, Chen T, Ienne P, Li L, Luo T, et al. ShiDianNao: shifting vision processing closer to the sensor. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture ; 2015 Jun 13–17; Portland, OR, USA; 2015. p. 92–104.
|
[20] |
Akopyan F, Sawada J, Cassidy A, Alvarez-Icaza R, Arthur J, Merolla P, et al. Truenorth: design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans Comput Aided Des Integr Circ Syst 2015;34 (10):1537–57.
|
[21] |
Chi P, Li S, Xu C, Zhang T, Zhao J, Liu Y, et al. PRIME: a novel processing-inmemory architecture for neural network computation in ReRAM-based main memory. SIGARCH Comput Archit News 2016;44(3):27–39.
|
[22] |
Song L, Qian X, Li H, Chen Y. Pipelayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture; 2017 Feb 4–8; Austin, TX, USA; 2017.
|
[23] |
Chen F, Song L, Chen Y. ReGAN: a pipelined ReRAM-based accelerator for generative adversarial networks. In: Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference; 2018 Jan 22–25; Jeju, Republic of Korea; 2018.
|
[24] |
Liu C, Yang Q, Yan B, Yang J, Du X, Zhu W, et al. A memristor crossbar based computing engine optimized for high speed and accuracy. In: Proceedings of the 2016 IEEE Computer Society Annual Symposium on VLSI; 2016 Jul 11–13; Pittsburgh, PA, USA; 2016. p. 110–5.
|
[25] |
Shafiee A, Nag A, Muralimanohar N, Balasubramonian R, Strachan JP, Hu M, et al. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. SIGARCH Comput Archit News 2016;44(3):14–26.
|
[26] |
Qiao X, Cao X, Yang H, Song L, Li H. Atomlayer: a universal ReRAM-based CNN accelerator with atomic layer computation. In: Proceedings of the 55th Annual Design Automation Conference; 2018 Jun 24–29; San Francisco, CA, USA; 2018.
|
[27] |
Strukov DB, Snider GS, Stewart DR, Williams RS. The missing memristor found. Nature 2008;453:80–3.
|
[28] |
Esmaeilzadeh H, Sampson A, Ceze L, Burger D. Neural acceleration for generalpurpose approximate programs. Commun ACM 2014;58(1):105–15.
|
[29] |
UCI machine learning repository [Internet]. Irvine: University of California [cited 2019 Jan 18]. Available from: http://archive.ics.uci.edu/ml/.
|
[30] |
LeCun Y, Cortes C, Burges CJC. The MNIST database [Internet]. [cited 2019 Jan 18]. Available from: http://yann.lecun.com/exdb/mnist/.
|
[31] |
Chen Y, Chen T, Xu Z, Sun N, Temam O. DianNao family: energy-efficient hardware accelerators for machine learning. Commun ACM 2016;59 (11):105–12.
|
[32] |
Liu S, Du Z, Tao J, Han D, Luo T, Xie Y, et al. Cambricon: an instruction set architecture for neural networks. In: Proceedings of the 43rd International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 393–405.
|
[33] |
Google I/O’17 [Internet]. California: Google [cited 2019 Jan 18]. Available from: https://events.google.com/io2017/.
|
[34] |
Google I/O’18 [Internet]. California: Google [cited 2019 Jan 18]. Available from: https://events.google.com/io2018/.
|
[35] |
Google Cloud Next’18 [Internet]. California: Google [cited 2019 Jan 18]. Available from: https://cloud.withgoogle.com/next18/sf/.
|
[36] |
Chen YH, Emer J, Sze V. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of the 2016 ACM/ IEEE 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 367–79.
|
[37] |
HC29 (2017) [Internet]. Hot chips [cited 2019 Jan 18]. Available from: https:// www.hotchips.org/archives/2010s/hc29/.
|
[38] |
GraphCore [Internet]. Bristol: GraphCore [cited 2019 Jan 18]. Available from: https://www.graphcore.ai/technology.
|
[39] |
Pawlowski JT. Hybrid memory cube (HMC). In: Proceedings of the 2011 IEEE Hot Chips 23 Symposium; 2011 Aug 17–19; Stanford, CA, USA; 2011.
|
[40] |
Farmahini-Farahani A, Ahn JH, Morrow K, Kim NS. NDA: near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In: Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture; 2015 Feb 7–11; Burlingame, CA, USA; 2015. p. 283–95.
|
[41] |
Hu M, Li H, Wu Q, Rose GS. Hardware realization of BSB recall function using memristor crossbar arrays. Proceedings of the 49th Annual Design Automation Conference; 2012 Jun 3–7; San Francisco, CA, USA; 2012. p. 498–503.
|
[42] |
Hu M, Strachan JP, Li Z, Grafals EM, Davila N, Graves C, et al. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix–vector multiplication. In: Proceedings of the 53nd Annual Design Automation Conference; 2016 Jun 5–9; Austin, TX, USA; 2016. p. 1–6.
|
[43] |
Kim D, Kung J, Chai S, Yalamanchili S, Mukhopadhyay S. Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. In: Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture ; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 380–92.
|
[44] |
Lu H, Wei X, Lin N, Yan G, Li X. Tetris: re-architecting convolutional neural network computation for machine learning accelerators. In: Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design; 2018 Nov 5–8; San Diego, CA, USA; 2018. p. 1–8.
|
[45] |
Wen W, Wu C, Wang Y, Chen Y, Li H. Learning structured sparsity in deep neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016 Dec 5–10; Barcelona, Spain; 2016. p. 2082–90.
|
[46] |
Han S, Pool J, Narang S, Mao H, Gong E, Tang S, et al. DSD: dense-sparse-dense training for deep neural networks. 2016. arXiv:1607.04381.
|
[47] |
Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, et al. EIE: efficient inference engine on compressed deep neural network. In: Proceedings of the 2016 ACM/ IEEE 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 243–54.
|
[48] |
Albericio J, Judd P, Hetherington T, Aamodt T, Jerger NE, Moshovos A. Cnvlutin: ineffectual-neuron-free deep neural network computing. In: Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 1–13.
|
[49] |
Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, et al. Cambricon-X: an accelerator for sparse neural networks. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture; 2016 Oct 15–19; Taipei, China; 2016.
|
[50] |
Zhou X, Du Z, Guo Q, Liu S, Liu C, Wang C, et al. Cambricon-S: addressing irregularity in sparse neural networks through a cooperative software/ hardware approach. In: Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture; 2018 Oct 20–24; Fukuoka, Japan; 2018. p. 15–28.
|
[51] |
Ji H, Song L, Jiang L, Li HH, Chen Y. ReCom: an efficient resistive accelerator for compressed deep neural networks. In: Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition; 2018 Mar 19–23; Dresden, Germany; 2018. p. 237–40.
|
[52] |
Migacz S. 8-bit inference with TensorRT [Internet]. Available from: http://ondemand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-withtensorrt.pdf.
|
[53] |
Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In: Proceedings of the 2018 ACM/ IEEE 45th Annual International Symposium on Computer Architecture; 2018 Jun 1–6; Los Angeles, CA, USA; 2018. p. 688–98.
|
[54] |
Jain S, Venkataramani S, Srinivasan V, Choi J, Chuang P, Chang L. CompensatedDNN: energy efficient low-precision deep neural networks by compensating quantization errors. In: Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference; 2018 Jun 24–28; San Francisco, CA, USA; 2018. p. 1–6.
|
[55] |
Mao H, Song M, Li T, Dai Y, Shu J. LerGAN: a zero-free, low data movement and PIM-based GAN architecture. In: Proceedings of the 2018 51st Annual IEEE/ ACM International Symposium on Microarchitecture; 2018 Oct 20–24; Fukuoka, Japan; 2018. p. 669–81.
|
[56] |
Song M, Zhang J, Chen H, Li T. Towards efficient microarchitectural design for accelerating unsupervised GAN-based deep learning. In: Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture; 2018 Feb 24–28; Vienna, Austria; 2018. p. 66–77.
|
[57] |
Yazdanbakhsh A, Samadi K, Kim NS, Esmaeilzadeh H. GANAX: a unified MIMDSIMD acceleration for generative adversarial networks. In: Proceedings of the 45th Annual International Symposium on Computer Architecture; 2018 Jun 2– 6; Los Angeles, CA, USA; 2018. p. 650–61.
|
[58] |
Han S, Kang J, Mao H, Hu Y, Li X, Li Y, et al. ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; 2017 Feb 22– 24; Monterey, CA, USA; 2017. p. 75–84.
|
[59] |
Shin D, Lee J, Lee J, Yoo H. 14.2 DNPU: an 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In: Proceedings of the 2017 IEEE International Solid-State Circuits Conference; 2017 Feb 5–9; San Francisco, CA, USA; 2017. p. 240–1.
|
[60] |
Gao C, Neil D, Ceolini E, Liu SC, Delbruck T. DeltaRNN: a power-efficient recurrent neural network accelerator. In: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; 2018 Feb 25– 27; Monterey, CA, USA; 2018. p. 21–30.
|
[61] |
Song L, Mao J, Zhuo Y, Qian X, Li H, Chen Y. HyPar: towards hybrid parallelism for deep learning accelerator array. In: Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture; 2019 Feb 16–20; Washington, DC, USA; 2019. p. 56–68.
|
[62] |
Bojnordi MN, Ipek E. Memristive Boltzmann machine: a hardware accelerator for combinatorial optimization and deep learning. In: Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture; 2016 Mar 12–16; Barcelona, Spain; 2016. p. 1–13.
|
[63] |
Chen A, Lin M. Variability of resistive switching memories and its impact on crossbar array performance. In: Proceedings of the 2011 International Reliability Physics Symposium; 2011 Apr 10–14; Monterey, CA, USA; 2011. p. MY.7.1–4.
|
[64] |
Dongale TD, Patil KP, Mullani SB, More KV, Delekar SD, Patil PS, et al. Investigation of process parameter variation in the memristor based resistive random access memory (RRAM): effect of device size variations. Mater Sci Semicond Process 2015;35:174–80.
|
[65] |
Ambrogio S, Balatti S, Cubeta A, Calderoni A, Ramaswamy N, Ielmini D. Understanding switching variability and random telegraph noise in resistive RAM. In: Proceedings of the 2013 IEEE International Electron Devices Meeting; 2013 Dec 9–11; Washington, DC, USA; 2013. p. 31.5.1–4.
|
[66] |
Choi S, Yang Y, Lu W. Random telegraph noise and resistance switching analysis of oxide based resistive memory. Nanoscale 2014; 6(1):400–4.
|
[67] |
Beckmann K, Holt J, Manem H, van Nostrand J, Cady NC. Nanoscale hafnium oxide RRAM devices exhibit pulse dependent behavior and multi-level resistance capability. MRS Adv 2016;1(49):3355–60.
|
[68] |
Chen YY, Goux L, Clima S, Govoreanu B, Degraeve R, Kar GS, et al. Endurance/ retention trade-off on HfO2/metalCap 1T1R bipolar RRAM. IEEE Trans Electron Dev 2013;60(3):1114–21.
|
[69] |
Wong HP, Lee H, Yu S, Chen Y, Wu Y, Chen P, et al. Metal-oxide RRAM. Proc IEEE 2012;100(6):1951–70.
|
[70] |
Xue CX, Chen WH, Liu JS, Li JF, Lin WY, Lin WE, et al. A 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN-based AI edge processors. In: Proceedings of the 2019 IEEE International Solid-State Circuits Conference; 2019 Feb 17–21; San Francisco, CA, USA; 2019. p. 388–90.
|
[71] |
LiKamWa R, Hou Y, Gao J, Polansky M, Zhong L. RedEye: analog ConvNet image sensor architecture for continuous mobile vision. In: Proceedings of the 43rd Annual International Symposium on Computer Architecture; 2016 Jun 18–22; Seoul, Republic of Korea; 2016. p. 255–66.
|
/
〈 |
|
〉 |