Journal Home Online First Current Issue Archive For Authors Journal Information 中文版

Engineering >> 2016, Volume 2, Issue 2 doi: 10.1016/J.ENG.2016.02.008

Strategies and Principles of Distributed Machine Learning on Big Data

1 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Received: 2015-12-29 Revised: 2016-05-01 Accepted: 2016-05-23 Available online: 2016-06-30

Next Previous

Abstract

The rise of big data has led to new demands for machine learning (ML) systems to learn complex models, with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics (such as high-dimensional latent features, intermediate representations, and decision functions) thereupon. In order to run ML algorithms at such scales, on a distributed cluster with tens to thousands of machines, it is often the case that significant engineering efforts are required—and one might fairly ask whether such engineering truly falls within the domain of ML research. Taking the view that “big” ML systems can benefit greatly from ML-rooted statistical and algorithmic insights—and that ML researchers should therefore not shy away from such systems design—we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of big ML systems and architectures, with the goal of understanding how to make them efficient, generally applicable, and supported with convergence and scaling guarantees. They concern four key questions that traditionally receive little attention in ML research: How can an ML program be distributed over a cluster? How can ML computation be bridged with inter-machine communication? How can such communication be performed? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and enlarge the area that lies between ML and systems.

Figures

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Fig. 12

Fig. 13

Fig. 14

Fig. 15

Fig. 16

Fig. 17

Fig. 18

Fig. 19

References

[ 1 ] Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. J Mach Learn Res 2008;9:1981–2014.

[ 2 ] Ahmed A, Ho Q, Eisenstein J, Xing EP, Smola AJ, Teo CH. Unified analysis of streaming news. In: Proceedings of the 20th International Conference on World Wide Web; 2011 Mar 28−Apr 1; Hyderabad, India; 2011. p. 267–76.

[ 3 ] Zhao B, Xing EP. Quasi real-time summarization for consumer videos. In:?Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun 23−28; Columbus, OH, USA; 2014. p. 2513–20.

[ 4 ] Lee S, Xing EP. Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs. Bioinformatics 2012;28(12):i137–46. link1

[ 5 ] Thrun S, Montemerlo M, Dahlkamp H, Stavens D, Aron A, Diebel J, . Stanley: the robot that won the DARPA Grand Challenge. J Field Robot 2006;23(9):661–92. link1

[ 6 ] Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv 2009;41(3):15:1–15:58.

[ 7 ] Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Hanover: Now Publishers Inc.; 2008.

[ 8 ] Koller D, Friedman N. Probabilistic graphical models: principles and techniques. Cambridge: MIT Press; 2009.

[ 9 ] Xing EP. Probabilistic graphical models [Internet]. [cited 2016 Jan 1]. Available from: https://www.cs.cmu.edu/~epxing/Class/10708/lecture.html.

[10] Zhu J, Xing EP. Maximum entropy discrimination markov networks. J Mach Learn Res 2009;10:2531–69.

[11] Zhu J, Ahmed A, Xing EP. MedLDA: maximum margin supervised topic models for regression and classication. In: Proceedings of the 26th Annual International Conference on Machine Learning; 2009 Jun 14–18; Montreal, Canada; 2009. p. 1257–64.

[12] Zhu J, Chen N, Xing EP. Bayesian inference with posterior regularization and applications to innite latent SVMs. J Mach Learn Res 2014;15(1):1799–847.

[13] Griffiths TL, Ghahramani Z. Infinite latent feature models and the Indian buffet process. In: Weiss Y, Schölkopf B, Platt JC, editors Proceedings of the Neural Information Processing Systems 2005; 2005 Dec 5−8; Vancouver, Canada; 2005. p. 475–82.

[14] Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical dirichlet processes. J Am Stat Assoc 2006;101(476):1566–81. link1

[15] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc B 2006;68(1):49–67. link1

[16] Kim S, Xing EP. Tree-guided group lasso for multi-response regression with structured sparsity, with applications to eQTL mapping. Ann Appl Stat 2012;6(3):1095–117. link1

[17] Burges CJC. A tutorial on support vector machines for pattern recognition. Wires Data Min Knowl 1998;2(2):121–67. link1

[18] Taskar B, Guestrin C, Koller D. Max-margin Markov networks. In: Thrun S, Saul LK, Schölkopf B, editors Proceedings of the Neural Information Processing Systems 2003; 2003 Dec 8−13; Vancouver and Whistler, Canada; 2003. p. 25–32.

[19] Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, . Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Proc Mag 2012;29(6):82–97.

[20] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2012; 2012 Dec 3–8, Lake Tahoe, USA; 2012. p. 1097–105.

[21] Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401(6755):788–91. link1

[22] Salakhutdinov R, Mnih A. Probabilistic matrix factorization. In: Platt JC, Koller D, Singer Y, Roweis ST, editors Proceedings of the Neural Information Processing Systems 2007; 2007 Dec 3–6; Vancouver, Canada; 2007. p.1257–64.

[23] Olshausen BA, Field DJ. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Res 1997;37(23):3311–25.

[24] Lee H, Battle A, Raina R, Ng AY. Efficient sparse coding algorithms. In: Schölkopf B, Platt JC, Hoffman T, editors Proceedings of the Neural Information Processing Systems 2006; 2006 Dec 4–7; Vancouver, Canada; 2006. p. 801–8.

[25] Zheng X, Kim JK, Ho Q, Xing EP. Model-parallel inference for big topic models. 2014. Eprint arXiv:1411.2305.

[26] Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, . LightLDA: big topic models on modest compute clusters. 2014. Eprint arXiv:1412.1576.

[27] Coates A, Huval B, Wang T, Wu DJ, Ng AY, Catanzaro B. Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning; 2013 Jun 16–21; Atlanta, GA, USA; 2013. p. 1337–45.

[28] Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola AJ. Scalable inference in latent variable models. In: Proceedings of the 5th International Conference on Web Search and Data Mining; 2012 Feb 8−12; Seattle, WA, USA; 2012. p. 123–32.

[29] Moritz P, Nishihara R, Stoica I, Jordan MI. SparkNet: training deep networks in spark. 2015. Eprint arXiv:1511.06051.

[30] Agarwal A, Duchi JC. Distributed delayed stochastic optimization. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2011; 2011 Dec 12−17; Granada, Spain; 2011. p. 873–81.

[31] Niu F, Recht B, Re C, Wright SJ. HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2011; 2011 Dec 12−17; Granada, Spain; 2011. p. 693–701.

[32] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM 2008;51(1):107–13.

[33] Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation; 2012 Oct 8–10; Hollywood, CA, USA; 2012. p. 17–30.

[34] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, . Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation; 2012 Apr 25−27; San Jose, CA, USA; 2012. p. 2:1–2:14.

[35] Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, . Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 2015;1(2):49–67. link1

[36] Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, . Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation; 2014 Oct 6–8; Broomfield, CO, USA; 2014. p. 583–98.

[37] Ho Q, Cipar J, Cui H, Kim JK, Lee S, Gibbons PB, . More effective distributed ML via a stale synchronous parallel parameter server. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2013; 2013 Dec 5–10; Lake Tahoe, USA; 2013. p. 1223–31.

[38] Kumar A, Beutel A, Ho Q, Xing EP. Fugue: slow-worker-agnostic distributed learning for big models on big data. In: Kaski S, Corander J, editors Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) 2014; 2014 Apr 22–25; Reykjavik, Iceland; 2014. p. 531–9.

[39] Lee S, Kim JK, Zheng X, Ho Q, Gibson GA, Xing EP. On model parallelization and scheduling strategies for distributed machine learning. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2014; 2014 Dec 8–13; Montreal, Canada; 2014. p. 2834–42.

[40] Dai W, Kumar A, Wei J, Ho Q, Gibson G, Xing EP. High-performance distributed ML at scale through parameter server consistency models. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence; 2015 Jan 25–30; Austin, TX, USA; 2015. p. 79–87.

[41] Wei J, Dai W, Qiao A, Ho Q, Cui H, Ganger GR, . Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the 6th ACM Symposium on Cloud Computing; 2015 Aug 27–29; Kohala Coast, HI, USA, 2015. p. 381–94.

[42] Bottou L. Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G, editors Proceedings of COMPSTAT’2010; 2010 Aug 22–27; Paris France. New York: Springer; 2010. p. 177–86.

[43] Zhou Y, Yu Y, Dai W, Liang Y, Xing EP. On convergence of model parallel proximal gradient algorithm for stale synchronous parallel system. In: Gretton A, Robert CC, editors Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016; 2016 May 7–11; Cadiz, Spain; 2016. p. 713–22.

[44] Fercoq O, Richtárik P. Accelerated, parallel and proximal coordinate descent. SIAM J Optim 2013;25(4):1997–2023.

[45] Gilks WR. Markov Chain Monte Carlo. In: Encyclopedia of biostatistics. 2nd ed. New York: John Wiley and Sons, Inc., 2005.

[46] Tibshirani R. Regression shrinkage and selection via the lasso. J R Statist Soc B 1996;58(1):267–88.

[47] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res 2003;3:993–1022.

[48] Yao L, Mimno DM, McCallum A. Effcient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2009 Jun 28–Jul 1; Paris, France; 2009. p. 937–46.

[49] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation—Volume 6; 2004 Dec 6–8; San Francisco, CA, USA; 2004. p. 137–50.

[50] Zhang T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the 21st International Conference on Machine Learning; 2004 Jul 4–8; Banff, Canada; 2004. p. 116.

[51] Gemulla R., Nijkamp E, Haas PJ, Sismanis Y. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2011 Aug 21–24; San Diego, CA, USA; 2011. p. 69–77.

[52] Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, . Large scale distributed deep networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2012; 2012 Dec 3–8, Lake Tahoe, USA; 2012. p. 1232–40.

[53] Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. GraphLab: a new framework for parallel machine learning. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010); 2010 Jul 8–11, Catalina Island, CA, USA; 2010.

[54] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504–7. link1

[55] Xie P, Kim JK, Zhou Y, Ho Q, Kumar A, Yu Y, . Distributed machine learning via sufficient factor broadcasting. 2015. Eprint arXiv:1409.5705.

[56] Zhang H, Hu Z, Wei J, Xie P, Kim G, Ho Q, . Poseidon: a system architecture for effcient GPU-based deep learning on multiple machines. 2015. Eprint arXiv:1512.06216.

[57] Bradley JK, Kyrola A, Bickson D, Guestrin C. Parallel coordinate descent for L1-regularized loss minimization. In: Proceedings of the 28th International Conference on Machine Learning; 2011 Jun 28–Jul 2; Bellevue, WA, USA; 2011.

[58] Scherrer C, Tewari A, Halappanavar M, Haglin D. Feature clustering for accelerating parallel coordinate descent. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2012; 2012 Dec 3–8, Lake Tahoe, USA; 2012. p. 28–36.

[59] Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: Proceedings of the VLDB Endowment; 2012 Aug 27–31; Istanbul, Turkey; 2012;5(8): 716–27. link1

[60] Chilimbi T, Suzue Y, Apacible J, Kalyanaraman K. Project Adam: building an efficient and scalable deep learning training system. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation; 2014 Oct 6–8; Broomfield, CO, USA; 2014. p. 571–82.

[61] Valiant LG. A bridging model for parallel computation. Commun ACM 1990;33(8):103–11.

[62] McColl WF. Bulk synchronous parallel computing. In: Davy JR, Dew PM, editors Abstract machine models for highly parallel computers. Oxford: Oxford University Press; 1995. p. 41–63.

[63] Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, . Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data; 2010 Jun 6–11; Indianapolis, IN, USA; 2010. p. 135–46.

[64] Terry D. Replicated data consistency explained through baseball. Commun ACM 2013;56(12):82–9.

[65] Li H, Kadav A, Kruus E, Ungureanu C. MALT: distributed data-parallelism for existing ML applications. In: Proceedings of the 10th European Conference on Computer Systems; 2015 Apr 21−25; Bordeaux, France; 2015. Article No.: 3.

[66] [66]Xing EP, Jordan MI, Russell SJ, Ng AY. Distance metric learning with application to clustering with side-information. In: Becker S, Thrun S, Obermayer K, editors Proceedings of the Neural Information Processing Systems 2002; 2002 Dec 9−14; Vancouver, Canada; 2002. p. 505–12.

[67] Partalas I, Kosmopoulos A, Baskiotis N, Artieres T, Paliouras G, Gaussier E, . LSHTC: A benchmark for large-scale text classification. 2015. Eprint arXiv:1503.08581.

[68] Hsieh CJ, Chang KW, Lin CJ, Sathiya Keerthi S, Sundararajan S. A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning; 2008 Jul 5−9; Helsinki, Finland; 2008. p. 408–15.

[69] Shalev-Shwartz S, Zhang T. Stochastic dual coordinate ascent methods for regularized loss. J Mach Learn Res 2013;14(1):567–99.

[70] Yang T. Trading computation for communication: distributed stochastic dual coordinate ascent. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2013; 2013 Dec 5–10; Lake Tahoe, USA; 2013. p. 629–37.

[71] Jaggi M, Smith V, Takac M, Terhorst J, Krishnan S, Hofmann T, . Communication-efficient distributed dual coordinate ascent. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors Proceedings of the Neural Information Processing Systems 2014; 2014 Dec 8–13; Montreal, Canada; 2014. p. 3068–76.

[72] Hsieh CJ, Yu HF, Dhillon IS. PASSCoDe: parallel asynchronous stochastic dual co-ordinate descent. In: Bach F, Blei D, editors Proceedings of the 32nd International Conference on Machine Learning; 2015 Jul 6–11; Lille, France; 2015. p. 2370–9.

Related Research