Resource Type

Journal Article 3

Year

2023 1

2018 1

2016 1

Keywords

Artificial intelligence big data 1

Big model 1

Computing model 1

Data-parallelism 1

Distributed systems 1

Domain-specific storage 1

Fault-tolerance 1

Flash translation layer 1

Garbage collection 1

Heterogeneous 1

Internal parallelism 1

Machine learning 1

Model-parallelism 1

Open-channel solid-state drives (OCSSDs) 1

Parallelism 1

Post-exascale 1

Principles 1

Theory 1

open ︾

Search scope:

排序: Display mode:

Strategies and Principles of Distributed Machine Learning on Big Data Review

Eric P. Xing,Qirong Ho,Pengtao Xie,Dai Wei

Engineering 2016, Volume 2, Issue 2,   Pages 179-195 doi: 10.1016/J.ENG.2016.02.008

Abstract:

The rise of big data has led to new demands for machine learning (ML) systems to learn complex models, with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics (such as high-dimensional latent features, intermediate representations, and decision functions) thereupon. In order to run ML algorithms at such scales, on a distributed cluster with tens to thousands of machines, it is often the case that significant engineering efforts are required—and one might fairly ask whether such engineering truly falls within the domain of ML research. Taking the view that “big” ML systems can benefit greatly from ML-rooted statistical and algorithmic insights—and that ML researchers should therefore not shy away from such systems design—we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of big ML systems and architectures, with the goal of understanding how to make them efficient, generally applicable, and supported with convergence and scaling guarantees. They concern four key questions that traditionally receive little attention in ML research: How can an ML program be distributed over a cluster? How can ML computation be bridged with inter-machine communication? How can such communication be performed? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and enlarge the area that lies between ML and systems.

Keywords: Machine learning     Artificial intelligence big data     Big model     Distributed systems     Principles     Theory     Data-parallelism     Model-parallelism    

Avision of post-exascale programming None

Ji-dong ZHAI, Wen-guang CHEN

Frontiers of Information Technology & Electronic Engineering 2018, Volume 19, Issue 10,   Pages 1261-1266 doi: 10.1631/FITEE.1800442

Abstract: we discuss three significant programming challenges for future post-exascale systems: heterogeneity, parallelism

Keywords: Computing model     Fault-tolerance     Heterogeneous     Parallelism     Post-exascale    

A survey on design and application of open-channel solid-state drives Review Article

Junchao CHEN, Guangyan ZHANG, Junyu WEI,gyzh@tsinghua.edu.cn

Frontiers of Information Technology & Electronic Engineering 2023, Volume 24, Issue 5,   Pages 637-658 doi: 10.1631/FITEE.2200317

Abstract: Compared with traditional solid-state drives (SSDs), open-channel SSDs (OCSSDs) expose their internal physical layout and provide a host-based (FTL) that allows host-side software to control the internal operations such as (GC) and input/output (I/O) scheduling. In this paper, we comprehensively survey research works built on OCSSDs in recent years. We show how they leverage the features of OCSSDs to achieve high throughput, low latency, long lifetime, strong performance isolation, and high resource utilization. We categorize these efforts into five groups based on their optimization methods: adaptive interface customizing, rich FTL co-designing, exploiting, rational I/O scheduling, and efficient GC processing. We discuss the strengths and weaknesses of these efforts and find that almost all these efforts face a dilemma between performance effectiveness and management complexity. We hope that this survey can provide fundamental knowledge to researchers who want to enter this field and further inspire new ideas for the development of OCSSDs.

Keywords: Domain-specific storage     Flash translation layer     Garbage collection     Internal parallelism     Open-channel    

Title Author Date Type Operation

Strategies and Principles of Distributed Machine Learning on Big Data

Eric P. Xing,Qirong Ho,Pengtao Xie,Dai Wei

Journal Article

Avision of post-exascale programming

Ji-dong ZHAI, Wen-guang CHEN

Journal Article

A survey on design and application of open-channel solid-state drives

Junchao CHEN, Guangyan ZHANG, Junyu WEI,gyzh@tsinghua.edu.cn

Journal Article