On the Data Quality and Imbalance in Machine Learning-based Design and Manufacturing—A Systematic Review

Jiarui Xie , Lijun Sun , Yaoyao Fiona Zhao

Engineering ›› 2025, Vol. 45 ›› Issue (2) : 111 -141.

PDF (3459KB)
Engineering ›› 2025, Vol. 45 ›› Issue (2) :111 -141. DOI: 10.1016/j.eng.2024.04.024
Research Intelligent Manufacturing—Review
review-article
On the Data Quality and Imbalance in Machine Learning-based Design and Manufacturing—A Systematic Review
Author information +
History +
PDF (3459KB)

Abstract

Machine learning (ML) has recently enabled many modeling tasks in design, manufacturing, and condition monitoring due to its unparalleled learning ability using existing data. Data have become the limiting factor when implementing ML in industry. However, there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing. The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them. To establish the background for the subsequent analysis, crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition, management, analysis, and utilization. Thereafter, the concepts and frameworks established to evaluate data quality and imbalance, including data quality assessment, data readiness, information quality, data biases, fairness, and diversity, are further investigated. The root causes and types of data challenges, including human factors, complex systems, complicated relationships, lack of data quality, data heterogeneity, data imbalance, and data scarcity, are identified and summarized. Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed. This literature review focuses on two promising methods: data augmentation and active learning. The strengths, limitations, and applicability of the surveyed techniques are illustrated. The trends of data augmentation and active learning are discussed with respect to their applications, data types, and approaches. Based on this discussion, future directions for data quality improvement and data imbalance mitigation in this domain are identified.

Graphical abstract

Keywords

Machine learning / Design and manufacturing / Data quality / Data augmentation / Active learning

Cite this article

Download citation ▾
Jiarui Xie, Lijun Sun, Yaoyao Fiona Zhao. On the Data Quality and Imbalance in Machine Learning-based Design and Manufacturing—A Systematic Review. Engineering, 2025, 45(2): 111-141 DOI:10.1016/j.eng.2024.04.024

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

Design and manufacturing are two indispensable and interrelated elements in industrial production: designs must be realized by manufacturing, and efficient manufacturing can be amplified by advanced designs. Therefore, design and manufacturing are frequently investigated jointly in research and development. Modeling of design and manufacturing processes enables automated decision-making and grants predictability throughout the lifecycle of a product [1], [2], [3]. There are four main modeling paradigms: physics-based, rule-based, data-driven, and hybrid modeling [4]. Physics-based methods analytically model the relationship with mathematical equations of the subject matter rationales. This approach allows modeling at an early stage with minimal data. When facing complex systems, physics-based methods often lead to intractable demands of computational resources and domain expertise [5]. Knowledge-based modeling is built upon rule-based expert systems, where domain experts select variables and associated thresholds. Knowledge-based models are usually computationally inexpensive and work well for relatively simple systems [6]. However, this approach is limited by the small number of input and output variables.

To address the challenges of complex systems, data-driven methods, including statistical methods and computational intelligence, have been extensively implemented in production modeling [7]. Historical data are utilized to train a model and learn a set of parameters that approximate the underlying relationships. Compared with conventional statistical methods, computational intelligence methods, including machine learning (ML), possess high flexibility, capacity, and learning ability [5], [8]. Hybrid modeling has also been extensively used to combine the advantages of multiple modeling strategies. For example, partial differential equations can be embedded in ML models to guide the training process with physics-based knowledge. The training data needed to achieve satisfactory predictive performance can be reduced as the underlying knowledge is learned from both the data and physical rationales [9]. Recently, ML has become one of the driving forces behind advancements in data-driven design, manufacturing, and condition monitoring, as discussed in the following paragraphs [10], [11].

Data-driven design can be categorized into design representation, modeling, and synthesis [12], [13]. The goal of data-driven design representation is to learn a set of design descriptors that represent the designs in a tractable dimensionality while avoiding information loss [14]. The aim of modeling is usually to capture the relationship between the design space and the property space using the data acquired from simulation or experiments. For instance, performance prediction models predict the properties of a design given the design shapes [15]. These models frequently serve as surrogate models to replace computationally heavy simulations in design optimization. Design synthesis includes guidelines and methods to generate designs that fulfill the design requirements. One example of data-driven synthesis is generative models that are trained to generate designs with respect to specified properties or constraints [16].

Data-driven manufacturing is a complex domain that embraces various topics, such as design for manufacturing, process monitoring, process modeling, and process control [17], [18], [19], [20]. Practical considerations such as manufacturability and cost-effectiveness must be considered. Starting from the design phase, design candidates are evaluated regarding their manufacturability and predicted part quality [21]. During manufacturing, sensors such as acoustic emission and thermal imaging sensors are deployed to collect in-process data [22], [23]. The collected data can be labeled by domain experts to train ML-based defect detection models. With real-time sensor data, the process parameters can be adaptively adjusted to improve the manufacturing quality and achieve real-time process control [19]. This approach requires accurate process modeling between the process parameters and the part quality.

Data-driven condition monitoring is closely related to design and manufacturing. Evaluating the service lifespans of machines and products is essential to ensuring production cost-effectiveness. ML models are trained using historical operational data collected from cyber-physical systems (CPSs) or synthetic data from simulations to detect and classify machine faults [5]. With advanced time-series forecasting, predictive maintenance models can be constructed to predict the health state and remaining useful life (RUL) of a component at any point throughout its lifecycle. Limited by the scope of the survey, the analysis and discussion in this paper will focus more on design and manufacturing while briefly highlighting relevant insights from condition monitoring.

Although many ML approaches have been developed to improve modeling capability, surveys have revealed that data challenges are the main barriers to ML deployment in production environments. Chuo et al. [24] and Xu et al. [25] reported that data scarcity, data imbalance, data quality and data security are the major challenges in smart manufacturing. Ito et al. [26] and Hagemann et al. [27] indicated that a lack of data accessibility, quality and digital literacy is the major challenge in automated production systems. A lack of data management and data quality induces subsequent challenges in Industrial 4.0, such as poor value chain integration and a lack of standards [1]. A lack of data quality in enterprise resource planning systems leads to information gaps in Logistics 4.0 [2]. Apostolidis et al. [28] highlighted that data heterogeneity, such as various data sources and data characteristics, is a major challenge in aviation maintenance. Williams et al. [29] indicated that data scarcity, poor data quality and complex systems are the major barriers to the adoption of artificial intelligence in design and manufacturing.

To support the development of ML-based design and manufacturing, data challenges must be investigated and addressed. The existing review papers have investigated data quality assessment (DQA) and improvement in the industry from multiple perspectives, including data management and data analytics [30], [31]. Various techniques that improve data quality for ML or address data challenges using ML have been surveyed regarding several topics, including data heterogeneity [32], smart manufacturing [24], [25], and smart production systems [26]. However, systematic reviews that investigate emerging data quality concepts such as data biases in design and manufacturing remain rare. Few surveys have reviewed the advanced data quality improvement techniques in this domain, especially for data augmentation and active learning. A recent review by Lee et al. [33] offers detailed discussions on the existing bias mitigation methods of metamaterial design. This is the first review that covers data biases in this domain but primarily focuses on metamaterial design and active learning. The contributions of our survey include the following:

(1) The data handling techniques, terminologies, and challenges in ML-based design and manufacturing are investigated.

(2) The data quality concepts, such as data quality, data readiness, and information quality (InfoQ), as well as their applications in this domain, are reviewed.

(3) The data imbalance and biases in design and manufacturing data sets are analyzed.

(4) We present and discuss data quality improvement and bias mitigation techniques, focusing on data augmentation and active learning.

The remainder of this paper is organized as follows (Fig. 1). Section 2 describes the methodology of this survey, including the research questions (RQs) and search keywords. Section 3 lays out the roadmap of relevant data terminologies in ML-based modeling to establish a background for subsequent analysis. Thereafter, the data challenges in this domain are elaborated with respect to the root causes and challenge types. Section 4 surveys data quality concepts (data quality, data readiness, and InfoQ) and their applications in design and manufacturing. Section 5 investigates the data imbalance and biases in design and manufacturing data sets. Section 6 reviews and analyzes data quality improvement and bias mitigation techniques, including data augmentation and active learning. Section 7 discusses the trends, applications, and methods of the surveyed techniques. Section 8 highlights the remarks about this research.

2. Methodology

This section explains the methodology used in this survey to investigate data quality and the existing publications in this domain. The main RQ of this review is as follows:

How can data quality be evaluated and improved in ML-based design and manufacturing?

The main question can be decomposed into subquestions in Table 1. The aim of RQ1 is to discover the most frequently employed data terminologies to establish a comprehensive foundation for subsequent analysis. RQ2 reveals the dominant data challenges that impede the deployment of ML in the industry. The survey is conducted around the identified data quality challenges to refine the scope and select the most vibrant topics from various data quality aspects. RQ3 elaborates on the concepts developed to investigate the identified data quality challenges. RQ4 discusses the techniques developed to improve data quality. RQ5 compares the surveyed techniques and investigates their suitability for ML-based design and manufacturing. RQ6 analyzes the surveyed papers from different perspectives.

This survey utilized two databases, Scopus and Web of Science (WOS), to search for relevant publications, focusing on journals, conference proceedings, and book chapters in English. This search was further limited to engineering domain publications since 2014. The criteria shown in Table 2 were applied to search the publication titles, abstracts, and keywords. The algorithm keywords restricted the survey to ML-based modeling. Various applications keywords related to design and manufacturing were included to perform a comprehensive review. Starting with data quality, data terminologies were gradually expanded to include additional keywords. Data quality, data readiness, bias, fairness, and its variants, and diversity and its variants were chosen to search for the first portion of publications. An asterisk (*) following a keyword means that the search includes all the words that start with the keyword. For example, “divers*” can include keywords such as diverse, diversity, diversify, and diversification. Data augmentation, adaptive sampling, and active learning were selected as the data quality improvement techniques for further review.

The publication selection process of this survey is shown in Fig. 2. Three searches were conducted in Scopus and WOS. The papers from the two databases were merged, and duplicates were removed. Thereafter, screening was conducted to filter out the irrelevant publications according to the following rules: ➀ ML methods are implemented; ➁ design papers are restricted to mechanical and material design; ➂ manufacturing, industry, maintenance, and production papers are related to mechanical products; and ➃ for data augmentation, adaptive sampling, and active learning, the papers must be research papers instead of review papers. The screening of the data quality publications considerably reduced the number of papers from 1482 to 112 because many papers were irrelevant to mechanical design and manufacturing, such as construction and software. After papers with no access were filtered out, 94 publications related to data quality and relevant keywords remained. Due to the complexity and large number of these papers, they are reviewed and summarized in 3 Background, 4 Investigation of data quality, 5 Investigation of data imbalance. Publications related to data augmentation, adaptive sampling, and active learning are discussed in detail in Section 6. Note that there was some overlap between the three searches.

3. Background

The study of data quality spans multiple domains with different goals and perspectives. Researchers from different domains often have different understandings of many data terminologies. Some terminologies are used interchangeably, and the same terminology might be used differently across different domains. Thus, it is crucial to establish a common understanding of relevant terminologies in ML-based design and manufacturing. This section organizes the frequently referred data terminologies for ML-based modeling (Fig. 3). Thereafter, the challenges in ML-based design and manufacturing are discussed.

3.1. Data terminologies

Fig. 3 presents the data terminologies in the context of ML-based modeling, primarily focusing on design and manufacturing. The terminologies are arranged according to the four major phases in ML-based modeling: data acquisition, management, analysis, and utilization. These four phases usually occur in sequence, as indicated by the colored arrows. Data handling techniques can be implemented at different times to improve the data quality and modeling performance, which are indicated by the branches along the colored arrows. Closely related techniques are listed together, indicating that they have significant overlap or are often used interchangeably. Some techniques are placed in the middle of two phases, indicating that they can be implemented at either phase. In addition, there are various definitions that help describe data, concepts that help investigate the quality of data, and frameworks that facilitate the management of data, presented in the blue blocks. They occur at different phases but not at a particular time point or in sequence. Similarly, closely related terminologies are listed in the same block. Some blocks are placed in the middle of two phases, indicating that they can occur in either phase. To answer RQ1, the following paragraphs introduce the data terminologies presented in Fig. 3.

At the beginning of ML-based modeling, data acquisition [34] is conducted to acquire data from one or more physical or digital data sources. Typical data sources in this domain include but are not limited to design documents (e.g., design specifications, 2-dimensional sketches, and computer-aided design files), measuring and sensing equipment (e.g., cameras, thermometers, vibration sensors, and acoustic emission sensors), numerical simulation (e.g., finite element analysis), production and process planning, enterprise resource planning, manufacturing process parameters, and operational logs. Data collection and data generation often refer to data acquisition from physical sources and digital sources, respectively. Data heterogeneity arises from data acquisition because different data types, formats, structures, and sources coexist in design and manufacturing. Data synchronization is often performed in distributed manufacturing systems to temporally coordinate multiple machines as data sources [35]. Metadata that describe the essential characteristics of a data set should be determined after data acquisition and will guide subsequent data management [36]. Data provenance is a type of metadata that records the data set history, including its origin and transformations [37].

Data management involves storing, organizing, and maintaining data using scientific practices to assure data quality. Data quality [38], data readiness [39], and InfoQ [40], discussed in detail in Section 4, are different methodologies that help characterize the quality of a data set. Datasheets [41], data statements [42], fact sheets [43], and data set nutrition [44] are technologies that evaluate data sets according to data quality or data readiness. Data governance dictates the policies, standards, and guidelines to manage data as assets within an organization [45]. Data governance, especially in the manufacturing industry, is challenging yet crucial to deploy and maintain at the corporate level [46]. In a manufacturing company, data are collected from different groups, including suppliers, employers, and customers. Data privacy and security risks must be identified and addressed to protect stakeholders [47]. The volume of data can be massive because they are collected across different processes, locations, and time points. Thus, metadata management is another indispensable element in data governance to archive data sets and facilitate data mining [47]. Metadata management should account for both the data set origin and transformation history to facilitate troubleshooting. As a multidisciplinary activity, data governance also provides guidelines on other processes, such as data quality management and data integration. In manufacturing companies, special agents and guiding principles are deployed and adopted to fulfill the above data governance requirements. For example, data stewards oversee data management in an organization and ensure compliance with data governance rules [48]. The FAIR guiding principles are data stewardship guidelines for making data findable, accessible, interoperable, and reusable to promote data sharing and collaboration across multiple domains [36]. After the data are acquired, DQA and measurement can be conducted to evaluate the data quality [30]. Moreover, data cleaning and data wrangling are implemented to improve data quality and transform data into usable forms [49]. The cleaned and transformed data will be structured into a database for efficient data management. Both data fusion and data integration in ML-based modeling refer to combining multiple data sets to enrich the information base and fill the knowledge gap [50].

In this context, data analysis is a set of processes in which insights are extracted, models are built, and predictions are made from data sets via methods such as statistical analysis and ML. Data imbalance [51] characterizes the imbalance among different groups within a data set due to unequal distribution. Biases in data acquisition and preparation are the root causes of data imbalance and other data quality concerns [52]. Data imbalance and biases, if not resolved, compromise the performance of ML models. Bias detection and measurement techniques, such as data fairness, diversity, and coverage, have been proposed to evaluate data biases [52]. In the data analysis phase, data preparation and preprocessing techniques ranging from parametrization to feature learning can help facilitate subsequent ML tasks, as discussed in Section 6.1.1 [53]. Data registration, one of the methods used to address data heterogeneity, involves spatial and temporal alignment of different data sources within a data set [54]. Data augmentation can help mitigate data imbalance by partially altering existing samples or generating new synthetic samples to increase the sizes of minority groups [55]. Data augmentation could be implemented before modeling to improve model performance or after modeling to generate synthetic data using the model. Model cards help document trained models to promote reliability and transparency for data management [56].

Data utilization describes the processes that generate insights, make decisions, and improve production using the data and models. Decision fusion combines the decisions of multiple models for more informed final decisions [57]. Techniques such as data synthesis, data compression, and data reconstruction can be implemented using trained models. Synthetic data generated by the trained models should be properly stored in the database and managed under data management guidelines. Based on the model performance, adaptive sampling and active learning can efficiently guide additional data acquisition to improve model performance [58].

3.2. Data challenges

Combining the challenges summarized by existing surveys and our observations, the primary data challenges in this domain are listed in this subsection to answer RQ2 (Fig. 4). In production systems, human factors propagate into complex systems with complicated relationships, constituting the root causes of data challenges. The most prominent challenges for ML-based design and manufacturing include a lack of data quality, data heterogeneity, data scarcity, and data imbalance.

3.2.1. Root causes

(1) Human factors play a crucial role in production and digital transformation [59], [60]. Errors induced by human operations are frequently observed in industry [27]. The evolving digital environments that continue to embrace advanced information technologies impose a high demand on the digital literacies of industrial employees. However, there is currently a lack of digital literacy education in both higher education and personal development programs [61]. It is also difficult to coordinate among different groups with different levels of digital literacy, including managers, engineers, technicians, and operators.

(2) Complex systems describe that modern manufacturing industries usually involve multiple production activities, including design, manufacturing, logistics, and maintenance. These activities occur at different times and scales, involving various human groups. To capture the digital and physical characteristics of different activities, data are often acquired at different frequencies, with different precisions and formats, contributing to data heterogeneity and imbalance [28].

(3) Complicated relationships commonly exist in design and manufacturing. Some production activities feature high variability, such as changing operating conditions of gas turbines [7] and multiple candidate materials for additive manufacturing (AM) [19]. Others possess complex underlying relationships that require models with high capacity and powerful causal factor disentangling capability. For instance, multiple faults might occur concurrently, which can yield overlapping fault signatures that are difficult to isolate and classify [5]. In addition, different production activities might affect or interact with each other. For example, design for manufacturing is a special design process that optimizes the design to facilitate subsequent manufacturing processes.

3.2.2. Types of challenges

(1) A lack of data quality has become a major concern in industrial digitalization [27]. Despite the rapidly increasing amount of data generated from digitalization, DQA is usually not well incorporated into production systems [27]. For example, missing values and outliers are common in industrial data and have various root causes, such as human error and sensor faults.

(2) Data heterogeneity refers to the inconsistency in a data set or across multiple data sets [32]. Industry data heterogeneity is a major challenge due to the variety of information from complex systems [28]. For example, data acquired from different data sources can be stored in different types, formats, and structures at different time points. In addition, it is challenging but crucial to align data across different systems, such as design, manufacturing, and operation data. For example, the prediction of AM part quality involves the design parameters, manufacturing process parameters, and in-process monitoring data, which often have different representations.

(3) Data scarcity is dictated by the nature of production: high variability and low throughput. The increasing customizability and complexity of modern products have led to numerous variables contributing to product quality, increasing the dimensionality of the modeling problem [62]. A data set becomes increasingly sparse with increasing dimensionality. In addition, the more complex the product is, the more time and resources it requires to perform experiments or to complete simulations with high fidelity. A special case of data scarcity is the lack of labeled data due to high labeling costs because labeling often requires domain expertise in design and manufacturing.

(4) Data imbalance describes the phenomenon in which some groups are underrepresented in the data set. Data imbalance is common in design and manufacturing data sets because human-defined design data sets naturally exhibit bias [63]; normal examples in manufacturing or operational data sets often significantly outnumber examples with defects or faults [64].

This survey investigates the challenges regarding data quality in ML-based design and manufacturing. The concepts established to investigate data quality and data imbalances are discussed in 4 Investigation of data quality, 5 Investigation of data imbalance. The concept of data scarcity was not identified, but the mitigation methods in Section 6 will help resolve data scarcity. The analysis and mitigation of challenges related to complex systems will be investigated in future studies, as they involve knowledge from multiple domains, including engineering, data science, and social science.

4. Investigation of data quality

This section investigates the concepts of data quality, data readiness, and information quality to answer RQ3. The frameworks and pipelines used to improve data quality are introduced and discussed.

4.1. Data quality

Data quality is an umbrella term that characterizes many aspects of data through their lifecycle activities. One of the earliest works that proposed a DQA methodology in production is Wang [65], in which data quality is ensured through four steps: defining, measuring, analyzing, and improving. In Wang [65], DQA metrics were classified into four categories, which were further split into 15 dimensions (Table 3). These metrics provide a comprehensive evaluation of data quality, considering data as an information product. However, the measurement of 15 metrics is time-consuming, and some metrics are not applicable in many cases. Thus, Askham et al. [38] condensed 15 dimensions into six primary dimensions in Table 3 that were found to be most relevant in the industrial context and became widely recognized. Accuracy, timeliness, consistency, and completeness from Wang [65] were kept in Askham et al. [38]. Uniqueness and validity were added to evaluate two frequently observed problems: duplication and unconformities. Many metrics related to specific use cases in Wang [65] were removed in Askham et al. [38], such as reputation, access, and security, which concern human interaction. Although widely adopted, these two evaluation methodologies do not account for the data characteristics needed to improve ML performance, including data imbalance and scarcity.

DQA paradigms with different concentrations have been proposed for industrial data management [30], [31]. A major paradigm builds generic and adaptable DQA tools for various industrial use cases, adopting the six primary dimensions [38]. For instance, Günther et al. [66] developed a DQA tool to manage manufacturing data quality for small to medium-sized enterprises. Guidance is provided to select suitable metrics from 20 available metrics based on the use case, context, domain knowledge, and data. Wiemer et al. [67] proposed a holistic DQA approach named the V-model, which ensures data quality from data acquisition to results presentation for a CPS. Guidelines for selecting suitable data quality metrics were provided based on the system, content, presentation, and use case of the data. Another paradigm develops DQA tools for a specific and emerging use case with specialized metrics [26], [73]. For instance, Schelter et al. [69], funded by Amazon Research, proposed a DQA methodology for large-scale data. The authors defined four large-scale data characteristics (declarativity, flexibility, scalability, and supporting growing data size), which are incorporated into their big data DQA tool. Byabazaire et al. [70] presented an end-to-end DQA framework for the Internet of Things using the concept of trust. This framework assesses data quality across different stages of the data lifecycle, merges those scores into one trust score, and allows the weights and metrics of the scores to be customized. The trust score is a dynamic metric that can be improved over time through metadata assessment, data preprocessing and analytics, and model monitoring. Inspired by DQA paradigms, many data management pipelines with DQA functionalities have been implemented in ML-based design and manufacturing research projects since 2019 (Table 4 [29], [35], [76], [77], [78], [79], [80], [81], [82], [66], [67], [68], [69], [70]). It was found that data quality enhancement pipelines do not focus on data quality improvement for data-driven modeling but rather work for general data quality.

4.2. Data readiness and InfoQ

As defined by Lawrence [39], data readiness describes the state of a data set with respect to its accessibility, validity, and utility. Accessibility checks the prerequisites, including existence, access, licensing, and ethical issues. Validity evaluates the quality of data, such as missing values and outliers. The utility of a data set is investigated to determine whether the data set is suitable and sufficient for a target data-driven task. Data readiness is different from data quality because data readiness is not for general data management purposes. Instead, the readiness of a data set is evaluated with respect to a specific data-driven task. This special characteristic makes data readiness a suitable methodology for evaluating the quality and sufficiency of a data set for ML-based modeling tasks.

InfoQ is another concept used to evaluate the quality and value of a data set. Many InfoQ definitions have been established over the past 40 years and mostly overlap with data quality and data readiness [83]. Kenett and Shmueli [40] proposed a unique definition of InfoQ that measures the potential of a data set (X) for achieving a given analysis goal (g) by employing a data analysis method (f) and considering a given utility function (U). This InfoQ definition is formulated as InfoQg,X,f,U=UfX|g. The authors argue that most data quality concepts only account for UX or UX|g, which disregards the analysis method and the goal in a data analysis task. Eight dimensions are proposed to evaluate InfoQ, including data resolution, generalizability, and operationalizability [40]. Following the eight dimensions, frameworks have been proposed based on InfoQ to improve data quality and data management in academia and industry [40], [84], [85]. These frameworks embrace highly empirical and question-based evaluations instead of statistical and numerical metrics.

Data readiness and InfoQ also inspired the development of several data management tools. The data set nutrition label provides a comprehensive overview of the data set with multiple modules [44]. The data accessibility is described by the metadata and provenance modules; the validity is evaluated by the variables, statistics, and pair plot modules. Unlike DQA tools, data nutrition employs probabilistic models and ground truth correlation modules that offer more insights into data utility [44]. This approach informs the data users of the potential modeling tasks for which the data set is suitable. Yang et al. [86] proposed ranking fact, a tool that utilizes nutrition labels to evaluate data readiness for ranking tasks. This tool trains ranking models and compares their performances based on the imported data set, whose readiness is evaluated regarding nutrition labels [44], fairness, and diversity [55]. Stoyanovich and Howe [87] presented their initiative to develop a tool that semiautomatically generates nutrition labels for data sets and models. Chmielinski et al. [88] presented the 2nd Gen data set nutrition label, which is equipped with context-specific functionalities and training data bias mitigation capabilities. Sun et al. [89] developed MithraLabel, which automatically generates nutrition labels to understand the readiness of a data set for a certain type of ML task. MithraLabel focuses on four types of ML tasks (ranking, classification, prediction, and clustering) and three data characteristics (representativeness of minorities, bias, and correctness). The above-reviewed data readiness tools can improve data sets for a specific data-driven modeling task. Nonetheless, these methods are only compatible with tabular data.

5. Investigation of data imbalance

Data imbalance is an evolving challenge in ML-based modeling. This section analyzes data imbalance based on the concept of biases in design and manufacturing and then explores the methods used to measure representation bias to answer RQ3.

5.1. Biases in ML-based design and manufacturing

Data imbalance describes the phenomenon in which some groups are underrepresented in the data set due to biased data acquisition or a skewed underlying distribution [52]. If not mitigated, biases in the data set will propagate to the ML models and eventually yield unsatisfactory predictions [55]. There are many types of data biases:

(1) Measurement bias arises from the way features are measured and recorded. For example, while collecting data using a CPS, it is inevitable that measurements of different features have different levels of precision. This might result in models with different levels of sensitivity for different groups.

(2) Omitted variable bias occurs when important features are not included in the data set. For example, design parametrization might fail to include some important design parameters and thus induces bias during data set generation. Some possible design variants will be underrepresented in such a data set.

(3) Aggregation bias results from the different underlying distributions of individual groups. When aggregating them into one data set, the aggregated distribution can be considerably different from the individual distributions. For instance, if several data sets of different designs are merged into one data set, the modeling complexity is significantly increased.

(4) Representation bias is the most common root cause of data imbalance and can be categorized into selection bias and underlying distribution skew.

Selection bias stems from how sampling is conducted during the data acquisition processes. The data set will be biased if data acquisition is intentionally inclined to certain classes or regions in the feature space, leaving the remainder of the population less likely to be sampled.

Underlying distribution skew describes the phenomenon in which the population inherently follows a skewed distribution. Consequently, the data set will still be unbalanced even if selection bias is mitigated.

Although these biases are related to data, the key to mitigating some of them is not about the data set itself. The mitigation methods of measurement and omitted variable biases heavily rely on domain knowledge. Aggregation bias can be addressed with ML models that can segregate the underlying distributions of individual groups to facilitate the learning processes. This review concentrates on representation bias because its mitigation is highly related to the data set itself [52], [55], [90].

Representation bias commonly exists in design and manufacturing data sets. In design data sets, the most salient representation bias resides in the property space, where samples are passively populated [63]. Design of experiment (DOE) conducted on the design space ensures the generation of diverse design shapes. Nonetheless, it results in a skewed underlying distribution in the property space due to the nonlinear relationship between the design shape and the design properties. Consequently, design data sets are commonly unbalanced in property space with intensively searched regions, voids in the sampled regions, and unexplored regions [91]. Thus, the ML model may focus on intensively sampled property regions while overlooking underrepresented properties. Representation bias in manufacturing data sets is usually more prominent because the cost of manufacturing data collection is significantly greater than that of design. Unlike design data that can be generated via simulation, manufacturing data sets are mostly collected from real-life experiments or production. Such data sets are subject to sampling bias because they are generated according to production plans instead of DOE. Moreover, the underlying distributions of manufacturing data sets are naturally skewed toward normal examples because reliable manufacturing processes reduce the probability of defects and faults.

In the engineering domain, biases are not necessarily always detrimental or undesirable. Whether to mitigate or encourage an existing bias depends on the downstream modeling tasks [33]. For instance, the design performance is proportional to the porosity if the data-driven design task is to obtain a design with high cooling efficiency. Consequently, data acquisition should be encouraged to bias toward high-porosity designs instead of equally sampling low- and high-porosity designs. Another example from condition monitoring is that the sampling of near-failure examples is usually more critical than that of very healthy examples. The boundaries between healthy classes and fault classes are surrounded by failure and near-failure examples; thus, collecting very healthy examples offers less knowledge than collecting near-failure examples.

Different sampling methods can help control the bias in a data set. Conventionally, random sampling methods such as randomized sampling and Latin hypercube sampling are implemented to avoid embedding selection bias during data acquisition. Nonrandom sampling is subjective and thus usually induces bias in the data set. For example, grid sampling allows to sample examples only at grid intersections. Although random sampling helps mitigate bias, it is not the most efficient method for improving ML model performance. Deterministic sampling guided by space-filling experimental design can yield better feature space coverage and thus reveal more underlying knowledge of the system [33], [92]. Recently developed adaptive sampling and active learning methods are also nonrandom but highly efficient for enhancing ML performance. These methods might induce biases in the data set, but those biases can be desirable, as discussed above.

5.2. Measurements of representation bias

The approaches to measuring representation bias can be categorized into representation rate and data coverage [55]. The representation rate, which is defined in Ref. [93], normally applies to categorical features, based on which the entire data set can be categorized into several subgroups. The representation rate measures the probability of a randomly selected sample belonging to each subgroup. It is also applicable to continuous-valued features once they are sliced [94]. Data coverage measures the space covered by the existing data set in the designated feature space. The notions of data coverage have been defined with respect to different contexts and data types [95], [96], [97].

The concepts of fairness and diversity have also been proposed to evaluate representation bias [55]. Fairness metrics measure the lack of bias in a data set or model [90]. Although many fairness metrics have been established, most of them are designed for classification algorithms [98] and are well suited for social science and policy-making modeling [52], [99]. Fairness metrics have not been implemented to evaluate design and manufacturing data sets because fairness in social sciences cannot be directly adopted in this domain. There has not been a fairness metric developed for ML-based design or manufacturing. Diversity metrics, which describe the richness of varieties, are compatible with various use cases. As defined by Drosou et al. [100], diversity metrics can be classified into distance-based, coverage-based, and novelty-based measures. Distance-based diversity defines the similarity measurements of a data set using pairwise distance. Like data coverage, coverage-based diversity metrics measure how well a designated space is covered by the data set. Novelty-based diversity captures how different a new sample is relative to the existing samples to reduce redundancy. Diversity metrics are extensively utilized to sample from the existing data set or collect new examples to reduce representation bias in design and manufacturing data sets [63], [91], [101]. However, the existing methods for measuring representation bias are mostly limited to tabular data. The notions of data coverage, fairness, and diversity have not been defined for image, time-series, and spatiotemporal data.

6. Data quality improvement and bias mitigation methods

Through the above review, the significance of data quality is illustrated, and several evaluation paradigms are introduced. This section investigates data quality improvement techniques and representation bias mitigation methods to answer RQ4. There are two contexts for data improvement methods: existing data only and additional data acquisition. When additional data acquisition is not allowed, data handling methods are applied to the existing data, including data cleaning, feature extraction, data preprocessing, feature selection, feature learning, and data augmentation. If additional data acquisition is allowed, adaptive sampling or active learning methods are used to efficiently acquire new data. Data augmentation and active learning in ML-based design and manufacturing are reviewed in detail in this section.

6.1. Existing data only

With the same raw data set, the performance of an ML model could vary significantly depending on the data preparation and preprocessing techniques applied. According to Xie et al. [7], data preparation extracts useful features from the raw data, while data preprocessing transforms the data set to improve ML learning performance.

6.1.1. Data quality improvement

The data preparation and preprocessing techniques, ranging from data cleaning to data augmentation (data analysis phase in Fig. 3), are overviewed in this subsection. Data cleaning is a common operation for ML projects to improve data quality with respect to completeness, accuracy, uniqueness, and validity [102]. For design and manufacturing data sets, data cleaning is indispensable because data quality has become a major hindrance to Industry 4.0 [1], [2]. Table 5 provides an overview of the major data cleaning targets, including their descriptions and common methods [102], [103]. Note that there could be some overlap between the data cleaning targets (e.g., the removal of outliers and irrelevant samples). There could also be some overlap between data cleaning and other data quality improvement techniques (e.g., the removal of irrelevant features and feature selection).

Parametrization is the process of defining the parameters or variables to represent a type of design for subsequent analysis, such as ML and design optimization [104]. Different design variants are generated by varying the design parameters. Traditionally, design parametrization is manually defined using domain knowledge, which induces two common issues: overspecified and underspecified design parametrization [105]. Overspecified parametrization yields more than enough variables to represent the design, introducing redundancy to the system. It is likely to increase the computational cost and compromise the performance of subsequent modeling tasks. Underspecified parametrization yields fewer than enough variables and thus is not capable of representing all design variations. Underspecification induces missing knowledge when generating a design data set. The ML models trained on this data set will be biased toward representable designs, which leads to compromised prediction accuracy and generative design diversity. Recent works have demonstrated that design parametrization can be learned using ML to avoid the above issues [105]. Feature extraction is conducted because data acquired from sensors are usually in forms that are difficult to statistically analyze and train ML models. Therefore, statistical and geometrical features are usually extracted from raw data to facilitate subsequent analysis [106]. For example, computer vision techniques such as texture analysis and edge detection can extract AM melt pool shape features [107], [108], and time-frequency analysis techniques such as spectrograms and wavelet transforms can extract features that characterize fault signatures [109], [110].

After the original features are derived from the raw data, feature selection and feature learning can be conducted to improve the performance of the ML models. A feature set can be defined as A = {X1, X2, X3, …, Xn}, where Xi represents the ith feature in the original feature space Rn. If the dimensionality of a feature set is high, it is theoretically difficult to construct an optimum ML model owing to the large number of hypotheses under consideration [111]. Feature selection chooses a subset S A to reduce the feature space to Rs, where s is a new dimension smaller than n [112]. The selected subset should ideally only contain features relevant to the modeling task. Generally, features are ranked according to the selected technique, and then a subset is chosen as the new feature space [113]. According to Xie et al. [7], feature selection techniques can be classified into three categories:

(1) Label-feature correlation-based methods are supervised selection methods that evaluate the correlation between each feature and the label. Features with high correlations are selected as the input features.

(2) Similarity/interaction-based methods rank features by investigating the dependencies and interactions between them. If two or more features are found to be statistically similar or highly linearly dependent, only the most representative feature will be selected. In addition, the interactions between two features are detected to approximate nonlinear relationships.

(3) Wrappers investigate the correlations between the label and subsets of features instead of individual features. The most suitable feature subset is determined by iteratively training ML models using different subsets. The subset that provides the best model performance is selected at each iteration. The selection process is terminated when the designated model performance is achieved.

Feature learning transforms the original features to generate a set of new features L A in the feature space Rl [114]. New features are generated to better represent the original data set by reducing the noise and extracting the hidden patterns [115]. Feature learning techniques can be categorized into statistical and ML methods. One of the most commonly implemented statistical methods is principal component analysis (PCA). It projects the original data set to a low-dimensional space with the principal components as the new features. To transform a training set X = {x(1), x(2), x(3), …, x(m)} (mN+) from Rn to Rl, where l < n, the covariance matrix is:

Σ=1mi=1mxixiT

The eigenvalues and eigenvectors of Σ are then calculated. The eigenvectors corresponding to the largest eigenvalues are selected to compose the transformation matrix P. The new data set with new features can be obtained by Z=PT×X. New features can also be learned using ML methods. For example, autoencoders (AEs) and domain adversarial neural networks can be trained to learn representative features. Convolutional and recurrent neural networks can be combined with AEs to learn spatial and temporal features. The learned features have the potential to possess desirable properties, such as disentangled causal factors. Our previous work, Xie et al. [7], provides a more comprehensive review of feature selection and feature learning techniques for ML-based gas turbine modeling.

Data preprocessing improves data quality or suitability for ML-based modeling [116]. In general, preprocessing can be classified into image preprocessing and numerical preprocessing. Images are often preprocessed with techniques such as gray-scaling and cropping to reduce computational complexity [117]. Numerical preprocessing techniques such as normalization can lead to better performance and faster convergence for ML training processes [118]. Our previous work, Zhang et al. [19], systematically reviewed the data preprocessing techniques for AM.

There are many overlaps among the above data quality improvement methods. For example, design parametrization can be regarded as feature extraction for a type of design. Feature learning can also be considered feature extraction using statistical or ML methods. In addition, there is no fixed sequence for implementing the above methods. For instance, ML-based feature learning usually occurs after data preprocessing. However, if standardization is implemented to preprocess the data and force the features to have uniform variances, PCA must be implemented before the data are standardized. Fig. 3 indicates the earliest point where each data improvement method is usually implemented. The sequence and occurrence may vary depending on the specific scenario.

6.1.2. Data augmentation

Data augmentation can mitigate data imbalance and scarcity by partially altering existing samples or synthesizing new samples to increase the sizes of minority groups [55]. Various data augmentation methods have been implemented in this domain since 2019 (Table 6 [105], [109], [110], [119], [120], [121], [122], [123], [124], [125], [126], [127], [128], [129], [130], [131], [132], [133], [134], [135], [136], [137], [138], [139], [140], [141], [142], [143], [144], [145], [146], [147], [148], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159], [160], [161], [162], [163], [164], [165], [166], [167], [168]). Data augmentation methods can be divided into three categories: domain knowledge, statistical, and ML methods.

(1) Domain knowledge-based data augmentation methods

Manually creating new examples based on domain knowledge to reduce representation bias is common in ML-based design and manufacturing. Due to monitoring system limitations (e.g., fixed camera position and angle), the data sets acquired from experiments naturally contain biases. The goal of knowledge-based data augmentation is to reduce biases by imputing examples that are not represented in the data set using domain knowledge. Numerous augmentation techniques are borrowed from the computer vision domain to generate new graphics data, including applying noise [119], [120], [121], [122], [123], [124], [125], [126], [127], [128], rotation [119], [120], [122], [123], [124], [125], [126], [127], [128], [129], [130], [131], [132], [133], brightness change [120], [122], [124], [125], [127], [133], [134], contrast change [120], [133], [134], [135], shadow [122], [124], [125], scaling [123], [126], [130], [132], [136], [137], translation [123], [126], [127], [131], [132], [133], flipping [123], [124], [125], [126], [131], [133], [137], and deformation [121], [136], [161]. Knowledge has also been borrowed from the signal processing domain to transform time-series data collected from sensors. Lee and Lee [109] proposed a series of ML models for vehicle noise level prediction based on revolutions per minute (RPM)-frequency spectrograms. The mathematical expression of engine order lines on spectrograms with respect to the number of cylinders in the engine was manually derived to synthesize more spectrograms. Sha et al. [138] proposed a data augmentation technique using a sliding window and fast Fourier transform to segregate a long signal into several short signals for cavitation detection. Ye et al. [139] proposed circshift to augment time-series data collected from railway wheels according to their periodic behavior. However, it is unclear whether the methods proposed by Sha et al. [138] and Ye et al. [139] are data augmentation methods because they are essentially based on a sliding window, which is a data preprocessing technique. Other signal processing techniques, such as applying noise [140], signal translation [140], amplitude shifting [140], [141], and time-stretching [140], [141], have also been utilized to conduct data augmentation. Some domain knowledge utilized to perform data augmentation originates from mechanical engineering. Zhang et al. [142] randomly altered the dimension sets of engineering drawings to generate synthetic drawings that can represent more users with different drawing styles. Ruediger-Flore et al. [122] created high-resolution, computer-aided design models and stacked them with real-life workshop backgrounds at different camera angles to achieve photorealistic synthesis. Lyu et al. [143] proposed morphing-based data augmentation to oversample fatigue fracture images by interpolating between two fracture images based on topology.

(2) Statistical data augmentation methods

Statistical data augmentation methods statistically oversample or downsample data sets. The most naive statistical method is random over sampling (ROS), which randomly duplicates minority class samples. The most frequently implemented statistical data augmentation technique is the synthetic minority oversampling technique (SMOTE), which synthesizes minority samples using k-nearest neighbors (KNN). The pseudocode of SMOTE for synthesizing samples for an individual minority class is shown in Algorithm 1 [146].

Martins et al. [144] combined SMOTE and additive white Gaussian noise to oversample the scarce examples of fault classes. This technique increased the fault classification accuracy of the stacked sparse AE model by 3.5% in the roller bearing multi-fault detection case study. Fan et al. [145] utilized SMOTE to oversample defect examples for ML-based defect diagnosis in wafer production. Data augmentation improved the accuracy of the linear regression model from 81% to 100%.
Algorithm 1. SMOTE (Minority_class, N, k) [146].
Input: minority class samples (Minority_class), number of synthetic samples to generate as a ratio to the original minority class (N), and number of nearest neighbors (k)
Output: synthetic minority class samples
1. (* if N is less than 100, randomly order the minority class samples because only a random percentage of them will be used to synthesize samples. *)
2. M = number of samples in Minority_class
3. if N < 100 then
4. Randomly order the Minority_class
5. M = (N/100) × M
6. N = 100
7. End if
8. N = (int)(N/100) (* The number of synthetic samples is assumed to be in integral multiples of 100. *)
9. k = Number of nearest neighbors
10. numattrs = Number of attributes of each Minority_class sample
11. Sample[ ][ ]: array that stores the Minority_class
12. newindex = 0 (* A counter of synthetic samples is generated. *)
13. Synthetic[ ][ ]: array to store synthetic samples
14. (* Compute k nearest neighbors for each Minority_class sample. *)
15. for i = 1 to M do
16. Compute k nearest neighbors for i, and save the indices in the nnarray.
17. Populate (N, i, nnarray)
18. Endfor
19. Populate(N, i, nnarray) (* This function generates synthetic samples. *)
20. while N ≠ 0 do
21. nn: a random integer from [1, k] (* This step chooses one of the k nearest neighbors of i. *)
22. for attr = 1 to numattrs do
23. Compute: dif = Sample[nnarray[nn]][attr] – Sample[i][attr]
24. Compute: gap = random number between 0 and 1
25. Synthetic[newindex][attr] = Sample[i][attr] + gap × dif
26. Endfor
27. newindex += 1
28. N = N – 1
29. Endwhile
30. Return (* End of populate. *)

(3) ML-based data augmentation methods

Recently, ML-based data augmentation techniques such as generative adversarial networks (GANs) and AEs have been intensively researched [147]. A GAN consists of a generator (G) that generates synthetic examples from a noise vector and a discriminator (D) that distinguishes whether the input is real or synthetic [148]. An adversarial methodology that trains the generator and discriminator through competition guides the learning process for GANs to generate high-quality syntheses. Fig. 5(a) shows the original form of GAN, which is also known as the vanilla GAN. At an arbitrary training epoch i, a batch of random vectors (Zg’s) is fed into the generator (Gi) to yield the synthetic samples (Sg’s). The synthetic samples are therefore fed into the discriminator (Di) to yield the predictions (ŶG’s), which are the outputs of the sigmoid function. ŶG is within the range of [0, 1] and can be perceived as the probability of the prediction being positive Eq. (1):

Y^G=DiSg=DiGiZg

Gi + 1 can be obtained by updating Gi based on the binary cross entropy (BCE) loss between ŶG’s and a vector of ones. The original BCE loss (LBCE) and the loss to update the generator (LG) are:

LBCE=BCEY,Y^=-1nYj=1nYYjlogY^j+1-Yjlog1-Y^j
LG=BCE1,Y^G=-1nYj=1nYlogY^G,j

where nY is the number of predictions, Y is the true label, and Ŷ is the prediction. Following the update of the generator at epoch i, the discriminator is trained to improve its ability to classify real and synthetic samples. First, a new batch of random vectors (Zd’s) is fed into the updated generator (Gi + 1) to yield a new batch of synthetic samples (Sd’s). The synthetic samples are then fed into Di to yield the predictions ŶD’s, which are considered negative sample that receive labels of YD = 0. The real samples vector (S) in the training set are also fed into Di to yield ŶD+’s, which are considered positive samples that receive labels of YD+ = 1. The predictions and labels are concatenated to construct ŶD’s and Y’s, respectively:

Y^D-=DiSd=DiGi+1Zd
Y^D+=DiS
Y^D=Y^D+Y^D-
YD=YD+YD-

Di + 1 can be obtained by updating Di based on the BCE loss between ŶD’s and YD’s (LD):

LD=BCEYD,Y^D=-1nYj=1nYYD,jlogY^D,j+1-YD,jlog1-Y^D,j

The above procedures describe a typical method for training a vanilla GAN. There are other methods to achieve and enhance adversarial training. After a GAN is trained, synthetic samples, Sg’s, can be generated by inputting random vectors (Z) into the generator: Sg = G(Z). Some works have reported the use of GANs to conduct data augmentation. Dong et al. [134] synthesized gear grinding burn images using a vanilla GAN. De Santo et al. [110] generated synthetic examples using a vanilla GAN for time-series predictive maintenance data mainly to benchmark some time-series encoding techniques, including recurrent plots and Gramian angular fields. The authors highlighted that vanilla GANs only provided slight performance enhancement while consuming considerably more computational resources. In addition, vanilla GANs cannot generate synthetic samples with respect to a specific demand.

Instead of generating synthesis solely from a noise vector, a conditional GAN (CGAN) allows the model to be conditioned with additional information such as class labels to synthesize designated examples (Fig. 5(b)). A classic method for training a CGAN is to concatenate the sample vector with its associated condition vector. During training, a batch of random vectors (Zg) are first concatenated with condition vectors (Pg’s) to indicate what types of samples are to be generated. The concatenated vector is fed into the generator (Gi) to obtain the synthetic samples (Sg’s). Sg’s are again concatenated with the associated condition vectors and then fed into the discriminator (Di) to yield the predictions (ŶG’s):

Y^G=DiSgPg=DiGiZgPgPg

After the generator is updated, ŶD+’s and ŶD’s can be obtained using the same procedures as Eqs. (5), (6), (7), (8). The only difference is that the random and sample vectors are concatenated with the associated condition vectors:

Y^D-=DiSdPd=DiGi+1ZdPdPd
Y^D+=DiSP

When generating synthetic samples using CGAN, the designated conditions are concatenated with random vectors to generate samples that can provide the conditions. The above methods can generate 1-dimensional vectors but not images, which are 2-dimensional (e.g., grayscale) or multiple channels of 2-dimensional matrices (e.g., RGB color model).

To generate images, the generator and discriminator can employ convolution layers and pooling layers to construct a convolutional GAN. Fig. 5(c) describes the structure of a conditional convolutional GAN with the dimensions of the convolution layers indicated. A convolution layer has multiple channels that each embeds a 2-dimensional feature map. As the random vector passes through the generator, the number of channels decreases, and the size of the feature maps increases. In this way, the output of the generator will be an image with the designated size and number of channels. The discriminator reverses the process of the generator by increasing the number of channels and decreasing the size of the feature maps. Eventually, the feature maps are flattened into a fully connected layer before the discriminator makes a prediction regarding whether the image is true or fake.

Based on vanilla GAN and CGAN, researchers have proposed modified GANs to address the challenges in data-driven design and manufacturing by altering the loss functions and network structures. In data-driven design, Chen and Ahmed [149] presented a performance augmented diverse GAN (PaDGAN) that combines GAN loss with performance augmented determinantal point process (DPP) loss in the training process to achieve generative design. PaDGAN models learn to synthesize training design data while generating diverse shapes with desirable properties. Considering that there are usually multiple target properties in design tasks, Chen and Ahmed [105] proposed MO-PaDGAN, which integrates performance augmented diverse GAN with multiobjective Bayesian optimization (MOBO). As demonstrated in the case studies, this pipeline can generate diverse shapes and facilitate the exploration of the full Pareto fronts in the property space. Nobari et al. [150] proposed a performance conditional diverse GAN (PcDGAN) to enable the generation of designs with designated properties. Compared with PaDGAN, this model is more flexible because the users can appoint desirable properties instead of maximizing or minimizing properties. The above GAN models are devised and trained to conduct data representation or synthesis instead of data augmentation. These works are included in this survey because they can potentially be utilized as advanced data augmentation methods. In particular, the diversity of the synthetic data, which is often overlooked by other data augmentation methods in this survey, can be improved by using these methods. Yoo et al. [151] proposed a designable GAN that deploys an inverse generator to infer and visualize the factors that affect system-level performance. Wu et al. [152] proposed a data augmentation GAN (daGAN) pipeline with two generators: the first generator is trained to conduct data augmentation, and the second generator focuses on airfoil design generation conditioned on a Mach number.

GANs are frequently used to generate defect or fault examples in manufacturing and condition monitoring data sets. In data-driven manufacturing, Jain et al. [124] trained a convolutional GAN model that replaces fully connected layers with convolution layers to enable image generation for hot-rolled steel defect classification. Wang et al. [153] proposed AdaBalGAN, which adaptively generates different numbers of synthetic examples for each wafer defect class at each iteration according to its class accuracy. Niu et al. [156] introduced D2 adversarial loss [169] and cycle consistency loss [170] to generate surface defect examples with high fidelity and diversity. In data-driven condition monitoring, Behera and Misra [159] used a CGAN to enrich fault instances to build predictive maintenance models using gated recurrent units. Li et al. [157] generated synthetic fault signals using the Wasserstein GAN (WGAN), where the Wasserstein distance is used to measure the divergence between the distribution of real examples perceived by the discriminator and the distribution of synthesized examples generated from the generator. To address the lack of diversity of WGAN-synthesized examples, SMOTE was implemented to further oversample the minority fault classes. WGAN and SMOTE helped consistently improve the accuracy of deep neural network models by more than 5% across different data imbalance ratios in the electromechanical fault diagnosis case study. Li et al. [147] proposed an augmented time-regularized GAN (ATR-GAN) for online process anomaly detection, which is suitable for time-series signals from condition monitoring. ATR-GAN employs a time-regulated Hausdorff distance that measures the similarity between data points while considering the temporal effect. An augmented filter layer is embedded to calculate the similarity between the real samples and synthetic samples and to remove the synthetic samples that are disparate from the real samples. Zhou et al. [163] proposed a distribution bias-aware collaborative GAN that utilized collaborative training to generate synthetic samples that resemble the distribution of the original data.

An AE is composed of an encoder (En) that maps the input into a low-dimensional latent space at the code layer (C) and a decoder (De) that reconstructs the input from the code (Fig. 6(a)):

Sg=DeC=DeEnS

During training, the S’s are both the input and the label. A reconstruction loss (LR) that computes the difference between the original (S’s) and synthetic samples (Sg’s) is used to update both the encoder and decoder:

Reconstructionloss=i=1nSLRSi,Sg,i=i=1nSLRSi,DeEnSi

where nS is the training set size. A commonly used reconstruction loss is the mean squared error. Synthetic samples can be generated by adding random vectors (Z’s) to the code layer (Fig. 6(a)). Once an input is mapped to a code, the decoder will try to reconstruct the input from the code with the added noise. In this way, the reconstructed samples are slightly different every time the same input is fed into the model. Similar to GANs, an AE can also be constructed with convolution layers to become a convolutional AE or be conditioned with class labels to become a conditional AE. The variational AE (VAE) is a special AE that reparametrizes the code layer into a multivariate normal distribution to construct a probabilistic latent space (Fig. 6(b)). The output of the encoder is:

CμCσ=EnS

where Cμ and Cσ are the mean vector and standard deviation vector to be reparametrized into C:

Ci=Cμ,i+expCσ,i
fori=1,2,3,...,nC

where ~N0,1, N stands for a normal distribution, and nC is the dimensionality of the code layer. Kullback–Leibler (KL) divergence can be computed to compare the latent distribution and standard normal distribution:

KLC=j=1nSi=1nCCμ,i,j2+exp2Cσ,i,j-2lnCσ,i,j-0.5

KL divergence is added to the loss function of VAE (LVAE) to make the latent distribution approximate a standard normal distribution:

LVAE=i=1nSLRSi,Sg,i+KLC

While generating synthetic samples, random vectors are sampled from a multivariate standard normal distribution as C, which are then fed into the decoder to output the Sg’s. To condition a VAE, a classic approach is to concatenate the sample vectors with the associated condition vectors (Fig. 6(b)). After the latent vectors are obtained from the encoder and reparameterization, they are again concatenated with the associated condition vectors. In this way, the decoder learns to generate different types of samples according to the conditions. To generate synthetic samples, random vectors are sampled from a multivariate standard normal distribution as the latent vectors and then concatenated with the designated condition vectors. The concatenated vector is fed into the decoder to generate Sg’s. Yang et al. [164] synthesized air rudder defect image snippets using convolutional AE and transplanted them into images with no defects to increase the sizes of the defect classes. Li et al. [158] utilized AE data to synthesize gear fault data for fault diagnosis. Alawieh et al. [154] generated synthetic wafer defect images using a convolutional AE to address the data imbalance challenge in wafer defect pattern classification. Yun et al. [155] built a conditional convolutional VAE to synthesize defect images on metal surfaces. Che et al. [162] trained a hybrif gated recurrent unit and VAE to detect roller bearing faults.

Special techniques have been applied to augment data using ML methods. Niu et al. [168] blocked the high-confidence regions on the training images based on a pretrained convolutional neural network for metal surface defect detection. This method balances the ML attention given to different regions and avoids overfitting to high-confidence regions. Yang et al. [165] constructed a bearing RUL prediction tool where the time-series signal was transformed into spectrograms using Fourier transform and then modeled using graphs. A graph-based data augmentation method was proposed to oversample the training data. However, models trained with augmented data sometimes perform worse than those trained without augmentation throughout the bearing lifecycle. Peng et al. [166] synthesized rare fault examples by learning the embedding using AE and applying soft Brownian offset to the latent space to generate new samples.

(4) Synthetic data evaluation

The evaluation of synthetic data quality was only conducted in two papers. Farady et al. [167] proposed PreAugNet, which investigates the quality of the augmented data. A support vector machine (SVM) was trained to classify whether the augmented defect images were similar enough to the original data set. This method is not sufficiently rigorous to ensure that synthetic images are diverse and close to reality. Meister et al. [160] utilized a conditional convolutional GAN to generate synthetic images for fiber layup defect detection. GAN-train and GAN-test were implemented to evaluate synthetic image diversity and realism. Diversity describes the variance of the synthetic samples; realism describes how close the synthetic samples are to the real data.

Synthetic data evaluation investigates the relationships among the true distribution, real data, and synthetic data (Fig. 7). The real data are acquired to capture the true distribution of the system. However, the distribution of the real data set can potentially differ from the true distribution for various reasons, such as sampling bias and data scarcity. The synthetic data in Fig. 7 are the data points generated by data augmentation. Note that synthetic data might inherit and magnify biases from real data, especially for statistical and ML-based data augmentation trained using real data. For example, mode collapse might occur while training GANs and lead to low synthetic data diversity. It is crucial yet difficult to evaluate the synthetic data quality with respect to the true distribution. The existing evaluation methods usually verify whether the synthetic data follow the distribution of the real data set, which is representative of the true distribution. The existing synthetic validation methods include, but are not limited to, descriptive statistics, graphical representations, ML performance, GAN-train/GAN-test, and other quantitative metrics [171].

By comparing the basic descriptive statistics (e.g., means, variances, and medians), the similarity in the distributions between the real data set and the synthetic data set can be investigated. The more similar the descriptive statistics are, the closer the synthetic data are to the real data distribution. Nonetheless, Anscombe [172] showed that different distributions might yield identical basic descriptive statistics. Graphical representations of a distribution, such as histograms and quartile–quartile plots, can provide more statistical details than basic descriptive statistics in a visualized manner. Graphical packages that embrace various graphical representations to visualize data sets are readily available. However, the graphs can be overwhelming and misleading, especially when dimensionality increases and multicollinearity occurs.

When using ML-based data augmentation techniques, the predictive performances of the ML models can indicate the similarities between the two data sets (Fig. 8(a) [171], [173]). The real data set is split into a training set (Dtrain) and a test set (Dtest). Dtrain is used to train the ML-based data augmentation model, which generates a synthetic data set (Dsyn). Thereafter, two ML models are trained using Dtrain and Dsyn. Dtest is utilized to test the models and determine their predictive performance. Dtrain and Dsyn are similar if the predictive performances of the model trained using Dsyn emulate the model trained using Dtrain. However, this method builds on the assumption that Dtest adequately represents the distribution of real data, which is undermined if Dtest is small. In addition, the choice of hyperparameters might significantly influence the predictive performance of the models.

GAN-train and GAN-test were developed to evaluate the performance of GANs (Fig. 8(b)) [173]. The real data set is split into Dtrain and Dtest. Dtrain is used to train a GAN model, which subsequently generates the synthetic data set Dsyn. Dsyn is used to train model #2, and Dtest is used to evaluate the model, which yields the GAN-train performance. A high GAN-train performance indicates that the synthetic data set is diverse and real. Dtrain is used to train model #1, and Dsyn is used to evaluate the model, which yields the GAN-test performance. A very high GAN-test performance indicates that the GAN model is likely to overfit Dtrain. A very low GAN-test performance indicates that the GAN model is likely to underfit Dtrain. Similar to the performance-based evaluation method, the choice of the model hyperparameters might significantly influence the predictive performances of the models, which might lead to unstable evaluation.

There are many quantitative measures for evaluating synthetic data quality. For example, the inception score [174], Fréchet inception distance [175], and sliced Wasserstein distance [176] can be used to evaluate the quality of synthetic images generated by GANs. Alaa et al. [177] proposed the α-precision, β-recall, and authenticity metrics to evaluate synthetic data quality. α-precision measures the realism of the synthetic data, β-recall measures the diversity, and the authenticity checks whether the data augmentation model overfits the training data. This comprehensive method separately measures realism, diversity, and generalizability.

(5) Discussion on data augmentation techniques

Among all the data augmentation methods, ROS is chosen as the baseline because it is simple yet widely applicable in all scenarios. This approach forces the learning process to give more attention to the minority classes but might lead to overfitting of the duplicated samples. Domain knowledge-based methods are highly effective at mitigating bias but require domain expertise. This approach can help mitigate the biases caused by environmental variations and monitoring system limitations. Thus, most domain knowledge-based methods are highly applicable to manufacturing and condition monitoring tasks that deploy sensing systems. Compared with other techniques, knowledge-based methods are less likely to cause overfitting to real data because they are not trained using real data. These methods can rapidly generate many synthetic samples using simple transformations. However, there are three major concerns:

(1) Many methods are not label-preserving. Although it is often assumed that transformed samples have the same label as the original samples, this assumption is not always true. In addition, there is no way to validate this hypothesis unless experiments are conducted on synthetic samples.

(2) Biases might be injected into the data set. Although it is less likely to magnify the biases from the real data, other biases might be induced by the transformations.

(3) As the amount of synthetic data increases, the computational costs of training an ML model also increase. Nonetheless, it is possible that not all synthetic data are meaningful in the task. Transformations must be applied wisely to avoid synthesizing irrelevant samples that occupy the learning capacity of the model.

SMOTE is a flexible method that is compatible with various data types and applications. Although the original SMOTE is only applicable to tabular data, it can be modified to oversample other data types, such as image and time-series data. Typical modifications for different data types utilize an AE to map the data into a latent space, which can be represented by tabular data and oversampled using regular SMOTE. SMOTE can be modified to address these challenges in different applications. For instance, borderline-SMOTE only oversamples the examples near the borderlines of different classes in classification tasks; K-means SMOTE enhances the diversity of synthetic data by first clustering the minority class and then oversampling each cluster. Compared with ROS, SMOTE will not be biased toward individual real samples because it fuses multiple samples. However, synthetic data might inherit biases from real data sets and eventually overfit them. Another weakness of SMOTE is that it does not learn from the underlying distribution of real data, thus yielding less realistic synthesis than ML-based methods. For example, SMOTE does not utilize majority class information to synthesize minority samples because it cannot learn from different groups. Additionally, label-preserving and computational cost issues also exist when using SMOTE.

ML-based methods that learn the underlying patterns of real data can potentially synthesize more realistic and diverse samples than SMOTE. They can be modified to generate syntheses of different data types (e.g., convolution layers for image augmentation and LSTM layers for time-series augmentation). They can be conditioned to generate designated synthetic samples. In addition, advanced ML-based models that can be used for data augmentation are emerging. Diffusion model, inspired by nonequilibrium thermodynamics, is a state-of-the-art image generation algorithm [178]. The diffusion process involves a Markov chain of diffusion steps that gradually adds Gaussian noise to the original data. Then, the model learns to reverse the diffusion process to synthesize samples from noise. Diffusion model has been utilized as a data augmentation technique in computer vision [179] and medical imaging [180] domains. Compared with ROS, ML-based methods consume significantly more computational resources but can generate exceptionally realistic and diverse syntheses.

Although powerful, ML-based methods still face various challenges. The limitation of all ML-based methods is that they are likely to inherit biases from their training data set, which is the real data. There is no single method that possesses all three desirable characteristics: high-quality samples, fast sampling, and mode coverage/diversity (Fig. 9 [181]). GANs are able to generate high-quality samples and achieve fast sampling but often suffer from mode coverage problems. This phenomenon is referred to as mode collapse, in which GANs, if not trained well, only generate synthetic samples from one mode (i.e., limited variety), while there are other modes in the training set. In addition, the training of GAN is unstable due to the adversarial nature of the generator and discriminator. VAEs that are trained using variational inference methods can better approximate the real data distribution given a noise vector input. Thus, VAEs are less prone to mode collapse. Nevertheless, VAEs often generate low-quality samples that are blurry and hazy [182]. Diffusion models can synthesize high-quality and diverse samples, but the diffusion process is usually time-consuming. Moreover, diffusion models usually require a large training data set, which contradicts data scarcity in this domain.

6.2. Additional data acquisition

When additional data collection is allowed, there is a chance to acquire additional samples from underrepresented groups. The most computationally efficient way to acquire or label additional data are to collect enough data that satisfy the requirements of the ML task to avoid unnecessary costs. Active learning is a cost-efficient approach that iteratively guides the data acquisition or labeling process to improve model performance (Fig. 10) [183]. At each iteration, an ML model is trained using the labeled data set. Adaptive sampling is conducted if the model does not meet the prediction performance requirement and the designated computational resources are not exhausted. A set of samples is selected to maximally improve the data representativeness or the model performance (Fig. 11). The samples can be selected from an unlabeled pool or a feature space, depending on the task. Thereafter, the selected samples are labeled by the annotator and added to the labeled data set for the next iteration. Some hybrid methods jointly consider improvements in data representativeness and model performance [184], [185], [186], [187]. Query-by-committee is a special active learning method that selects the next batch of data based on the performances of multiple models [188].

Active learning can be categorized into query synthesis, stream-based selective sampling, and pool-based sampling [189], [190]. Query synthesis samples an arbitrary amount of new data from a feature space. Stream-based selective sampling usually involves sampling one new example at a time from a feature space. Pool-based sampling starts from a set of labeled data and iteratively labels new data from a large unlabeled data pool. Adaptive sampling can be perceived as the acquisition function of active learning that determines the next batch of data to acquire or label. Various adaptive sampling and active learning methods have been implemented in ML-based design, manufacturing, and condition monitoring. since 2019 (Table 7 [12], [63], [91], [101], [183], [184], [185], [186], [187], [191], [192], [193], [194], [195], [196], [197], [198], [199], [200], [201], [202], [203], [204], [205], [206], [207], [208], [209], [210], [211], [212], [213], [214], [215], [216], [217], [218], [219]).

6.2.1. Data representativeness

There are various metrics for measuring data representativeness, including data diversity and coverage. Diversity metrics are usually established based on pairwise distance (Fig. 11(a)). At each iteration, an ML model is trained and evaluated to check whether the required performance has been achieved. If not achieved, adaptive sampling helps acquire new samples to fill the most underrepresented region in the feature space. For instance, Chan et al. [91] proposed METASET to select an unbiased subset from a large metamaterial shape database, which can be utilized as an adaptive sampling method for active learning. DPP was utilized to model the diversity in both shape and property spaces, which are jointly considered to evaluate the subsets. The selection of the most diverse subset in the shape space given the subset size (M) and entire shape data set (DS) of size N using DPP is illustrated as follows. A similarity matrix (L) is defined based on a pairwise distance function Δ(DS,i, DS,j), where DS,i and DS,j are two shapes in DS. Chan et al. [91] transformed the distance function using a radial basis function kernel because DPP requires the input similarity matrix to be semidefinite:

Li,j=exp-0.5ΔDS,i,DS,j
fori,jN+

Using L-ensembles [190], the probability of selecting any possible subset M is:

Prob(M)=detLMdetL+I

where det(∙) denotes the determinant of a matrix, I is an identity matrix of size N × N, and LM[Li,j]i,jN+ is the submatrix of the selected indices from L. The probability of a set containing two distinct samples is inversely related to the similarity between them. Thus, the least similarity among the subset (i.e., the most diverse set) can be obtained by finding the maximum P(M) using optimization techniques. This approach is equivalent to finding the maximum det(LM) because det(L + I) of a fixed-sized DS is constant. According to the case studies in Ref. [91], the selected subset is highly diverse with a small sample size, which offers better predictive performance and shorter training time. Lee et al. [63] proposed t-METASET, which iteratively generates diverse unit cell shapes and acquires diverse properties from the existing samples. Its task-aware functionality guides property sampling toward the designated region. Jang et al. [12] trained a reinforcement learning (RL) agent that iteratively generates diverse designs by rewarding the diversity of topology using pixel differences based on Euclidean distance and structural dissimilarity based on pixel distribution. Compared with a conventional greedy search, this method can generate 5% more design shapes on average in the tire design case study.

Data coverage measures how well the feature space is covered by the existing data (Fig. 11(b)). Samavatian et al. [191] proposed to predict the solder joint lifespan from the dwelling temperature and time using an iterative correlation-driven network. The input feature space was segregated into uniform grids. The most insufficiently populated grids are the underrepresented regions for sampling new examples. Wang et al. [101] designed a shape perturbation algorithm that gradually samples new properties toward the unexplored regions in the property space. The algorithm builds on the assumption that a small perturbation in the shape of a design will yield a small change in its properties. The designs with the least number of surrounding designs in the property space are slightly altered to obtain designs with similar properties. The above data coverage measurements based on grid and relative data density have substantial limitations: ➀ In nature, the coverage of a subject is usually defined by a circle instead of a rectangular space; ➁ the overlapping effects of multiple samples are not considered; and ➂ a quantitative metric of data coverage cannot be established. Xie et al. [192] introduced a rigorous data coverage notion defined by Asudeh et al. [96] to data-driven auxetic design. This notion defines the coverage of a data set in a continuous-valued feature space (Fig. 11(b)). Given data set E, query point q, distance function Δ, vicinity value ρ, and coverage order k, the coverage of q by E is defined:

Covρ,kq,E=trueiftE|Δt,qρkfalseotherwise

This notion essentially checks whether the query point is in the vicinity defined by ρ and Δ of at least k data points (t’s) from the data set E. With user-defined Δ, ρ, and k, a region covered by the data set can be computed by:

CoverageE=q|Covq,E=True

This notion can incorporate any appropriate distance functions, account for the overlaps among data points, and quantify the data coverage.

Special data representativeness metrics that do not involve diversity or coverage have been devised. Information entropy was utilized by Zhang et al. [193] to select the most representative material data for active learning. Lin et al. [194] utilized K-medoids clustering based on K-means clustering to select a set of examples that represents the entire data set. Implemented in lithography modeling with active learning, K-medoids clustering helped reduce the amount of labeled data required to achieve satisfactory performance by 3 to 10 times. Shao et al. [195] proposed a graph-sampling-based active learning method for lithography novelty detection. A KNN graph was established based on the latent variables extracted from an AE model. Thereafter, a random walk algorithm was designed to randomly explore the graph. The sampling priority of a node was determined according to its total number of visits.

6.2.2. Model performance

Model performance is evaluated using predictive error or uncertainty. When using predictive error to guide data acquisition, the regions that exhibit the greatest predictive errors are the regions of interest (ROIs) (Fig. 11(c)). Kapusuzoglu et al. [183] proposed an adaptive surrogate modeling method for high-dimensional spatiotemporal problems in structural optimization. An adaptive sampling technique with exploration and exploitation capabilities was designed according to the mean absolute prediction errors. Zhang et al. [208] proposed an adaptive sampling technique using K-means clustering, KNN, and maximum curvature for surrogate models. The test set is divided into subgroups using K-means clustering and KNN. The subgroup that possesses the highest total prediction error is the ROI. Thereafter, maximum curvature is used to select a set of points from the ROI to generate new samples. Sun et al. [202] introduced bootstrap-guided adaptive optimization, which involves adaptive sampling based on subgroup predictive error, to determine the optimal hardware configuration design. Batch transductive experimental design was implemented at the initial data acquisition stage to select a subset of configurations with optimized diversity by maximizing the intra-set distance. Adaptive sampling based on predictive error has also been adopted to improve surrogate modeling in Wang et al. [200] for electromagnetic design, in Li et al. [215] for AM quality prediction, and in Kolesnikov et al. [212] for protective coating design.

Uncertainty metrics such as using the entropy of prediction probabilities are also widely deployed in active learning. New examples are sampled around the regions where the uncertainties are high according to an uncertainty estimation function (Fig. 11(d)). Xiao et al. [187] developed a predictive uncertainty metric to adaptively sample new data and iteratively train a classification model that predicts whether a hotspot would occur in a circuit design. This uncertainty metric evaluates the proximities of test points to the hyperplane of an SVM. New samples are selected based on the examples close to the hyperplane because the model is uncertain about those examples. As an expansion of [187], Yang et al. [184] combined the above uncertainty metric with a diversity metric referred to as layout pattern sampling based on clustering. Farrokh and Fallah [217] utilized the same uncertainty characterization method based on an SVM as Xiao et al. [187] for active learning-aided flutter instability boundary prediction. Zhu et al. [213] proposed a machine fault prognosis model based on a Bayesian neural network (BNN), whose predictive uncertainty was captured by the variance of the predictive distribution to guide active learning. The expected information was utilized to measure the uncertainty for active learning in Hughes et al. [211] for structural health monitoring and in Wan et al. [214] for machining operation optimization. Gaussian process (GP) regression models are trained as surrogate models in various design optimization works and can guide adaptive sampling because of their inherent uncertainty measurement functionality [58]. GP adaptive sampling was also implemented in Xu et al. [209] for hall effect sensor design, in Liu et al. [210] for functionally graded cellular structure design, in Sarkar et al. [197] for compressor rotor design, in Luo et al. [218] for rotor fan aerodynamic design, and in Yue et al. [201] for fuselage control. In Refs. [185], [186], uncertainty metrics were combined with randomized sampling to jointly consider data representativeness and model uncertainty. Cui and Ghosn [198], Shim et al. [199], and Verduzco et al. [204] compared multiple uncertainty metrics, such as least confidence, least margin, and maximum uncertainty, via structural reliability analysis, wafer pattern classification, and battery material design case studies, respectively. Botcha et al. [203] and Cheng and Jin [205] utilized a query-by-committee method that trains multiple ML models and calculates the variance of predictions to characterize the predictive uncertainty. The above works commonly focus on measuring predictive uncertainty in forward modeling (e.g., from design shapes to properties). Xie et al. [220] developed a query-by-committee methodto measure the predictive uncertainty of mixture density networks (MDNs), which is an inverse modeling algorithm. This method trains multiple MDN models and compares their predictions to measure the predictive uncertainty of an input. Combined with data coverage, this active learning method can efficiently improve the data coverage and explore the feature space with fewer new samples. Consequently, the performance of the MDN model rapidly increases with the number of samples.

Active optimization is a special case that combines active learning and design optimization. Active optimization iteratively trains surrogate models until the model prediction at the optimum is close enough to the simulation result [206], [207]. At each iteration, surrogate models quickly locate the regions that are likely to include the global optimum as the ROIs. Adaptive sampling then samples more data within the ROIs to improve the prediction performance around them. Instead of reducing the data imbalance, adaptive optimization actively induces bias in the data set. It leads to the risk that the surrogate model might provide a poor estimation of the optimal design at the beginning of the optimization, thus misleading the following adaptive sampling. Note that active optimization in Refs. [206], [207] is different from other design optimization methods (e.g., Bayesian optimization) that balance exploration and exploitation. Combined with active learning, this method focuses more on exploitation to accelerate the optimization.

6.2.3. Discussion on active learning techniques

The baseline method to sample additional data are randomized sampling, which is not efficient because some random samples could already be covered by the existing data set. When using data representativeness to conduct adaptive sampling, the assumption is that the more well-covered the feature space is, the more knowledge the data set employs, and the better the ML model performs. Thus, it is an indirect measure for improving ML performance. Diversity metrics are widely used because various distance metrics, such as Euclidean and Manhattan distances, can be incorporated to deal with different scenarios (e.g., data types). Compared with the data coverage notation defined by Asudeh et al. [96], diversity metrics are usually faster since they do not compute the area/volume of the covered space.

Data coverage is a recently developed metric that is compatible only with tabular data. Other data types can be analyzed using data coverage once they are mapped into a latent space. Data coverage can be adapted to various scenarios using user-specified Δ, ρ, and k. In addition, data coverage can be quantified and visualized to help conduct adaptive sampling. The computational cost of calculating the covered space is high because the coverages of different samples might overlap. Thus, Voronoi diagram-based methods and coverage approximation methods have been developed to reduce the computational cost [96].

Although adaptive sampling based on predictive error can be utilized for all types of ML models, it poses a critical prerequisite for the test set. The test set must be representative enough to ensure that the sampling process can access all regions in the feature space. The regions not covered by the test set will not be sampled. This method faces significant limitations in ML-based design and manufacturing where data scarcity is a common challenge. Uncertainty-based adaptive sampling is a powerful and frequently employed method that directly identifies and samples the examples from the regions about which the model is most uncertain. It is compatible with most of the classification models. However, regression models must be capable of modeling the uncertainty to implement uncertainty-based adaptive sampling (e.g., GP and Bayesian neural networks). Query-by-committee can lift the requirements of the regression model but significantly increases the computational cost as it trains multiple models.

Compared with randomized sampling, all surveyed techniques can help avoid sampling new examples that are already covered or that do not contribute to model performance improvement. Adaptive sampling based on data representativeness is widely applicable because it does not require certain types of ML models. The decision of the next batch to select solely depends on the data set itself, while the ML model only informs when to terminate active learning. Regarding model performance improvement, data representativeness might be less effective than model performance because the latter directly indicates the examples that the model demands. However, adaptive sampling based on model performance only focuses on the mitigation of data scarcity, while adaptive sampling based on data representativeness addresses both data imbalance and data scarcity. The applicability of all adaptive sampling methods must be further improved to account for different data types and mixed-variable scenarios.

7. Discussion

In this section, recent publications regarding DQA, data augmentation, and active learning in this domain are discussed in terms of the status, trends, and challenges to answer RQ6.

7.1. Research status and trend

The papers summarized in Table 4, Table 6, Table 7 are arranged according to their years of publication in Fig. 12. DQA, data augmentation, and active learning started to attract the attention of researchers in 2019. Overall, the number of publications rapidly increased from 2019 to May 2023, which indicates that there has been a growing awareness of measuring and improving data quality in this domain. The number of publications that reported DQA pipelines increased from one to five per year, which is seemingly the slowest. However, DQA has become a standard procedure in ML-based modeling such that most related studies have embraced DQA but not reported it. The DQA publications included in this survey discuss the development of DQA pipelines for a specific research field in design and manufacturing, which only accounts for a small portion of all research works that performed DQA. The number of publications that report data augmentation had the fastest growth, increasing from 2 to 19 per year. Data augmentation is widely applicable in ML-based design and manufacturing where data imbalance and scarcity are challenging. Data augmentation has enabled the deployment of ML and improved ML performance in many data-poor tasks (Table 6). Similarly, there are other studies that implemented data augmentation without reporting it. Among the three categories, adaptive sampling and active learning had the largest number of publications in 2019. The rate of increase was slow, and there were 10 publications in 2022. Active learning is not as widely applicable as data augmentation because additional data acquisition is a prerequisite. Thus, challenges such as data imbalances in fault and defect detection cannot be addressed using active learning. However, 10 publications leveraged active learning from January to May 2023, indicating rapid growth in 2023.

The compositions of the papers surveyed with respect to design, manufacturing, and condition monitoring are shown in Fig. 13. For DQA, most publications have focused on manufacturing and production data quality, where various systems and stakeholders are involved. DQA pipelines were proposed for multiple manufacturing phases, such as in-process monitoring [78] and production planning [79]. Only two and three papers investigated DQA in design and condition monitoring, respectively. Data augmentation techniques were frequently reported in all three categories due to their wide applicability. These techniques were observed more frequently in ML-based manufacturing and condition monitoring publications, as additional data acquisition is usually not an option for defect and fault examples, leaving data augmentation the best choice. In contrast, ML-based design accounts for more than 60% of the publications in active learning, as simulation is usually the primary data source and can generate additional design examples. Only six and eight papers were related to active learning-aided manufacturing and condition monitoring, respectively. Most of the paper implemented pool-based sampling to selectively label existing input data instead of generating new data.

The surveyed publications are clustered according to the raw data types in Fig. 14, including tabular, image, time-series, and 3D data. Image data augmentation publications accounted for more than 60% of the data augmentation papers surveyed. It has become common practice to reduce biases during image data collection [221]. Data augmentation with tabular data was implemented in six papers that utilized GANs and SMOTE. Data augmentation methods applied to time-series data usually involve either signal processing techniques or first transforming to tabular data. Data augmentation for 3D data was only reported in three papers, where heavy domain knowledge was involved. For active learning, tabular and image data accounted for most of the surveyed papers, as most ML-based design tasks employ tabular or image data to represent designs. In addition, pool-based active learning in ML-based manufacturing usually involves process monitoring image data. Time-series data are incompatible with pool-based sampling and are mostly obtained from condition monitoring papers where synthesizing data are challenging. Three papers transformed time-series data to tabular data to conduct active learning. Only one paper implemented active learning with 3D data, as most ML-based design research still focuses on image data.

The compositions of the surveyed data augmentation and active learning techniques are shown in Fig. 15. Twenty-nine of the 51 papers that implemented data augmentation techniques were inspired by domain knowledge such as image analysis, signal processing, and engineering domain expertise. Nearly half of the data augmentation papers implemented ML-based data augmentation to generate synthetic data, including GANs and AEs. Statistical data augmentation, which was implemented in 12% of the data augmentation publications, attracted the least attention in ML-based design and manufacturing. Seven publications implemented more than one data augmentation technique. However, different techniques are deployed sequentially to complement each other instead of being integrated as one technique. For active learning, model performance, including predictive error and uncertainty, was the most frequently utilized sampling technique and accounted for more than 70% of the publications. The main reason is that model performance techniques are more straightforward than data representativeness because no diversity or coverage metric needs to be established. Diversity is the dominant technique used to evaluate data representativeness, accounting for more than 80% of the data representative metrics. This is because diversity metrics are more flexible and easily integrable into ML pipelines [91]. Only three papers implemented coverage-based metrics. Three papers integrated uncertainty and diversity metrics to guide active learning. Adaptive mechanisms were established to tune the impacts of the two metrics and tradeoff between exploration and exploitation.

7.2. Challenges and future directions

The above review showed that there have been significant research efforts to address data challenges in ML-based design and manufacturing. However, challenges remain in DQA, data augmentation, and active learning for design and manufacturing data sets, which creates opportunities. The challenges and opportunities are discussed in the following passages.

7.2.1. Data augmentation for time-series and 3D data

Although most data augmentation methods are compatible with tabular and image data, techniques for augmenting time series and 3D data are lacking. The existing methods for time-series and 3D data heavily rely on domain knowledge, where transformations are manually chosen and applied. Time-series data must be preprocessed into tabular data before ML-based data augmentation can be implemented. Nonetheless, most data augmentation techniques do not account for the temporal relationship when synthesizing time-series data. GANs and AEs can be used to synthesize 3D data but with high computational costs. A method is needed to augment 3D data with low computational cost while accounting for the 3D spatial relationship.

7.2.2. Advanced data augmentation methods

Although GANs and AEs have shown exceptional data synthesis capabilities, advanced ML models continue to emerge and have been utilized to conduct data augmentation in other domains. For example, a hybrid model named VAE-GAN combines the advantages of VAEs and GANs to generate realistic and diverse syntheses [222]. In ML-based design, diffusion models have been adopted to generate desirable designs. However, there has been no work that conducts data augmentation using diffusion models in this domain. More advanced data augmentation models should be adapted to ML-based design and manufacturing to improve synthetic data quality.

7.2.3. Validation of synthetic data

All the surveyed papers that implemented data augmentation demonstrated the effectiveness of the methods with improved model performance in case studies. However, synthetic data quality evaluations are often lacking in surveyed works. Synthetic data evaluation involves both realism and diversity. It is possible that the model test performance increases even though the synthetic data are not close enough to the real data (e.g., a small test set). In addition, a synthesis that lacks diversity induces bias in the data set and only represents part of the real data population. The two criteria can enhance the trustworthiness of the data synthesis and the proposed data augmentation techniques. Additionally, the tasks that require data augmentation usually suffer from data scarcity. Validation methods must be developed for data-poor tasks.

7.2.4. Active learning for generative models

Active learning implemented in the surveyed papers has achieved up to 80% data quantity reduction while achieving similar performance in this domain. However, all of these papers were aimed at building forward prediction models with one-to-one relationships using active learning. Active learning has not been developed or utilized to build generative models that characterize inverse and one-to-many relationships. For example, multiple shapes can be generated given a target property in design, and multiple control parameter sets can be selected to achieve the same result during manufacturing [223]. Such an inverse relationship is much more complicated than forward one-to-one relationships. Current active learning methods, including adaptive sampling based on data representativeness and model performance, cannot handle generative models. However, generative tasks such as generative design are attracting an increasing amount of attention [105], [149]. Thus, active learning methods for generative models must be developed.

7.2.5. Adaptive data acquisition from scratch

As data-driven design and manufacturing are becoming increasingly popular, the amount of data to be acquired and labeled has rapidly increased. These exploding data acquisition efforts are associated with the consumption of resources and, if not efficient, will lead to significant costs. Thus, advanced pipelines that monitor and guide the data acquisition from the beginning of a project are needed. Such a pipeline must conduct DQA to ensure that the acquired data are of high quality and thus can be utilized for modeling tasks. The DQA should be designed according to task-specific data characteristics, such as data types, formats, structure, and usage. Advanced sampling and active learning must also be implemented to reduce the data acquisition effort. In this way, models with satisfactory performance can be achieved with minimum consumption of resources.

8. Conclusions

The level of interest in utilizing ML in design and manufacturing has been rapidly increasing. Data availability is the major challenge and limiting factor in most industrial applications. This survey reviews the methods for evaluating and improving data quality in ML-based design and manufacturing. Six research subquestions were proposed to guide this review. This survey first establishes the background of industrial data quality by reviewing the terminologies related to data in ML-based modeling. The root causes and types of data challenges are identified, including human factors, complex systems, complicated relationships, lack of data quality, data heterogeneity, data imbalance, and data scarcity.

The focus of this survey is data quality and data imbalance. Data quality concepts and the root causes of data imbalance are investigated. The definitions, metrics, and frameworks of data quality, data readiness, and InfoQ are discussed. Thereafter, data imbalance is analyzed according to biases in design and manufacturing. Metrics for measuring and evaluating representation bias, such as fairness and diversity, are introduced. Then, a comprehensive review of the methods used to improve data quality and mitigate data imbalance is conducted, focusing on data augmentation and active learning. The applications of data augmentation and active learning in design and manufacturing are discussed. Different methods are compared regarding their advantages, limitations, and applicability. In the discussion section, the status and trend of the surveyed techniques are analyzed in terms of publications per year, applications, data types, and methods.

This review provides a comprehensive introduction to data quality in ML-based design and manufacturing with respect to its terminologies, challenges, concepts, and applications. This demonstrates recent advancements in data quality improvement and bias mitigation techniques. The first limitation of this study is that it focuses on the data quality with respect to ML-based modeling. Data quality concerns related to other topics, such as data governance and heterogeneity, are also significant barriers to industrial digitalization. However, data quality in industrial digitalization is such a massive domain that this survey can only cover one aspect. The second limitation is that this survey only investigates the methods that improved the data set. There are many methods that address those challenges during the learning process in ML. For instance, re-weighting can be used to assign more weights to underrepresented classes while calculating the loss, which encourages the model to pay more attention to minority classes. Transfer learning involves knowledge transfer from similar data sets to mitigate data scarcity. This survey focuses on data quality and thus does not cover these learning methods.

Acknowledgments

This work was funded by the McGill University Graduate Excellence Fellowship Award (00157), the Mitacs Accelerate Program (IT13369), and the McGill Engineering Doctoral Award (MEDA).

Compliance with ethical guidelines

Jiarui Xie, Lijun Sun, and Yaoyao Fiona Zhao declare that they have no conflict of interest or financial conflicts to disclose.

References

[1]

Kumar P, Bhamu J, Sangwan KS.Analysis of barriers to Industry 4.0 adoption in manufacturing organizations: an ISM approach.Procedia CIRP 2021; 98:85-90.

[2]

Silva N, Barros J, Santos MY, Costa C, Cortez P, Carvalho MS, et al.Advancing logistics 4.0 with the implementation of a big data warehouse: a demonstration case for the automotive industry.Electronics 2021; 10(18):2221.

[3]

Carvalho TP, Soares FA, Vita R, Francisco RP, Basto JP, Alcalá SG.A systematic literature review of machine learning methods applied to predictive maintenance.Comput Ind Eng 2019; 137:106024.

[4]

Wilhelm Y, Reimann P, Gauchel W, Mitschang B.Overview on hybrid approaches to fault detection and diagnosis: combining data-driven, physics-based and knowledge-based models.Procedia CIRP 2021; 99:278-283.

[5]

Fentaye AD, Baheta AT, Gilani SI, Kyprianidis KG.A review on gas turbine gas-path diagnostics: state-of-the-art methods, challenges and opportunities.Aerospace 2019; 6(7):83.

[6]

Fan CM, Lu YP.A Bayesian framework to integrate knowledge-based and data-driven inference tools for reliable yield diagnoses. In: Proceedings of the 2008 Winter Simulation Conference; 2008 Dec 7–10; Miami, FL, USA. Piscataway: IEEE; 2008. p. 2323–9.

[7]

Xie J, Sage M, Zhao YF.Feature selection and feature learning in machine learning applications for gas turbines: a review.Eng Appl Artif Intl 2023; 117:105591.

[8]

Goodfellow I, Bengio Y, Courville A. Deep learning. Natrue, 521 (2015), pp. 436-444

[9]

Liu D, Wang Y.Multi-fidelity physics-constrained neural network and its application in materials modeling.J Mech Des 2019; 141(12):121403.

[10]

Kotsiopoulos T, Sarigiannidis P, Ioannidis D, Tzovaras D.Machine learning and deep learning in smart manufacturing: the smart grid paradigm.Comput Sci Rev 2021; 40:100341.

[11]

Wu J, Qian X, Wang MY.Advances in generative design.Comput Aided Des 2019; 116:102733.

[12]

Jang S, Yoo S, Kang N.Generative design by reinforcement learning: enhancing the diversity of topology optimization designs.Comput Aided Des 2022; 146:103225.

[13]

Zhang C, Xie J, Shanian A, Kibsey M, Zhao YF.A hybrid deep learning approach for the design of 2D low porosity auxetic metamaterials.Eng Appl Artif Intell 2023; 123:106413.

[14]

Xu H, Liu R, Choudhary A, Chen W.A machine learning-based design representation method for designing heterogeneous microstructures.J Mech Des 2015; 137(5):051403.

[15]

Ling C, Kuo W, Xie M.An overview of adaptive-surrogate-model-assisted methods for reliability-based design optimization.IEEE Trans Reliab 2023; 72(3):1243-1264.

[16]

Zhang C, Ridard A, Kibsey M, Zhao YF.Variant design generation and machine learning aided deformation prediction for auxetic metamaterials.Mech Mater 2023; 181:104642.

[17]

Edwards K.Design for manufacturing: a structured approach.Mater Des 2003; 24:157-158.

[18]

Xie J, Saluja A, Rahimizadeh A, Fayazbakhsh K.Development of automated feature extraction and convolutional neural network optimization for real-time warping monitoring in 3D printing.Int J Comput Integr Manuf 2022; 5(8):813-830.

[19]

Zhang Y, Safdar M, Xie J, Li J, Sage M, Zhao YF.A systematic review on data of additive manufacturing for machine learning applications: the data quality, type, preprocessing, and management.J Intell Manuf 2022; 34:3305-3340.

[20]

Yang M, Liu J.In situ monitoring of corrosion under insulation using electrochemical and mass loss measurements.Int J Corrosion 2022; 2022:6681008.

[21]

Yang S, Page T, Zhang Y, Zhao YF.Towards an automated decision support system for the identification of additive manufacturing part candidates.J Intell Manuf 2020; 31(8):1917-1933.

[22]

Saluja A, Xie J, Fayazbakhsh K.A closed-loop in-process warping detection system for fused filament fabrication using convolutional neural networks.J Manuf Process 2020; 58:407-415.

[23]

Yang M, Keshavarz MK, Vlasea M, Molavi-Kakhki A, Laher M.Supersolidus liquid phase sintering of water-atomized low-alloy steel in binder jetting additive manufacturing.Heliyon 2023; 9(3):e13882.

[24]

Chuo YS, Lee JW, Mun CH, Noh IW, Rezvani S, Kim DC, et al.Artificial intelligence enabled smart machining and machine tools.J Mech Sci Technol 2022; 36(1):1-23.

[25]

Xu J, Kovatsch M, Mattern D, Mazza F, Harasic M, Paschke A, et al.A review on AI for smart manufacturing: deep learning challenges and solutions.Appl Sci 2022; 12(16):8239.

[26]

Ito A, Hagström M, Bokrantz J, Skoogh A, Nawcki M, Gandhi K, et al.Improved root cause analysis supporting resilient production systems.J Manuf Syst 2022; 64:468-478.

[27]

Hagemann S, Sünnetcioglu A, Stark R.Hybrid artificial intelligence system for the design of highly-automated production systems.Procedia Manuf 2019; 28:160-166.

[28]

Apostolidis A, Pelt M, Stamoulis KP.Aviation data analytics in MRO operations: prospects and pitfalls. In: Proceedings of the 2020 Annual Reliability and Maintainability Symposium (RAMS); 2020 Jan 27–30; Palm Springs, CA, USA. Piscataway: IEEE; 2020. p. 1–7.

[29]

Williams G, Meisel NA, Simpson TW, McComb C.Design for artificial intelligence: proposing a conceptual framework grounded in data wrangling.J Comput Inf Sci Eng 2022; 22(6):060903.

[30]

Ehrlinger L, Wß.A survey of data quality measurement and monitoring tools.Front Big Data 2022; 5:850611.

[31]

Chandran DR, Gupta V.A short review of the literature on automatic data quality.J Compu Commun 2022; 10(5):55-73.

[32]

Kamm S, Veekati SS, Müller T, Jazdi N, Weyrich M.A survey on machine learning based analysis of heterogeneous data in industrial automation.Comput Ind 2023; 149:103930.

[33]

Lee D, Chen W, Wang L, Chan Y, Chen W.Data-driven design for metamaterials and multiscale systems: a review.Adv Mater 2023; 36(8):2305254.

[34]

Kirianaki NV, Yurish SY, Shpak NO, Deynega VP.Data acquisition and signal processing for smart sensors. Hoboken: Wiley (2002)

[35]

Schmetz A, Lee TH, Zontar D, Brecher C.The time synchronization problem in data-intense manufacturing.Procedia CIRP 2022; 107:827-832.

[36]

Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton J, Axton M, Baak A, et al.The FAIR guiding principles for scientific data management and stewardship.Sci Data 2016; 3(1):160018.

[37]

Simmhan Y, Plale B, Gannon D.A survey of data provenance techniques [dissertation]. Indiana University, Bloomington (2005)

[38]

Askham N, Cook D, Doyle M, Fereday H, Gibson M, Landbeck U, et al.The six primary dimensions for data quality assessment. Report. Olympia: Washington State Board for Community and Technical Colleges. 2013.

[39]

Lawrence ND.Data readiness levels.2017. arXiv: 1705.02245.

[40]

Kenett RS, Shmueli G.Information quality: the potential of data and analytics to generate knowledge.Wiley, Hoboken (2017)

[41]

Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, et al.Datasheets for datasets.Commun ACM 2021; 64(12):86-92.

[42]

Bender EM, Friedman B.Data statements for natural language processing: toward mitigating system bias and enabling better science.Trans Assoc Comput Linguist 2018; 6:587-604.

[43]

Arnold M, Bellamy RKE, Hind M, Houde S, Mehta S, et al.FactSheets: increasing trust in AI services through supplier's declarations of conformity. IBM J Res Dev 2019;63:6:1–13.

[44]

Holland S, Hosny A, Newman S, Joseph J, Chmielinski K.The dataset nutrition label: a framework to drive higher data quality standards.2018. arXiv: 1805.03677.

[45]

Alhassan I, Sammon D, Daly M.Data governance activities: an analysis of the literature.J Decis Systems 2016; 25:64-75.

[46]

Lismont J, Vanthienen J, Baesens B, Lemahieu W.Defining analytics maturity indicators: a survey approach.Int J Inf Manage 2017; 37(3):114-124.

[47]

Gökalp MO, Gökalp E, Kayabay K, Ko Açyiğit, Eren PE.Data-driven manufacturing: an assessment model for data science maturity.J Manuf Syst 2021; 60:527-546.

[48]

Rosenbaum S.Data governance and stewardship: designing data stewardship entities and advancing data access.Health Serv Res 2010; 45:1442-1455.

[49]

Endel F, Piringer H.Data wrangling: making data useful again.IFAC-PapersOnLine 2015; 48(1):111-112.

[50]

Meng T, Jing X, Yan Z, Pedrycz W.A survey on machine learning for data fusion.Inform Fusion 2020; 57:115-129.

[51]

Ali H, Salleh M, Saedudin R, Hussain K, Mushtaq M.Imbalance class problems in data mining: a review.Indonesian J Electr Eng Comput Sci 2019; 14(3):1552-1563.

[52]

Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A.A survey on bias and fairness in machine learning.ACM Comput Surv 2021; 54(6):1-35.

[53]

Safdar M, Lamouche G, Paul PP, Wood G, Zhao YF.Feature engineering in additive manufacturing. In: Safdar M, Lamouche G, Paul PP, Wood G, Zhao Y, editors. Engineering of additive manufacturing features for data-driven solutions: sources, techniques, pipelines, and applications. Cham: Springer; 2023. p. 17–43.

[54]

Kim J, Yang Z, Ko H, Cho H, Lu Y.Deep learning-based data registration of melt-pool-monitoring images for laser powder bed fusion additive manufacturing.J Manuf Syst 2023; 68:117-129.

[55]

Shahbazi N, Lin Y, Asudeh A, Jagadish H.A survey on techniques for identifying and resolving representation bias in data.2022. arXiv: 2203.11852.

[56]

Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L. Hutchinson B, et al.Model cards for model reporting. In: Proceedings of the FAT* '19: Conference on Fairness, Accountability, and Transparency; 2019 Jan 29–31; Atlanta, GA, USA. New York City: Association for Computing Machinery; 2019. p. 220–9.

[57]

Zaccaria V, Rahman M, Aslanidou I, Kyprianidis K.A review of information fusion methods for gas turbine diagnostics.Sustainability 2019; 11(22):6202.

[58]

Tan YT, Kunapareddy A, Kobilarov M.Gaussian process adaptive sampling using the cross-entropy method for environmental sensing and monitoring. In: Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA); 2018 May 21–25; Brisbane, QLD, Australia: Piscataway: IEEE; 2018. p. 6220–7.

[59]

Ngoc N, Lasa G, Lriarte L.Human-centred design in Industry 4.0: case study review and opportunities for future research.J Intell Manuf 2022; 33(1):35-76.

[60]

Robert M, Giuliani P, Gurau C.Implementing Industry 4.0 real-time performance management systems: the case of schneider electric.Prod Plan Control 2022; 33(2–3):244-260.

[61]

Leon-Urrutia M, Taibi D, Pospelova V, Splendore S, Urbsiene L, Marjanovic U.Data literacy: an essential skill for the industry. In: Lalic B, Gracanin D, Tasic N, Simeunović N, editors. Proceedings on 18th International Conference on Industrial Systems–IS’20. Cham: Springer; 2022. p. 326–31.

[62]

Verleysen M, François D.The curse of dimensionality in data mining and time series prediction. In: Cabestany J, Prieto A, Sandoval F, editors. Computational intelligence and bioinspired systems. Berlin: Springer; 2005. p. 758–70.

[63]

Lee D, Chan Y, Chen W, Wang L, Chen W.T-METASET: task-aware generation of metamaterial datasets by diversity-based active learning.2022. arXiv: 2202.10565.

[64]

Volponi AJ.Gas turbine engine health management: past, present, and future trends.J Eng Gas Turbines Power 2014; 136(5):051201.

[65]

Wang RY.A product perspective on total data quality management.Commun ACM 1998; 41(2):58-65.

[66]

Günther LC, Colangelo E, Wiendahl HH, Bauer C.Data quality assessment for improved decision-making: a methodology for small and medium-sized enterprises.Procedia Manuf 2019; 29:583-591.

[67]

Wiemer H, Dementyev A, Ihlenfeldt S.A holistic quality assurance approach for machine learning applications in cyber-physical production systems.Appl Sci 2021; 11(20):9590.

[68]

Liewald M, Bergs T, Groche P, Behrens BA, Briesenick D, Müller M, et al.Perspectives on data-driven models and its potentials in metal forming and blanking technologies.Prod Eng 2022; 16(5):607-625.

[69]

Schelter S, Lange D, Schmidt P, Celikel M, Biessmann F, Grafberger A.Automating large-scale data quality verification.Proc VLDB Endow 2018; 11(12):1781-1794.

[70]

Byabazaire J, O GMP’Hare, Delaney DT.End-to-end data quality assessment using trust for data shared IoT deployments.IEEE Sens J 2022; 22(20):19995-20009.

[71]

Zacarias AGV, Reimann P, Mitschang B.A framework to guide the selection and configuration of machine-learning-based data analytics solutions in manufacturing.Procedia CIRP 2018; 72:153-158.

[72]

Frye M, Robert H.Structured data preparation pipeline for machine learning-applications inpro-duction. In: Proceedings of the 17th IMEKO TC 10 and EUROLAB Virtual Conference; 2020 Oct 20–22; Aachen, Germany. London: IMEKO; 2020. p. 241–6.

[73]

Malik S, Rouf R, Mazur K, Kontsos A.The Industry Internet of Things (IIoT) as a methodology for autonomous diagnostics in aerospace structural health monitoring.Aerospace 2020; 7(5):64.

[74]

Bekar ET, Nyqvist P, Skoogh A.An intelligent approach for data pre-processing and analysis in predictive maintenance with an industrial case study. Adv Mech Eng 2020;12(5):1–14.

[75]

Frye M, Gyulai D, Bergmann J, Schmitt RH.Production rescheduling through product quality prediction.Procedia Manuf 2021; 54:142-147.

[76]

Chen Q, Liu Y, Hou S, Duan F, Cai Z.Data-driven methodology for state detection of gearbox in PHM context. In: Proceedings of the 2021 Global Reliability and Prognostics and Health Management (PHM-Nanjing); 2021 Oct 15–17; Nanjing, China. Piscataway: IEEE; 2021. p. 1–6.

[77]

Xie Q, Suvarna M, Li J, Zhu X, Cai J, Wang X.Online prediction of mechanical properties of hot rolled steel plate using machine learning.Mater Des 2021; 197:109201.

[78]

Guo S, Wang D, Feng Z, Guo W.UIR–NET: object detection in infrared imaging of thermomechanical processes in automotive manufacturing.IEEE Trans Autom Sci Eng 2022; 19(4):3276-3287.

[79]

Iantovics LB, En Căchescu.Method for data quality assessment of synthetic industrial data.Sensors 2022; 22(4):1608.

[80]

Segreto T, Teti R.Data quality evaluation for smart multi-sensor process monitoring using data fusion and machine learning algorithms.Prod Eng 2022; 19:197-210.

[81]

Klaproth T, Hornung M.Off-design mission performance prediction for unmanned aerial vehicles based on machine learning. In: Proceedings of the 2022 IEEE Aerospace Conference (AERO); 2022 Mar 5–12; Big Sky, MT, USA. Piscataway: IEEE; 2022. p. 1–13.

[82]

Sen S, Husom EJ, Goknil A, Politaki D, Tverdal S, Nguyen P, et al.Virtual sensors for erroneous data repair in manufacturing a machine learning pipeline.Comput Ind 2023; 149:103917.

[83]

Lee YW, Strong DM, Kahn BK, Wang RY.AIMQ: a methodology for information quality assessment.Inf Manag 2002; 40(2):133-146.

[84]

Kenett RS.Reviewing of applied research with an Industry 4.0 perspective. Report. Rochester: Social Science Research Network. 2020. SSRN scholarly paper ID 3591808.

[85]

Coleman SY, Kenett RS.The information quality framework for evaluating data science programs.Encycl Semant Comput Robot Intell 2018; 2(2):1730001.

[86]

Yang K, Stoyanovich J, Asudeh A, Howe B, Jagadish, HV, Miklau, G. A nutritional label for rankings. In: Proceedings of the 2018 International Conference on Management of Data; 2018 Jul 10–15; Houston, TX, USA. New York City: Association for Computing Machinery; 2018. p.1773–6.

[87]

Stoyanovich J, Howe B.Nutritional labels for data and models.IEEE Tech Comm Data Eng 2019; 42(3):13-23.

[88]

Chmielinski KS, Newman S, Taylor M, Joseph J, Thomas K, Yurkofsky J, et al.The dataset nutrition label (2nd Gen): leveraging context to mitigate harms in artificial intelligence.2022. arXiv: 2201.03954.

[89]

Sun C, Asudeh A, Jagadish HV, Howe B, Stoyanovich J.Mithralabel: flexible dataset nutritional labels for responsible data science. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management; 2019 Nov 3–7; Beijing; China. New York City: Association for Computing Machinery; 2019. p. 2893–6.

[90]

Catania B, Guerrini G, Accinelli C.Fairness & friends in the data science era.AI Soc 2023; 38:721-731.

[91]

Chan YC, Ahmed F, Wang L, Chen W.METASET: exploring shape and property spaces for data-driven metamaterials design.J Mech Des 2021; 143(3):031707.

[92]

Simpson T, Lin D, Chen W.Sampling strategies for computer experiments: design and analysis.International Journal of Reliability and applications 2001; 2(3):209-240.

[93]

Celis L, Vishnoi N.Data preprocessing to mitigate bias: a maximum entropy based approach. In: Proceedings of the 37th International Conference on Machine Learning; 2020 Jul 13–18; online. Cambridge: JMLR; 2020. p. 1349–59.

[94]

[94] Tea KH, Whang SE.Slice tuner: a selective data acquisition framework for accurate and fair machine learning models. In: Proceedings of the 2021 International Conference on Management of Data; 2021 Jun 20–25; Xi'an, China. New York City: Association for Computing Machinery; 2021. p. 1771–83.

[95]

Lin Y, Guan Y, Asudeh A, Jagadish HV.Identifying insufficient data coverage in databases with multiple relations.Proc VLDB Endow 2020; 13(12):2229-2242.

[96]

Asudeh A, Shahbazi N, Jin Z, Jagadish HV.Identifying insufficient data coverage for ordinal continuous-valued attributes. In: Proceedings of the 2021 International Conference on Management of Data; 2021 Jun 20–25; Xi'an, Chinsa. New York: Association for Computing Machinery; 2021. p. 129–41.

[97]

Asudeh A, Jin Z, Jagadish HV.Assessing and remedying coverage for a given dataset. In: Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE); 2019 Apr 8–11; Macao, China. Piscataway: IEEE; 2019. p. 554–65.

[98]

Verma S, Rubin J.Fairness definitions explained. In: Proceedings of the International Workshop on Software Fairness; 2018 May 29; Gothenburg, Sweden. New York City: Association for Computing Machinery; 2018. p. 1–7.

[99]

Oneto L, Chiappa S.Fairness in machine learning. In: Oneto L, Navarin N, Sperduti A, Anguita D, editors. Recent trends in learning from data. Cham: Springer; 2020. p. 155–96.

[100]

Drosou M, Jagadish HV, Pitoura E, Stoyanovich J.Diversity in big data: a review.Big Data 2017; 5(2):73-84.

[101]

Wang L, Chan YC, Liu Z, Zhu P, Chen W.Data-driven metamaterial design with laplace-beltrami spectrum as “shape-DNA”.Struc Multidiscip Optim 2020; 61(6):2613-2628.

[102]

Brownlee J.Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. San Francisco: Machine Learning Mastery (2020)

[103]

Slater K, Li Y, Wang Y, Shan Y, Liu C.A generative adversarial network (GAN)-assisted data quality monitoring approach for out-of-distribution detection of high dimensional data.Report. Norcross: Institute of Industrial and Systems Engineers; 2023.

[104]

Chang KH.E-design: computer-aided engineering design. Academic Press, New York City (2015)

[105]

Chen W, Ahmed F.MO-PaDGAN: reparameterizing engineering designs for augmented multi-objective optimization.Appl Soft Comput 2021; 113:107909.

[106]

Guyon I, Gunn S, Nikravesh M, Zadeh L.Feature extraction: foundations and applications. Springer, Cham (2008)

[107]

Yazdi RM, Imani F, Yang H.A hybrid deep learning model of process-build interactions in additive manufacturing.J Manuf Syst 2020; 57:460-468.

[108]

Roach DJ, Rohskopf A, Hamel CM, Reinholtz WD, Bernstein R, Qi HJ, et al.Utilizing computer vision and artificial intelligence algorithms to predict and design the mechanical compression response of direct ink write 3D printed foam replacement structures.Addit Manuf 2021; 41:101950.

[109]

Lee H, Lee J.Neural network prediction of sound quality via domain knowledge-based data augmentation and bayesian approach with small data sets.Mech Syst Signal Process 2021; 157:107713.

[110]

De Santo A, Ferraro A, Galli A, Moscato V, Sperl Gì.Evaluating time series encoding techniques for predictive maintenance.Expert Syst Appl 2022; 210:118435.

[111]

Blum AL, Langley P.Selection of relevant features and examples in machine learning.Artif Intell 1997; 97(1–2):245-271.

[112]

Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al.Feature selection: a data perspective.ACM Comput Surv 2017; 50(6):1-45.

[113]

Pfingsten T, Herrmann DJL, Schnitzler T, Feustel A, Scholkopf B.Feature selection for troubleshooting in complex assembly lines.IEEE Trans Automn Sci Eng 2007; 4(3):465-469.

[114]

Janssens O, Slavkovikj V, Vervisch B, Stockman K, Loccufier M, Verstockt S, et al.Convolutional neural network based fault detection for rotating machinery.J Sound Vib 2016; 377:331-345.

[115]

Bengio Y, Courville A, Vincent P.Representation learning: a review and new perspectives.IEEE Trans Pattern Anal Mach Intell 2013; 35(8):1798-1828.

[116]

Alasadi SA, Bhaya WS.Review of data preprocessing techniques in data mining.ARPN J Eng Appl Sci 2017; 12(16):4102-4417.

[117]

Chaki J, Dey N.A beginner’s guide to image preprocessing techniques. CRC Press, Boca Raton (2018)

[118]

Singh D, Singh B.Investigating the impact of data normalization on classification performance.Appl Soft Comput 2020; 97:105524.

[119]

Yu L, Zhu J, Zhao Q, Wang Z.An efficient YOLO algorithm with an attention mechanism for vision-based defect inspection deployed on FPGA.Micromachines 2022; 13(7):1058.

[120]

You Z, Gao H, Li S, Guo L, Liu Y, Li J.Multiple activation functions and data augmentation-based lightweight network for in situ tool condition monitoring.IEEE Trans Ind Electron 2022; 69(12):13656-13664.

[121]

Wang Y, Joseph J, Unni TPA, Yamakawa S, Farimani A, Shimada K.Three-dimensional ship hull encoding and optimization via deep neural networks.J Mech Des 2022; 144(10):101701.

[122]

Ruediger-Flore P, Glatt M, Hussong M, Aurich JC.CAD-based data augmentation and transfer learning empowers part classification in manufacturing.Int J Adv Manuf Technol 2023; 125:5065-5118.

[123]

De la Rosa FL, Gómez-Sirvent JL, Sánchez-Reolid R, Morales R, Fernández-Caballero A.Geometric transformation-based data augmentation on defect classification of segmented images of semiconductor materials using a ResNet50 convolutional neural network.Expert Syst Appl 2022; 206:117731.

[124]

Jain S, Seth G, Paruthi A, Soni U, Kumar G.Synthetic data augmentation for surface defect detection and classification using deep learning.J Intell Manuf 2022; 33(4):1007-1020.

[125]

Davtalab O, Kazemian A, Yuan X, Khoshnevis B.Automated inspection in robotic additive manufacturing using deep learning for layer deformation detection.J Intell Manuf 2022; 33(3):771-784.

[126]

Xie Y, Li S, Wu CT, Lai Z, Su M.A novel hypergraph convolution network for wafer defect patterns identification based on an unbalanced dataset.J Intell Manuf 2024; 35:633-646.

[127]

Molitor DA, Kubik C, Becker M, Hetfleisch RH, Lyu F, Groche P.Towards high-performance deep learning models in tool wear classification with generative adversarial networks.J Mater Process Technol 2022; 302:117484.

[128]

Zhang Z, Wen G, Chen S.Weld image deep learning-based on-line defects detection using convolutional neural networks for Al alloy in robotic arc welding.J Manuf Process 2019; 45:208-216.

[129]

Donda K, Zhu Y, Merkel A, Wan S, Assouar B.Deep learning approach for designing acoustic absorbing metasurfaces with high degrees of freedom.Extreme Mech Lett 2022; 56:101879.

[130]

Shi P, Qi Q, Qin Y, Scott PJ, Jiang X.A novel learning-based feature recognition method using multiple sectional view representation.J Intell Manuf 2020; 31(5):1291-1309.

[131]

Dai W, Li D, Tang D, Jiang Q, Wang D, Wang H, et al.Deep learning assisted vision inspection of resistance spot welds.J Manuf Process 2021; 62:262-274.

[132]

Singh SA, Desai KA.Automated surface defect detection framework using machine vision and convolutional neural networks.J Intell Manuf 2023; 34(4):1995-2011.

[133]

Ma G, Yu L, Yuan H, Xiao W, He Y.A vision-based method for lap weld defects monitoring of galvanized steel sheets using convolutional neural network.J Manuf Process 2021; 64:130-139.

[134]

Dong L, Chen W, Yang S, Yu H.A new machine vision–based intelligent detection method for gear grinding burn.Int J Adv Manuf Technol 2023; 125(9–10):4663-4677.

[135]

Tang J, Zhou H, Wang T, Jin Z, Wang Y, Wang X.Cascaded foreign object detection in manufacturing processes using convolutional neural networks and synthetic data generation methodology.J Intell Manuf 2022; 34:2925-2941.

[136]

Wong V, Ferguson M, Law K, Lee Y, Witherell P.Segmentation of additive manufacturing defects using U-Net.J Comput Inf Sci Eng 2022; 22(3):31005.

[137]

Kumaresan S, Aultrin K, Kumar S, Anand M.Deep learning-based weld defect classification using VGG16 transfer learning adaptive fine-tuning.Int J Interact Des Manuf 2023; 17:2999-3010.

[138]

Sha Y, Faber J, Gou S, Liu B, Li W, Schramm S, et al.A multi-task learning for cavitation detection and cavitation intensity recognition of valve acoustic signals.Eng Appl Artif Intell 2022; 113:104904.

[139]

Ye Y, Huang C, Zeng J, Zhou Y, Li F.Shock detection of rotating machinery based on activated time-domain images and deep learning: an application to railway wheel flat detection.Mech Syst Sig Process 2023; 186:109856.

[140]

Li X, Zhang W, Ding Q, Sun JQ.Intelligent rotating machinery fault diagnosis based on deep learning using data augmentation.J Intell Manuf 2020; 31:433-452.

[141]

Becker P, Roth C, Roennau A, Dillmann R.Acoustic anomaly detection in additive manufacturing with long short-term memory neural networks. In: Proceeding of the 2020 IEEE 7th International Conference on Industrial Engineering and Applications (ICIEA); 2020 Apr 16–21; Bangkok, Thailand. Piscataway: IEEE; 2020. p. 921–6.

[142]

Zhang W, Joseph J, Chen Q, Koz C, Xie L, Regmi A, et al.A data augmentation method for data-driven component segmentation of engineering drawings.J Comput Inf Sci Eng 2024; 14(1):011001.

[143]

Lyu Y, Yang Z, Liang H, Zhang B, Ge M, Liu R, et al.Artificial intelligence-assisted fatigue fracture recognition based on morphing and fully convolutional networks.Fatigue Fract Eng Mater Struct 2022; 45(6):1690-1702.

[144]

Martins D, Lima A, Pinto M, Hemerly D, Prego T, Silva F, et al.Hybrid data augmentation method for combined failure recognition in rotating machines.J Intell Manuf 2022; 34:1795-1813.

[145]

Fan SKS, Cheng CW, Tsai DM.Fault diagnosis of wafer acceptance test and chip probing between front-end-of-line and back-end-of-line processes.IEEE Trans Autom Sci Eng 2022; 19(4):3068-3082.

[146]

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP.SMOTE: synthetic minority over-sampling technique.J Artif Intell Res 2002; 16:321-357.

[147]

Li Y, Shi Z, Liu C, Tian W, Kong Z, Williams CB.Augmented time regularized generative adversarial network (ATR–GAN) for data augmentation in online process anomaly detection.IEEE Trans Autom Sci Eng 2022; 19(4):3338-3355.

[148]

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al.Generative adversarial networks.Commun ACM 2020; 63(11):139-144.

[149]

Chen W, Ahmed F.PaDGAN: learning to generate high-quality novel designs.J Mech Des 2021; 143(3):031703.

[150]

Nobari AH, Chen W, Ahmed F.PcDGAN: a continuous conditional diverse generative adversarial network for inverse design. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; 2021 Aug 14–18; Singapore; online. New York City: Association for Computing Machinery; 2021 p. 606–16.

[151]

Yoo Y, Jung UJ, Han YH, Lee J.Data augmentation-based prediction of system level performance under model and parameter uncertainties: role of designable generative adversarial networks (DGAN).Reliab Eng Syst Saf 2021; 206:107316.

[152]

Wu H, Liu X, An W, Lyu H.A generative deep learning framework for airfoil flow field prediction with sparse data.Chinese J Aeronaut 2022; 35(1):470-484.

[153]

Wang J, Yang Z, Zhang J, Zhang Q, Chien WTK.AdaBalGAN: an improved generative adversarial network with imbalanced learning for wafer defective pattern recognition.IEEE Trans Semicond Manuf 2019; 32(3):310-319.

[154]

Alawieh MB, Boning D, Pan DZ.Wafer map defect patterns classification using deep selective learning. In: Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC); 2020 Jul 20–24; San Francisco, CA, USA. Piscataway: IEEE; 2020. p. 1–6.

[155]

Yun JP, Shin WC, Koo G, Kim MS, Lee C, Lee SJ.Automated defect inspection system for metal surfaces based on deep learning and data augmentation.J Manuf Syst 2020; 55:317-324.

[156]

Niu S, Li B, Wang X, Lin H.Defect image sample generation with GAN for improving defect recognition.IEEE Trans Autom Sci Eng 2020; 17(3):1611-1622.

[157]

Li H, Fan R, Shi Q.oversampling and deep forest based minorityclass sensitive fault diagnosis approach. In: Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC); 2020 Oct 11–14; Toronto, ON, Canada. Piscataway: IEEE; 2020. p. 3629–36.

[158]

Li XY, Li J, Qu Y, He D.Semi-supervised gear fault diagnosis using raw vibration signal based on deep learning.Chinese J Aeronaut 2020; 33(2):418-426.

[159]

Behera S, Misra R.Generative adversarial networks based remaining useful life estimation for IIoT.Comput Electr Eng 2021; 92:107195.

[160]

Meister S, Möller N, Stüve J, Groves RM.Synthetic image data augmentation for fibre layup inspection processes: techniques to enhance the data set.J Intell Manuf 2021; 32:1767-1789.

[161]

Wiederkehr P, Finkeldey F, Merhofe T.Augmented semantic segmentation for the digitization of grinding tools based on deep learning.CIRP Annals 2021; 70(1):297-300.

[162]

Che C, Wang H, Fu Q, Ni X.Intelligent fault prediction of rolling bearing based on gate recurrent unit and hybrid autoencoder.Proc Inst Mech Eng C 2021; 235(6):1106-1114.

[163]

Zhou X, Hu Y, Wu J, Liang W, Ma J, Jin Q.Distribution bias aware collaborative generative adversarial network for imbalanced deep learning in industrial IOT.IEEE Trans Ind Inf 2023; 19(1):570-580.

[164]

Yang Z, Zhang M, Chen Y, Hu N, Gao L, Liu L, et al.Surface defect detection method for air rudder based on positive samples.J Intell Manuf 2022; 35(1):99-113.

[165]

Yang C, Liu J, Zhou K, Li X.Dynamic spatial–temporal graph-driven machine remaining useful life prediction method using graph data augmentation.J Intell Manuf 2022; 35:355-366.

[166]

Peng P, Lu J, Xie T, Tao S, Wang H, Zhang H.Open-set fault diagnosis via supervised contrastive learning with negative out-of-distribution data augmentation.IEEE Trans Ind Inf 2023; 19(3):2463-2473.

[167]

Farady I, Lin CY, Chang MC.PreAugNet: improve data augmentation for industrial defect classification with small-scale training data.J Intell Manuf 2024; 35:1233-1246.

[168]

Niu S, Peng Y, Li B, Qiu Y, Niu T, Li W.A novel deep learning motivated data augmentation system based on defect segmentation requirements.J Intell Manuf 2024; 35:687-701.

[169]

Nguyen T, Le T, Vu H, Phung D.Dual discriminator generative adversarial nets.2017. arXiv: 1709.03831.

[170]

Zhu JY, Park T, Isola P, Efros AA.Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceeding of the 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. Piscataway: IEEE; 2017. p. 2242–51.

[171]

Figueira A, Vaz B.Survey on synthetic data generation, evaluation methods and GANs.Mathematics 2022; 10(15):2733.

[172]

Anscombe FJ.Graphs in statistical analysis.Am Stat 1973; 27(1):17-21.

[173]

Shmelkov K, Schmid C, Alahari K.How good is my GAN? In: Proceedings of Computer Vision–ECCV 2018; 2018 September 8–14; Munich, Germany. Berlin: Springer; 2018. p. 218–34.

[174]

Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X.Improved techniques for training gans. In: Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016 Dec 5–10; Barcelona, Spain. New York City: Curran Associates Inc.; 2016. p. 2234–42.

[175]

Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S.Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA. New York City: Curran Associates Inc.; 2017. p. 6629–40.

[176]

Karras T, Aila T, Laine S, Lehtinen J.Progressive growing of gans for improved quality, stability, and variation.2017. arXiv: 1710.10196.

[177]

Alaa A, Von Breugel B, Saveliev E, van de Schaar M.How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models.2022. arXiv: 2102.08921.

[178]

Ho J, Jain A, Abbeel P.Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; Vancouver, BC, Canada. New York City: Curran Associates Inc.; 2017. p. 6840–50.

[179]

Trabucco B, Doherty K, Gurinas M, Salakhutdinov R.Effective data augmentation with diffusion models.2023. arXiv: 2302.07944.

[180]

Kebaili A, Lapuyade-Lahorgue J, Ruan S.Deep learning approaches for data augmentation in medical imaging: a review.J Imaging 2023; 9(4):81.

[181]

Xiao Z, Kreis K, Vahdat A.Tackling the generative learning trilemma with denoising diffusion GANs. 2021. arXiv:2112.07804.

[182]

Chlap P, Min H, Vandenberg N, Dowling J, Holloway L, Haworth A.A review of medical image data augmentation techniques for deep learning applications.J Med Imaging Radiat Oncol 2021; 65(5):545-563.

[183]

Kapusuzoglu B, Mahadevan S, Matsumoto S, Miyagi Y, Watanabe D.Adaptive surrogate modeling for high-dimensional spatio-temporal output.Struct Multidiscip Optim 2022; 65(10):300.

[184]

Yang H, Li S, Tabery C, Lin B, Yu B.Bridging the gap between layout pattern sampling and hotspot detection via batch active learning.IEEE Trans Comput-Aided Des Integr Circuits Syst 2020; 40(7):1464-1475.

[185]

Ro Jžanec, Bizjak L, Trajkova E, Zajec P, Keizer J, Fortuna B, et al.Active learning and novel model calibration measurements for automated visual inspection in manufacturing.J Intell Manuf 2023; 35:1963-1984.

[186]

Van Houtum GJJ, Vlasea ML.Active learning via adaptive weighted uncertainty sampling applied to additive manufacturing.Addit Manuf 2021; 48:102411.

[187]

Xiao Y, Su M, Yang H, Chen J, Yu J, Yu B.Low-cost lithography hotspot detection with active entropy sampling and model calibration. In: Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC); 2021 Dec 5–9; San Francisco, CA, USA. Piscataway: IEEE; 2021. p. 907–21.

[188]

Seung H, Opper M, Sompolinsky H. Query by committee. Proceedings of the Fifth Annual Workshop on Computational Learning Theory; 1992 Jul 27–29; Pittsburgh, PA, USA. New York City: Association for Computing Machinery; 1992. p. 287–94.

[189]

Settles B.Active learning literature survey [dissertation]. Madison: University of Wisconsin–Madison; 2009.

[190]

Borodin A.Determinantal point processes.2009. arXiv: 0911.1153.

[191]

Samavatian V, Fotuhi-Firuzabad M, Samavatian M, Dehghanian P, Blaabjerg F.Iterative machine learning-aided framework bridges between fatigue and creep damages in solder interconnections.IEEE Trans Compon Packag Manuf Technol 2022; 12(2):349-358.

[192]

Xie J, Zhang C, Sun L, Zhao YF.Fairness-and uncertainty-aware data generation for data-driven design based on active learning.J Comput Inf Sci Eng 2024; 24(5):051004.

[193]

Zhang H, Chen W, Rondinelli JM, Wei C.et al: entropy-targeted active learning for bias mitigation in materials data.Appl Phys Rev 2023; 10(2):021403.

[194]

Lin Y, Li M, Watanabe Y, Kimura T, Matsunawa T, Nojima S, et al.Data efficient lithography modeling with transfer learning and active data selection.IEEE Trans Comput-Aided Des Integr Circuits Syst 2019; 38(10):1900-1913.

[195]

Shao H, Ping H, Chen K, Su W, Lin C, Fang S, et al.Keeping deep lithography simulators updated: global-local shape-based novelty detection and active learning.IEEE Trans Comput-Aided Des Integr Circuits Syst 2023; 42(3):1000-1014.

[196]

Bull LA, Worden K, Rogers TJ, Wickramarachchi C, Cross EJ, McLeay T, et al.A probabilistic framework for online structural health monitoring: active learning from machining data streams.J Phys Conf Ser 2019; 1264(1):012028.

[197]

Sarkar S, Mondal S, Joly M, Lynch ME, Bopardikar SD, Acharya R, et al.Multifidelity and multiscale Bayesian framework for high-dimensional engineering design and calibration.J Mech Des 2019; 141(12):121001.

[198]

Cui F, Ghosn M.Implementation of machine learning techniques into the subset simulation method.Struct Saf 2019; 79:12-25.

[199]

Shim J, Kang S, Cho S.Active learning of convolutional neural network for cost-effective wafer map pattern classification.IEEE Trans Semicond Manuf 2020; 33(2):258-266.

[200]

Wang Y, Franzon PD, Smart D, Swahn B.Multi-fidelity surrogate-based optimization for electromagnetic simulation acceleration.ACM Trans Des Autom Electron Syst 2020; 25(5):45.

[201]

Yue X, Wen Y, Hunt JH, Shi J.Active learning for gaussian process considering uncertainties with application to shape control of composite fuselage.IEEE Trans Autom Sci Eng 2020; 18(1):36-46.

[202]

Sun Q, Bai C, Geng H, Yu B.Deep neural network hardware deployment optimization via advanced active learning. In: Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE); 2021 Feb 1–5; Grenoble, France. Piscataway: IEEE; 2021. p. 1510–5.

[203]

Botcha B, Iquebal AS, Bukkapatnam STS.Efficient manufacturing processes and performance qualification via active learning: application to a cylindrical plunge grinding platform.Procedia Manuf 2021; 53:716-725.

[204]

Verduzco JC, Marinero EE, Strachan A.An active learning approach for the design of doped LLZO ceramic garnets for battery applications.Integr Mater Manuf Innov 2021; 10:299-310.

[205]

Cheng J, Jin H.An adaptive extreme learning machine based on an active learning method for structural reliability analysis.J Brazilian Soc Mech Sci Eng 2021; 43(12):546.

[206]

Owoyele O, Pal P.A novel active optimization approach for rapid and efficient design space exploration using ensemble machine learning.J Energy Resour Technol 2021; 143(3):032307.

[207]

Yang S, Lee S, Yee K.Inverse design optimization framework via a two-step deep learning approach: application to a wind turbine airfoil.Eng Comput 2022; 39:2239-2255.

[208]

Zhang Q, Wu Y, Lu L, Qiao P.An adaptive dendrite-HAMR metamodeling technique for high-dimensional problems.J Mech Des 2022; 144(8):081701.

[209]

Xu Y, Zheng Z, Arora K, Senesky D, Wang P.Hall effect sensor design optimization with multi-physics informed gaussian process modeling. In: Proceedings of the International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. 2022 Aug 14–17; St. Louis, MO, USA. New York City: ASME; 2022. p. V03BT03A028.

[210]

Liu Z, Renteria A, Zheng Z, Wang P, Li Y.Design of additively manufactured functionally graded cellular structures. In: Proceedings of the IISE Annual Conference and Expo 2022; 2022 May 21–24; Seattle, WA, USA. Montreal: IISE; 2022.

[211]

Hughes AJ, Bull LA, Gardner P, Barthorpe RJ, Dervilis N, Worden K.On risk-based active learning for structural health monitoring.Mech Syst Signal Process 2022; 167:108569.

[212]

Kolesnikov VI, Pashkov DM, Belyak OA, Guda AA, Danilchenko SA, Manturov DS, et al.Design of double layer protective coatings: finite element modeling and machine learning approximations.Acta Astronaut 2023; 204:869-877.

[213]

Zhu R, Peng W, Wang D, Huang CG.Bayesian transfer learning with active querying for intelligent cross-machine fault prognosis under limited data.Mech Syst Signal Process 2023; 183:109628.

[214]

Wan J, Che Y, Wang Z, Cheng C.Uncertainty quantification and optimal robust design for machining operations.J Comput Inf Sci Eng 2023; 23(1):011005.

[215]

Li Z, Segura LJ, Li Y, Zhou C, Sun H.Multiclass reinforced active learning for droplet pinch-off behaviors identification in inkjet printing.J Manuf Sci Eng 2023; 145(7):071002.

[216]

Hao P, Duan Y, Liu D, Yang H, Liu D, Wang B.Image-driven intelligent prediction of buckling behavior for geometrically imperfect cylindrical shells.AIAA J 2023; 61(5):2266-2280.

[217]

Farrokh M, Fallah MR.Flutter instability boundary determination of composite wings using adaptive support vector machines and optimization.J Brazilian Soc Mech Sci Eng 2023; 45(3):181.

[218]

Luo J, Fu Z, Zhang Y, Fu W, Chen J.Aerodynamic optimization of a transonic fan rotor by blade sweeping using adaptive Gaussian process.Aerosp Sci Technol 2023; 137:108255.

[219]

Pidaparthi B, Missoum S.A multi-fidelity approach for reliability assessment based on the probability of classification inconsistency.J Comput Inf Sci Eng 2023; 23(1):011008.

[220]

Xie J, Zhang C, Sun L, Zhao Y.Fairness-and uncertainty-aware data generation for data-driven design.2023. arXiv: 2309.05842.

[221]

Shorten C, Khoshgoftaar TM.A survey on image data augmentation for deep learning.J Big Data 2019; 6(1):60.

[222]

Niu Z, Yu K, Wu X.LSTM-based VAE–GAN for time-series anomaly detection.Sensors 2020; 20(13):3738.

[223]

Zhang C, Sedal A, Zhao YF.Differentiable surrogate models for design and trajectory optimization of auxetic soft robots. In: Proceedings of the 2023 IEEE International Conference on Soft Robotics (RoboSoft); 2023 Apr 3–7; Singapore. Piscataway: IEEE; 2023. p. 1–8.

RIGHTS & PERMISSIONS

THE AUTHOR

PDF (3459KB)

25048

Accesses

0

Citation

Detail

Sections
Recommended

/