1. Introduction
Chronic obstructive pulmonary disease (COPD) is a progressive respiratory disorder ranked as the fourth leading cause of death globally in 2021 [
1]. In China, the prevalence of COPD among adults aged 20 years and older was 8.6% in 2015, according to the China Pulmonary Health Study [
2]. The condition is projected to impose an economic burden of 1.36 trillion USD between 2020 and 2050 [
3]. COPD is characterized by persistent airflow limitation and is increasingly recognized as a heterogeneous condition [
4]. It is further compounded by the high prevalence of comorbidities such as cardiovascular disease, asthma, bronchiectasis, and diabetes [
5], all of which significantly affect health-related quality of life (HRQoL), disease burden [
6], and survival [
4,
7].
Understanding the intricate interactions [
8] between COPD and comorbidities is essential for improving patient outcomes [
9] and guiding clinical and public health interventions [
10]. A 2023 study emphasized the need to view COPD not as a single disease with comorbidities but as a component of multimorbidity, requiring a syndrome-based management approach [
11]. Previous studies have employed machine learning to classify patients with COPD into clusters based on disease profiles, enabling more targeted interventions [
10,
12]. Such clustering approaches [
10,
12] are pivotal to addressing the multifaceted challenges of COPD and ensuring more effective global disease management. While clinical research has identified key clusters—including chronic bronchitis, emphysema, and asthma-COPD overlap
[13],
[14],
[15],
[16],
[17]—less attention has been given to the impact on daily functioning and well-being.
No standardized clustering approach exists, as disease features vary by sample and method. Moreover, most COPD clustering studies have been conducted in Western populations
[13],
[14],
[15], with relatively few from Asia [
16,
17]. Given the high burden of COPD in China, there is a pressing need to delineate clinically meaningful patient clusters to improve risk stratification and inform targeted care. While some Chinese studies have investigated the prevalence of comorbidities such as cardiovascular disease and diabetes in COPD patients [
18], few have conducted cluster-based analyses. Overall, the literature lacks clustering studies that capture multimorbidity patterns, validate clustering methodologies, or reflect data from Chinese populations.
We conducted a study involving 11 145 patients with COPD in China, incorporating 31 variables, the majority of which were 27 comorbidity conditions, to identify COPD patient clusters using unsupervised machine learning. We evaluated clustering consistency across algorithms, validated model generalizability via random forests, and used logistic regression to quantify inter-cluster differences. We aimed to identify COPD patient clusters in a Chinese cohort and examine their association with HRQoL. These findings contribute to a more nuanced understanding of COPD in China, facilitating the development of more effective public health and clinical strategies.
2. Materials and methods
2.1. Data source
This cross-sectional study used the Enjoying Breathing Program data, the first nationwide prospective COPD cohort study in China, designed to establish a comprehensive management system and evaluate interventions [
19]. Participants were recruited from healthcare institutions, including primary and tertiary centers, and enrolled at community-based hospitals. The Enjoying Breathing Program was registered at ClinicalTrials.gov (ID: NCT04318912) in March 2020. This study was approved by the China–Japan Friendship Hospital, China (Approval number: 2019-41-k29). All participants provided written informed consent. This study followed the
Declaration of Helsinki ethical principles.
2.2. Study population
We included adults aged 18 and older enrolled in the Enjoying Breathing Program between May 2020 and April 2023. Those with a COPD screening questionnaire (COPD-SQ) score
≥ 16 underwent baseline surveys, including pulmonary function tests using portable machines. COPD was diagnosed in participants with a post-bronchodilator forced expiratory volume in 1 second to forced vital capacity ratio (FEV
1/FVC) < 0.7. The FEV
1 percent predicted normal value (FEV
1%pred) was used to reflect the level of airflow limitation. Patients missing key variables such as age, sex, body mass index (BMI), or smoking status were excluded to minimize bias from non-random missing data (
Fig.1).
2.3. Definition of patient characteristics, morbidities, and outcomes
Morbidities were defined as binary variables (“yes” or “no”) based on a self-reported disease history questionnaire. To avoid data sparsity and enhance multiple correspondence analysis (MCA) reliability, we excluded morbidities with a prevalence below 0.10%, resulting in 27 morbidities retained (Table S1 in Appendix A).
Patient characteristics included socio-demographic, health-related, and clinical factors. Socio-demographic variables were age, sex, educational level, geographical region, and health insurance access. Health-related factors included BMI, smoking status, dust/biomass exposure, and family history of COPD. Clinical factors encompassed pulmonary function and comorbidity status. Pulmonary function was assessed using the modified Medical Research Council (mMRC) dyspnea scale and the COPD assessment test (CAT). Comorbidity status was measured by the number of chronic conditions and two indices [
20]: the Charlson comorbidity index (CCI) and the COPD specific comorbidity test index (COTE). CCI estimates overall survival impact, whereas COTE predicts COPD-related mortality. All socio-demographic and health-related variables were coded as categorical data, and clinical factors were coded as continuous variables.
The primary outcome was HRQoL, measured using the five-level version of EuroQol five dimension (EQ-5D-5L) instrument [
21], which covers mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each dimension has five reponse levels: no problems, slight problems, moderate problems, severe problems, and unable to/extreme problems. Health state utility values were derived using time trade-off [
22], ranging from < 0 (worse than death) to 1 (full health), with “0” equivalent to death. To facilitate analysis, each dimension was dichotomized into “no problems” and “having problems” (slight, moderate, severe, and unable to/extreme problems). We adopted the EQ-5D-5L utility index derived from Chinese urban health preferences [
22]. Values for the EQ-5D-5L are reported as frequencies (percentages) of individuals with any severity level of problems (levels 2–5), excluding those reporting no problems (level 1).
2.4. Cluster identification
Cluster identification steps are outlined in
Fig. 2. First, we reduced dimensionality from 31 variables (27 morbidities plus age, sex, BMI, and smoking status) to three uncorrelated components using MCA [
23]. MCA, similar to principal component analysis (PCA), is tailored for categorical data. Using MCA components—continuous, independent, centered, and standardized—improved clustering stability and interpretability (Section S2.1 in Appendix A).
We then randomly divided the sample into a training dataset (80%) and a validation dataset (20%).
K-means++ clustering [
24] and hierarchical clustering were applied to identify potential COPD patient clusters. The optimal number of clusters was determined by inspecting the hierarchical clustering dendrograms and validated using the elbow [
25] and silhouette methods [
26]. We then selected the clustering solution based on the silhouette magnitude and sign—higher values indicating well-separated clusters and negative values suggesting poor fit [
27] (Section S2.2 in Appendix A).
2.5. Cluster validation
To validate our clustering results, we trained a random forest (RF) model using cluster assignments from the training dataset and applied it to the validation dataset (
Fig. 2). The RF model was selected for its robustness and ability to capture complex feature interactions. We used ten-fold cross-validation to improve RF model performance and compared predicted clusters with original assignments using a confusion matrix and adjusted rand index [
25,
28]. This validation was repeated 100 times to ensure robustness. Further methodological details are presented in Section S2.3 in Appendix A.
2.6. Statistical analysis
After identifying clusters, we used descriptive statistics, Pearson’s χ2 test, and Wilcoxon rank-sum test to determine differences between patients with and without comorbidities. Socio-demographic characteristics were analyzed descriptively, and inter-cluster differences were examined using the Kruskal–Wallis rank sum test for continuous variables and Pearson’s χ2 test for categorical data. Radar plots were used to visualize the distribution of major diseases and HRQoL dimensions across the identified clusters. One plot illustrates the prevalence of key comorbidities within each cluster, while the other shows the proportion of individuals reporting problems in the five dimensions. Associations between clusters and EQ-5D outcomes were explored using logistic regression, adjusting for education, residence, region, occupation, health insurance, family history, and dust exposure. All analyses were performed using R version 4.3.1.
3. Results
3.1. Patient characteristics
A total of 11 145 patients with COPD were included.
Table 1 summarizes their baseline socio-demographic and clinical characteristics. Among them, 6616 (59.36%) had at least one comorbidity. Overall, 56.42% were aged 70 years, 76.86% were male, and 70.50% lived in rural areas. The mean EQ-5D-5L utility score was 0.71 (standard deviation (SD) = 0.26). Compared to those without comorbidities, patients with comorbidities were more likely to be older, female, exposed to dust or biomass, and have higher BMI. They also reported lower EQ-5D-5L utility and more severe symptoms across all five health dimensions.
3.2. Prevalence of comorbidities
The average number of comorbidities was 1.09 (SD = 1.22). Chronic bronchitis was the most common (n = 3573, 32.06%), followed by hypertension and pulmonary emphysema (n = 1981, 17.77% each). Other common comorbidities included ischemic heart disease (n = 875, 7.85%) and pneumonia (n = 674, 6.05%) (Table S1).
3.3. Characterization of clusters
Four distinct COPD clusters emerged using
K-means++ (Figs. S2 and S3 in Appendix A). Key characteristics are in
Table 2, with full details in Table S2 in Appendix A. Cluster 1 (young male smokers): the largest and youngest group, predominantly male (98.26%), had the highest proportion of current (54.41%) and former smokers (33.23%) and the lowest comorbidity rate. Cluster 2 (biomass-exposed females): primarily female (86.51%) with low smoking prevalence (4.87%) but high biomass fuel exposure (50.26%). This cluster also had the highest family history of COPD (23.40%), mostly rural residents (73.21%) and low education (75.84%) with only primary schooling. Cluster 3 (respiratory comorbidity): lowest FEV
1%pred and FEV
1/FVC with a predominance of chronic bronchitis and pulmonary emphysema. Cluster 4 (elderly multimorbid): predominantly aged ≥ 70, highest BMI, with high rates of hypertension, ischemic heart disease, and diabetes.
3.4. Comorbidity comparison between clusters
Table 2 shows the burden and severity of comorbidities by cluster. The respiratory comorbidity cluster had the highest average number of comorbidities (2.63), predominantly respiratory-related. Despite having fewer comorbidities, the elderly multimorbid cluster displayed the highest CCI (4.97) and COTE (1.08) scores, reflecting the highest overall disease severity and mortality risk. The biomass-exposed females cluster averaged 0.94 comorbidities, lower than the respiratory and elderly multimorbid clusters. The young male smokers cluster had the fewest comorbidities (0.38) but notably high smoking prevalence, a key risk factor for future health deterioration.
The prevalence of primary comorbidities varied across clusters (
Fig. 3, Table S3 in Appendix A). Chronic bronchitis and pulmonary emphysema were concentrated in the respiratory comorbidity cluster, affecting 82.96% and 77.52% of patients, respectively. Hypertension was most frequent in the elderly multimorbid cluster (67.99%), compared to the respiratory comorbidity (13.85%), biomass-exposed females (12.61%), and young smokers (8.64%) clusters. The elderly multimorbid cluster also had the highest prevalence of ischemic heart disease (47.55%) and diabetes (20.50%), highlighting a higher burden of systemic disease. In contrast, the young male smokers and biomass-exposed females clusters exhibited lower rates of chronic respiratory and cardiovascular conditions.
3.5. Associations between clusters and HRQoL
Table 2 and
Fig. 4 present EQ-5D-5L utility and percentages of reported problems across different dimensions and clusters. HRQoL declined with increasing comorbidities. The young male smokers cluster reported the highest EQ-5D-5L utility score (0.74), followed by the biomass-exposed females (0.69), the respiratory comorbidity (0.66), and the elderly multimorbid cluster (0.65).
Across dimensions, the young male smokers cluster reported the lowest severity, while the respiratory comorbidity and elderly multimorbid clusters had more problems in all five dimensions. Specifically, 74.60% of patients in the elderly multimorbid cluster reported pain, and 80.59% in the respiratory comorbidity cluster experienced anxiety and depression.
Table 3 presents adjusted odds ratios (ORs) with 95% confidence intervals (CIs) for associations between clusters and HRQoL based on the EQ-5D-5L. Using the young male smokers cluster as the reference, the biomass-exposed females cluster showed significantly higher odds of impairment in self-care (OR: 1.65, 95%CI: 1.49–1.83) and mobility (OR: 1.62, 95%CI: 1.43–1.82). The respiratory comorbidity cluster had the worst outcomes overall, with significantly increased risks in mobility (OR: 1.80, 95%CI: 1.57–2.06), activity (OR: 1.83, 95%CI: 1.62–2.06), and anxiety/depression (OR: 1.97, 95%CI: 1.73–2.25). The elderly multimorbid cluster also showed worse HRQoL, particularly in mobility (OR: 1.71, 95%CI: 1.47–1.98) and pain (OR: 1.68, 95%CI: 1.46–1.92).
3.6. Cluster validation
Clustering validity was confirmed using the RF model. The confusion plot (Fig. S5 in Appendix A) and the adjusted rand index from 100 replications (0.71, 95%CI: 0.48–0.94) showed strong agreement between predicted cluster labels and original groupings, indicating stable and reliable clustering.
4. Discussion
This study identified four COPD clusters—young male smokers, biomass-exposed females, respiratory comorbidity, and elderly multimorbid—with clear differences in HRQoL outcomes. To our knowledge, this is the most comprehensive cluster analysis of COPD comorbidities in a Chinese population using a machine learning approach. A key strength lies in using a large, updated nationwide multicenter cohort, enhancing the generalizability of findings. By including patients from both primary and tertiary care, we captured a more representative COPD population than typical clinical trials. Unlike previous studies focused on clinical characteristics, our analysis adopted a public health perspective, integrating socio-demographic, physiological, and behavioral variables routinely available in clinical settings [
2,
29]. Rather than focusing on traditional endpoints like acute exacerbation of chronic obstructive pulmonary disease (AECOPD) and mortality, we emphasized HRQoL using the EQ-5D-5L instrument to provide a deeper understanding of patient burden. The robustness of our findings was strengthened through internal validation, which demonstrated high clustering accuracy and reliability. Overall, this study contributes meaningfully to COPD research and offers valuable insights for personalized care and policymaking.
Despite variability in clusters across studies using unsupervised machine learning, three of our clusters—young male smokers, respiratory comorbidities, and elderly multimorbid—align with those reported globally, indicating generalizability beyond China. The young male smokers cluster matches the “low comorbidity” cluster identified in Western
[13],
[30],
[31],
[32],
[33],
[34],
[35],
[36] and Asian cohorts [
16,
17,
37], typically comprising younger men with few comorbidities. A 2025 Chinese study [
38] reported a similar cluster, reinforcing its reproducibility. Our respiratory comorbidity cluster aligns with the “severe respiratory disease” or “severe airflow limitation” clusters reported in both Western
[32],
[34],
[35],
[36] and Asian studies [
37,
39], including recent Chinese data [
38]. These clusters consistently exhibit severe lung impairment and coexisting pulmonary conditions. Similarly, the elderly multimorbid cluster—characterized by older age and high cardiovascular and metabolic disease burden, mirrors the “cardiovascular” or “high prevalence of comorbidities” clusters seen in both Western [
13,
[30],
[31],
[32],
[33],
[34],
[35],
[36],
40] and Asian cohorts [
16,
17,
37,
38]. In contrast, the biomass-exposed females cluster appears to reflect a region-specific pattern common in rural China and other low- and middle-income regions.
By specifying pulmonary diseases to define respiratory comorbidity cluster, our study advances prior work focusing solely on respiratory function indices. This approach also highlights a significant challenge in comorbidity research: the lack of standardized criteria for disease inclusion and delineation, which complicates the understanding of disease interactions and outcomes [
41]. Through machine learning, we grouped conditions such as chronic bronchitis, asthma, and bronchiectasis under impaired respiratory function, providing a more holistic view of respiratory health in patients with COPD.
The elderly multimorbid cluster, characterized by prevalent cardiovascular diseases and diabetes, aligns with existing evidence [
[32],
[33],
[34],
42]. Cardiovascular comorbidities significantly increase the economic burden and mortality risk in COPD [
6]. These findings reflect shared risk factors in China’s aging population, where smoking and systemic inflammation often exacerbate COPD.
Compared to other clusters, the young male smokers cluster included younger individuals with fewer comorbidities and better health, serving as a comparative benchmark. This group has been described as having a mild COPD cluster [
32,
34,
43] despite a high proportion of smokers—reflecting China’s high smoking rates [
44]. Yet, smoking cessation rates remain low, and many patients lack access to professional support, leading to high relapse rates due to addiction and social factors [
45]. These findings highlight the urgent need for targeted cessation programs for early-stage patients to improve outcomes [
46,
47].
The biomass-exposed females cluster mainly included women, non-smokers, and those exposed to biomass fuels. This group highlights significant sex-based differences in COPD
[48],
[49],
[50] and expands the scope beyond its traditional association with older male smokers. While smoking is a well-known cause of COPD, other risk factors, such as biomass exposure, are increasingly acknowledged. This cluster is common in non-smokers and has been underrepresented in cluster studies using data from developed countries, where biomass use is rare. However, approximately three billion people in low- and middle-income countries rely on biomass fuels [
51,
52], and COPD frequently occurs in non-smokers exposed to their combustion. One study found that biomass exposure increased COPD risk by 1.71 times in men and 2.88 times in women, with even higher risks in never-smokers (2.18 times) [
53]. Another study from Republic of Korea [
54] showed biomass smoke exposure posed a COPD exacerbation risk comparable to tobacco smoke. The identification of this cluster aligns with the clinical expert opinion that biomass-related COPD presents a distinct cluster, often with chronic bronchitis and small airway involvement [
51,
55]. Studies from western China have similarly reported biomass exposure as a significant contributor to rural COPD [
52]. Therefore, strategies targeting biomass reduction and clean energy promotion are crucial for mitigating disease severity in Chinese women [
48,
49].
Our results highlight the need to prioritize the biomass-exposed females cluster, a group often neglected compared to male smokers. This cluster reported more severe issues in self-care, mobility, and pain dimensions despite having similar FEV
1 as the young male smokers group. This supports previous findings that women with COPD experience worse dyspnea, stronger emotional responses, and greater HRQoL impairment—potentially due to prolonged household smoke exposure [
56,
57]. Public health responses should focus on reducing biomass exposure through cleaner cooking technologies, better ventilation, and community education. Policies may include fuel subsidies and air quality regulations. Clinicians should be trained to manage biomass-related COPD, and research should explore tailored therapies for this group.
Current smoking cessation efforts may fall short in reducing the COPD burden among early-stage younger patients with mild symptoms, such as those in the young male smokers cluster. Quitting remains critical for slowing lung function decline, with earlier interventions yielding greater benefits [
58]. Tailored cessation programs that address younger patients’ unique challenges are needed to improve outcomes and reduce long-term disease burden.
The respiratory comorbidity cluster exhibited the hightest percentage problems with usual activities and anxiety/depression. Given the high prevalence of anxiety and depression in COPD [
59], worsening dyspnea and mobility likely exacerbate psychological distress. A multidisciplinary approach integrating pulmonary rehabilitation, pharmacotherapy, and surgical options is essential. Public health efforts should promote specialized COPD clinics offering comprehensive care. Raising vaccination and infection control awareness is also key to preventing COPD exacerbations.
The elderly multimorbid cluster showed poor pain/discomfort scores, likely due to osteoarthritis, osteoporosis-related pain, and angina [
60]. Chronic pain is highly prevalent in COPD, with reported rates between 44% and 88%. Greater pain severity correlates with worse HRQoL [
61]. These findings underscore the need for a comprehensive management approach. Public health policies should prioritize routine screenings and coordinated care pathways. Community programs can reinforce education on lifestyle changes, medication adherence, and self-management. Improving healthcare access for older adults and training providers in age-appropriate COPD care remains essential.
This study has limitations. First, the data-driven nature of our unsupervised machine learning necessitates clinical validation to support cluster interpretation. A consensus on cluster definitions is vital for practical applications. Second, although internal validation was performed, the absence of external datasets restricted the generalizability of our findings. Nonetheless, our nationwide sample introduces heterogeneity that partially mitigates this issue. Third, reliance on self-reported comorbidity data may underestimate actual prevalence. Future research should incorporate comprehensive health records for improved accuracy. Finally, the cross-sectional design limits causal inference between comorbidity clusters and HRQoL outcomes. Longitudinal studies are needed to explore causal relationships.
5. Conclusions
We identified four distinct clusters of Chinese patients with COPD—young male smokers, biomass-exposed females, respiratory comorbidity, and elderly multimorbid—each with unique clinical and HQRoL profiles. This study highlights the need for integrating tailored strategies into public health policies. Future research should validate these clusters and investigate their utility in shaping COPD-related policy and practice.
CRediT authorship contribution statement
Chao Wang: Writing – original draft, Methodology, Formal analysis, Conceptualization. Fengyun Yu: Writing – original draft, Validation, Software, Formal analysis. Zhong Cao: Writing – original draft, Methodology, Formal analysis, Conceptualization. Ke Huang: Writing – review & editing, Data curation. Qiushi Chen: Writing – review & editing. Pascal Geldsetzer: Writing – review & editing. Jinghan Zhao: Data curation. Zhoude Zheng: Data curation. Till Bärnighausen: Writing – review & editing. Ting Yang: Writing – review & editing, Resources, Funding acquisition, Data curation. Simiao Chen: Writing – review & editing, Supervision, Funding acquisition, Conceptualization. Chen Wang: Writing – review & editing, Supervision, Funding acquisition, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The study was supported by the Ministry of Science and Technology of the People's Republic of China (2023ZD0506000), the CAMS Innovation Fund for Medical Sciences (ClFMS, 2023-I2M-2-001), and the Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2022-ZHCH330-01). The statements made and views expressed are solely the responsibility of the authors.
Appendix A. Supplementary material
Supplementary data to this article can be found online at
https://doi.org/10.1016/j.eng.2025.05.005.