1. Introduction
The human gut is a complex and intricate mini-ecosystem that mediates interactions between the host and the environment. The human gut contains trillions of microorganisms, such as bacteria, fungi, viruses, and other life forms, most of which exist in the colon. These microorganisms and the intestinal environment together constitute the intestinal microecology, and its diversity is the result of the coevolution of the intestinal microbiota and host [
1], [
2], [
3]. The composition of the intestinal microecology is easily affected by many factors, such as diet [
4], [
5], age [
6], [
7], sex [
7], genetics [
8], and medicine [
1].
The composition of the intestinal microbiota plays a fundamental role in regulating human health and diseases [
9]. In addition to participating in human digestive function, the intestinal microbiota can affect human development, growth, and physiology, including organ development and morphogenesis and metabolism [
10], [
11]. The intestinal microbiota also plays an indispensable role in the development and induction of the human immune system; it regulates the differentiation of immune cells and the production of immune mediators to maintain interaction between the host and the intestinal microflora [
12], [
13], [
14]. The destruction of the normal intestinal microbiota will increase the risk of infection and excessive proliferation of harmful pathogens and the occurrence of inflammatory diseases [
12]. Qin et al. [
15] studied the changes in the gut microbiota of patients with liver cirrhosis (LC) and found that at the genus level, Bacteroides was the dominant phylotype in both groups, but its abundance was significantly decreased in the LC group [
15]. Therefore, it is very important for a healthy host to maintain microecological homeostasis. Indeed, an unbalanced intestinal microecology leads to the occurrence of a variety of diseases, including liver diseases, gastrointestinal diseases, metabolic diseases, and cardiovascular diseases [
16], [
17], [
18], [
19].
Therefore, it is of great significance to routinely acquire the status of the human gut microbiota in a timely manner for the auxiliary evaluation of overall health and disease prediction. In the past, the coccus/bacillus (C/B) ratio was commonly used to reflect gut bacterial homeostasis [
20], [
21] based on traditional methods, such as bacterial culture or microscopic examination. More recently, with the maturity of polymerase chain reaction (PCR) technology and the microbial sequencing technologies developed from it, more bacterial species can be detected, and the amount of some key bacterial species, bacterial ratios, or other indices have been selected to serve as balance indicators for the intestinal microecology. For example, it was found that the concentration of Bifidobacterium/concentration of Enterobacteriaceae (B/E) ratio can be used to judge the extent of intestinal microecology dysbiosis in the progression of liver diseases [
22], [
23]. Low et al. [
24] found that an increase in the abundance of Klebsiella/abundance of Bifidobacterium (K/B) ratio in early infants is a potential indicator of an increased risk of allergic disease. In addition, Ley et al. [
25] first proposed that a higher abundance of Firmicutes/abundance of Bacteroidetes (F/B) ratio in the gut was likely to result in obesity. Microbial sequencing technologies have now become the mainstream method for studying the gut microbiota because the advantage of being high throughput screening. Microbial studies were mostly based on 16S ribosomal RNA (rRNA) gene sequencing [
26], [
27], [
28] and metagenomic sequencing [
29], [
30]. However, these two kinds of approaches are expensive and inefficient and are generally very cumbersome. As a result, microbial sequencing is not appropriate for routine detection of the gut microbiota in a large population. For rapid testing and simplicity as well as the absolute quantification function [
31], [
32], a good alternative approach for carrying out routine gut bacterial detection is quantitative PCR (qPCR).
Based on previous studies [
15], [
32], this study attempted to establish an effective and routine gut bacterial detection procedure; ten predominant bacterial groups in the human intestinal tract were detected via qPCR, including probiotics (Lactobacillus and
Bifidobacterium), opportunistic pathogens (Enterobacteriaceae, Enterococcus, Bacteroides, and Atopobium), and other health-promoting symbiotic bacteria (Faecalibacterium prausnitzii (
F. prausnitzii), Clostridium butyricum (
C. butyricum), Clostridium leptum (
C. leptum), and Eubacterium rectale (
E. rectale)). These ten representative bacteria were selected from a large sample cohort of healthy people, which were considered potential indices for the evaluation of the whole human gut microbiome. Furthermore, we sought to obtain reference ranges of a healthy population and the changing patterns of the ten bacterial groups and their pairwise ratios with aging to pave the way for subsequent studies on large-sample disease-specific populations. Moreover, to investigate the probable difference in the qPCR detection results between healthy people and people with specific diseases, we evaluated patients with LC in comparison to healthy control (HC) subjects. To further test the capability of qPCR detection to distinguish people with diseases from the general population, we utilized a machine learning algorithm to mine the information of the detection results of the LC population and the HC population and built several classification models to finally select an optimal one.
2. Material and methods
2.1. Volunteer recruitment and sample collection
A total of 510 healthy subjects and 248 patients with LC were recruited; the inclusion and exclusion criteria are listed in Section S1 in Appendix A. Feces of the first defecation in the morning were collected in a clean plastic bag tucked into a disposable plastic bowl, and after defecation, the plastic bag was then fastened tightly to avoid urine pollution. To avoid interference factors, such as food residues, the softer part of the fresh feces was selected and loaded into cryopreservation tubes within half an hour. A DNA stabilizer (Invitek, Germany) was added, and the cryopreservation tubes were numbered before storage at -80 °C for preservation. On the morning of sample collection, a blood sample was collected from the patients and was subjected to routine blood tests, blood biochemistry tests, C-reactive protein (CRP) tests, and tests for other indicators. The basic information of all volunteers (
Table 1), including age, sex, and body mass index (BMI), was registered. Volunteers with unqualified fecal samples and incomplete basic information were excluded; 500 healthy people and 244 LC patients were included. All of the work was performed according to guidelines approved by the Research Ethics Committee of the First Affiliated Hospital, College of Medicine, Zhejiang University (No. 2019-1026 and 2022-874).
2.2. qPCR assessment for major gut bacterial species
Microbial DNA was extracted from the feces of the 744 volunteers with a MegaBio soil/fecal genomic DNA purification kit (Bioer, Inc., China). The specific steps are described in Section S2 in Appendix A. The concentration of total DNA of fecal microorganisms (Ct) in each sample DNA eluate was detected using Nanodrop One (Thermo Fisher Scientific, USA). Ten predominant bacterial populations in the intestine were detected by real-time qPCR. Primers were synthesized by GenScript (China), and the primer information is listed in Table S1 in Appendix A. A ViiA™ PCR system (Applied Biosystems, Inc., USA) was used to conduct qPCR on 744 fecal bacterial DNA samples with a reaction volume of 20 μL, including 10 μL of SYBR Green PCR Master Mix (Zhongnuo Gene, Inc., China), 8 μL of primer pairs (0.2-0.6 μmol·L−1), and 2 μL of template DNA or 2 μL of distilled water (negative control). The reaction conditions are listed in Table S2 in Appendix A. Each reaction was performed in triplicate, and the cycle threshold (ΔCT) between repeats was required to be less than 0.5. Plasmid DNA standards containing the corresponding amplification fragment of each primer group were diluted in multiple gradient ratios and amplified with the bacterial DNA templates in the same PCR plate. The copies of the target bacteria in the DNA template were determined by comparison with the standard curve obtained from amplification of the corresponding bacterial DNA standards. The final concentration of the target bacteria was obtained by dividing the concentration of the target bacteria in the DNA template (N) by the total DNA concentration of fecal microorganisms in each sample DNA eluate and the volume of the template (V). The unit of the concentration is copies per nanogram total DNA of fecal microorganisms, hereinafter referred to as copies·ng−1. The formula is listed in Section S4 in Appendix A.
2.3. Statistical analysis
All statistical analyses and plotting/graphical drawings in this study were performed using R script (version 4.1.2), Rstudio software (version 2022.07.1+554), and Origin 2021 (version 9.95). ggplot2 pack (version 3.3.5) was used for plotting/graphical drawings. The heatmap for correlation analysis was drawn by the corrplot package (version 0.92). The heatmap for comparing the quantity of the ten bacterial species was drawn by the pheatmap package (version 1.0.12). Data preprocessing was completed by the dplyr package (version 1.0.8). The processing of all outliers of the data adopts the box-plot method, and values less than Q1-1.5* interquartile range (IQR) or greater than Q3+1.5* IQR were determined to be outliers. These outliers were removed from the corresponding data groups. The reference range (bilateral) is calculated by taking a 95% confidence interval (95% CI) after removing outliers from the data of the healthy people. The normal distribution method was adopted for normally distributed data, and the upper and lower limits were ± (1.96 × SD) (SD: standard deviation). The percentile method was adopted for nonnormally distributed data, and the upper and lower limits were P2.5 and P97.5, respectively. The Shapiro-Wilk test and Kolmogorov-Smirnov test were used to test the normal distribution of ten bacterial species. The Wilcoxon rank sum test was used for the analysis of differences between the two groups. The Kruskal-Wallis test was used to analyze the differences in nonnormal data among multiple groups. Spearman rank correlation analysis was used to analyze the correlation between the liver function indices and the ten associated bacterial species.
2.4. Machine learning
Machine learning algorithms were used to build models for distinguishing between cirrhotic and noncirrhotic samples. A total of 744 clinical samples (244 cirrhosis samples and 500 HC samples) from the same sample collection area were used as the dataset for constructing the model. The random sampling method was utilized to divide the data into training data and test data according to the ratio of 75-25. There were 408 training data points (183 cirrhosis data points and 225 noncirrhosis data points) and 136 test data points (61 cirrhosis data points and 75 noncirrhosis data points).
Through pretest screening, six machine learning methods were finally used to classify and model the data, including RF [
33], GBM [
34], AdaBoost [
35], XGBoost [
36], SVM_poly, and SVM_Gauss [
37], with the first four methods belonging to ensemble learning [
38].
The content of ten predominant bacterial species in the clinical samples and patient sex were selected as characteristics of the training model [
7]. A tenfold cross-validation RF model was used to explore the importance of these features, and the mean decrease the Gini index was employed as a metric to determine the importance of these characteristics and their contribution to the model.
Repeated tenfold cross-validation (ten repeats) was applied to build and verify the model, and hyperparameter tuning was used to tune the six models. Through hyperparameter tuning, the optimal set of hyperparameter values of each of the six models was obtained, and the optimal model for each model was established according to the set of these values. The hyperparameter optimization of SVM_poly, and other models are shown in Appendix A. Finally, the area under the curve (AUC) was used to evaluate the quality of the trained model, and the model with the highest average AUC value was selected as the optimal model. When overfitting occurred, the suboptimal model was chosen as the final result. The corresponding AUC value, sensitivity, and specificity of each model were obtained from the test data to validate the final model generated by the six machine learning algorithms.
The machine learning analysis process was implemented under the caret machine learning framework of R (version 6.0-93). RF was implemented using RF (version 4.7-1.1). Gradient upgrade was implemented using the gbm package (version 2.1.8.1). AdaBoost was implemented using the adabag package (version 4.2). XGBboost was implemented using the XGBoost package (version 1.6.0.1). Two support vector machine methods were implemented using the kernlab package (version 0.9-31).
2.5. Ethics
This study was approved by the Research Ethics Committee of the First Affiliated Hospital, College of Medicine, Zhejiang University (No. 2019-1026 and 2022-874). The patients/participants provided written informed consent to participate in this study. The research protocol complies with the ethical guidelines of the 1975 Declaration of Helsinki.
3. Results
3.1. The predominant gut microbiota in healthy humans
A total of 500 healthy humans (275 males and 225 females) were included in this study. Healthy subjects were recruited based on strict exclusion criteria. After quality control, ten predominant gut bacterial species were detected by qPCR, and all data were statistically analyzed and processed to find the reference range for healthy people (
Table 2). Furthermore, given that the structure and abundance of the gut microbiota are altered in people of different ages, we also analyzed and found the reference range for people of different ages (Table S3 in Appendix A). Healthy people of different ages were divided into five groups at intervals of 20 years: 0-20, 20-40, 40-60, 60-80, and 80-100 years. Except for
E. rectale, the species showed significant differences among the five age groups. Among those species with significant differences, the abundance of
Atopobium had the most significant change (
P = 5.7 × 10
−8), followed by
Enterococcus (
P = 1.9 × 10
−7), and only
E. rectale did not show significant differences with aging (
P = 0.081).
3.2. Pairwise ratios are potential indicators of gut microbial homeostasis
Studies have reported that the abundance of Firmicutes decreases, and that of Bacteroidetes increases in almost all disease situations [
39]. Therefore, we hypothesized that pairwise ratios are potential indicators for evaluating the balance of the gut microbiota. We compared the concentrations of the above ten bacterial species as logarithmic values (
Fig. 1(a)). We used the pairwise ratio, and plotted the trends for a total of 45 ratios. Interestingly, we did find that the B/E ratio showed a typical trend of first decreasing and then increasing with age, displaying a U-shaped curve (
P = 0.021). However, more studies are still needed. Moreover, the
Enterococcus/Enterobacteriaceae (Ec/E) ratio, an indicator that significantly increases in critical patients [
32], showed an increasing trend with age (
P = 0.00025). The ratio of the two bacterial species with the largest difference between groups was
C. leptum/Bacteroides (
P = 2.3 × 10
−7), and its increasing trend with age was also an obvious U-shaped curve that decreased first and then increased (
Figs. 1(b) and (c)).
3.3. An imbalanced gut microbiota in LC patients
To further verify whether the above ten predominant bacterial species may be used as indicators for microbial homeostasis, we collected and measured 244 fecal samples from LC patients. By comparing the results of healthy people with those of LC patients, it was found that there were significant differences between seven bacterial species (
Fig. 2(a)); in contrast,
F. prausnitzii,
C. leptum, and
Atopobium showed no differences between the two populations. Among the seven bacterial species,
Enterococcus,
E. rectale, and
Bacteroides exhibited the largest differences (
P = 2.50 × 10
−12, 3.47 × 10
−10, and 6.57 × 10
−10, respectively). This was followed by
C. butyricum (
P = 1.76 × 10
−5), Lactobacillus (
P = 1.13 × 10
−3)
, Enterobacteriaceae (
P = 5.59 × 10
−3), and
Bifidobacterium (
P = 9.71 × 10
−3), which is consistent with the results visualized in the heatmap (
Fig. 2(b)). In addition, concentrations of
Bifidobacterium,
E. rectale, Enterobacteriaceae, and
C. butyricum were relatively stable in both healthy individuals and cirrhosis patients, with few outliers. The number of outliers of the four bacterial species in cirrhosis patients and healthy people was (1/244, 1/500), (0/244, 2/500), (4/244, 1/500), and (2/244, 3/500), respectively.
Furthermore, compared to healthy individuals, the ratios of Ec/
E. rectale (
P = 3.08 × 10
−24),
C. leptum/Bacteroides (
P = 2.14 × 10
−18), and
C. butyricum/E. rectale (
P = 2.54 × 10
−18) were significantly different in cirrhosis patients, indicating that the gut microbiota was imbalanced in cirrhosis patients (
Fig. 2(c)). These results are consistent with our previous work [
15]. However, the B/E ratio remained almost unchanged in this work.
3.4. The gut microbiota is associated with the severity of LC
Correlation analysis of the gut microbiota and liver function indicators revealed that the serum levels of alternate (ALT), albumin (ALB), direct bilirubin (DBIL), triglyceride (TG), and total bile acid (TBA) positively correlated with the abundance of
Bacteroides but negatively correlated with that of the other nine bacterial species. The serum level of alkaline phosphatase (ALP) negatively correlated with the abundance of
Bacteroides but positively correlated with that of the other nine bacterial species. Negative correlations between the abundance of Enterobacteriaceae and the B/E ratio were also found (
Fig. 3).
3.5. Multiple machine learning models to distinguish and predict healthy people and patients with LC
Furthermore, we constructed multiple machine learning models to analyze the above results of the gut microbiota and to distinguish HC subjects from LC patients (
Fig. 4(a)). By tuning the hyperparameters of the six models, the optimal combination of hyperparameters was obtained. The relationship between the hyperparameters of the SVM_poly model based on the polynomial kernel and the model performance is shown in
Fig. 4(b). From the figure, the optimal combination of the hyperparameters is polynomial degree = 2, scale = 0.100, and C = 0.75. Relationship diagrams between the estimates of performance and the tuning parameters of the other five models are shown in Figs. S1-S3 in Appendix A.
The training results of the six models showed the XGBoost model to have the best training results, with an average AUC value reaching 0.9376 (95% CI, 0.9158-0.9595). This was followed by SVM_Gauss and SVM_poly, with average AUCs reaching 0.9050 (95% CI, 0.8754-0.9346) and 0.9040 (95% CI, 0.8749-0.9331), respectively. The worst training result was observed for the RF model, but the average AUC also reached 0.8746 (95% CI, 0.8414-0.9078) (
Fig. 4(c)).
Then, the machine learning model was also used to analyze 50 real clinical samples. The test results (
Table 3) show that the six models achieved good prediction results, among which RF achieved the best AUC value, reaching 0.8776 (95% CI, 0.8159-0.9393). This was followed by XGBoost and Adaboost, with AUC values reaching 0.8726 (95% CI, 0.8097-0.9355) and 0.8630 (95% CI, 0.7969-0.9290), respectively. The three models with the highest sensitivity were RF, XGBoost, and Adaboost, with values of 0.8361, 0.7869, and 0.7869, respectively. The three models with the highest specificity were XGBoost, Adaboost, and GBM, with values of 0.8667, 0.8533, and 0.8400, respectively (
Fig. 4(c) and Table S4 in Appendix A).
4. Discussion
Gut microbiota dysbiosis is an abnormal change in the number, proportion, and species of the normal microbiota in the gut that affects human health, leading to a series of abnormal physiological and pathological phenomena [
40], [
41]. Homeostasis between the gut microbiota and the host immune system is compromised when the former is imbalanced [
42]. The gut microbiota changes significantly during human aging [
43]. Zhang et al. [
7] found consistent changes in gut microbiota during aging in humans, as characterized by increased α diversity. The abundance of multiple members of the oral microbiota, Enterobacteriaceae and Clostridia
, which are short-chain fatty acid (SCFA) producers, increased with age. They also found that several
Bifidobacterium species (
B. breve,
B. bifidum,
B. longum, and
B. adolescentis) negatively correlated with age. Biagi et al. [
44] revealed that the cumulative abundance of symbiotic bacterial taxa (mostly belonging to the dominant Ruminococcaceae, Lachnospiraceae, and Bacteroidaceae families) decreased with age. Nevertheless, health-related Akkermansia
, Bifidobacterium, and Christensenellaceae were enriched in elderly individuals, especially semisupercentenarians (105-109 years old). Another study also found a decrease in the abundance of core genera in the gut microbiota of healthy aging individuals, particularly
Bacteroides, which is associated with longer life expectancy [
45]. While different studies have identified gut microbiota changes during human aging, there is no consensus on the changing patterns of the gut microbiome during aging. Therefore, in this study, we aimed to establish the reference range for all ages as well as for different age groups of healthy people based on the results of a large cohort and to determine underlying characteristics during healthy aging.
Based on our previous work, we detected ten predominant gut bacterial species in each healthy individual by qPCR. The ten predominant gut bacterial species selected for this study were found to play an important role in maintaining intestinal homeostasis. For instance,
Lactobacillus and
Bifidobacterium are important for regulating immunity and maintaining gut barrier function [
46], [
47], [
48], and
F. prausnitzii,
C. butyricum,
C. leptum, and
E. rectale can produce SCFAs, which are important in inhibiting the overgrowth of opportunistic pathogens, maintaining the integrity of the intestinal epithelial barrier, and enhancing immunity [
49], [
50], [
51]. Our results revealed that the concentrations of the ten predominant gut bacterial species changed differently with age. The healthy population was divided into five groups at intervals of 20 years (0-20, 20-40, 40-60, 60-80, and 80-100). Nine of the ten predominant bacterial species showed some variation in all five age groups.
Atopobium showed the most significant difference (
P = 5.7 × 10
−8); this was followed by
Enterococcus (
P = 1.9 × 10
−7). In addition, except for
Bacteroides, the concentration of which increased and then decreased with age, the other bacteria showed a general trend of decreasing and then increasing. Most large-sample studies have been based on 16S rRNA or metagenomic sequencing, but one of the main limitations of sequencing is that taxa can only be assigned according to the sequence of a single region in the bacterial genome, and only relative abundance results are obtained, which makes it difficult to generate a stable and reliable reference value [
31], [
52]. In contrast, qPCR can be used to quantitatively detect bacterial concentration by calibration with known concentrations of standard substances, which has the characteristics of good repeatability, rapid results, and simple operation. Timely detection of intestinal microbial dysbiosis is helpful for clinicians to achieve early diagnosis and treatment and to improve prognosis [
32], [
53].
Our previous study analyzed the changes in the gut microbiota of patients with LC. We found that at the genus level,
Bacteroides was the dominant phylotype in both groups, but its abundance was significantly decreased in the LC group [
15]. Of the remaining genera, Veillonella, Streptococcus,
Clostridium, and Prevotella were enriched in the LC group, while Eubacterium and Alistipes were dominant in the HC subjects. Of the species that decreased the most in abundance in the LC group, twelve were
Bacteroidetes and seven were Firmicutes, specifically from the order Clostridiales. However, the cost of metagene sequencing is high, and the amount of data is large, which requires a large amount of computing resources to perform analysis. Therefore, in this study, we aimed to characterize the changes in the gut microbiota of cirrhosis patients using a smaller number of bacteria. Moreover, we also constructed multiple machine learning models to further analyze the microecological results. Interestingly, we found that the prediction results of the classification models built by four algorithms under the framework of ensemble learning were better than those of SVM and some other machine learning algorithms. Among them, the classification model of RF achieved the highest AUC value and sensitivity, and the classification model of XGBoost achieved the second highest AUC value and the highest specificity, which were relatively better than those of the other four models. Comparing the two classification models, the RF model had a higher AUC value and sensitivity and better identified patients with cirrhosis or suspected cirrhosis, but there were more false-positives. The XGBoost model showed higher specificity and was better able to identify healthy people. However, there were more false-negatives, which may result in unrecognized cirrhosis patients failing to visit the hospital in a timely manner, delaying diagnosis and treatment. The RF model with the highest sensitivity was the best choice. Although the RF model may cause some healthy people to be misdiagnosed, the benefits of finding more patients with true cirrhosis in a timely manner outweigh the risks. However, geography and location have strong influences on the human gut microbiome, and geographical differences limit the application of the reference range of the healthy gut microbiome and disease models. To verify the prevalence of microbiota differences between healthy or disease states, standardized experimental protocols, regional study designs, and extensive sampling are needed, and geography as a feature needs to be added to the model. Different predictive models need to be trained for different geographical regions [
54], [
55]. Considering the influence of region on the human microbiota, samples from different regions will be collected in the future to re-establish the classification model. We will adopt two strategies: ① add region as an additional feature to the model and ② establish classification models according to different regions relatively independently and select a better classification model according to the results.
The combination of qPCR and machine learning makes is an easy, rapid, stable, and reliable method to test and analyze the gut microbiome of the human body. With the inclusion of new samples, our healthy population cohort will be subsequently expanded, the reference range for the healthy population results will be updated as the sample size expands, and the capability of relevant machine learning models to predict illnesses will be improved through continuous learning and prediction, which may contribute to clinical work.
5. Conclusions
Ten kinds of predominant gut bacterial species that characterize the whole microbiome in the human gut were found from a large-sample Chinese cohort. We established the reference ranges of these ten predominant gut bacterial groups by detecting their concentrations by qPCR and discovered the changing patterns of the ten bacterial groups with aging and disease. In addition, we utilized machine learning algorithms to deeply extract differential information from the detection results and built and selected a reliable classification model for predicting LC. This study revealed that it is highly necessary to describe and predict the changes in gut microbiota in a healthy Chinese population with a small amount of information. Based on this healthy range, it can be widely used to predict and describe intestinal microecological dysbiosis in various diseases. However, more new theoretical models and clinical practice are still needed in future work.
the National Key Research and Development Program of China(2018YFC2000500)
the Fundamental Research Funds for the Central Universities(2022ZFJH003)
the Independent Task of State Key Laboratory for Diagnosis and Treatment of Infectious Diseases(2022zz22)
the National Natural Science Foundation of China(81703430)
the National Natural Science Foundation of China(32170058)
the National Natural Science Foundation of China(82200994)
the Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences(2019-I2M-5-045)
the Research Project of Jinan Microecological Biomedicine Shandong Laboratory(JNL-2022051B)