1. Introduction
Diverse biological risk factors (BRFs), including pathogenic microorganisms, antibiotic resistance genes (ARGs), and virulence factors (VFs), have been frequently detected in engineered water systems such as wastewater treatment plants (WWTPs) [1],
[2],
[3]. Even though WWTP effluents are usually disinfected, disinfection-resistant pathogens are discharged to natural water bodies or transmitted to atmosphere environments, causing health risks to human communities [4],
[5],
[6]. For example, the bacterial strains Pseudomonas aeruginosa PAO1 and
Escherichia coli (
E. coli) CGMCC1.3373 are chlorine-resistant
[7], and tetracycline-resistant
Enterobacter-1 shows high tolerance to ultraviolet disinfection [8]. The tracing of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in sewer networks and WWTP effluents during the coronavirus disease 2019 (COVID-19) pandemic further highlighted the nonnegligible biological threat in engineered water environments [9],
[7]. As there is an increasing demand for the use of reclaimed wastewater (i.e., treated effluent water) as a secondary water resource to address the water crisis [10], there is an urgent need to understand the occurrence and dispersal patterns of BRFs in global engineered water systems to alleviate biosafety concerns.
Accurate identification of BRFs from environmental samples is a prerequisite for revealing a real picture of BRF diversity in water systems. Conventional methods, including cultivation and quantitative polymerase chain reaction (qPCR) approaches, such SARS-CoV-2 in sewage, are limited by their ability to target only a specific pathogen or a small spectrum of BRFs in samples, missing the majority of BRFs [11],
[12]. Untargeted methods analyze either 16S ribosomal RNA (rRNA) gene-based or metagenome-based sequencing data. For example, the pipeline for Multiple Bacterial Pathogen Detection and the Mypathogen database contain pipelines for bacterial pathogen detection from 16S rRNA gene sequencing data [13],
[14]. However, 16S-based approaches are limited in taxonomic resolution and often cannot classify pathogens to the species level
[15],
[16]. Moreover, viral and eukaryotic pathogens will be missed by such approaches.
To date, online pipelines and platforms for identifying BRFs from metagenome data are still scarce. The metagenome analysis tools of the PathoSystems Resource Integration Center identify ARGs and VFs by read mapping using k-mer alignment (KMA)
[17],
[18], while the ARGs online analysis pipeline (ARGs-OAP) v3.0 annotates ARGs from metagenomic reads by mapping with Basic Local Alignment Search Tool X (BLASTX) [19]. However, platforms specifically designed for analyzing and illustrating the geographic distribution of BRFs, including pathogenic microbes, ARGs, and VFs, in global engineered water systems based on metagenomic data are still lacking. In this study, we constructed a multifunctional platform, the Global Wastewater Pathogen Database (GWPD), to analyze and illustrate the vast diversity of BRFs in global engineered water systems. Its metagenome-based searching approach employs optimized bioinformatic tools to simultaneously identify bacterial, viral, and eukaryotic pathogens, along with ARGs and VFs. The web implementation of this database provides an interactive visualization of the distribution of these BRFs, an information search tool, and an analytical pipeline for annotating user-provided metagenomic data. GWPD will support the surveillance of BRFs, especially those related to wastewater-based epidemiology, and ultimately facilitate risk control and biosafety related to water-environment-health in the context of “One Health” [20]. GWPD is freely available at http://gwpd.hitsz.edu.cn. It has already garnered more than 150 registered users globally, with more than 30 000 accesses by Jan 2024.
2. Methods
2.1. The construction of a knowledge dictionary of BRFs
We collected information on 1908 bacterials, virals, and eukaryotic pathogens related to 157 hosts and 889 diseases available from the Nation Center for the Biotechnology Information (NCBI), the Centers for Disease Control and Prevention (CDC), the Chinese Centers for Disease Control and Prevention, the International Committee on Taxonomy of Viruses (ICTV), and the Pathogen–Host Interactions Database (PHI-base) [21]. The potential hosts and infectious diseases, reflecting the possible risks of pathogens, are shown in Table S1 in Appendix A.
In addition, information on 2373 ARGs belonging to 47 ARG classes (
Fig. 1) was obtained from the Comprehensive Antibiotic Resistance Database (CARD) [22], and the class of each ARG is listed in Table S2
in Appendix A. Moreover, 541 VFs were collected from the Virulence Factor Database (VFDB) [23]. These information elements construct the knowledge dictionary of GWPD.
2.2. Metagenomic data collection
For data retrieval, relevant publications on WWTP-related metagenomic studies were retrieved from the Web of Science using the following search formula: topics (TS) = (pathogen OR pathogenic microorganism OR pathogenic bacteria OR pathogenic virus OR pathogenic fungi OR microbial OR microorganism) AND TS = (ARG OR resistance gene OR antibiotic resistance gene OR antimicrobial resistance OR resistome MGE OR mobile genetic element OR resistant) AND TS = (VF OR VF gene OR VFG OR virulence factor) AND TS = (metagenome OR metagenomics OR metagenomic sequencing OR high-throughput OR NGS OR next-generation sequencing OR shot-gun OR Illumina OR short reads OR metagenome assembly) AND TS = (wastewater treatment plant OR wastewater OR WWTPs OR full scale OR activated sludge OR influent OR effluent OR disinfect OR receiving water OR natural water OR river OR lake OR ocean) AND TS = (function OR functional gene OR functional microorganism). The data filtering criteria for the identified projects were as follows: ① Sequencing method: shotgun Illumina sequencing, with a sequencing depth of no less than 10 GB; ② sampling date: within the last 15 years (May 2009–Dec 2023); and ③ sample types: from engineered water systems, including wastewater sewer networks, influent, anoxic activated sludge, oxic activated sludge, effluent, or receiving water/natural water. Metadata of the relevant samples were obtained, including the sample types and geographic information (i.e., location, latitude, and longitude).
The sequencing datasets were downloaded from the NCBI Sequence Read Archive, which includes the following projects: PRJNA506137, PRJEB39738, PRJEB37889, PRJEB28815, PRJNA484416, PRJNA688020, PRJNA672704, PRJNA230567, PRJEB28796, PRJEB37137, PRJNA526679, PRJNA428383, PRJNA524094, PRJNA556302, PRJEB28033, PRJNA542960, PRJNA473136, PRJNA549409, PRJNA420682, PRJNA406858, PRJNA432264, PRJNA438174, SRP107015, PRJNA505617, PRJEB13831, PRJNA524456, SRP127721, PRJNA377521, SRP159184, PRJNA434744, PRJNA400857, PRJNA342017, PRJNA288131, SRP067931, PRJEB14051, PRJNA529711, PRJNA507800, PRJNA482680, PRJEB36775, PRJNA661613, PRJNA414521, PRJNA577390, and PRJNA419931.
2.3. The pipeline for analyzing metagenome data
The data processing procedure is shown in
Fig. 2. The raw reads were trimmed and quality-filtered by using flexbar (v3.5.0)
[24] and skewer (v0.2.2)
[25] with the parameters of average
Q-score higher than 30 and read length higher than 75 nucleotides (nt). The community composition was classified by using MetaPhlAn (v3.0)
[26]. The composition of the pathogenic microbes was analyzed by MegaPath (v2.0) [27] with the default parameters. Information on pathogenic microbes was generated from the MegaPath results according to our knowledge dictionary by species name matching.For the annotation of ARGs and VFs, clean reads were assembled by using MEGAHIT [28] with
k-mer parameters of 21, 29, 39, 59, 79, 99, 119, and 141, and assemblies longer than 2 kB were retained for further analysis. Putative coding sequences were predicted by using Prodigal (v2.6.3, meta)
[29] and then annotated for ARGs by using DeepARG
[30] with a probability threshold of 0.8 and for VFs by using Diamond BlastP [31] against the VFDB database
[23] with an e-value cutoff of 1 × 10
−5. For ARG and VF abundance estimation, clean reads were mapped to the ARG- and VF-containing contigs by using BWA-MEM (v2.3.2) [18]. Reads with an aligned region lower than 90% of their length and with an identity lower than 99% were removed by using Bamm. The mapped read numbers were counted by using BBMap
[32]. The number of copies per genome of the ARGs and VFs in each sample were calculated via normalization to the average coverage of conserved single-copy genes, namely, COG0048, COG0049, COG0087, COG0088, COG0091, COG0093, COG0094, COG0096, COG0097, COG0099, COG0100, COG0102, COG0184, COG0186, COG0256, and COG0522. 2.4. Database reconstruction and web implementation
The PostgreSQL database was adopted to store the data from the knowledge dictionary and annotation results of the pathogenic microorganisms, ARGs, and VFs in the samples. PostGIS (a plugin in PostgreSQL) was used to save the geographic information. Both exact and regular expression matching were used in the Searching function by utilizing Axios.
Python was utilized to implement the backend of the website. The web interface was implemented via code running in Vue.js, Cascading Style Sheets (CSS), Hyper Text Markup Language (HTML), and JavaScript by using Elements. Echarts was used for heatmap and stacked chart visualization. A fragmented upload approach was utilized in the annotation module to stage and record the number of slices to achieve breakpoint resumption.
Task submission triggers file transfer by the Open Geographic Modeling and Simulation (OpenGMS) model and initiates an asynchronous computation task in the Celery assignment queue [33], which reflects the task status. The results are saved in the PostgreSQL database after the computation. The web service has been tested in the web browsers Google Chrome (v117.0.5938.89), Microsoft Edge (v113.0.1774.35), and Mozilla Firefox (v18.05). 3. Results
3.1. Database statistics
To construct the knowledge dictionary (
Fig. 2), which serves as the reference database for annotation, we searched and sorted 1908 pathogenic microorganisms in total, including bacteria, fungi, viruses, and metazoa, from public data sources (e.g., PHI-base
[21], CDC, NCBI, and ICTV), of which 991 are bacteria, 514 are fungi, 399 are viruses, and four are metazoa. These pathogens can potentially infect 157 host species and result in 889 diseases. Additionally, GWPD harbors information on 2373 ARGs belonging to 47 ARG classes from the CARD [22] and 541 VFs from the VFDB
[23] (
Tables 1 and
Table 2). The information on all these BRFs has been integrated into the knowledge dictionary (
Fig. 2).
We then collected 1302 metagenomic samples with metadata from 186 cities in 68 countries on six continents, including 413 from Asia, 406 from Europe, and 320 from North America, followed by South America (96), Africa (56), and Oceania (11). Specifically, the datasets include 770 samples from 21 developed countries, 503 from 39 developing countries, and 29 from eight low-income countries. The largest quantity of samples was distributed in the US (307), followed by China (285), Germany (102), Argentina (86), and Sweden (73) (Fig. S1
in Appendix A). We noticed that samples from low-income countries are underrepresented in the current version of the GWPD. All of the samples were assigned to six types: sewer network, WWTP influent, anoxic activated sludge, oxic activated sludge, WWTP effluent, and receiving/natural water (Fig. S2 in Appendix A). Specifically, among the effluent samples, seven were collected post-disinfection by different technologies, including four by ultraviolet light, two by ozone, and one by chlorine. A total of 476 pathogens, 442 ARGs belonging to 20 classes, and 246 VFs were identified in the collected samples (
Table 1).
3.2. Interactive illustration of the global distribution of BRFs
Interactive interfaces are available on GWPD for users to visualize the global distribution of BRFs in engineered water systems (
Fig. 3 and Fig. S3 in Appendix A). Specifically, the interactive homepage presents the 30 most abundant BRFs on the world map and the map of China (Fig. S3). The font size of the words in the word cloud is proportional to the relative abundance of the BRFs. A heatmap will be generated when users click the names of the BRFs in the word cloud, showing the distribution and abundance of the specific BRFs across countries worldwide or provinces nationally in China (Fig. S3). Potential hosts and diseases associated with the top 30 pathogens are listed in Table S1.
Moreover, the numbers of cities, WWTPs, samples, and identified pathogenic microorganisms are displayed by clicking a country/region on the world map or a province on the map of China. The homepage can guide users to a specific webpage by selecting a specific country/region or a Chinese province on the map. The webpage shows the composition of the ten most abundant BRFs for each sample type collected in a province/state of a country in a city within a Chinese province (
Fig. 3).
Overall, the dominant pathogens vary across continents (Fig. S4 in Appendix A).
Streptococcus suis (
S. suis),
E. coli, and
Klebsiella pneumoniae are predominant in Africa and Europe, while
E. coli,
Pseudomonas mendocina (
P. aeruginosa), and
Pseudomonas alcaligenes (
P. alcaligenes) are abundant in Asia. In North America, the dominant pathogens are
S. suis,
E. coli, and
Bacteroides uniformis. In Oceania,
S. suis,
E. coli, and
Acinetobacter hemolyticus are predominant, while in South America,
E. coli has the highest abundance, followed by
P. aeruginosa and
Gordonia bronchialis. Moreover, ARGs commonly found and prevalent in wastewater treatment processes are prevalent in GWPD [34],
[35]. Six continents share two highly abundant ARGs,
ksgA (an aminoglycoside) and
bacA (a peptide), while other dominant ARGs differ among continents (Fig. S4 in Appendix A). For instance,
aadA is the most abundant ARG in Africa, followed by OXA and
sul1.
ksgA,
ompR, and
bacA are predominant in Asia, while
ksgA,
kdpE, and
bacA are highly abundant in Europe. The most abundant ARGs in North America, Oceania, and South America are
ksgA,
msrE, and OXA, respectively. Finally, the dominant VFs across continents exhibit a highly uniform pattern, including EF-Tu, Type IV pili, flagella, capsules, LPS, Hsp60, and HSI-I T6SS-secreted effectors, indicating the probable prevalence of these VFs in global engineered water systems (Fig. S4).
3.3. Information browsing and retrieval
Two methods are used by users to retrieve GWPD data: searching/browsing by location and searching/browsing by the names of pathogenic microorganisms (
Fig. 4). When searching/browsing by location, information on the province/state or city of interest is shown, including information on the identified pathogenic microorganisms, ARGs, and VFs, as well as the potential hosts and diseases of these BRFs according to the GWPD knowledge dictionary. Alternatively, users can search by pathogen name, where the specific pathogens are shown, including potential hosts, diseases, ARGs, and associated VFs.3.4. Online service for automatic metagenome annotation
The annotation function is designed to provide a one-step analysis service for BRF annotation for researchers who lack advanced bioinformatics skills (
Fig. 5(a)). After registration and login, Illumina paired-end metagenomic sequencing reads can be uploaded with one click, and the analytical pipeline will then be automatically triggered (
Figs. 5(b) and (c)). Detailed notes, such as the naming scheme, file formats, and sizes, are introduced in the example. Currently, we have implemented 256 Central Processing Units (CPUs) specifically for the online analysis function, and 32 CPUs are allocated for each task. Thus, eight tasks are supported simultaneously. The average duration for a task involving 10 GB of raw sequencing data is approximately 12 h. Task status can be tracked, and the annotated results in the form of Excel sheets will be sent to users as e-mail attachments when the task is finished. In addition, users can log into GWPD to visualize the preliminary analysis results, consisting of the compositions of microorganisms, pathogen species, and ARGs and VFs (Fig. 5(d)).
3.5. Other functionalities
The “Help” page frequently offers questions with answers and website navigation. A comment window is also provided to welcome any public queries and suggestions for further website improvement. The “About” page introduces our project mission and the management and technical support teams of the GWPD. The citations of the publicly available data used in the GWPD are also provided.
3.6. Extensibility of the GWPD platform
GWPD is designed to be extensible, which provides feasible interfaces for future updates from a technique perspective (
Fig. 6). Echart, Axios, OpenGMS, and Vue.js were used to visualize abundance, retrieve search results, conduct annotations, and add functional modules, respectively. Based on the existing tools and development framework, subdatabases of BRFs from other aquatic environments and more functional modules for data analysis (e.g., source tracking and quantitative microbiological risk assessment [36]) can be included in further updated versions of the GWPD. We also plan to connect remote public databases relevant to wastewater epidemiology to integrate new cloud computing modules for cross-database BRF analysis.4. Discussions
GWPD provides a comprehensive catalog of BRFs in global engineered water systems, including WWTP-related samples. The geographic distribution of BRFs among countries and continents implies their potential threat to human health. However, the variations in dominant pathogens and ARGs among continents may be attributed to uneven sampling (such as limited sample collection in Africa and Oceania, particularly in low-income countries) (Figs. S1 and S2
). In addition, we still have limited information on BRFs from specific samples related to WWTPs, including sewer networks, effluents from different disinfection technologies (e.g., chlorine, ultraviolet, ozone, chlorine dioxide, combined ozone/UV disinfection, etc.), and receiving waters. For example, during the COVID-19 pandemic, studies have demonstrated the transmission of SARS-CoV-2 in the feces of infected individuals to WWTPs via sewer networks [37],
[38]. Monitoring sewage from sewer networks for pathogen tracing enables real-time surveillance of communities
[7]. Furthermore, some strains of
E. coli and
P. aeruginosa have chlorine resistance and ultraviolet tolerance. These highly risky pathogens may further leak into receiving waters and threaten human communities
[4]. Thus, it is important to analyze the occurrence of BRFs after disinfection with different technologies to evaluate removal efficiency before discharge to receiving waters.
GWPD provides an illustration of the relative abundance of potential pathogenic microorganisms, ARGs, and VFs in global engineered water systems. These original data are valuable resources for further in-depth analyses. For instance, samples with geographic information can be analyzed together to trace the dissemination of specific pathogens by using tools such as Source Tracker [39]. In addition, the QMRA can be used to estimate the risk of each sample, specific habitat, or even a particular city via pathogen dissemination from source to sink
[36]. Epidemic data of clinically relevant pathogens can be included in the analysis of environmental data in the scheme of water-based epidemiology.
Currently, the annotation service only supports paired-end Illumina short reads but future versions of GWPD will be implemented with the capacity to process Pacific Biosciences (PacBio, USA) and Oxford Nanopore (UK) reads considering the rapid progress of long-read sequencing technologies [40]. Long-read data may provide more accurate (e.g., strain-level) identification of pathogens. Moreover, long-read data can potentially link ARGs and VFs to their host genomes, enabling tracing of the transmission of these BRFs between pathogens
[41]. Furthermore, alternative software tools in the annotation pipeline (e.g., Bracken [42] and mOTUs
[43] for microbiome taxonomic profiling and ARGs-OAP v3.0 [19] for ARG annotation, quantification, and risk ranking) can be used in the future to facilitate methodological comparisons and standardization.
In conclusion, GWPD provides high-throughput metagenome-based identification and visualization of BRFs in global engineered water systems. The potential risks from the BRFs revealed in this database warrant further research on source tracking, risk assessment, and epidemic analysis. This database will also provide primary data and encourage the optimization of BRF-control technologies in WWTPs and other engineered water systems.
CRediT authorship contribution statement
Aijie Wang: Conceptualization, Formal analysis, Project administration, Supervision, Visualization, Writing – review & editing. Fang Huang: Data curation, Formal analysis, Investigation, Software, Visualization, Writing – original draft. Wenxiu Wang: Writing – review & editing, Software. Yanmei Zhao: Software, Writing – review & editing. Yiyi Su: Writing – review & editing. Zelin Lei: Writing – review & editing. Rui Gao: Writing – review & editing. Yu Tao: Writing – review & editing. Jun Wei: Writing – review & editing. Haoyi Cheng: Writing – review & editing. Jinsong Liang: Writing – review & editing. Bin Liang: Writing – review & editing. Jianhua Guo: Writing – review & editing. Jiping Jiang: Writing – review & editing. Lu Fan: Conceptualization, Funding acquisition, Methodology, Project administration, Software, Supervision, Writing – review & editing. Shu-Hong Gao: Conceptualization, Funding acquisition, Investigation, Project administration, Supervision, Writing – review & editing.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (52321005, 52293441, 52293443, and 52230004), the Shenzhen Stability Support Key Program in Colleges and Universities of China (GXWD20231127195344001), the Natural Science Foundation of Guangdong Basic and Applied Basic Research Foundation (2024A1515010085), the Shenzhen Overseas High-level Talents Research Startup Program (20200518750C), the Shenzhen Science and Technology Program (KQTD20190929172630447), and the Open Project of the Key Laboratory of Environmental Biotechnology, CAS (KF2021006). The GWPD website was supported by Shenzhen Zhishu Environmental Technology Co., Ltd., China.
Appendix A. Supplementary data
Supplementary data to this article can be found online at
https://doi.org/10.1016/j.eng.2024.04.022.