1. Introduction
Glycosylation is the enzymatic process that transfers an oligosaccharide (glycan) to an aglycon; that is, protein, lipid, or RNA [
1]. Interestingly, a recent work also considers the possibility of DNA glycosylation [
2]. In
Homo sapiens (
H. sapiens), glycans are synthesized and attached to aglycons through one of the 16 glycosylation pathways consisting of about 700 genes encoding enzymes transporters, and other proteins found in cellular glycosylation machinery (GM) [[
3], [
4], [
5], [
6]]. Protein glycosylation is a complex post-translational modification occurring almost completely in the endoplasmic reticulum (ER) and the Golgi [
7], where glycans are attached to proteins in a series of reactions catalyzed by glycotransferases and glycoside hydrolases [
8]. It has long been known that glycans play major metabolic, structural, and physiological roles in biological systems, and that more than half of all human proteins are glycosylated [[
9], [
10], [
11]].
Protein glycosylation can be found in all three domains of life [
12,
13]. Compared to bacteria and archaea, eukaryotes use a narrower spectrum of monosaccharide building blocks in glycan synthesis [
14,
15]. However, eukaryotic glycans, despite the reduced panel of monosaccharide units, show highly complex linear and branching structural diversity [[
15], [
16], [
17]]. Protein glycosylation can be viewed as a polygenic trait, where glycosylation phenotypes result from the actions of many genes along the glycosylation pathways, partly modulated by intracellular and extracellular environments [
16,
18]. The studies of protein-bound glycan diversity in different evolutionary lineages showed remarkably discontinuous patterns that are shaped by various evolutionary forces, including coevolutionary interactions within a specific holobiont [[
16], [
17], [
18], [
19], [
20], [
21]]. However, despite the biological importance and ubiquitous presence of glycosylation on the tree of life [
12,
14,
19], our understanding of the evolutionary dynamics of GM genes is still cursory [[
22], [
23], [
24], [
25], [
26], [
27]].
For instance, N-glycosylation (NG) biosynthesis is the most frequent and the most studied type of protein glycosylation in eukaryotes [
9,
23]. It starts at the cytoplasmic side of the ER membrane where monosaccharides are sequentially added by GM proteins to the activated lipid carrier until a glycan with Man
5GlcNAc
2 structure is formed [
8]. This glycan structure is then translocated by a flippase to the luminal side of the ER membrane, where it is further elongated by the luminal GM to the Glc
3Man
9GlcNAc
2 structure, which is common to most eukaryotes [
19,
28]. Further glycan editing continues in the Golgi where, under the control of Golgi-specific GM proteins (glycosyltransferases and glycosidases), their final structure is formed [
8]. In eukaryotes, this final glycan structure varies within populations and between species [
12,
14,
17,
19,
27].
Eukaryotic, bacterial, and archaeal NG pathways share a common topology where the first part of stepwise oligosaccharide synthesis unfolds on the cytosolic side of a membrane. At some point, this process includes flipping of the nascent oligosaccharide across the membrane [
12,
23,
29,
30]. In addition, NG pathways show chemical resemblance that include the use of polyisoprenol lipid carriers and phosphate-linked substrates [
22,
29,
31]. These topological and chemical similarities suggest that the last universal common ancestor (LUCA) possessed an NG like pathway [
22]. The remarkable difference, however, is that in bacteria and archaea, the NG pathway is embedded in the plasma membrane, whereas in eukaryotes, it is part of the ER membrane [
12,
19,
23,
29,
30,
32].
On the other hand, the evolutionary origin of individual GM genes acting in the eukaryotic NG pathway is not yet fully resolved [
23]. Because of the symbiotic origin of eukaryotes with the contribution of both archaeal and bacterial ancestors, it is not fully clear which prokaryotic group is ancestral to the eukaryotic NG pathway [[
33], [
34], [
35]]. Interestingly, recent phylogenomic analysis suggests a mixed origin of eukaryotic NG with the contribution of both archaea and bacteria [
23]. Nevertheless, the evolutionary origin of many eukaryotic NG genes remains unresolved and some of them seem to be a eukaryotic innovation [
23].
Another facet of the glycosylation process involves proteins that are targeted by the GM. At least in humans, it is clear that many proteins are glycosylated, however evolutionary dynamics of these glycoproteins (GPs) have not been addressed so far. To obtain a coherent view of the evolutionary origin of GM and GP genes, and consequently to gain a better understanding of the role of glycosylation in the evolution of metazoans, particularly humans, we applied here a phylostratigraphic approach [[
36], [
37], [
38], [
39], [
40], [
41], [
42], [
43], [
44], [
45], [
46], [
47], [
48]]. Although the phylostratigraphic approach has proven to be a powerful tool for addressing various macroevolutionary questions—including those related to human diseases [
36,
42,
47,
48]—it has never been applied to problems related to glycobiology.
Our results showed that GM and GP genes have divergent macroevolutionary dynamics and that the intracellular localization of NG proteins on the ER membrane follows an evolutionarily polarized pattern. This finding led us to propose that the ER evolved through the invagination of the prokaryotic plasma membrane. Together, our study reveals that eukaryotic glycosylation is built upon a fundamental core that is largely homologous to prokaryotic glycosylation genes, along with a set of eukaryota-specific genes that are unique to the glycosylation process in eukaryotic cells.
2. Methods
2.1. GM genes and GPs
We compiled a list of GM genes from several sources: genes listed in the study from Moremen et al. [
4] (refer to Data S1 in Appendix A), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways related to glycosylation in
H. sapiens (hsa00510, hsa00511, hsa00512, hsa00514, hsa00515, hsa00531, hsa00532, hsa00533, hsa00534, hsa00563, hsa00601, hsa00603, and hsa00604) [
49],
H. sapiens entries in Carbohydrate Active enZYmes (CAZY) database [
3], genes listed in the study from Varki et al. [
6] and from Schjoldager et al. [
5] (Data S1). We also retrieved
H. sapiens orthologues of mouse glycosylation genes listed in Moremen et al. [
4]. The obtained dataset was manually reviewed for duplicates and pseudogenes which finally yielded a total of 673 GM genes with unique ensemble gene IDs (Data S1).
In addition, we collected a list of GPs from UniProtKB/Swiss-Prot database [
50] using the following query: “annotation:(type:carbohyd) AND reviewed:yes AND organism: “Homo sapiens (Human) [9606]”.” In this way, we obtained all the reviewed
H. sapiens proteins that show some evidence to possess a posttranslational modification of glycosylation type. The final GP dataset included a total of 4565 genes with unique ensemble gene IDs (Data S1).
2.2. Consensus phylogeny and reference genomes
We constructed a consensus tree of 503 organisms representing the diversity of major lineages that lead from the origin of cellular organisms to
H. sapiens, resulting in 29 phylogenetic levels (phylostrata (ps),
Fig. 1, Data S2 and S3 in Appendix A). For phylogenetic relationships between taxa, we followed the relevant literature [[
51], [
52], [
53], [
54], [
55], [
56], [
57], [
58]]. Reference genomes (that is, corresponding amino acid sequences (proteomes) of all organisms included in the phylogeny) were downloaded mostly from the Ensembl and National Center for Biotechnology Information (NCBI) databases. Proteomes of choanoflagellates were retrieved from Richter et al. [
59]. Prior to the phylostratigraphic analysis, the proteomes were processed to keep only the longest splicing variant of each gene. Other details on the construction of a phylostratigraphic database are described in Domazet-Lošo et al. [
37].
2.3. Phylostratigraphic analysis
The theoretical background and phylostratigraphic procedure have been described previously [
37,
39,
40,
47]. Briefly, the longest splicing variant of every
H. sapiens gene was compared to the reference database using the blastp algorithm with the
e-value cutoff equal to 10
-3, which has repeatedly been shown to be optimal in phylostratigraphic analysis [[
60], [
61], [
62], [
63]]. The obtained sequence similarity search results, along with the consensus phylogeny, were used to estimate the evolutionary origin (phylostratum; ps) of all human genes. We further tested the robustness of this cutoff by conducting a sliding
e-value test, which confirmed the stability of the recovered patterns [
37,
43]. It is important to note that phylostrata represent key evolutionary events along the lineage leading to the focal species, reflecting the tree topology and thereby providing a broad overview of gene dynamics over geological time—that is, a macroevolutionary perspective [
37,
48].
The frequency distributions of studied genes (GM and GP) were compared to the frequency distribution of all human genes across phylostrata (expected distribution). In all graphical representations the ratio of these frequencies is shown as odds ratios. Deviations from the expected frequencies at each phylostratum were tested using a two-tailed hypergeometric test. The obtained
p-values were corrected for multiple testing using the Benjamini-Hochberg method. To account for a potential sequence similarity search bias, we applied a sliding
e-value protocol [
43] where we repeated sequence similarity searches and all downstream calculations using a range of
e-value cut-offs (1, 10
-1, 10
-2, 10
-5, 10
-10, 10
-15, and 10
-20).
2.4. Statistical analyses
To test whether there is an association between the evolutionary origin of genes coding for enzymes of the NG pathway acting on the ER and membrane topology, we generated a contingency table with these categories and tested observed biases using Fisher’s exact test. We excluded requiring fifty three 1 homolog (RFT1) from the statistical analyses due to its ambiguous topology on the ER membrane (lumen vs cytosol). We used the same approach in the Golgi to test the association between the evolutionary origin and enzyme type.
To assess if there is an association between the order of enzymes of the NG pathway in the ER and their evolutionary origin, we devised the homogeneity index (HI). For an ordered list of N genes with assigned evolutionary origin (phylostratum), HI is calculated by iterating from the 2nd to the Nth element of the list and comparing every element with the one immediately preceding it. If the evolutionary origin of the two elements is not equal, the index is increased by 1. In essence, a high HI means that the sequence of enzymes is disordered with respect to their evolutionary origin, while a sequence in which the enzymes of the same evolutionary origin are clustered together will result in a low HI.
To test whether the observed biased grouping of enzymes of the NG pathway in the ER is statistically significant, we generated all possible permutations of evolutionary origins and calculated the HI for each permutation, analogous to the procedure utilized in earlier publications [
64,
65]. We arranged the calculated HI values into bins containing identical values. We divided the number of elements in each bin with the total number of permutations, resulting in the empirical mass density function of HI values. Finally, we calculated the
p-value by summing the probabilities of bins containing the HI values equal to or lower than the observed one. All statistical analyses were performed using Python and R [
66]. Figures were created using R packages “ggplot2” and “cowplot” [
67,
68].
3. Results
To reconstruct the evolutionary origin of
H. sapiens GM and GP genes, we first mapped all human protein sequences (23 237) on the consensus phylogeny consisting of 29 phylogenetic levels (ps,
Fig. 1, Data S2 and S3) [[
37], [
38], [
39],
41,
43].
To get an overview of macroevolutionary patterns related to glycobiology, we made a compilation of 673 human GM genes. The majority of these genes (56%) map to the origin of all cellular organisms (ps1,
Fig. 2), which is significantly more than expected by chance (odds ratio = 2.71,
p = 1.29 × 10
-37). The ps1 (cellular organisms) corresponds to the LUCA, thus all genes/proteins shared between bacteria and archaea are classified in this phylostratum. It is important to note that the focal linage (
H. sapiens) determines which phylostrata are on the evolutionary path (trajectory) to the root of the tree. In our case, given that
H. sapiens is a focal species, all bacteria are integral part of ps1 (cellular organisms).
The origin of cellular organisms (ps1,
Fig. 2) is the only evolutionary period in which we detected the enrichment of GM genes indicating that glycosylation is an ancient process common to all life. However, we detected also a substantial number of GM genes at the origin of eukaryota (24%; ps6,
Fig. 2) suggesting that in this evolutionary transition glycosylation pathways were further elaborated. Another 17% of GM genes we traced back to the period between the origin of Amorphea and Bilateria (ps7-ps13,
Fig. 2). These numbers reveal that the origin of animal multicellularity as well as the body plan assembly of eumetazoans and bilaterians required new additions to glycosylation pathways. Notably, after the origin of amniotes (ps19,
Fig. 2) no new GM genes were mapped on our phylostratigraphic map. Taken together, our phylostratigraphic profile of human GM genes suggests that glycosylation is an evolutionarily ancient process, fully established before the radiation of mammals (ps20,
Fig. 2).
The reported results are based on the blastp
e-value cutoff of 10
-3 which has repeatedly been shown to be optimal in phylostratigraphic analysis [[
60], [
61], [
62]]. However, to test the stability of the observed phylostratigraphic patterns we recently introduced a test, where the analysis is repeated for a broad range of
e-value cutoffs; for example, between 1 and 10
-20 [
43]. This sliding
e-value protocol intentionally inflates false positive (
e-values closer to 1) and false negative rates (
e-values closer to 10
-20) and thus tests the robustness of the initially observed phylostratigraphic pattern at 10
-3 e-value cutoff [
43]. Our sliding
e-value analysis confirmed the stability of the signal, which we initially found in ps1, in the full range of tested cutoff values (Fig. S1 in Appendix A). This result reassured us that the observed pattern at the 10
−3 e-value reflects a genuine evolutionary imprint, rather than noise from the error rates of the blastp algorithm.
To get a more detailed insight into the evolutionary patterns of GM genes, we divided them into seven subgroups, which reflect their specific roles in glycobiology [
50,
69]. We then analyzed these subgroups independently using phylostratigraphic procedure (
Fig. 3). The monosaccharide metabolism (MM) genes that contribute to glycosylation process are prevailingly mapped to the evolutionary oldest phylostrata (ps1) where they contribute to a strong enrichment signal (
Fig. 3). This pattern suggests that the basic sugar metabolism necessary for glycosylation (i.e., monosaccharide activation) was present at the origin of cellular organisms. Comparably, genes that contribute to the NG and O-glycosylation (OG) of proteins show a bimodal pattern with significant enrichments at the origin of cellular organisms (ps1,
Fig. 3) and eukaryota (ps6,
Fig. 3). This enrichment profile suggests that protein glycosylation was present at the origin of cellular organisms (ps1), but also points to further innovations linked to these pathways at the origin of eukaryota (ps6).
Genes involved in the synthesis of glycosaminoglycans (GAGs) and glycan-binding proteins (GBPs) produced very similar phylostratigraphic patterns (
Fig. 3). Both groups of GM genes showed enrichment signals at the origin of cellular organisms (ps1,
Fig. 3) and at the origin of eumetazoans (ps12,
Fig. 3). The first signal at the origin of cellular organisms (ps1) suggests that the synthesis of GAGs and the production of GBPs have deep roots in evolutionary history. However, the second signal at the origin of eumetazoans (ps12) points that novel GAGs and GBPs played an important role in the emergence of the first metazoans. This is not surprising because GAGs are the main component of the extracellular matrix and endothelial glycocalyx layer—the crucial elements of true tissues that determine their physical characteristics [
70,
71]. Similarly, GBPs play an important role in cell adhesion, cell-cell interactions, cell-matrix interactions, and immune processes via self-nonself recognition; all of which are important functions of animal lifestyle [
72].
The glycosylphosphatidylinositol (GPI) anchor is a glycan-based posttranslational modification that allows modified proteins to be attached to the outer surface of the plasma membrane [[
73], [
74], [
75]]. The GM involved in the formation of GPI anchored proteins showed a strong enrichment signal at the origin of eukaryota (ps6,
Fig. 3). This signal, in line with the observation that GPI modifications are found only in eukaryotes [
76,
77], suggests that GPI anchor is a eukaryogenesis-linked innovation that facilitates protein trafficking within highly compartmentalized eukaryotic cells. In contrast, GM genes involved in glycolipid (GL) formation did not show any statistically significant enrichment (
Fig. 3). Regardless of this, the vast majority of GL genes, similar to the other subgroups of GM genes, map to the origin of cellular organisms (ps1) or eukaryota (ps6,
Fig. 3). Finally, all enrichment profiles that we recovered from the subgroups of GM genes showed stability in sliding
e-value analyses (Fig. S2 in Appendix A).
To compare evolutionary patterns of GM genes with their target proteins we mapped 4565 human genes coding for GPs onto the consensus phylogeny (
Fig. 4). Around 38% of GPs could be traced back to the ancestor of cellular organisms (ps1), which is significantly more than expected by chance (odds ratio = 1.30,
p < 0.001). Apart from that, we detected significant enrichment signals in the evolutionary period that spans the origin of animals (Apoikozoa, ps10,
Fig. 4) and bony vertebrates (Euteleostomi, ps17,
Fig. 4). Approximately 37% of GPs have their origin in this period with statistically significant odds ratio in the range between 1.26 and 2.32. These results corroborate the notion that glycosylation is an ancient process common to all life, because we found the overlapping enrichment of both types of genes, that is, GM genes as well as GP genes, at the origin of cellular organisms (ps1,
Fig. 2,
Fig. 4). In addition, the strong enrichment signals of GPs in the span between the unicellular ancestors of animals (ps10,
Fig. 4) to bony vertebrates (ps17,
Fig. 4) suggests that the increased use of GPs played an important role in the origin of the first animals and in their further radiation along the Cambrian and Ordovician geological periods [
21], which is likely connected with the immense energetic shift at the origin of animals [
64]. The sliding
e-value analyses confirmed the robustness of these results (Fig. S3 in Appendix A).
The vast majority of human GPs (97%) in our dataset carry N-linked glycans. This sharply contrasts OGs which are much less frequent (7%) post-translational modification. By assuming that these large differences reflect the relative biological importance of these two post-translational modification types, and by considering that NG is the best-studied protein glycosylation process, we focused our further analysis on N-GM. We first depicted the part of the NG pathway that acts in the ER (
Fig. 5). This subset includes 40 proteins that act directly in the biosynthetic pathway on the cytoplasmatic and luminal side of the ER membrane as well as those that are indirectly involved by providing activated monosaccharide blocks or are acting as membrane transporters (
Fig. 5(a)). The NG pathway starts with dolichol-phosphate biosynthesis or recovery at the cytosolic side of ER (
Fig. 5(a)). We traced the evolutionary origin of three genes involved in these processes (dolichyldiphosphatase 1 (DOLPP1), steroid 5α-reductase (SRD5A3), and dolichol kinase (DOLK)) to eukaryota (ps6) which suggests that they appeared for the first time in the last eukaryotic common ancestor (LECA); in line with the results of a previous study [
22].
The next step in the NG process is the addition of the first glycan on the dolichol-phosphate carrier catalyzed by a glycosyltransferase dolichyl-phosphate
N-acetylglucosamine phosphotransferase 1 (DPAGT1) on the cytosolic side of the ER membrane (
Fig. 5(a)). After this modification, a series of glycosyltransferases (ALG13, ALG14, ALG1, ALG2, and ALG11) sequentially add monosaccharides to yield the Man
5GlcNAc
2 structure. All these genes are mapped to ps1, suggesting their evolutionary occurrence in the last common ancestor of all cellular organisms (
Fig. 5(a)). This newly synthesized glycan structure, which is attached to the dolichol carrier, is then flipped by RFT1 into the lumen of the ER (
Fig. 5(a)). The NG process proceeds further on the luminal side of the ER membrane where a series of glycosyltransferases (ALG3, ALG9, ALG12, ALG6, ALG8, and ALG10) add additional glycans to reach the final Glc
3Man
9GlcNAc
2 structure common to all eukaryotes. All these proteins that act on the luminal side of the ER membrane are mapped to the origin of eukaryota (ps6) (
Fig. 5(a)).
In the next step, the oligosaccharyltransferase (OST) protein complex transfers the glycan from the dolichol carrier to the polypeptide (
Fig. 5(a)). This protein complex is assembled from five constant and three variable subunits. We traced all these subunits to the origin of eukaryota (ps6), with the exception of staurosporine and temperature sensitive oligosaccharyltransferase complex catalytic subunit (SST3), which we mapped at the origin of cellular organisms (ps1). This phylostratigraphic pattern suggests that the OST complex evolved during eukaryogenesis by the addition of new regulative subunits to the ancient core catalytic protein (SST3). Once an oligosaccharide is attached to a protein, the processing of the glycan begins with the removal of the glucose residues with two glucosidases (mannosyl-oligosaccharide glucosidase (MOGS) and GANAB). We traced MOGS to the origin of TACK
Asgard archaea (ps4) and GANAB to the origin of cellular organisms (ps1). When the GP is finally properly folded, α-mannosidase I (MAN1B1), which we traced to cellular organisms (ps1,
Fig. 5(a)), removes one mannose from the central glycan arm which signals that the GP is ready for transport outside the ER, usually to the Golgi apparatus [
4].
By looking at the phylogenetic distribution of NG ER proteins (
Fig. 5(a)), we found that essentially all of them come from only two evolutionary periods: cellular organisms (ps1) or eukaryota (ps6). Besides this binary evolutionary origin of NG ER proteins, which follows the global evolutionary pattern of NG genes (
Fig. 3), we noticed that the proteins of the same evolutionary origin tend to have adjacent positions along the biosynthetic pathway (
Fig. 5(a)). For instance, there is a block of seven consecutive biochemical steps (DPAGT1 to ALG11) where all proteins have evolutionary roots at the origin of cellular organisms (ps1,
Fig. 5(a)). This block is followed by an evolutionary younger one (RFT1-ALG10) that contains eight biochemical steps where all proteins have phylogenetic roots at the origin of eukaryota (ps6,
Fig. 5(a)).
To assess this phenomenon quantitatively, we devised a positional HI which estimates the clustering strength of identical phylostrata along a biosynthetic pathway. Low HI values reflect an extensive positional grouping of the identical phylostrata, whereas high HI values point to a scattered arrangement of phylostrata along the biosynthetic pathway. By applying this measure to the sequence of all NG reactions in the ER, we found a very low HI value that signals the biased grouping of phylostrata (HI = 5,
Fig. 5(b)). To test if this biased grouping is statistically significant, we compared the observed HI value to the distribution of all possible HI values in the ER part of the NG pathway. We found that is extremely unlikely that such strong positional clustering of identical phylostrata occurred by chance (
p = 5.68 × 10
-3,
Fig. 5(b)). We obtained similar results if we focused only on the core segment of the ER NG pathway from DPAGT1 to OST, which starts with the first attachment of a monosaccharide to the lipid carrier and ends with the transfer of the glycan to the polypeptide chain (HI = 2,
p = 3.86 × 10
-3,
Fig. 5(b)).
Besides these longitudinal clustering biases along the ER NG pathway, we noticed that phylostrata were not randomly distributed with respect to the position of N-GM on the two sides of the ER membrane. The reactions that occur at the cytosol side of the ER membrane tend to use N-GM proteins that we traced to the origin of cellular organisms (ps1,
Figs. 5(a) and (c)). Conversely, the reactions that unfold at the luminal side of the ER membrane preferentially rely on the N-GM proteins that we traced to the origin of eukaryota (ps6,
Figs. 5(a) and
(c)). This biased arrangement is significant when all genes are analyzed (Fisher’s exact test,
p = 0.0104) as well as when only the core part of the NG pathway from DPAGT1 to OST is considered (Fisher’s exact test,
p = 2.77 × 10
-4). These results are quite stable in a broad range of
e-value cutoffs (Fig. S4 in Appendix A).
After reaching Golgi,
N-glycans go through further processing that involves coordinated action of glycosidases and glycosyltransferases that gives rise to three main classes of glycans: oligomannosidic, hybrid, and complex glycans. However, some glycans can completely skip these modifications [
78]. The Golgi apparatus is organized into discrete cisternae, each containing a distinct subset of N-GM proteins which include glycosidases, glycosyltransferases, and nucleotide sugar transporters that provide glycosyltransferases with nucleotide sugars as glycosyl donors [[
78], [
79], [
80]] (
Fig. 6(a)). Similar to the ER, we found that NG proteins in Golgi are also assigned only to two phylostrata: cellular organisms (ps1) and eukaryota (ps6,
Fig. 6(a)).
We traced seven nucleotide sugar membrane transporters acting in Golgi (solute carrier 35 (
SLC35) gene family) to the origin of eukaryota (ps6,
Fig. 6(a)). In contrast, we tracked all five glycosidases (mannose trimming enzymes), located in the
cis- (mannosidase alpha class 1A members (MAN1A1, MAN1A2), and mannosidase alpha class 1C member 1 (MAN1C1)) and
medial-Golgi (mannosidase alpha class 2A members (MAN2A1 and MAN2A2)), to the origin of cellular organisms (ps1,
Fig. 6). On the other hand, 22 glycosyltransferases have binary phylogenetic origin. The majority of them (16) could be traced to eukaryota (ps6) and the rest (6) to the origin of cellular organisms (ps1,
Fig. 6). In contrast to the ER, the Golgi NG pathway is not linear and lacks membrane polarity, as all glycosidase or glycosyltransferase reactions occur within the Golgi lumen. These properties prevented us from applying heterogeneity index or testing location biases similar to those applied in the ER (
Fig. 5). However, we noticed an obvious bias in the phylogenetic origin of proteins acting in Golgi depending on their functional roles, where all glycosidases could be traced back to the origin of cellular organisms (ps1) and the majority of glycosyltransferases to the origin of eukaryota (ps6). There is a low probability of observing such a distribution by chance, (Fisher’s exact test,
p = 5.72 × 10
-3,
Fig. 6(b)), and the pattern is stable for
e-values above 10
-3 (Fig. S5 in Appendix A).
4. Discussion
Our global phylostratigraphic analysis of GM genes in humans revealed that glycosylation is an evolutionary old process most likely common to all life. However, GM pathways relevant to humans were established along protracted macroevolutionary time via three important periods. The first one is the origin of cellular organisms (ps1,
Fig. 2,
Fig. 3) where we traced most of the human GM. The second is the origin of eukaryota (ps6,
Fig. 2,
Fig. 3) and the third one covers eukaryotic radiation and early metazoan diversification (ps7-ps13,
Fig. 2,
Fig. 3). Finally, the complete GM, on which extant humans depend, was most likely fully established before the origin of mammals (ps20,
Fig. 2,
Fig. 3). However, it is interesting that we traced the comparably small number of GM genes to the archaea lineage (ps2-ps5,
Fig. 2,
Fig. 3). This pattern suggests that glycosylation in humans predominantly relies on the GM that was already present at the origin of cellular organisms (ps1) and not on some archaea-specific innovations (ps2-ps5). However, this sharply contrasts the subsequent evolutionary period, the onset of eukaryotes, where we detected the burst of new GM genes (ps6,
Fig. 2,
Fig. 3). This pattern creates a polar distribution in the analysis of the NG pathway where GM genes are either common to all cellular life (ps1) or specific to eukaryota (ps6,
Fig. 5,
Fig. 6).
For instance, the structural and functional divergence between prokaryotic and eukaryotic OSTs reflects fundamental differences in the biological context of N-linked glycosylation. In prokaryotes, the OST is composed of a single catalytic subunit, such as undecaprenyl-diphosphooligosaccharide-protein glycotransferase (PglB) in
Campylobacter jejuni or dolichyl-phosphooligosaccharide-protein glycotransferase (AglB) in archaea, which catalyzes the
en bloc transfer of an oligosaccharide to asparagine residues in target proteins post-translationally, often after protein folding has occurred [
22,
23,
81]. In contrast, eukaryotic OST is a multi-subunit complex embedded in the ER membrane, composed of the STT3 oligosaccharyltransferase complex catalytic subunit A or B (STT3A or STT3B) and multiple non-catalytic accessory proteins, each with specialized roles [
82]. This complex organization supports the co-translational nature of glycosylation in eukaryotes, enabling synchronized interaction with the protein translocation machinery, as well as with quality control systems that monitor folding and post-translational modifications.
Subunits such as magnesium transporter 1 (MAGT1)/tumor suppressor candidate 3 (TUSC3) confer oxidoreductase activity necessary for glycosylation near disulfide-forming cysteines, while others like oligosaccharyltransferase complex subunit 4 (OST4), dolichyl-diphosphooligosaccharide-protein glycosyltransferase non-catalytic subunit (DDOST), ribophorin 1 (RPN1), and RPN2 contribute to complex stability, lipid-linked oligosaccharide recruitment, and substrate positioning within the catalytic site [
82]. The evolutionary transition from a single-subunit to a multi-subunit OST likely reflects the increased complexity of GP biosynthesis and trafficking in compartmentalized eukaryotic cells. Indeed, sequence and structural analyses have shown that the eukaryotic STT3 subunit shares homology with prokaryotic PglB/AglB, suggesting that the eukaryotic OST evolved via subunit accretion around an ancient catalytic core [
81,
82]. Thus, the presence of multiple non-catalytic subunits in eukaryotic OST is not merely a reflection of redundancy but an adaptation that enables precise spatial and temporal control of NG, essential for the biosynthesis of complex multi-domain GPs characteristic for eukaryotic organisms.
This binary evolutionary origin of N-GM genes (ps1 vs ps6) is a useful marker which allowed us to look for potential biases in the distribution of N-GM genes on the endoplasmic membrane. Surprisingly, we found a non-random distribution where NG genes specific for cellular organisms (ps1) are predominantly located on the cytosolic side and those specific for eukaryota (ps6) on the luminal side of the ER membrane (
Fig. 5). This biased distribution has important evolutionary implications because it suggests that the lumen of the ER is a eukaryotic innovation which was devised by the invagination of the prokaryotic plasma membrane (
Fig. 7). Bacterial and archaeal N-GM is always positioned at the innermost cell membrane where glycan synthesis occurs on the cytoplasmic side of the membrane. When completed, glycan is flipped to the extracellular or periplasmatic space (
Fig. 7(a)). If we assume that this cross-membrane directionality of NG is preserved in eukaryotes, then the most likely explanation for the fact that in eukaryotes the flipping of the Man
5GlcNAc
2 structure occurs towards the ER lumen is that ER emerged by the invagination of the prokaryotic innermost membrane that contained N-GM (
Fig. 7). Our finding that NG genes located on the cytoplasmic side of the ER membrane tend to be common to all cellular organisms (ps1), and that those at the luminal side are preferentially of eukaryotic origin (ps6), further corroborate this view (
Fig. 5).
However, the question arises regarding the capability of extant and ancestral prokaryotes to form intracellular membrane vesicles (IMVs). In contrast to extracellular membrane vesicles (EMVs) which have been recently extensively studied in prokaryotes [
83,
84], IMVs are a much less explored feature. Nevertheless, it is clear that bacterial cells could also form IMVs [[
85], [
86], [
87], [
88], [
89], [
90]]. For instance, it was shown that cell wall-deficient bacteria can encapsulate extracellular material by invagination of the cytoplasmic membrane by an endocytosis-like process [
88]. This is a particularly suggestive finding, as wall-deficient bacteria are considered models of early cellular life that existed before the evolution of the cell wall [
86,
88,
91].
Another important question relates to the evolutionary origin of the NG genes that in our analysis map to the cellular organisms (ps1). In principle, there are two contexts when these genes are placed at ps1. In the first one, these genes could have a significant match to bacteria only, while in the second one, these genes could have significant matches to both bacteria and archaea. A situation where most of the NG genes that are placed at ps1 have significant matches only to bacteria would indicate that the evolutionary oldest part of the eukaryotic NG pathway stems from the bacterial ancestor. In turn, this would imply that bacteria and archaea evolved NG independently. Conversely, a situation where most of the NG genes that are placed at ps1 have significant matches to both bacteria and archaea would suggest that already LUCA possessed basic N-GM that was then inherited in all subsequent lineages.
To test for these possibilities, we created a heatmap of the best matches per phylostratum for NG genes (Fig. S6 in Appendix A). This analysis revealed that most of NG genes mapped to ps1, which act in ER, have significant matches in both bacteria and archaea, suggesting that the basic N-GM was already present in LUCA and that all diverging lineages were built on this ancient core. A notable exception is ALG1, mannose phosphate isomerase (MPI), and phosphomannomutase 2 (PMM2) which seem to have significant matches only in bacteria. Similarly, the majority of genes that map to ps1 in Golgi have significant matches exclusively in bacteria (Fig. S6). Together, this gives some credence to the idea that eukaryotic NG genes that map to ps1 were actually inherited from bacterial ancestor, in line with the notion that eukaryotic membranes are of bacterial origin [
35]. One approach to better resolve this issue in the future would be to conduct a detailed phylogenetic analysis of individual NG ps1 genes.
Our heatmap analysis also reveals that N-GM is largely preserved in eukaryotic side-branches along human lineage (
x-axis, ps6-ps29, Fig. S6). However, there are also some exceptions. The most striking example is the loss of Golgi glycosyltransferase genes in fungi (
x-axis, ps8, Fig. S6). This finding agrees with the observed lack of galactose and
N-acetylglucosamine residues in glycan structures found in fungi [
12].
It is important to establish links between our evolutionary findings and current biomedical research in glyco-therapeutics, glycan-based diagnostics, and the functional glycomics in human disease [[
92], [
93], [
94]]. A good example is a recent finding that increased levels of α2,6-sialylation, catalyzed via α2,6-sialyltransferase-I (ST6Gal-I), have a significant role in development and progression of Alzheimer’s disease [
93]. In this analysis, we found that this sialyltransferase originated at the origin of eukaryota (ps6, Fig. S6). This information can be useful in tracing other glycosylation genes implicated in Alzheimer’s disease, as it was previously shown that genes with similar functional significance correlate in terms of their phylostratigraphic origin [
46]. Yet another possible application of here recovered evolutionary information is to add the phylostratigraphic origin of glycosylation genes, as a parameter, to platforms for glycosylation-omics analysis [
92].
Here presented evolutionary findings on protein glycosylation can significantly inform and inspire current biomedical research. Since aberrant glycosylation is involved in numerous human diseases, such as cancer, autoimmune disorders, and infections, in depth understanding of the glycan biosynthesis machinery is obligatory for elucidating disease mechanisms and identifying new therapeutic targets [
95]. Glycosylation plays a critical role in cancer immunotherapy as many tumor-associated glycans on the tumor glycocalyx help tumors evade immune surveillance and trigger immunosuppressive signaling via glycan-binding receptors [
96]. For example, the sialylated glycan epitopes interact with the lectin receptors (GBP), leading to immune suppression [
95]. The binary evolutionary origin of the GM in the ER and Golgi apparatus, with prokaryotic origins on the cytoplasmic side and eukaryotic origins in the lumen, offers a novel framework for designing glyco-therapeutics. These may include strategies such as microRNA-based targeting of either conserved, prokaryotic-like glycosylation processes or more recently evolved, eukaryotic-specific processes. This evolutionary perspective also supports the discovery and development of novel carbohydrate biomarkers for cancer stratification and improved clinical outcomes.
Taken together, our study showed that glycosylation in H. sapiens is an ancient feature, with a biphasic origin that comprises the origin of cellular organisms and the origin of eukaryotes. Despite this ancient history of GM, the usage of protein glycosylation intensified with the development of complex multicellular organisms, probably linked to selective pressures related to self-nonself recognition and improved coordination between the differentiated cells. Finally, using NG pathway as an evolutonary marker, we provide some support for the idea that ER evolved through invagination of the prokaryotic cell membrane that possessed the NG pathway.
CRediT authorship contribution statement
Domagoj Kifer: Writing - review & editing, Writing - original draft, Visualization, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Nina Čorak: Writing - review & editing, Visualization, Methodology, Formal analysis, Data curation. Mirjana Domazet-Lošo: Writing - review & editing, Software, Methodology, Investigation, Formal analysis, Data curation, Funding acquisition. Niko Kasalo: Writing - review & editing, Validation, Resources, Investigation, Formal analysis. Gordan Lauc: Writing - review & editing, Validation, Supervision, Project administration, Funding acquisition, Conceptualization. Göran Klobučar: Writing - review & editing, Writing - original draft, Visualization, Validation, Supervision, Methodology, Investigation, Formal analysis, Conceptualization. Tomislav Domazet-Lošo: Writing - review & editing, Writing - original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We thank M. Futo, A. Tušar, S. Koska, and D. Franjević for discussions. This work was supported by the Croatian Science Foundation (IP-2016-06-5924), the City of Zagreb, and the Adris Foundation to Tomislav Domazet-Lošo; and the European Regional Development Fund (KK.01.1.1.01.0009 DATACROSS) to Mirjana Domazet-Lošo and Tomislav Domazet-Lošo. We used the computational resources of the University Computing Center—SRCE (Padobran) and the Institute Ruđer Bošković.
Appendix A. Supplementary data
Supplementary data to this article can be found online at
https://doi.org/10.1016/j.eng.2025.06.039.