Supported by: #
#

#

1 Features of the ‘ISHN’

The ‘indiscernible sequence homology network’ (ISHN) is a quasi-random graph G(V,E) having 2,759 nodes and 11,388 edges, where vertices v ϵV (nodes), represented by structurally similar, sequence dissimilar sub-domain-size motifs (Root Mean Square Deviation = 3.5 Å and sequence identity < 20%), namely ßαß motif, ß-hairpin, and α-helix hairpins, ranging in size from 20 to 35 residues and taken from protein domains collected in the Protein Building Blocks Database (PBBD, see Gullotto et al., 2013), are connected to each other, and to the respective nodes represented by structural and functional annotations of their parent domains, through an edge e ϵ E, with E ϵ V x V (see Figure 1). Motifs of the ISHN retain a number of local attributes, represented by (a) SSE ID; (b) SSE topology; (c) closed loop sequences (see Berezovsky et al., 2000), i.e. remnants of ancient molecular prototypes that led to the birth of modern proteins (see Berezovsky et al., 2002; Sobolevsky et al., 2007; Trifonov et al., 2001); (d) hydrophobic sequences, here referred to as ‘primary van der Waals (vdW) locks’, localized between the ends of the closed loops,; (e) hydrophobic sequences, here referred to as ‘secondary vdW locks’, which establish contacts between different closed loops and/or spatially proximal regions; (f) sequence fingerprints detected by the conserved domain database (CDD) server (URL: http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) (see Marchler-Bauer et al., 2017). Domain annotations of the ISHN have been taken from the Structural Classification of Proteins (SCOP)-, Class Architecture Topology/Fold Homologous superfamily (CATH)-, and Pfam databases, together with gene product annotations taken from the Protein Data Bank (URL: https://www.rcsb.org/). Overall, these annotations are represented by (a) organism; (b) class; (c) superfamily; (d) family; (e) architecture; (f) molecular functions; (g) biological processes. The underling protein space of the ISHN covers 14,964 non-singleton sequence fragments. Among these sequences, 14,109, 12,815, and 10,612 are detected as true positives by Pfam Family-, CATH Homology-, and SCOP Superfamily classification levels, respectively, with CATH and Pfam databases returning the highest level of curation in terms of structural and evolutionary annotations, respectively. In view of the aforementioned, the ISHN has been used as a benchmark for exploring the occurrence of evolutionary signals that could belong to the last universal common ancestor (LUCA, i.e. the most recent pool of ancestors from which contemporary organisms have a descent), in the attempt to unveil the emergence of frameworks that make sense from an evolutionary perspective.

#

Figure 1. Graphical representation of the ISHN. Colour code refers to modularity class.

1.2 Assessment of the ISHN

Annotations of the ISHN complement each other to form complete sub-graphs with pairs of connected structural motifs, suggesting evolutionary paths between members of the network at sub-domain-size resolution. Two topological descriptors, represented by the average path length L, i.e. the average number of steps along all possible pairs of nodes having the shortest paths, and the clustering coefficient C, i.e. an average measure of how close neighbours of a node are to being a complete sub-graph, have been used to calculate ‘small-world-ness’ S (see Humphries and Gurney 2008) by using both the ISHN and a random network generated from the former by shuffling the edges between its protein motifs and their parent domain annotations, in order to establish whether the ISHN retain nodes clustered around hubs having a significantly high measure of centrality. Indeed, values of L and C for the ISHN, 3.812 and 0.092, respectively, and for the respective random network, 3.522 and 0.019, respectively, reveal that the ISHN exhibits small world behaviour, with S = 4.48. Consistently, network degree k (i.e. the number of edges connecting a node to each other) of the ISHN decays as a power law with degree exponent γ= 1.23 and R-squared = 0.96. By applying Chebyshev’s inequality (i.e. the fraction of values that exceed a number of standard deviation away from average in a probability distribution) (see Rice, 1994), two Z-scores, here referred to as Zk and Zv, derived respectively from betweenness centrality g(v), i.e. the number of shortest paths that pass through a node, and network degree k of vertex v, i.e. the number of edges connecting a node to each other, have been used to detect the most shared nodes (hubs) within the ISHN, assuming hubs with both z-scores ≥ 2.5 as being members of LUCA. From a structural perspective, the ISHN is dominated by four hubs represented by α\ß-, all- ß-, α+ß-, and all- α structural classes, together with 3-layer (αßα)-sandwich architecture. As far as protein functions of the network are concerned, nine nodes, represented by catalytic activity, oxidoreductase activity, metabolic process, transferase activity, hydrolase activity, protein binding, metal ion binding, nucleotide binding, and ATP binding, retain significant z-scores as well. In addition, eight annotations, represented by lyase activity, magnesium ion binding, isomerase activity, α\ß hairpin, cellular amino acid biosynthetic process, carbohydrate metabolic process, 2-layer sandwich, and NAD(P) binding Rossmann domains, exhibit significant values of Zk.

2 Construction and assessment of the ‘ESN’

Pairs {a0, b0} of connected motifs that retain glimpses of evolutionary signals have been taken from the ISHN and used to derive sequence profiles, here referred to as ‘evolutionary probes’ ep, leading to extend the detection of structurally conserved targets that retain amino acid conservation at specific positions. The pipeline designed to detect these motifs has been described elsewhere (see Gullotto, 2021). In addition, an information theoretic approach based on calculation of Shannon’s entropy H, i.e. a measure of uncertainty based on physicochemical and statistical attributes, and corrected mutual information Mip, i.e. the extent of co-evolutionary association between residues at positions X and Y of pairs {a0, b0}, separating the signal caused by functional/structural constraints from random noise and phylogeny, has allowed the detection of critical amino acid positions (positions that retain H(X) and H(Y) < 0.3; Mip(X, Y) << 0.188), here referred to as ‘co-independent evolutionary sites’ (CESs), i.e. amino acid sites that have evolved independently from one another during the pre-biotic world, but may still exhibit an inter-dependence in contemporary domains (see Gullotto, 2021), performing a PSI-BLAST (Altschul et al., 1997) search against the UniRef90 database by the RING 2.0 webserver (see Piovesan et al., 2016). Moreover, CESs having relative solvent accessibility RSA ≤ 0.3 are here defined as indicative of ‘phylogenetic buriedness’, i.e. the degree of thermodynamic stability due to non-divergence from a common ancestor. By respecting and extending the nomenclature provided for key residues of closed loops (see Goncearenco and Berezovsky, 2010), CESs involved in structural interactions could be referred to as ‘elementary structures’ (ES), therefore being regarded as members of both ‘elementary structural loops’ (ESL), i.e. closed loops that retain only ESs, and ‘elementary functional structural loops’ (EFSL), i.e. closed loops that retain a combination of both EFs and ESs. By applying the same criteria used to construct the ISHN, an example of ‘emergent sub-network’ (ESN) achieved by co-opting the aforementioned targets/signals is provided in Figure 2. The resulting framework is composed of 183 SSEs, 469 annotations, and 9,872 edges. From a topological perspective, the ESN cannot be regarded as a small-world network, because of the high number of connections between motifs forming complete sub-graphs. This great connectivity among motifs of the ESN mirrors higher structural conservation of α-carbon trajectories and greater evolutionary relationships among the corresponding parent domains (see Figure 3), also allowing the detection of folds that diverged beyond the point where homology is discernable (see Gullotto, 2021). Moreover, hubs of the ESN are represented only by twenty-six structural motif, owing to significant values for k and g(v) of a subset of highly interconnected motifs throughout the network, and, therefore, are here regarded, togheter with their interconnected motifs, as members of a ‘sub-domain-size island’. In sum, when compared to the ISHN, the ESN retains the following peculiar attributes: a) presence of discrete sub-domain-size islands; (b) retention of phylogenetically conserved positions in SSEs; (c) evolutionary relatedness between parent domains of SSEs; (d) higher density; (e) higher degree; (f) higher clustering coefficient; (g) lower path length; (h) lower modularity.

#

Figure 2. Graphical representation of the ESN. Colour code is referred to degree rank

#

Figure 3. Evolutionary relationships of taxa of the ESN. The evolutionary history of the ESN has been inferred using the Minimum Evolution method. To generate clusters, a bootstrap test (1000 replicates) has been performed. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances have been computed using the p-distance method and are in the units of the number of amino acid differences per site. The rate variation among sites has been modeled with a gamma distribution (shape parameter = 1). The Neighbor-joining algorithm has been used to generate the initial tree. The analysis has involved 183 amino acid sequences of domains, denoted by PDB ID, that contain SSEs of the ESN. All ambiguous positions have been removed for each sequence pair. There have been a total of 1100 positions in the final dataset.

3 Conclusions

Different protein levels have been reconnected for exploring the occurrence of evolutionary relationships within the protein space, leading to a protein network, namely ISHN, which reveals high connectivity of structural and functional features of LUCA. Assessment of intra-chain interactions of closed loops of the ISHN reveals critical positions, namely CESs, which presumably have evolved in co-option with domains during the transition from pre-biotic to domain world. Evolutionary probes derived from motifs of the ISHN retain conserved positions that may probe the boundaries of sequence similarity between queries and targets. The co-option of evolutionary signals outlined above may lead to the emergence of sub-networks, namely ESNs, which make sense from an evolutionary perspective. Intra-chain motifs that retain CESs spread over different (super)families suggest three not mutually exclusive scenarios that describe the transition from pre-biotic to domain world. In the first scenario, EFs/ESs have emerged at appropriate positions of polypeptide chains as a result of lately convergent evolution. In the second scenario, pathways of combinatorial recruitment of acceptors/donors of closed loops with mono and/or polyphyletic origin, as already retaining EFs/ESs, have given rise to domains straightforwardly. In the third scenario, pathways of monophyletic recruitment of acceptors/donors of closed loops that already retained EFs/ESs with more rigid constrains needed to allow peculiar functions being exerted have given rise to tandem-repeat protein domains (Andrade et al. 2001). Statistics for both the ISHN and the ESN, together with lists of pairs {a0, b0}, eps, projections, and intra-chain interactions of CESs, can be downloaded at Tools section.

References

· Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389−3402 (1997).

· Andrade M.A., Perez-Iratxeta C., Ponting C.P. Protein repeats: structures, functions, and evolution. J. Struct Biol 134: 117−131 (2001).

· Berezovsky I.N., Grosberg A.Y., Trifonov E.N. Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 466: 283−286 (2000).

· Berezovsky I.N., Kirzhner V.M., Kirzhner A., Rosenfeld V.R., Trifonov, E.N. Closed loops: persistence of the protein chain returns. Protein Eng 15: 955−957 (2002).

· Goncearenco A. and Berezovsky I.N. Prototypes of elementary functional loops unravel evolutionary connections between protein functions. Bioinformatics 26: 497−503 (2010).

· Gullotto D. Fine tuned exploration of evolutionary relationships within the protein universe. Stat Appl Genet Mol Biol DOI: 10.1515/sagmb-2019-0039 (2021).

· Gullotto D., Nolassi S.M., Bernini A., Spiga O., Niccolai N. Probing the protein space for extending the detection of weak homology folds. J Theor Biol 320, 152-158 (2013).

· Humphries M.D. and Gurney K. Network ‘small-world-ness’: a quantitative method for determining canonical network equivalence. PloS One 3: e0002051 (2008).

· Marchler-Bauer A., Bo Y., Han L., He J., Lanczycki C.J., et al. CDD/SPARCLE: functional classification of proteins via subfamily architectures. Nucleic Acids Res. 45: 200−203 (2017).

· Piovesan D., Minervini G., Tosatto, S.C. The RING 2.0 web server for high quality residue interaction networks. Nucleic Acids Res. 44: 367−374 (2016).

· Rice J.A. Mathematical statistics and data analysis. Belmont CA,Wadsworth Pub. Co. (1994).

· Sobolevsky Y., Frenkel Z.M., Trifonov E.N. Combinations of ancestral modules in proteins. J. Mol. Evol. 65: 640−650 (2007).

· Trifonov E.N., Kirzhner A., Kirzhner V.M., Berezovsky I.N. Distinct stages of protein evolution as suggested by protein sequence analysis. J. Mol. Evol. 53: 394−401 (2001).