Abstract
Changes in the structure of biological macromolecules, such as RNA and protein, have an important impact on biological functions, and are even important determinants of disease pathogenesis and treatment. Some genetic variations, including copy number variation, single nucleotide variation, and so on, can lead to changes in biological function and increased susceptibility to certain diseases by changing the structure of biological macromolecules. Here, we reviewed the progress of research about the effects of genetic variation on the structure of macromolecules including RNAs and proteins, several typical methods and common tools, and the effect on several diseases. An online resource (http://www.onethird-lab.com/gems/ ) to support convenient retrieval of common tools is also built. Finally, the challenges and future development of effect prediction were discussed.
Keywords: Genetic variation; single nucleotide variation; macromolecular structure; RNA; protein.

Introduction

There are many forms of genetic variation, large to structural fragment insertion/deletion, and copy number variation (CNV), small to short insertion/deletion, and single nucleotide variation (SNV), the effects of which are complex. Genetic variation and lead to changes in gene function or affect the expression quantitative trait loci (eQTL) and cause abnormal gene expression. For biological macromolecules, genetic variation may change the structural sequence directly, and cause abnormalities in the biological process. For example, Ennis et al.[1] found that T/C (ss71651738) single nucleotide polymorphisms (SNPs) located at the origin of replication may lead to changes in replication forks and repeated expansion. The resulting repetitive instability is thought to be an important cause of Fragile X syndrome (FXS) [2], which is caused by FMR1silencing due to repeated expansion of base CGG [3].
RNA molecules fold into complex structures due to intramolecular interactions between nucleotides, a folding process that involves not only simpler secondary structures but also forces in three dimensions. In this process, genetic variation can directly affect the structure of coding RNA molecules [4,5], or non-coding RNA (ncRNA) involved in biological processes [6], such as microRNA (miRNA) [7], mitochondrial tRNAs (mt-tRNAs) [8,9], and long non-coding RNA (lncRNA) [10,11].
Protein is another common class of macromolecules, which play a crucial role in a series of biological processes such as cell proliferation, signal transduction, host-pathogen interaction, and protein transport. Minor changes in proteins can have dramatic effects on phenotypic results, such as the development of disease and drug resistance. Although the structure of protein is much easier to predict than RNA [12], Protein formation is more complex and diverse than RNA. Protein folding, stability, interaction, and activity may be affected by single-point and multi-point mutations. Among them, analyzing and determining the influence of genetic variation on amino acid sequence structure is a key step to explain the influence on protein structure.
Although little is known about how genetic variation ultimately leads to structural changes, the regulations can be found and summarized based on known molecular structure data and sequence data. In this process, the prediction of the macromolecular structure plays a significant role, and accurate prediction result can become an evaluation index of the structural impact caused by variation. Here, we aimed to review the influence of genetic variation on macromolecules, especially SNP and SNV on RNA and protein. some commonly used prediction methods and tools were summarized, as well as the impact of genetic variants on several clinical diseases.

Principle and process of RNA structure prediction

The function of RNA largely depends on its structure, which is affected by its sequence. However, the formation of structure is not a simple process, and RNA needs to be folded many times. In general, RNA sequences initially fold into the most thermodynamically favorable secondary structural elements, then forms a complementary paired double helix structure by self-folding [13]. Therefore, accurate prediction of secondary structure is a key step to predicting tertiary structure and function [14]. Some factors should be considered in predicting: First, there are about 40% non-canonical base pairs in nature which are base pairs other than A-U, G-C, and G-U [15]. Then, base triples are the cluster of three base interactions, which widely exist in RNA structure [16] and can stabilize many RNA tertiary interactions [17]. Finally, pseudoknots will be formed when bases in different loops pair with each other [18]. These factors contribute to the functional diversity of RNA, but also increase the difficulty in predicting its structure.

Traditional RNA secondary structure prediction

In the era when direct measurement of RNA structure was not matured and RNA sequence data was scarce, and computational prediction was the mainstream method to identify RNA secondary structure. Most of these traditional methods were based on the minimum free energy (MFE) to simulate the RNA structure, but their accuracy and calculation speed met a bottleneck [19,20]. The optimal traditional method reached a performance ceiling of about 80% [21].
Traditional RNA secondary structure prediction methods can be divided into two categories: (i) Methods based on MFE score. This is also the most-used method, which is usually combined with the dynamic programming algorithm to find the MFE structure in the thermodynamic stable state. However, the calculating speed and accuracy on long sequence RNA and the prediction effect in the presence of pseudoknots are not ideal. (ii) A method based on comparative sequence analysis, which is more accurate than the former. Based on the assumption that RNA secondary structure is more conservative than RNA sequence in evolution, a group of homologous sequences is generally more similar in structure to predict secondary structure. Comparison sequence analysis can also predict structures with pseudoknots [22,23], but the accuracy is still limited. In addition, comparative sequence analysis can be combined with score-based methods [24-27]. However, a great limitation of this method is that many homologous sequences are in need.
Traditional RNA secondary structure prediction flows as follows Figure 1: First, RNA sequence data and its molecular thermodynamics and molecular dynamics parameters are used to obtain a series of possible free energy structures through a dynamic programming algorithm [28]. Secondly, the partition function and base pairing probabilities of RNA molecules are obtained according to the specific constraints of the actual environment such as temperature. Then, use them to trace back a series of possible structures obtained in the first step to get a new MFE structure. Finally, the optimal free energy structure is regarded as the result of prediction, and a graphical output in the form of an energy point diagram is formed. (Figure 1)

Machine learning-based RNA secondary structure prediction

In 2010, Kertesz et al. [20] described parallel analysis of RNA structure (PARS), which combined high-throughput technology with RNA structure-specific enzymes to provide secondary structure analysis of thousands of RNAs at single nucleotide resolution. Many detection methods have been developed and applied to RNA structure analysis, so the quality and quantity of RNA sequence data has been continuously improved. With the development and improvement of machine learning (ML) technology, especially deep learning, prediction methods based on ML have gradually developed and replaced traditional methods.
ML-based methods train their models in a supervised way [29]. These models learned to map features to structures by adjusting model parameters according to known structures and corresponding sequences and other feature information (Supplementary File S2.1). Many of them used free energy parameters, encoded RNA sequences, sequence patterns, or evolutionary information as key features, and their outputs can be classified as tags (such as paired or unpaired) or continuous values (such as free energy). When features of the new structure are input into the training model, the model can classify the corresponding tag or predict the corresponding parameter [29]. Its characteristic is that it contains all the information of the data, so it does not rely on the assumptions in traditional methods and is easy to combine with known biological rules [19]. While the accuracy of prediction is improved, the prediction model after training is faster than traditional methods, so it has more advantages in processing long RNA sequences.

The development history and representative methods of RNA prediction

Since the concept of RNA secondary structure was put forward in 1960 [30], the research on how RNA secondary structure formed from sequence to function has never stopped. In 1971, Tinoco et al. [31] proposed a simple method to estimate the secondary structure of RNA molecules for the first time, but Delisi et al. [32] didn’t get the expected results when trying to predict the secondary structure of RNA from the MFE. In 1981, Zuker et al. [33] used a dynamic programming algorithm to predict RNA secondary structure based on the MFE model. In 1989, Zuker et al. developed a computer program, which calculates the structure of optimal and suboptimal folds and becomes the basis of the mfold package [34]. In 1994, Hofacker et al. [28] proposed the ViennaRNA package including the RNAfold method to calculate the MFE structure of RNA molecules or partition function [12]. In 2004, Ding et al. proposed the MFE model combined with statistical methods and developed the software package Sfold. In 2010, Halvorsen et al. developed the SNPfold algorithm which compares the sequence information before and after mutation based upon the calculation of the RNAfold partition function. [35]. In 2014, David H. Mathews provided four programs in the RNAstructure based on comparative sequence analysis. [36]. With the development of RNA structure detection technology, learning-based methods were increasingly used to predict RNA structure. In 2020, Chen et al. proposed an end-to-end deep learning model, called E2Efold, which significantly improved the accuracy and speed of prediction [37]. In 2022, Laiyi Fu et al. developed the UFold method based on deep learning, which further improved the accuracy of prediction [38]. The development history of the common RNA structure prediction method is shown in Figure 2.
Here we list several representative methods for RNA prediction and more details of the RNA traditional prediction software packages are shown in supplementary materials, including mfold package (Supplementary File S1.1), ViennaRNA package (Supplementary File S1.2), and RNAstructure package (Supplementary File S1.3).
Mfold was first proposed by Zuker et al. in the 1980s [34,39,40] , which introduced new improvements to make the calculation of RNA structure prediction more accurate and efficient. The first is the combination of the new energy rules, and the second is the ability to calculate optimal and suboptimal folding. Conditional constraints can be imposed based on preassumptions, and sub-optimal structures may be more consistent with experimental data than optimal ones. In terms of output, the best and suboptimal folding lists can be sorted by energy. An energy point diagram can also be used to describe all suboptimal folds in one image [41].
RNAfold was developed based on the ViennaRNA package, the computer program widely used to compute and compare RNA secondary structures. Using the dynamic programming algorithm, the code predicted the structure with MFE after calculating the equilibrium partition function and base pairing probabilities, which may serve as constraints. Based on this, RNAfold developed an RNA secondary structure algorithm based on tree editing and alignment, which can calculate the MFE, backtrack the optimal secondary structure and efficiently solved the RNA inverse folding problem. It was worth noting that RNAfold can provide constraints for the folding algorithm to force the pairing of a specific location [13,28].
SNPfold algorithm was designed based on RNA partition function calculation in RNAfold [26,42]. The difference between them is that the SNPfold algorithm requires two different RNA strands with the same length. One strand is a wild-type RNA sequence, and the other is an RNA sequence containing genetic variation. SNPfold will calculate the Pearson correlation coefficient between two RNA base pairs and the partition function of the possible RNA conformations set of the sequence to analyze the influence of SNPs on RNA structure. Besides, it will help to identify the disease-related mutation in the regulatory RNA by analyzing genome-wide association studies (GWAS) data and the whole mRNA structure [35].
UFold was a deep learning-based method recently developed to directly train labeled data and base pairing rules. A novel RNA sequence image representation method was proposed by UFold, which can be effectively processed by Fully Convolutional Networks (FCNs). It was found that it outperformed other methods in the family dataset and its prediction speed was improved [38].

Principle and process of protein structure prediction

Changes in protein structure are influenced by a variety of factors, including SNPs and mutations. Nucleotide variations that cause an amino acid change are called nonsynonymous mutations, and nonsynonymous SNP (nsSNPs) are an important part of this extensive research. These single-base changes, or multiple nucleotide substitutions leading to alterations in the amino acid sequence of the encoded protein, are called single amino acid variations (SAVs) or missense variations. Amino acid changes can affect protein stability, interactions, and enzyme activity, and even cause disease. Therefore, it is crucial to accurate prediction of the effect of genetic variation on protein structure for understanding the mechanism of genome variation associated with certain diseases [43].
By reviewing the literature, we found that the prediction of the effect of genetic variants on protein structure is always inseparable from machine learning methods. Prediction methods often combine protein characterization data, such as protein dynamics, contact potential scores, interatomic interactions, and other aspects features, using machine learning algorithms including support vector machines, random forests, and deep learning to train and implement the predictions. Therefore, several feature extraction methods and machine learning methods are reviewed in this section.

Protein structure feature extraction method

Normal Mode Analysis (NMA) provides a valuable method for the study of system dynamics and accessible conformations as an alternative to time-consuming and computationally expensive molecular dynamics simulation. The kinetic properties extracted from the protein structure generated from the NMA module of the Bio3d tool can be utilized [44].
Analysis of Mutation Effects. The change of folded Gibbs free energy may be caused by many related factors. To combine these characteristics, Arpeggio [45] can be used to calculate the number of hydrophobic contacts involving wild-type residues and the contact potential score in the AAINDEX database [46].
Graph-based structural signatures approach to represent molecular structures has proven to be successful for a range of applications towards the study of protein structure and changes carried out by missense mutations, including phenotypic changes. These signatures comprise physicochemical and geometrical properties from the wild-type environment that are based on distance patterns mined from the 3D structure by representing atoms as nodes and their interactions as edges. Then the physicochemical properties and the distance pattern between atoms are defined according to the properties of amino acids (i.e., pharmacophores) and transformed into a cumulative distribution function [47-50].

ML

ML was successfully applied to proteins earlier than RNA (Supplementary File S2.2). As executors of functions, proteins are robust and have a large number of features that can be used for ML. In the field of bioinformatics, algorithms such as deep learning, random forest, support vector machine, etc. are widely used in protein structure prediction [51], protein functional site prediction [52,53], subcellular location prediction, etc [54,55]. There are many ML algorithms, none of which is the best algorithm for all tasks. Take the method flow of the classic sequence-based tool DynaMut2 [56] as an example, as shown in the following figure 3.

The development history and representative methods of protein prediction

Several methods have been developed to predict how missense mutations affect protein stability by using sequence or structural information from which the information is often complementary. The development history of relevant methods is shown in Figure 4. According to the characteristics of existing methods, we divide them into five categories and introduce their respective characteristics and representative methods.

The structure-based method only (red part in the figure)

mCSM was developed by Pires et al. in 2014, which was a method based on protein structure and relies on the graph-based signature to study missense mutation [47]. It encodes the interatomic distance to represent the protein residue environment and trains the prediction model. Subsequent studies have also proven that the effect of mutation is related to the atomic distance around amino acid disability. The mCSM network server has been established and provides many extended tools and methods.
SDM is an algorithmic program based on statistical potential energy function proposed by Topham et al. in 1997 [57]. Worth et al. established the SDM network server in 2011 [58]. Based on the structure method, SDM uses the amino acid substitution frequency of the homologous protein family in different environments to calculate the stability score between wild-type and mutant proteins. The change in protein stability is one of the important parts to estimate the effect of genetic variation on protein structure.

The sequence-based method only (green part in the figure)

MuStab was a network server that was developed by Teng et al. in 2010 [59], which is a machine-learning method for detecting protein stability based on sequence characteristics. It uses experimental data on the free energy variation of protein stability during mutation. After analyzing 20 sequence features, it is found that the classifier combined with 6 sequence features is the most accurate including stability (S3) bulkiness (Bu), the transmembrane tendency (Tt), beta-sheet (B), average area buried on transfer from standard state to folded protein (Aa), and the mobility of an amino acid on chromatography paper (Mc) (Supplementary File S2.2).

Methods based on structure and sequence (yellow part in the figure)

Imutant3. 0 was proposed by Capriotti et al. in 2008 [60], which can distinguish the experimental protein stability free energy change value (ΔΔG) into stable mutation, unstable mutation, and neutral mutation by using a support vector machine (SVM) based on sequence or structure. It also improves the prediction effect of free energy change caused by single-point protein mutation.

Other methods (purple part in the figure)

The iStable is an integrated predictor, which can predict the change of protein stability by SVM when a single amino acid residue is mutated. It uses sequence information and prediction results that adopted the SVM as an integrator from different element predictors to construct grid computing architecture [61]. The iStable2.0 systematically improves prediction performance based on iStable [62].

The new method (blue part in the figure)

Many new methods have emerged in recent years. Among them, AlphaFold2 [63] based on deep learning, and RoseTTAFold [64] using a 3-track neural network are two representative methods, both of which have achieved amazing accuracy in predicting protein structure.
The latest version of Alphafold is based on a new ML method, which integrates the physical and biological knowledge of protein structure into the design of deep learning algorithm by using multi-sequence alignment [63]. AlphaFold is a fully redesigned neural network-based model that has similar prediction accuracy to the experimental structure and significantly outperforms other methods in most cases.
RoseTTAFold was developed by Minkyungbaek and others [64]. Combined with the network architecture of relevant ideas, RoseTTAFold achieved the best performance by successively converting and integrating the information of the one-dimensional sequence layer, two-dimensional distance layer, and three-dimensional coordinate layer-3. The structure prediction accuracy of the track network is close to that of deepmind in the 14th round of the Critical Assessment of Structure Prediction (CASP14).

Common identification tools and software

Here, we list some tools that can predict the effect of mutation on the macromolecular structure (Supplementary Table T1). These tools combine the information related to mutation and macromolecular structure prediction methods. The tool usually takes the molecular sequence and variation information as input, the predicted wild-type structure and the structure after variation, and the change of macromolecular thermodynamic index as output. We have developed a website GEMS (http://www.onethird-lab.com/gems/) as a brief introduction and index to these tools.

Tools to identify structural effects on RNA

RNAsnp (https://rth.dk/resources/rnasnp ) is a web server tool, which use a different mode to predict the effect of SNPs on different length of RNA sequence. The global folding method RNAfold [28] calculate the base pair probabilities of wild-type and mutant sequence, which is less 1000nt, and the local folding method RNAplfold [65] are used for large RNA sequence. These two methods are part of the ViennaRNA package [13], for more information see the annex. SNP effects are quantified from extensive pre-computed tables of distributions of substitution effects as a function of gene length and GC content. The input data of RNAsnp can be a single RNA sequence in FASTA format with one or more mutants whose structural effect needs to be predicted. It not only provides the structural prediction results but also features a graphical output representation [66,67] (Supplementary File S5.1).
MutaRNA(http://rna.informatik.uni-freiburg.de/MutaRNA) is a web server for studying SNV-induced RNA structure changes, which is also the first tool that provides different dot plots for comparative analysis of base pairing potentials. MutaRNA uses the local folding method RNAplfold [65] which is part of the ViennaRNA package [13] to retrieve candidate RNAs. MutaRNA also integrates the empirical p-values from RNAsnp [67] and the relative entropy comparing wild-type and mutant-form remuRNA [68] to quantify the structure aberration caused by SNV. The input data of MutaRNA is an RNA sequence of the wildtype sequence (WT) in FASTA format and the location of the mutation. It can provide a variety of visualization results, including heat map matrices, circular plots, and arc plots that are convenient for users to use directly in scientific reporting [69] (Supplementary File S5.2).
LncCASE (http://bio-bigdata.hrbmu.edu.cn/LncCASE ) is a network database, which is constructed based on lncRNAs prediction in cancer. Multidimensional molecular analysis of tumor samples, biomolecular interaction networks, and pathway data resources by integrating genomic and transcriptome data from human cancer. LncCASE uses a computational method to identify the sub-pathways driven by lncRNAs under the influence of CNV. The copy number level of lncRNA was re-annotated, and the lncRNA CNV spectrum was constructed and visualized. The tool further analyzes the biological effects of lncRNA affected by genetic variation in cancer, which facilitates the study of cancer biology [70].
PON-mt-tRNA(http://structure.bmc.lu.se/PON-mt-tRNA) is a prediction tool for pathogenic variants on tRNAs. Since all pathogenic variants of tRNAs are located in mitochondria, investigators collected mt-tRNA variants and developed a machine-learning random forest algorithm based multivariate probabilistic prediction method. The method requires a reference position in the mtDNA, the reference (original) nucleotide, and the nucleotide altered by each variation as inputs. In addition, users have the option to submit evidence for isolation, biochemical, and histochemical characterization. The investigators classified all possible single nucleotide substitutions in all human mt-tRNA using PON-mt-tRNA, which documents the prediction of all possible nucleotide substitutions in mt-tRNA genes [71].
UFold (https://ufold.ics.uci.edu ) is developed as a web server running UFold to facilitate the use of the UFold method. Users can enter or upload RNA sequences in FASTA format. The server predicts the secondary structure of RNA using pre-trained UFold models (trained on all datasets) and stores the predicted structure in a dot-bracket file or bpseq file for users to download. The user can also select in the options panel either to predict non-canonical pairs or not directly. The server further provides interface connectivity to the VARNA tool [72] to visualize the predicted structures. Most existing RNA prediction servers, such as RNAfold [13], MXfold2 [73], and SPOT-RNA [74], can only predict one RNA sequence at a time and limit the length of the input sequence, but UFold does not have that limitation. Unfortunately, UFold is currently unable to combine SNP data to predict the effect of structure breaking mutations. More tools for mutation prediction of RNA structural disruption effects are shown in Supplementary File S3.

Tools of identifying structural effects on proteins

INPS (http://inps.biocomp.unibo.it ) is a new approach that departs from protein sequence information and does not rely on structure to annotate the effect of non-synonymous mutations on protein stability. INPS is based on support vector machine regression and is trained to predict the change in thermodynamic free energy based on a single point change in a protein sequence. It has the advantage of being suitable for calculating the effect of nonsynonymous polymorphisms on protein stability when the protein structure is unavailable. INPS predictor consists of one support vector regression (SVR) trained on single point variations in different proteins and can complement each other [75] with methods like structure-based mCSM [47].
DynaMut2 (http://biosig.unimelb.edu.au/dynamut2 ) is a tool that integrates information on protein dynamics and structural environment attributes of wild-type residues. As a graph-based signature method, it is able to accurately predict the effect of mutations on the stability and dynamics of single and multiple point mutations. DynaMut2 can predict the Gibbs free energy change of single point mutations or no more than 3 multiple point mutations based on Normal Mode Analysis (NMA) and protein kinetic analysis. Its input may be a single mutation or a mutation list, and its performance is better than other methods in predicting stability changes caused by single-point mutations [56].
mCSM-PPI2 (http://biosig.unimelb.edu.au/mcsm-ppi2/ ) is a new machine-learning computational tool that can predict the effect of missense mutations on the binding affinity of protein interactions accurately. It leverages graph-based structural signatures to model inter-residual interaction networks, evolutionary information, complex network metrics, and the change effects of energy terms to generate optimized predictors. The mCSM-PPI2 can be used to assess the impact of user input of a specified mutation or to predict the impact of protein interface mutations in an automated manner [76].
PhyreRisk (http://phyrerisk.bc.ic.ac.uk ) is a web application tool for connecting genomic, proteomic, and structural data to facilitate the mapping of human variants to protein structures. It provides information on 20,214 human typical protein sequences and 22,271 alternative protein sequences (isoforms) and supports new variant data in a genomic coordinate format (VCF, applying reference SNP IDs and HGVs release symbols) and human gene builds GRCh37 and GRCh38 as inputs. In addition, it supports the use of amino acid coordinates to map variations and search for genes or proteins of interest. PhyreRisk aims to enable researchers to translate genetic data into protein structural information that provides a more comprehensive assessment of the functional impact of variants [77].
AlphaFold (https://alphafold.ebi.ac.uk ) is an open-access and extensive database that provides highly accurate protein structure prediction. AlphaFold V2.0 gives an unprecedented expansion of structural coverage of known protein sequence space structures. The latest version of AlphaFold is based on a novel ML approach that combines physical and biological knowledge about protein structure. Using multiple sequence alignments, this knowledge is incorporated into the design of deep learning algorithms. Not only that, AlphaFold2 also uses inductive biases in physics and geometry to build components that learn from PDB data. This enables the network to learn more efficiently from limited data in the PDB and to deal with the complexity and diversity of structural data. AlphaFold and its technology computational methods have been important tools to solve biophysical problems in modern biology [63]. However, it is a pity that up to now, no tool has been developed to predict the structural effects caused by mutation [78]. More protein structure disruption effect mutation prediction tools are shown in Supplementary File S4.