Abstract
Changes
in the structure of biological macromolecules, such as RNA and protein,
have an important impact on biological functions, and are even important
determinants of disease pathogenesis and treatment. Some genetic
variations, including copy number variation, single nucleotide
variation, and so on, can lead to changes in biological function and
increased susceptibility to certain diseases by changing the structure
of biological macromolecules. Here, we reviewed the progress of research
about the effects of genetic variation on the structure of
macromolecules including RNAs and proteins, several typical methods and
common tools, and the effect on several diseases. An online resource
(http://www.onethird-lab.com/gems/ ) to support convenient
retrieval of common tools is also built. Finally, the challenges and
future development of effect prediction were discussed.
Keywords: Genetic variation; single nucleotide variation;
macromolecular structure; RNA; protein.
Introduction
There are many forms of genetic
variation, large to structural fragment insertion/deletion, and copy
number variation (CNV), small to short insertion/deletion, and single
nucleotide variation (SNV), the effects of which are complex. Genetic
variation and lead to changes in gene function or affect the expression
quantitative trait loci (eQTL) and cause abnormal gene expression. For
biological macromolecules, genetic variation may change the structural
sequence directly, and cause abnormalities in the biological process.
For example, Ennis et al.[1] found that T/C (ss71651738) single
nucleotide polymorphisms (SNPs) located at the origin of replication may
lead to changes in replication forks and repeated expansion. The
resulting repetitive instability is thought to be an important cause of
Fragile X syndrome (FXS) [2], which is caused by FMR1silencing due to repeated expansion of base CGG [3].
RNA molecules fold into complex
structures due to intramolecular interactions between nucleotides, a
folding process that involves not only simpler secondary structures but
also forces in three dimensions. In this process, genetic variation can
directly affect the structure of coding RNA molecules [4,5], or
non-coding RNA (ncRNA) involved in biological processes [6], such as
microRNA (miRNA) [7], mitochondrial tRNAs (mt-tRNAs) [8,9], and
long non-coding RNA (lncRNA) [10,11].
Protein is another common class of
macromolecules, which play a crucial role in a series of biological
processes such as cell proliferation, signal transduction, host-pathogen
interaction, and protein transport. Minor changes in proteins can have
dramatic effects on phenotypic results, such as the development of
disease and drug resistance.
Although the structure of protein is
much easier to predict than RNA [12], Protein formation is more
complex and diverse than RNA. Protein folding, stability, interaction,
and activity may be affected by single-point and multi-point mutations.
Among them, analyzing and determining the influence of genetic variation
on amino acid sequence structure is a key step to explain the influence
on protein structure.
Although little is known about how genetic variation ultimately leads to
structural changes, the regulations can be found and summarized based on
known molecular structure data and sequence
data. In this process, the
prediction of the macromolecular structure plays a significant role, and
accurate prediction result can become an evaluation index of the
structural impact caused by variation. Here, we aimed to review the
influence of genetic variation on macromolecules, especially SNP and SNV
on RNA and protein. some commonly used prediction methods and tools were
summarized, as well as the impact of genetic variants on several
clinical diseases.
Principle and process of RNA structure
prediction
The function of RNA largely depends
on its structure, which is affected by its sequence. However, the
formation of structure is not a simple process, and RNA needs to be
folded many times. In general, RNA sequences initially fold into the
most thermodynamically favorable secondary structural elements, then
forms a complementary paired double helix structure by self-folding
[13]. Therefore, accurate prediction of secondary structure is a key
step to predicting tertiary structure and function [14].
Some factors should be considered in
predicting: First, there are about 40% non-canonical base pairs in
nature which are base pairs other than A-U, G-C, and G-U [15]. Then,
base triples are the cluster of three base interactions, which widely
exist in RNA structure [16] and can stabilize many RNA tertiary
interactions [17]. Finally, pseudoknots will be formed when bases in
different loops pair with each other [18]. These factors contribute
to the functional diversity of RNA, but also increase the difficulty in
predicting its structure.
Traditional RNA secondary structure
prediction
In the era when direct measurement of RNA structure was not matured and
RNA sequence data was scarce, and computational prediction was the
mainstream method to identify RNA secondary structure. Most of these
traditional methods were based on the minimum free energy (MFE) to
simulate the RNA structure, but their accuracy and calculation speed met
a bottleneck [19,20]. The optimal traditional method reached a
performance ceiling of about 80% [21].
Traditional RNA secondary structure
prediction methods can be divided into two categories: (i) Methods based
on MFE score. This is also the most-used method, which is usually
combined with the dynamic programming algorithm to find the MFE
structure in the thermodynamic stable state. However, the calculating
speed and accuracy on long sequence RNA and the prediction effect in the
presence of pseudoknots are not ideal. (ii) A method based on
comparative sequence analysis, which is more accurate than the former.
Based on the assumption that RNA secondary structure is more
conservative than RNA sequence in evolution, a group of homologous
sequences is generally more similar in structure to predict secondary
structure. Comparison sequence analysis can also predict structures with
pseudoknots [22,23], but the accuracy is still limited. In addition,
comparative sequence analysis can be combined with score-based methods
[24-27]. However, a great limitation of this method is that many
homologous sequences are in need.
Traditional RNA secondary structure prediction flows as follows Figure
1: First, RNA sequence data and its molecular thermodynamics and
molecular dynamics parameters are used to obtain a series of possible
free energy structures through a dynamic programming algorithm [28].
Secondly, the partition function and base pairing probabilities of RNA
molecules are obtained according to the specific constraints of the
actual environment such as temperature. Then, use them to trace back a
series of possible structures obtained in the first step to get a new
MFE structure. Finally, the optimal free energy structure is regarded as
the result of prediction, and a graphical output in the form of an
energy point diagram is formed. (Figure 1)
Machine learning-based RNA
secondary structure
prediction
In 2010, Kertesz et al. [20] described parallel analysis of RNA
structure (PARS), which combined high-throughput technology with RNA
structure-specific enzymes to provide secondary structure analysis of
thousands of RNAs at single nucleotide
resolution. Many detection methods
have been developed and applied to RNA structure analysis, so the
quality and quantity of RNA sequence data has been continuously
improved. With the development and improvement of machine learning (ML)
technology, especially deep learning, prediction methods based on ML
have gradually developed and replaced traditional methods.
ML-based methods train their models in a supervised way [29]. These
models learned to map features to structures by adjusting model
parameters according to known structures and corresponding sequences and
other feature information (Supplementary File S2.1). Many of them used
free energy parameters, encoded RNA sequences, sequence patterns, or
evolutionary information as key features, and their outputs can be
classified as tags (such as paired or unpaired) or continuous values
(such as free energy). When features of the new structure are input into
the training model, the model can classify the corresponding tag or
predict the corresponding parameter [29]. Its characteristic is that
it contains all the information of the data, so it does not rely on the
assumptions in traditional methods and is easy to combine with known
biological rules [19]. While the accuracy of prediction is improved,
the prediction model after training is faster than traditional methods,
so it has more advantages in processing long RNA sequences.
The development history and representative methods of RNA
prediction
Since the concept of RNA secondary structure was put forward in 1960
[30], the research on how RNA secondary structure formed from
sequence to function has never stopped. In 1971, Tinoco et al. [31]
proposed a simple method to estimate the secondary structure of RNA
molecules for the first time, but Delisi et al. [32] didn’t get the
expected results when trying to predict the secondary structure of RNA
from the MFE. In 1981, Zuker et al. [33] used a dynamic programming
algorithm to predict RNA secondary structure based on the MFE model. In
1989, Zuker et al. developed a computer program, which calculates the
structure of optimal and suboptimal folds and becomes the basis of the
mfold package [34]. In 1994, Hofacker et al. [28] proposed the
ViennaRNA package including the RNAfold method to calculate the MFE
structure of RNA molecules or partition function [12]. In 2004, Ding
et al. proposed the MFE model combined with statistical methods and
developed the software package Sfold. In 2010, Halvorsen et al.
developed the SNPfold algorithm which compares the sequence information
before and after mutation based upon the calculation of the RNAfold
partition function. [35]. In 2014, David H. Mathews provided four
programs in the RNAstructure based on comparative sequence analysis.
[36]. With the development of RNA structure detection technology,
learning-based methods were increasingly used to predict RNA structure.
In 2020, Chen et al. proposed an end-to-end deep learning model, called
E2Efold, which significantly improved the accuracy and speed of
prediction [37]. In 2022, Laiyi Fu et al. developed the UFold method
based on deep learning, which further improved the accuracy of
prediction [38]. The development history of the common RNA structure
prediction method is shown in Figure 2.
Here
we list several representative methods for RNA prediction and more
details of the RNA traditional prediction software packages are shown in
supplementary materials, including mfold package (Supplementary File
S1.1), ViennaRNA package (Supplementary File S1.2), and RNAstructure
package (Supplementary File S1.3).
Mfold was first proposed by Zuker et al. in the 1980s
[34,39,40] , which introduced new improvements to make the
calculation of RNA structure prediction more accurate and efficient. The
first is the combination of the new energy rules, and the second is the
ability to calculate optimal and suboptimal folding. Conditional
constraints can be imposed based on preassumptions, and sub-optimal
structures may be more consistent with experimental data than optimal
ones. In terms of output, the best and suboptimal folding lists can be
sorted by energy. An energy point diagram can also be used to describe
all suboptimal folds in one image [41].
RNAfold was developed based on the ViennaRNA
package, the computer program widely
used to compute and compare RNA secondary structures. Using the dynamic
programming algorithm, the code predicted the structure with MFE after
calculating the equilibrium partition function and base pairing
probabilities, which may serve as constraints. Based on this, RNAfold
developed an RNA secondary structure algorithm based on tree editing and
alignment, which can calculate the MFE, backtrack the optimal secondary
structure and efficiently solved the RNA inverse folding problem. It was
worth noting that RNAfold can provide constraints for the folding
algorithm to force the pairing of a specific location [13,28].
SNPfold algorithm was designed based on RNA partition
function calculation in RNAfold [26,42]. The difference between them
is that the SNPfold algorithm requires two different RNA strands with
the same length. One strand is a wild-type RNA sequence, and the other
is an RNA sequence containing genetic variation. SNPfold will calculate
the Pearson correlation coefficient between two RNA base pairs and the
partition function of the possible RNA conformations set of the sequence
to analyze the influence of SNPs on RNA structure. Besides, it will help
to identify the disease-related mutation in the regulatory RNA by
analyzing genome-wide association
studies (GWAS) data and the whole mRNA structure [35].
UFold was a deep learning-based method recently
developed to directly train labeled data and base pairing rules. A novel
RNA sequence image representation method was proposed by UFold, which
can be effectively processed by Fully Convolutional Networks (FCNs). It
was found that it outperformed other methods in the family dataset and
its prediction speed was improved [38].
Principle and process of protein structure
prediction
Changes in protein structure are influenced by a variety of factors,
including SNPs and mutations. Nucleotide variations that cause an amino
acid change are called nonsynonymous mutations, and nonsynonymous SNP
(nsSNPs) are an important part of this extensive research. These
single-base changes, or multiple nucleotide substitutions leading to
alterations in the amino acid sequence of the encoded protein, are
called single amino acid variations (SAVs) or missense variations. Amino
acid changes can affect protein stability, interactions, and enzyme
activity, and even cause disease. Therefore, it is crucial to accurate
prediction of the effect of genetic variation on protein structure for
understanding the mechanism of genome variation associated with certain
diseases [43].
By reviewing the literature, we found that the prediction of the effect
of genetic variants on protein structure is always inseparable from
machine learning methods. Prediction methods often combine protein
characterization data, such as protein dynamics, contact potential
scores, interatomic interactions, and other aspects features, using
machine learning algorithms including support vector machines, random
forests, and deep learning to train and implement the predictions.
Therefore, several feature extraction methods and machine learning
methods are reviewed in this section.
Protein structure feature
extraction
method
Normal Mode Analysis (NMA) provides a valuable method for the
study of system dynamics and accessible conformations as an alternative
to time-consuming and computationally expensive molecular dynamics
simulation. The kinetic properties extracted from the protein structure
generated from the NMA module of the Bio3d tool can be utilized
[44].
Analysis of Mutation Effects. The change of folded Gibbs free
energy may be caused by many related factors. To combine these
characteristics, Arpeggio [45] can be used to calculate the number
of hydrophobic contacts involving wild-type residues and the contact
potential score in the AAINDEX database [46].
Graph-based structural signatures approach to represent
molecular structures has proven to be successful for a range of
applications towards the study of protein structure and changes carried
out by missense mutations, including phenotypic changes. These
signatures comprise physicochemical and geometrical properties from the
wild-type environment that are based on distance patterns mined from the
3D structure by representing atoms as nodes and their interactions as
edges. Then the physicochemical properties and the distance pattern
between atoms are defined according to the properties of amino acids
(i.e., pharmacophores) and transformed into a cumulative distribution
function [47-50].
ML
ML was successfully applied to
proteins earlier than RNA (Supplementary File S2.2). As executors of
functions, proteins are robust and have a large number of features that
can be used for ML. In the field of bioinformatics, algorithms such as
deep learning, random forest, support vector machine, etc. are widely
used in protein structure prediction [51], protein functional site
prediction [52,53], subcellular location prediction, etc
[54,55]. There are many ML algorithms, none of which is the best
algorithm for all tasks. Take the method flow of the classic
sequence-based tool DynaMut2 [56] as an example, as shown in the
following figure 3.
The development history and representative methods of
protein
prediction
Several methods have been developed to predict how missense mutations
affect protein stability by using sequence or structural information
from which the information is often complementary. The development
history of relevant methods is shown in Figure 4. According to the
characteristics of existing methods, we divide them into five categories
and introduce their respective characteristics and representative
methods.
The structure-based method only
(red part in the
figure)
mCSM was developed by Pires et al. in 2014, which was a method
based on protein structure and relies on the graph-based signature to
study missense mutation [47]. It encodes the interatomic distance to
represent the protein residue environment and trains the prediction
model. Subsequent studies have also proven that the effect of mutation
is related to the atomic distance around amino acid disability. The mCSM
network server has been established and provides many extended tools and
methods.
SDM is an algorithmic program based on statistical potential
energy function proposed by Topham et al. in 1997 [57]. Worth et al.
established the SDM network server in 2011 [58]. Based on the
structure method, SDM uses the amino acid substitution frequency of the
homologous protein family in different environments to calculate the
stability score between wild-type and mutant proteins. The change in
protein stability is one of the important parts to estimate the effect
of genetic variation on protein structure.
The sequence-based method only
(green part in the
figure)
MuStab was a network server that was developed by Teng et al.
in 2010 [59], which is a machine-learning method for detecting
protein stability based on sequence characteristics. It uses
experimental data on the free energy variation of protein stability
during mutation. After analyzing 20 sequence features, it is found that
the classifier combined with 6 sequence features is the most accurate
including stability (S3) bulkiness (Bu), the transmembrane tendency
(Tt), beta-sheet (B), average area buried on transfer from standard
state to folded protein (Aa), and the mobility of an amino acid on
chromatography paper (Mc) (Supplementary File S2.2).
Methods based on structure and
sequence (yellow part in the
figure)
Imutant3. 0 was proposed by Capriotti et al. in 2008 [60],
which can distinguish the experimental protein stability free energy
change value (ΔΔG) into stable mutation, unstable mutation, and neutral
mutation by using a support vector machine (SVM) based on sequence or
structure. It also improves the prediction effect of free energy change
caused by single-point protein mutation.
Other methods (purple part in the
figure)
The iStable is an integrated predictor, which can predict the
change of protein stability by SVM when a single amino acid residue is
mutated. It uses sequence information and prediction results that
adopted the SVM as an integrator from different element predictors to
construct grid computing architecture [61]. The iStable2.0
systematically improves prediction performance based on iStable
[62].
The new method (blue part in the
figure)
Many new methods have emerged in recent years. Among them, AlphaFold2
[63] based on deep learning, and RoseTTAFold [64] using a
3-track neural network are two representative methods, both of which
have achieved amazing accuracy in predicting protein structure.
The latest version of Alphafold is based on a new ML method,
which integrates the physical and biological knowledge of protein
structure into the design of deep learning algorithm by using
multi-sequence alignment [63]. AlphaFold is a fully redesigned
neural network-based model that has similar prediction accuracy to the
experimental structure and significantly outperforms other methods in
most cases.
RoseTTAFold was developed by Minkyungbaek and others [64].
Combined with the network architecture of relevant ideas, RoseTTAFold
achieved the best performance by successively converting and integrating
the information of the one-dimensional sequence layer, two-dimensional
distance layer, and three-dimensional coordinate layer-3. The structure
prediction accuracy of the track network is close to that of deepmind in
the 14th round of the Critical Assessment of Structure Prediction
(CASP14).
Common identification tools and
software
Here, we list some tools that can
predict the effect of mutation on the macromolecular structure
(Supplementary Table T1). These tools combine the information related to
mutation and macromolecular structure prediction methods. The tool
usually takes the molecular sequence and variation information as input,
the predicted wild-type structure and the structure after variation, and
the change of macromolecular thermodynamic index as output. We have
developed a website GEMS
(http://www.onethird-lab.com/gems/)
as a brief introduction and index to these tools.
Tools to identify structural
effects on
RNA
RNAsnp (https://rth.dk/resources/rnasnp ) is a web server
tool, which use a different mode to predict the effect of SNPs on
different length of RNA sequence. The global folding method RNAfold
[28] calculate the base pair probabilities of wild-type and mutant
sequence, which is less 1000nt, and the local folding method RNAplfold
[65] are used for large RNA sequence. These two methods are part of
the ViennaRNA package [13], for more information see the annex. SNP
effects are quantified from extensive pre-computed tables of
distributions of substitution effects as a function of gene length and
GC content. The input data of RNAsnp
can be a single RNA sequence in FASTA format with one or more mutants
whose structural effect needs to be predicted. It not only provides the
structural prediction results but also features a graphical output
representation [66,67] (Supplementary File S5.1).
MutaRNA(http://rna.informatik.uni-freiburg.de/MutaRNA)
is a web server for studying SNV-induced RNA structure changes, which is
also the first tool that provides different dot plots for comparative
analysis of base pairing potentials. MutaRNA uses the local folding
method RNAplfold [65] which is part of the ViennaRNA package
[13] to retrieve candidate RNAs. MutaRNA also integrates the
empirical p-values from RNAsnp [67] and the relative entropy
comparing wild-type and mutant-form remuRNA [68] to quantify the
structure aberration caused by SNV.
The input data of MutaRNA is an
RNA sequence of the wildtype sequence (WT) in FASTA format and the
location of the mutation. It can provide a variety of visualization
results, including heat map matrices, circular plots, and arc plots that
are convenient for users to use directly in scientific reporting
[69] (Supplementary File S5.2).
LncCASE (http://bio-bigdata.hrbmu.edu.cn/LncCASE ) is a
network database, which is constructed based on lncRNAs prediction in
cancer. Multidimensional molecular analysis of tumor samples,
biomolecular interaction networks, and pathway data resources by
integrating genomic and transcriptome data from human cancer. LncCASE
uses a computational method to identify the sub-pathways driven by
lncRNAs under the influence of CNV. The copy number level of lncRNA was
re-annotated, and the lncRNA CNV spectrum was constructed and
visualized. The tool further analyzes the biological effects of lncRNA
affected by genetic variation in cancer, which facilitates the study of
cancer biology [70].
PON-mt-tRNA(http://structure.bmc.lu.se/PON-mt-tRNA)
is a prediction tool for pathogenic variants on tRNAs. Since all
pathogenic variants of tRNAs are located in mitochondria, investigators
collected mt-tRNA variants and developed a machine-learning random
forest algorithm based multivariate probabilistic prediction method. The
method requires a reference position in the mtDNA, the reference
(original) nucleotide, and the nucleotide altered by each variation as
inputs. In addition, users have the option to submit evidence for
isolation, biochemical, and histochemical characterization. The
investigators classified all possible single nucleotide substitutions in
all human mt-tRNA using PON-mt-tRNA, which documents the prediction of
all possible nucleotide substitutions in mt-tRNA genes [71].
UFold (https://ufold.ics.uci.edu ) is developed as a web
server running UFold to facilitate the use of the UFold method. Users
can enter or upload RNA sequences in FASTA format. The server predicts
the secondary structure of RNA using pre-trained UFold models (trained
on all datasets) and stores the predicted structure in a dot-bracket
file or bpseq file for users to download. The user can also select in
the options panel either to predict non-canonical pairs or not directly.
The server further provides interface connectivity to the VARNA tool
[72] to visualize the predicted structures. Most existing RNA
prediction servers, such as RNAfold [13], MXfold2 [73], and
SPOT-RNA [74], can only predict one RNA sequence at a time and limit
the length of the input sequence, but UFold does not have that
limitation. Unfortunately, UFold is currently unable to combine SNP data
to predict the effect of structure breaking mutations. More tools for
mutation prediction of RNA structural disruption effects are shown in
Supplementary File S3.
Tools of identifying structural
effects on
proteins
INPS (http://inps.biocomp.unibo.it ) is a new approach
that departs from protein sequence information and does not rely on
structure to annotate the effect of non-synonymous mutations on protein
stability. INPS is based on support vector machine regression and is
trained to predict the change in thermodynamic free energy based on a
single point change in a protein sequence. It has the advantage of being
suitable for calculating the effect of nonsynonymous polymorphisms on
protein stability when the protein structure is unavailable. INPS
predictor consists of one support vector regression (SVR) trained on
single point variations in different proteins and can complement each
other [75] with methods like structure-based mCSM [47].
DynaMut2 (http://biosig.unimelb.edu.au/dynamut2 ) is a
tool that integrates information on protein dynamics and structural
environment attributes of wild-type residues. As a graph-based signature
method, it is able to accurately predict the effect of mutations on the
stability and dynamics of single and multiple point mutations. DynaMut2
can predict the Gibbs free energy change of single point mutations or no
more than 3 multiple point mutations based on Normal Mode Analysis (NMA)
and protein kinetic analysis. Its input may be a single mutation or a
mutation list, and its performance is better than other methods in
predicting stability changes caused by single-point mutations [56].
mCSM-PPI2 (http://biosig.unimelb.edu.au/mcsm-ppi2/ ) is a
new machine-learning computational tool that can predict the effect of
missense mutations on the binding affinity of protein interactions
accurately. It leverages graph-based structural signatures to model
inter-residual interaction networks, evolutionary information, complex
network metrics, and the change effects of energy terms to generate
optimized predictors. The mCSM-PPI2 can be used to assess the impact of
user input of a specified mutation or to predict the impact of protein
interface mutations in an automated manner [76].
PhyreRisk (http://phyrerisk.bc.ic.ac.uk ) is a web
application tool for connecting genomic, proteomic, and structural data
to facilitate the mapping of human variants to protein structures. It
provides information on 20,214 human typical protein sequences and
22,271 alternative protein sequences (isoforms) and supports new variant
data in a genomic coordinate format (VCF, applying reference SNP IDs and
HGVs release symbols) and human gene builds GRCh37 and GRCh38 as inputs.
In addition, it supports the use of amino acid coordinates to map
variations and search for genes or proteins of interest. PhyreRisk aims
to enable researchers to translate genetic data into protein structural
information that provides a more comprehensive assessment of the
functional impact of variants [77].
AlphaFold (https://alphafold.ebi.ac.uk ) is an
open-access and extensive database that provides highly accurate protein
structure prediction. AlphaFold V2.0 gives an unprecedented expansion of
structural coverage of known protein sequence space structures. The
latest version of AlphaFold is based on a novel ML approach that
combines physical and biological knowledge about protein structure.
Using multiple sequence alignments, this knowledge is incorporated into
the design of deep learning algorithms. Not only that, AlphaFold2 also
uses inductive biases in physics and geometry to build components that
learn from PDB data. This enables the network to learn more efficiently
from limited data in the PDB and to deal with the complexity and
diversity of structural data. AlphaFold and its technology computational
methods have been important tools to solve biophysical problems in
modern biology [63]. However, it is a pity that up to now, no tool
has been developed to predict the structural effects caused by mutation
[78]. More protein structure disruption effect mutation prediction
tools are shown in Supplementary File S4.