Figure 8. Scatter plot of evaluation units in CASP14 (A,
left) and CASP15 (B, right) represented by sequence (HHscore,
Y-axis) and structure (LGA_S, X-axis) scores of the top template.
Evaluation units in the left panel are marked according to the
difficulty categories as manually assigned in CASP14: full squares –
TBM-easy; hollow squares – TBM-hard; hollow triangles –TBM/FM; full
triangles – FM. Targets of the same difficulty cluster together in the
suggested (X,Y) axes. An automatic delineation of EUs into four classes
(X+Y<70, red; 70-100, yellow; 100-130 green;
>130, blue) based on the results of sequence- and
structure-based searches of the PDB is suggested to mimic the CASP14
difficulty categories. The schema is applied to define target prediction
classes in CASP15 (right panel).
Two scores, HHscore and LGA_S, for sequence- and structure-based
relationships of the target with PDB entries, were defined in Methods.
They are plotted against each other for all EUs in CASP14 and CASP15
(Figure 8). The classification of the CASP14 data resulting from the
previous procedures 10– based partly on predictor performance and involving manual
intervention – is indicated by symbols in panel A. This reveals that
TBM-easy and FM EUs cluster in these coordinates in the upper right and
lower left corners respectively, while TBM-hard and TBM/FM EUs
predominantly occupy areas immediately above and below the diagonal,
respectively. It also can be seen that all triangle markers but two (FM
and TBM/FM targets) are below the diagonal and all squares but one
(TBM-easy and TBM-hard) are above. Thus, if we consider the diagonal
line (HHscore+LGA_S=100) as a boundary between the wider TBM (TBM-easy
and TBM-hard together) and FM categories (FM and TBM/FM), then there are
only three targets for which the prior CASP14 and current automated
classifications schemes disagree.
To further delineate TBM-easy from TBM-hard, and FM from TBM/FM we draw
two lines parallel to the diagonal. These lines were drawn symmetrically
so that the areas between them and the diagonal include the majority of
the TBM-hard (upper) and TBM/FM (lower) EUs yet not encroaching deeply
into the TBM-easy and FM territory. Based on the CASP14 data, the split
lines were drawn at HHscore+LGA_S=70 and 130 levels. As a side note, we
want to mention that we experimented with several other splitting
schemas (like rectangular or spherical divisions) and found the linear
split to be the simplest and best fitting the CASP14 and CASP13 target
classifications. When the suggested schema is applied to the
classification of CASP15 EUs (Figure 8B), we see that the points in the
graph are nicely separated, with particularly clear clustering in the FM
and TBM-easy zones.
Using this classification approach, the CASP15 EUs were automatically
assigned to four largely homology-based prediction classes (see Figure
1B and Table 1). Forty-seven EUs were assigned to the TBM-easy class, 15
to TBM-hard, 8 to TBM/FM, and 39 (~35%) to FM - a class
with the weakest or no evolutionary relation to available folds. These
data show that the CASP15 target set was one of the most difficult
(homology-wise) in the whole history of CASP. For comparison, the FM
class constituted only 24% of all targets in CASP14, and 27% in
CASP13. Conceivably this rise may already illustrate the impact of AF2
on target selection in structural biology: experimentalists may be
switching attention to more structurally novel targets with which AF2
still struggles.
As discussed in more detail elsewhere2129, it is clear that FM
targets comprise the majority of those with which even the top
predictive methods struggled, even though some FM targets were
well-predicted. Thus, even though it is well known that AF2 (on which
most predictive methods were based) generalizes beyond its training set,
the absence of similar structural folds in the PDB still leads to a
greater risk of predictive failure. Factors further predisposing a
target to less accurate prediction appear to include shallow Multiple
Sequence Alignment (MSA) (it is known that evolutionary covariance
information extracted from MSAs is required for accurate modelling of
natural proteins by AF230,31,
potentially in order to obtain a sufficiently accurate initial structure
estimate). Especially given the relatively small numbers of problematic
targets in each CASP, however, a deeper study on this subject is needed,
and deep learning methods could help with this task.