Limitation and future work. As shown in Fig. 5A, the fragment-based method performs quite worse than the atom-based method on specific property tasks. We consider that the keypoint fragments of these molecular properties are in disorder due to the fixed fragment decomposition mechanism [36\cite{bib27}] as mentioned above, which finally leads to the opposite effect. In addition, dihedral angles are used to specify the molecular conformation, which significantly affects the chemical properties of the molecule [37\cite{bib28}]. For example, the similar molecule structures tended to have a completely different mechanism to be positive due to the minor difference in molecular conformations [37\cite{bib28}]. However, the role of angles is not considered in this work, which directly influences the model performance. Therefore, only 90% of the obtained attribution fragments have been verified to show positive relevance to the property tasks in the above experiments. The source of the 10% error is likely to come from the lack of thorough analysis of the molecular structural information. At the same time, some molecules contain structural alerts and never generate toxic effects, as well as compounds that can be rejected as drugs due to manifested toxic effects without having a structural alert in their molecule [38\cite{bib29}]. Fortunately, the cases mentioned above are only a few. Therefore, although there are misjudgments in the attribution process, the accuracy of the judgment can be guaranteed in general. Further improvement of our work will focus on using three-dimension molecule conformations to represent the molecules.
Conclusion
We propose an explainable fragment-based molecular property attribution method for analyzing the relevance between the biochemical property and molecular fragments. Moreover, statistical results and mechanism verification are adopted to demonstrate the reliability of discovered relevance between molecular property and fragments. Experiments on forty-two biochemical property tasks show that about 90% of the attribution fragments strongly relate to the corresponding property task, and random-selected attribution results from six classical side effect property tasks satisfy the biochemical mechanism excellently. The discovered relationship between molecular property and fragments can be applied to various tasks, such as exploring the relation of different molecular properties and targeted property molecular synthesis with specific fragments. Based on the attribution fragment sequence for different property tasks, we build the property relation map of all the forty-two properties. The transfer learning experiments are adopted to verify the benefits of the property relation map for assisting rapid and accurate transfer learning performance. In summary, as a computer-assisted molecular discovery method, our fragment-based attribution method can provide pharmacologists with sufficiently precise guidance, accelerate the process of analyzing the properties of drug molecules, and promote the efficiency of clinical trials. In future work, we will focus on using more information to represent the characteristics of molecules, such as adding dihedral angles in the three-dimensional conformation and realizing more natural molecular tree decomposition methods to achieve more precise positioning.
Methods
Experiment dataset setting. The training and validation data is obtained from four datasets (BBBP, Tox21, Sider, ClinTox) of the physiology field in Mufei Li et al. [39\cite{bib22}]. These datasets include the experimental bioactivity data for 42 different property tasks, namely including eye disorder, hepatobiliary disorder. The BBBP dataset contains the data for binary labels of blood-brain barrier penetration (permeability) with 2,039 compounds. The Tox21 dataset denotes qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways, totaling 7,831 compounds. The Sider Dataset is the database of marketed drugs and adverse drug reactions (ADR), with 27 system organ classes and 1,427 compounds. The ClinTox dataset contains 1,478 qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons. Each property task may not contain all the data in the corresponding dataset.
Before being put into training, each molecule was decomposed into one fragment tree with Junction Tree Method [36\cite{bib27}], mainly including the ring fragments and diatomic fragments. For each atom and bond, the CanonicalFeaturizer interface of DGL-LifeSci [39\cite{bib22}] was used to generate features. The feature of a fragment composed of several atoms and bonds was represented by the weighted concatenation of these atoms and bonds features. After fragment decomposition for every task, all the molecules were split into training, validation, and test subsets following an 8:1:1 ratio. Different splittings were recommended depending on the contents of each dataset: the BBBP dataset adopted scaffold splitting, and the other three datasets adopted random splitting in our setting.
Processing of fragment-based attribution method and validation of attribution fragments. Each property task was processed with the pipeline of the proposed fragment-based attribution method as follows:
Train Stage. For each property task, the fragment-based GCN model was trained with the same setting, and the prediction models were then saved.
Sample Selection Stage. Taking into account the accuracy of attribution, the positive molecules used for attribution should be screened. The prediction loss is used to sort these molecules in the screening process, and low prediction loss means high confidence for the positive label. For low-confidence molecules, the prediction process is generally unreliable. The attribution results are then difficult to represent the property task effectively, so we adopted high-confidence molecules. Finally, the top-200 high-confidence samples were chosen for the next attribution stage. The selection of the number 200 mainly depends on the trade-off of the number of positive samples for these tasks.
Attribution Stage. In this stage, we obtained the attribution fragments of the above high-confidence samples. The reverse derivation process was performed from the prediction result layer of each molecule to the input fragment feature layer. After obtaining the gradient response of each fragment, we sorted them according to their response values and discarded fragments with small values.
Fragment-validation Stage. The reliability of each obtained fragment was validated with a new dataset \(\mathbb{M}\), which consists of positive subset \(\mathbb{M}_{pos}\) and negative subset \(\mathbb{M}_{neg}\). For each molecule in \(\mathbb{M}_{pos}\) and \(\mathbb{M}_{neg}\), we tried to determine whether attribution fragment \(g\) occurs in the test molecule. To eliminate the effect of the imbalance in the number of test molecules, we considered the difference in the two probabilities of occurrence on the new positive and negative datasets as the final validation metric. Fig. 3B shows one case of the calculation process, and the equation (1) for a specific difference \(Diff_g\) of attribution fragment \(g\) is given as follows: