2023 - A Coruña - Spain

PAGE 2023: Methodology – AI/Machine Learning
Sungwoo Goo

Graph Diffusion Model to Understand Quantitative Systems Pharmacology

Sungwoo Goo1, Jung-woo Chae1,2*, Sangkeun Jung1,3* Hwi-yeol Yun1,2*

1Department of Bio-AI convergence, Chungnam National University, Daejeon, Republic of Korea 2College of Pharmacy, Chungnam National University, Daejeon, Republic of Korea 3Department of Computer Convergence, Chungnam National University, Daejeon, Republic of Korea

Objectives

Understanding quantitative systems pharmacology is essential for selecting target proteins for drugs, and understanding the cell signaling pathways triggered by target proteins is necessary for selecting biomarkers in PD modeling, applying drug metabolism in PK modeling, and predicting side effects or Drug-Drug Interactions(DDI). Many protein-protein interactions(PPI) can be represented as discrete math graphs. Proteins are represented as nodes and interactions are represented as edges connecting two proteins. In the field of deep learning, diffusion model (DM) is used for image generation[1]. Image is a special form of graph, and generalization of image can be expressed as graph. Molecular Generation model using Graph Diffusion Model is also currently being developed[2]. Graph diffusion model (GDM) is built to predict whether a drug-disease pair can be cured by learning Protein-Protein Reaction Graph using Graph Diffusion Model.

Methods

The dataset is ogbl-biokg from Open Graph Benchmark[3]. The dataset contains five types of entities: disease, protein, drug, side effect, and protein function, and has 51 types of relation edges. Four of the 51 types of edges, protein-protein intereaction, disease-drug, disease-protein, and drug-protein, were used to build the dataset.

Proteins were mapped to gene unique ids from the National Central for Biotechnology Information (NCBI) [4]. The protein sequences were mined and processed with meta's esm2_t30_150M_UR50D model [5] and embedded into vectors with a dimension of 640. And the drug was mapped to the cid of PubChem [6]. This was collected as SMILES type and processed with DeepChem's ChemBERTa-10M-MTR model [7] and embedded into vectors with a dimension of 384. Disease was included as a trainable parameter in the GDM to be built without finding a pretrained model. The graph convolutional layer of the GDM used the GATv2 model [8]. The GDM learns by restoring the correct answer graph from a randomly generated noise graph from a standard normal distribution. Organisms have the property of homeostasis, so they try to repair the protein-protein reaction graph to the most stable state, and we wanted to simulate this with GDM.

The built Graph Diffusion Model learns the process of repairing the noise graph to the original PPI graph by inputting the drug embedding vector of a known disease-drug pair, the drug embedding vector, the index of the protein node affected by the disease, and the index of the protein node affected by the drug. The smaller the difference between the original graph and the repaired graph, the higher the performance model. ogbl-biokg provides separate train dataset and validation dataset, so we trained the model with the train dataset and validated it with the validation dataset.

Results

Objectives

Understanding quantitative systems pharmacology isF essential for selecting target proteins for drugs, and understanding the cell signaling pathways triggered by target proteins is necessary for selecting biomarkers in PD modeling, applying drug metabolism in PK modeling, and predicting side effects or Drug-Drug Interactions(DDI). Many protein-protein interactions(PPI) can be represented as discrete math graphs. Proteins are represented as nodes and interactions are represented as edges connecting two proteins. In the field of deep learning, diffusion model (DM) is used for image generation[1]. Image is a special form of graph, and generalization of image can be expressed as graph. Molecular Generation model using Graph Diffusion Model is also currently being developed[2]. Graph diffusion model (GDM) is built to predict whether a drug-disease pair can be cured by learning Protein-Protein Reaction Graph using Graph Diffusion Model.

Methods

The dataset is ogbl-biokg from Open Graph Benchmark[3]. The dataset contains five types of entities: disease, protein, drug, side effect, and protein function, and has 51 types of relation edges. Four of the 51 types of edges, protein-protein intereaction, disease-drug, disease-protein, and drug-protein, were used to build the dataset.

Proteins were mapped to gene unique ids from the National Central for Biotechnology Information (NCBI) [4]. The protein sequences were mined and processed with meta's esm2_t30_150M_UR50D model [5] and embedded into vectors with a dimension of 640. And the drug was mapped to the cid of PubChem [6]. This was collected as SMILES type and processed with DeepChem's ChemBERTa-10M-MTR model [7] and embedded into vectors with a dimension of 384. Disease was included as a trainable parameter in the GDM to be built without finding a pretrained model. The graph convolutional layer of the GDM used the GATv2 model [8]. The GDM learns by restoring the correct answer graph from a randomly generated noise graph from a standard normal distribution. Organisms have the property of homeostasis, so they try to repair the protein-protein reaction graph to the most stable state, and we wanted to simulate this with GDM.

The built Graph Diffusion Model learns the process of repairing the noise graph to the original PPI graph by inputting the drug embedding vector of a known disease-drug pair, the drug embedding vector, the index of the protein node affected by the disease, and the index of the protein node affected by the drug. The smaller the difference between the original graph and the repaired graph, the higher the performance model. ogbl-biokg provides separate train dataset and validation dataset, so we trained the model with the train dataset and validated it with the validation dataset.

Results

Given a protein-protein pair with interactions with protein vectors, GDM returns the noise in the corresponding graph. This noise is assumed to follow an independent multivariate standard normal distribution, so the probability of this noise occurring can be calculated. So if you enter a new protein and a protein-protein pair that is predicted to interact with that protein, the predicted noise will be output. So you can enter proteins that the new protein is likely to interact with, calculate the probability and rank them to find new cell signalling pathways. To validate the model, we used disease-drug pairs that were not used in the train. The true disease-drug pairs and the false disease-drug pairs generated by modifying these pairs were validated with the classification task. The protein graph diffusion model showed an accuracy of about 62%.

Conclusions

We built a deep learning model to understand protein-protein reaction using the latest Graph Diffusion Model. The large language model (LLM), which is currently in the spotlight, has the disadvantage of being non-explanatory, so the output of the LLM must be verified by experts in each field. In comparison, the graph model can be understood by experts in various fields and is therefore more suitable as a decision-making tool in the pharmaceutical industry where multiple experts collaborate. It is expected that it can be extended to PKPD modeling automation using the built graph model. In addition, our Graph Diffusion Model can be extended to the entire ogbl-biokg dataset to be used in more fields.

Conclusions

We built a deep learning model to understand protein-protein reaction using the latest Graph Diffusion Model. The large language model (LLM), which is currently in the spotlight, has the disadvantage of being non-explanatory, so the output of the LLM must be verified by experts in each field. In comparison, the graph model can be understood by experts in various fields and is therefore more suitable as a decision-making tool in the pharmaceutical industry where multiple experts collaborate. It is expected that it can be extended to PKPD modeling automation using the built graph model. In addition, our Graph Diffusion Model can be extended to the entire ogbl-biokg dataset to be used in more fields.



References:
[1] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).
[2] Hoogeboom, E., Satorras, V. G., Vignac, C., & Welling, M. (2022, June). Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning (pp. 8867-8887). PMLR.
[3] Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., ... & Leskovec, J. (2020). Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems33, 22118-22133.
[4] Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2005). NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research33(suppl_1), D501-D504.
[5] Lin, Zeming, et al. "Evolutionary-scale prediction of atomic level protein structure with a language model." bioRxiv (2022): 2022-07.
[6] Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Fu, G., Gindulyte, A., ... & Bryant, S. H. (2016). PubChem substance and compound databases. Nucleic acids research44(D1), D1202-D1213.
[7] Ramsundar, B. (2018). Molecular machine learning with DeepChem (Doctoral dissertation, Stanford University).
[8] Brody, S., Alon, U., & Yahav, E. (2021). How attentive are graph attention networks?. arXiv preprint arXiv:2105.14491.










Reference: PAGE 31 (2023) Abstr 10356 [www.page-meeting.org/?abstract=10356]
Poster: Methodology – AI/Machine Learning
Click to open PDF poster/presentation (click to open)
Top