Exploring PrimeKG — A Knowledge Graph for Medicine & Healthcare
This article will take a glance at the paper “Building a knowledge graph to enable precision medicine” (Chandak et al, 2023) [1] (aka. PrimeKG), load the graph data and visualize it, using Neo4j Desktop — A Graph Database.
Introduction
Everything is connected. Modeling the world with graphs is, thus, a natural choice. For example, MONDO Disease Ontology [2] — a knowledge graph, composed of millions of entities, from genes to diseases; or the Gene Ontology [4] — describing molecular functions, cellular components, and biological processes. In addition, there are also data banks, e.g. Disease Gene Network (DisGeNet) [3] — exploring gene-disease associations; Orphanet — an encyclopaedia of rare diseases. As showed, such medicine and healthcare knowledge is fragmented. PrimeKG [1] is an attempt to…
…integrates 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action, considerably expanding previous efforts in disease-rooted knowledge graphs.
Let’s get started! 😀
Preparing & Loading
Preparing
- Follow the link to PrimeKG raw data [5]. Download “edges.csv” (368.6 MB) and “nodes.csv” (7.5 MB). Why? → There are other files, e.g. “kg.csv” which contains the full graph (formatted as node1-relation-node2 for each line). But importing graph with separate nodes and edges files is the quickest and most efficient way [8].
- Download and install an Open JDK. I often use Eclipse Temurin [7]. Choose version 17 (LTS) or 21 (LTS). After installation, check that
JAVA_HOME
environment variable is pointed to the correct JDK location. Why? → We need JDK later to run “neo4j-admin.bat” in Neo4j. - Use Notepad++ to change the header of “edges.csv” to
:TYPE,display_relation,:START_ID,:END_ID
. Change the header of “nodes.csv” tonode_index:ID,node_id,:LABEL,node_name,node_source
. Why? → The Neo4j Import Tool [8] requires CSV files to have their headers follow certain guidelines, e.g. which column is ID. - Use Notepad++ to search and replace “/” for “__” in “nodes.csv”. Why? → Neo4j Node Name must follow certain restrictions.
- Download and install Anaconda, or at least Python with Pandas library. Why? → Unfortunately, I found that the “edges.csv” contain double edges, meaning we have a CSV row saying “Node A is pointed to Node B with relation X”, and another row saying “Node B is pointed to Node A with relation X”. We intend to build an undirected graph in Neo4j. Therefore, we must remove duplicate edges. I only know how to do that with Python/Pandas! 😅
import pandas as pd
df = pd.read_csv(r"edges.csv", header="infer", sep=",", encoding="utf-8", dtype=str, keep_default_na=False)
group = df[[":START_ID", ":END_ID", ":TYPE", "display_relation"]].agg(frozenset, axis=1)
df_result = df.groupby(group).first()
df_result.to_csv(r"edges2.csv", sep=',', encoding='utf-8', index=False)
Loading
- Open Neo4j Desktop. Create a New Project. Add “Local DBMS”. Look for the triple dots: click on it, then “Open Folder”, then “DBMS”.
- Windows Explorer will open the location where the DBMS is located. In my case, it is
C:\Users\[Username]\.Neo4jDesktop\relate-data\dbmss\dbms-886ac993–25d9–493a-922b-2d41e29b6446
. - Open CMD right there. Now use the command below. Remember to change the
[path]
appropriately.
.\bin\neo4j-admin database import full primekg --nodes="D:\[path]\nodes.csv" --relationships="D:\[path]\edges2.csv" --trim-strings=true
- Within Neo4j Desktop, click button “Start” according to the DBMS having been created. Then click on “Open” to open Neo4j Browser.
- Create the database “primekg”. Then change to using the database as the active one. (Those are Cypher query)
CREATE DATABASE primekg
:use primekg
Within Neo4j Browser, type a command to verify if everything is good:
MATCH (n) RETURN COUNT(n)
The command should return 129375
. This is the number of nodes within the graph. Then further testing with the command:
MATCH ()-[r]-() RETURN COUNT(DISTINCT r)
It should return 4050249
. On page 2 of the paper [1], we read:
Across 129,375 nodes and 4,050,249 relationships, PrimeKG captures information on ten major biological scales.
Everything is set up!
Visualizing & Exploring
Visualizing
To get the Database Schema, we can use CALL db.schema.visualization()
in Neo4j Browser. A visual graph is displayed:
Exploring
To get all the Node Types, use CALL db.labels()
. Result:
╒════════════════════╕
│label │
╞════════════════════╡
│"anatomy" │
├────────────────────┤
│"gene__protein" │
├────────────────────┤
│"disease" │
├────────────────────┤
│"effect__phenotype" │
├────────────────────┤
│"drug" │
├────────────────────┤
│"biological_process"│
├────────────────────┤
│"molecular_function"│
├────────────────────┤
│"cellular_component"│
├────────────────────┤
│"exposure" │
├────────────────────┤
│"pathway" │
└────────────────────┘
On page 5 of the paper [1], we can see the list of Nodes, and they match!
To get all the Relationship Types, use CALL db.relationshipTypes()
. Result:
╒════════════════════════════╕
│relationshipType │
╞════════════════════════════╡
│"protein_protein" │
├────────────────────────────┤
│"anatomy_protein_present" │
├────────────────────────────┤
│"molfunc_protein" │
├────────────────────────────┤
│"cellcomp_protein" │
├────────────────────────────┤
│"drug_effect" │
├────────────────────────────┤
│"bioprocess_bioprocess" │
├────────────────────────────┤
│"anatomy_anatomy" │
├────────────────────────────┤
│"bioprocess_protein" │
├────────────────────────────┤
│"exposure_disease" │
├────────────────────────────┤
│"exposure_protein" │
├────────────────────────────┤
│"exposure_exposure" │
├────────────────────────────┤
│"exposure_bioprocess" │
├────────────────────────────┤
│"pathway_protein" │
├────────────────────────────┤
│"pathway_pathway" │
├────────────────────────────┤
│"exposure_molfunc" │
├────────────────────────────┤
│"exposure_cellcomp" │
├────────────────────────────┤
│"molfunc_molfunc" │
├────────────────────────────┤
│"cellcomp_cellcomp" │
├────────────────────────────┤
│"anatomy_protein_absent" │
├────────────────────────────┤
│"drug_drug" │
├────────────────────────────┤
│"indication" │
├────────────────────────────┤
│"off-label use" │
├────────────────────────────┤
│"contraindication" │
├────────────────────────────┤
│"drug_protein" │
├────────────────────────────┤
│"disease_phenotype_positive"│
├────────────────────────────┤
│"disease_phenotype_negative"│
├────────────────────────────┤
│"phenotype_phenotype" │
├────────────────────────────┤
│"disease_disease" │
├────────────────────────────┤
│"disease_protein" │
├────────────────────────────┤
│"phenotype_protein" │
└────────────────────────────┘
On page 9 of the paper [1], we see the list of Relation Types, and they (likely) match! 😅
Although the graph is complex and full-fledged, I am interested in Associations “Disease-Exposure” and “Disease-Disease”. For example, I wonder what could be the environmental risk factors for Diabetes. The query:
MATCH (e:exposure)-[:exposure_disease]-(d:disease)
WHERE d.node_name CONTAINS 'diabetes'
RETURN d.node_name, e.node_name
ORDER BY d.node_name, e.node_name
Result:
╒═════════════════════════════╤══════════════════════════════════════╕
│d.node_name │e.node_name │
╞═════════════════════════════╪══════════════════════════════════════╡
│"diabetes mellitus (disease)"│"2,4,4',5-tetrachlorobiphenyl" │
...
├─────────────────────────────┼──────────────────────────────────────┤
│"diabetes mellitus (disease)"│"bisphenol A" │
├─────────────────────────────┼──────────────────────────────────────┤
│"diabetes mellitus (disease)"│"cyanazine" │
...
Some sanity check with the association between “bisphenol A” (BPA — a chemical used in the production of polycarbonate plastics) and “diabetes”: the paper [9] shows:
Several epidemiological studies reveal a significant association between BPA and the development of insulin resistance and impaired glucose homeostasis…
Remarks
The original Knowledge Graphs as well as databanks, from which PrimeKG is composed, definitely change overtime. How to recreate PrimeKG from crash (i.e. update it) is advised in [6].
“The devil is in the details.” Synthesizing multiple data sources may pose challenges because different data sources could use (slightly) different names for diseases. Even in the same source, “MONDO contains many repetitive disease entities with no apparent clinical correlation. For this reason, we were motivated to group diseases in MONDO into medically relevant entities”, as noted by the Chandak et al [1]. Then the authors moved to leveraging ClinicalBERT embeddings with cutoff threshold of 0.98 to determine similarity. This is just among many other steps in processing raw data coming from various sources.
The original “edges.csv” contains directed edges. At first, I thought there could be some assumption about this as directed edges allow to meaningfully indicate “A affects B” (but not the other way around). However, when loading full “edges.csv” into Pandas Dataframe, the number of rows is perfectly double of the intended rows in the paper (8,100,498
vs. 4,050,249
). So I think the authors happened to have directed edges, rather than a careful consideration.
In summary, PrimeKG is beneficial to not only medical researchers but also to Machine Learning developers because they can make use of such graphs for feature engineering, or graph embedding. 🧐
References
[1] Building a knowledge graph to enable precision medicine. Chandak et all (2023) (https://www.nature.com/articles/s41597-023-01960-3)
[2] MONDO Disease Ontology (https://monarchinitiative.org/)
[3] DisGeNET (https://disgenet.com/)
[4] Gene Ontology (https://geneontology.org/)
[5] PrimeKG Data (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM&version=2.1)
[6] PrimeKG SourceCode (https://github.com/mims-harvard/PrimeKG)
[7] Eclipse Temurin (https://adoptium.net/temurin/releases/)
[8] Neo4j Import (https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import/#import-tool-syntax)
[9] Bisphenol A and Type 2 Diabetes Mellitus: A Review of Epidemiologic, Functional, and Early Life Factors (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7830729/)