Modeling Healthcare Data with Graph Databases

Introduction

Since the transition from paper records to virtual records, hospitals have been piling up data. Every touchpoint of the healthcare system, every prescription, operation, and immunization, is logged and stored in the hospital’s electronic health record (EHR). It has now reached a point where hospitals have more data than they know what to do with. Even worse, this oversaturation of complex data makes accessing and analyzing the data extremely inefficient.

So, what’s the solution to this crisis?

Graph Databases!

Graphs are perfect for storing and visualizing healthcare data. They are designed to handle highly connected information, like patient records. If you are unfamiliar with graphs, check out this awesome article that introduces some of the basics of graph theory.

Generating the Data

In an ideal world, we could create this graph using real patient data; however, there are a number of rules and regulations that make working with patient data pretty hard. Instead, we can use the next best thing: synthetic data.

Using Synthea, an open-source synthetic patient generator, we can create an entire healthcare ecosystem full of patients, hospital visits, insurance providers, and everything else you could think of. If you’ve never encountered Synthea before, check out this short post I wrote explaining how it works.

The output data from Synthea is divided into several CSV files such as Allergies, Medications, Encounters, Providers, etc.

Creating a Graph Schema

HEALTHCARE DATA IS COMPLEX!

Each patient has so many interactions with the healthcare system, a simple schema would most certainly fail to capture all of the data and information available. Our schema must be as detailed as our data.

I could write an entire blog on this script alone. But, in the interest of sparing you from having to read what would surely be a very boring read, I’ll just note some key points.

  • Each CSV file topic becomes a vertex with the appropriate edges.
  • All of the edges are undirected because all relationships are bi-directional (i.e. a patient has a medication, but that medication then corresponds to that patient.)
  • Attributes like genderrace, and address could be internal attributes, but I chose to break them out to optimize searches around those attributes.
  • The vertex SnomedCode stores every medical code used, which also helps to optimize searches

Loading the Data

Again, let’s briefly take a look at the important parts of this code.

  • We first define the file that we are using to load our data.
  • We then specify which columns correspond to the vertex ids, vertex attributes, or edge attributes as defined in our schema.
  • Finally, we state that our file has a header and that the separator is a comma

Using the same format, we can write a similar loading job for the rest of our data files.

For our sample of 500 patients, loading all of our data results in about 800,000 vertices and almost 2 million edges.

Sample Query

I won’t go into too much detail about GSQL queries. I don’t want to focus on the actual writing queries, but instead on the speed at which queries run— after all, this blog is meant to showcase the efficiency of graph databases. I have another blog that goes through a number of query examples if you are curious. And, as always, the TigerGraph documentation site is a great place to find more information.

Let’s run a simple query that grabs all vertices and edges immediately connected to a given patient.https://towardsdatascience.com/media/1849059eb24dfa467f022d8d02e9c738Example query

This query returns a lot of information. It basically calls every touchpoint that a given patient has with the healthcare system. Normally, that would be a tough job for a database. But with our graph database, the information was retrieved within A FEW MILLISECONDS! That’s incredibly fast!

This speed applies to datasets much larger than 500 patients as well. In a sample system of 100 million patients (wow that’s a lot of data), the time taken to gather this same information is only a couple of seconds.

Querying using graphs is extremely efficient and shows a huge improvement over standard techniques used for querying healthcare data.

Further Exploration

While graph databases do serve as effective and efficient ways to hold data, their benefits extend beyond storage. We saw already how quickly queries can be performed on large datasets. But, we can also take advantage of graphs for visualizations. For example, using the same database and a slightly different query, we can easily create this 3D network graph, which only takes a few seconds to render.

Besides looking really cool, this 3D visual is extremely useful. While a 2D representation of the same information would be cluttered and impossible to read, this 3D model provides an open and clean method of viewing our large amount of interconnected data. And, while the aesthetic parts are made with HTML and JavaScript, the data, the key to the entire visualization, lies in the graph database and the query.

Conclusion

Compared to normal relational databases, graphs are an excellent alternative, especially for representing highly connected data. They’re perfect for representing healthcare networks, where each patient is connected to huge amounts of data. If implemented on a large scale, this technology could greatly reduce the burden experienced by EHRs, and make storing, analyzing, and visualizing data much more efficient. Graphs are the future of medical data!

First published at Towards Data Science

Similar Posts: