Analyst Commentary: Neo4j looks to break down barriers between graph analysis and data science


Graph database market standout, Neo4j, has introduced Neo4j for Graph Data Science, a set of tools designed to facilitate the creation of predictive algorithms based on the relationships and network structures that exist between and among data points. This move, and recent work to integrate with business intelligence (BI) solutions, reveal a company intent on making graph analysis a mainstream enterprise endeavor on par with relational data analysis.

Extending graph data science beyond the database 

Graph analysis has surely proven its worth in helping data scientists solve some very thorny problems, such as fraud detection, churn prediction, and predictive maintenance. However, given its unique architecture, which uses nodes and edges rather than rows and columns, graph analysis has largely operated outside mainstream analytics and data science circles. And yet its idiosyncrasies make graph analysis and graph databases not just unique but crucial within the realm of data science. As an example, data scientists seeking to identify a very basic question about relationships between two products that might sit across one hierarchy would need to set up a number of complex and resource-hungry table joins and table index lookups. A graph database, however, can traverse those hierarchies and surface insights with comparatively little code and less overhead. 

Given such benefits, graph data science (GDS) capabilities are rapidly gaining enterprise-class functionality within graph database solutions themselves. Neo4j for Graph Data Science, for instance, goes a long way to operationalize GDS within the Neo4j database itself. Built to work on top of the company’s most recent graph database incarnation, Neo4j 4.0, this new toolkit combines a number of elements: 

  • Neo4j Graph Data Science Library – Leading up to this release, Neo4j has been optimizing and hardening an existing set of graph algorithms proven to be effective in answering specific questions. Data scientists working on a retail recommendation engine can quickly identify unique users, transaction volumes, customer segments, and purchasing similarities through reproducible workflows within the Neo4j database. More broadly, this library of algorithms spans a number of GDS concepts, including community detection, similarity, pathfinding, link prediction, and centrality. 
  • Neo4j Graph Database – Beyond leveraging Neo4j’s core native graph storage and processing engine, Neo4j for Graph Data Science equips users with numerous automated scripts specific to data science concerns, such as data load and transformation tasks. Analysts working within this toolkit can also import from a database (Neo4j or any other source) into an in-memory model and reshape (mutate) that instance without changing the source database. 
  • Neo4j Bloom – Built to enable developers and database admins to visualize nodes and edges (relationships between nodes), Neo4j Bloom directly supports and augments data science tasks because this search-first interface can quickly uncover non-intuitive hidden relationships for further exploration both within and outside of Neo4j itself.

Graph data science down the road

However, this is only half of the story. Neo4j for Graph Data Science doesn’t presume that data science can only happen within the confines of the Neo4j graph database. With this toolkit, data scientists can greatly improve and extend their more traditional machine learning (ML) models. The new toolkit, of course, allows users to easily import and transform data for graph analysis within Neo4j. But it goes well beyond that by allowing users to export the results of that graph analysis back to a form suitable for inclusion within an external ML project.

For example, after importing a table of customer purchases over time, a Neo4j user could carry out a Louvain Method to identify and classify communities among those users. This information can then be used to enhance an existing ML model through the inclusion of an entirely new, graph-generated predictive feature. Any algorithm carried out within Neo4j can serve to enhance feature engineering in this way. For example, users can run a PageRank algorithm to measure transaction volumes or run a Jaccard index algorithm to identify purchasing similarities. These can then be labeled for ML classification and regression work outside Neo4j within traditional data science development environments.

At the end of the day, this conjoined graph/tabular approach will make ML models much more accurate; it will also help data scientists explore new predictive avenues. The underlying workflow, however, is at present fairly basic in that it ends with Neo4j either writing dataset mutations back to the original graph database or exporting them in the form of a CSV file that extends the originally imported file structure. It doesn’t do much to bring Neo4j any closer to orthodox data science, which is at present chasing the idea of operationalized, DevOps-style lifecycle management for ML projects, MLOps.

There’s a lot more to come here for graph analysis in general, with Neo4j and its rivals actively building more GDS functionality into graph databases and creating further touchpoints between GDS and traditional data science. One interesting notion on the horizon is graph embeddings. With graph embeddings, data scientists can transform nodes, edges, and associated features into a lower-dimensional, vector space. These vector spaces are much more closely aligned with ML techniques and can take advantage of a very rich set of established, supportive tools. Neo4j is actively working on this notion and expects to push something in front of users sometime this summer.


Further reading

“TigerGraph readies no-code graph analysis database,” INT002-000277 (March 2020)

“Tech companies pitch in to fight COVID-19,” INT002-000279 (March 2020)

“Analyst Commentary: Splice Machine aims to meld relational databases with machine learning,” INT002-000280 (April 2020)

“Oracle looks to take the “Oops!” out of data science,” INT002-000273 (February 2020)


Bradley Shimmin, Distinguished Analyst, Data Management and Analytics

Original article at –

Similar Posts: