A quick primer on graph databases

Graph databases have moved from a topic of academic study into the mainstream of information technology in the last few years. Now business analysts are confronted with the need to better understand:

What business problems do graph databases address well?
What advantages do graph databases offer over widely-implemented relational databases?
What issues emerge as graph databases are introduced into an existing application portfolio?

A graph database (GDB) uses graph structures to represent and store data. Graph databases emphasize relationships among data entities.

DBMS developments

First a bit of history: To improve data management and data processing as data volumes grew, database management systems (DBMS) emerged as a separate software layer between the operating system and the application program in the 1960s.

By the 1980s the relational DBMS had become and has remained the principal DBMS. The 2000s saw the emergence of XML databases, NoSQL databases, and the idea that databases didn’t need to be tightly structured in a purely tabular form.

During the 2010s, databases that support the JSON open standard file format gained traction. We also saw the rise and ultimate fall of Hadoop, a software framework for using highly distributed storage to process big data.

Data volume explosion

Fast forward to today: Data volumes are continuing to explode exponentially. The vast data volumes are being generated by many sources including:

Internet of Things (IoT). The explosion of industrial and consumer devices, or things, that monitor the performance of almost everything and IoT devices that are replacing analog data recording devices, all generate huge data volumes.
Digitalization of society. The most obvious examples are the vast volume of digital data available on the web and its consumption by billions of people.
Graphic and video data types. Originally, data meant letters and numbers only. Introducing graphics and videos has added many orders of magnitude more data.
Digital transformation of businesses and government. Most organizations are actively working to enhance application functionality and eliminate the remaining bits of paper and Excel workbooks that exist between their systems.
Voracious demand for data analytics. The demands of data analytics triggered by the move toward more data-driven organizations have added significant data volumes.

Graph database opportunities

Today’s problem: The many DBMS advances plus huge improvements in computing infrastructure performance, introduced over many decades, are nonetheless straining, or failing to handle these vast data volumes.

Today’s solution: Applications that access graph databases can solve various types of problems that are creating frustrations at the enterprise level. Examples of these applications, for which business analysts need to seriously consider graph databases, include:

Artificial intelligence.
Computing infrastructure monitoring.
Customer 360 interaction analysis.
Fraud detection.
Knowledgebase.
Metadata management.
Master data management.
Natural language processing.
Recommendation engine.
Social media influencer analysis.

These applications benefit from using graph databases because they:

Deliver excellent performance for complex data analytics.
Simplify data ingestion and integration from diverse data sources.
Manage vast data volumes reliably.

Advantages of graph databases

At the recent Collision from Home virtual conference, Javier Ramirez, Senior Developer Advocate, Amazon Web Services (AWS), described how graph databases are superior for managing highly interconnected data, and for quickly producing concise results for complex queries. AWS offers the Neptune graph database service. He said that “Neptune addresses the graph database issues that many end-users encounter. These issues include lack of scaling, non-existent high availability, and uneven support for open standards.”
For each advantage in the section below, graph databases are compared to relational databases.

Query speed

All end-users are always impatient and expect quick response times. Nobody cares about the impact of query complexity or the vastness of data volumes that must be traversed to produce a result. DBMSs work hard to respond to this expectation.

For graph databases, query speed is only dependent on the number of concrete relationships, and not on the total data volume in the database. This focus on reading only the data directly or closely related to the relationships being queried produces super-fast results.
For relational databases, query speed is dependent on the number of tables that must be joined and the amount of data in every table being queried. This focus on tables and data volume means queries slow materially as the number of tables and the data volume involved increase. While good index design and superior query optimization can reduce speed losses, it’s often not enough.

Representation of relationships

Whenever a DBMS can represent real-world relationships accurately and avoid kluges or workarounds such as cross-reference tables or composite keys, it’s easier for software developers to understand the organization of the data in the database. That ease-of-understanding leads to:

More accurate, reliable solutions with less development effort.
Reduced effort and elapsed time to implement future enhancements.

For graph databases, relationships are stored as data alongside the attribute data in the databases. This relationship storage results in high-performance queries, even for complicated queries or large data volumes.

For relational databases, relationships are defined through the value of foreign keys or software logic. Foreign keys are incredibly useful up to the point where they trigger too many joins or even force a self-join. At that point, foreign keys cause a significant deterioration in query performance. Defining relationships through software logic makes it difficult to understand relationships just from the database schema and creates significant software maintenance effort.

Graph databases can easily represent and query hierarchies of data. Hierarchies are more difficult for relational databases to represent and result in multiple tables that create query performance problems.

Representation of data structures

Whenever a DBMS can represent real-world data structures accurately, more of the same benefits listed under Representation of relationships above can be realized.

For graph databases, data structures are more flexible. While data is still contained in tables, these table definitions and their relationship definitions can be altered dynamically.

These characteristics are particularly important when the data doesn’t have a specific format. A good example is Facebook comments or posts that can consist of any combination of text, images, videos, links, and geographic coordinates.

For relational databases, data structures are more rigid, and:

Always consist of related tables that together define and contain the data available for entities.
Rely exclusively on values of foreign keys to represent the relationships between entities.
Must be defined in advance.

Making changes to data structures of relational databases always requires careful impact analysis and planning. Often an application outage is required to introduce the change.

Disadvantages of graph databases

For each disadvantage in the section below, graph databases are compared to relational databases.

Rapidly evolving technology

Every graph database vendor is introducing major enhancements regularly. This high level of product development creates:

Difficulty comparing products because the landscape is changing so quickly.
Product stability issues because it’s difficult to thoroughly test all this new software.

Example graph database enhancements include support for:

High-speed data ingestion.
Integrated data visualization.
Integrated machine-learning algorithms and tools.
Graphics processing units (GPU).
Both property graphs and semantic graphs.
JSON open standard file format data storage.
XML format data storage.
NoSQL data structures.
Document entity enrichment – parsing unstructured data for entity values to store as structured data.

Relational database vendors are also introducing many of these enhancements in response to competitive pressure and customer requests.

Difficult to scale

Most graph databases were initially designed for a one-tier architecture. Some vendors have begun to offer sharding which is the functionality to distribute the database across multiple servers.

Most relational databases have supported sharding for many years.

No standard language

Every graph database vendor has defined a unique syntax or language for updates and queries. Every vendor claims their language is superior. Most vendors support some version of Gremlin, SPARQL, or Cypher. This lack of standardization makes it difficult to migrate from one product to another and adds cost to train staff in a particular language.

All relational databases support the standard SQL language for updates and queries. Although many vendors have extended the SQL language, every vendor supports the core SQL language. This standardization makes it easy to find and onboard experienced staff.

Lack of parallelism

End-users of relational databases take parallelism for granted. This is the ability of the database engine to concurrently process both queries and updates submitted by multiple active tasks.

Some graph databases offer parallelism and others don’t.

Missing operational features

End-users of relational databases take operational features such as the following for granted:

Transactions and the associated rollback mechanism.
Various data recovery options.
Durability – guarantees that transactions that have committed will survive permanently.

Graph databases either don’t offer these operational features or are working on them.

First published at IT World Canada