Have you ever stopped to understand what graph databases are and what they can do for you? Graph databases and graph processing frameworks are all the hype in the NoSQL world at the moment. The ecosystem is constantly evolving and different datastores of processing frameworks are coming out what seems like weekly. The truth is that graph databases are a great way to solve certain application problems in areas such as personalization and recommendation, logistics, master data management, social networks, fraud or IoT but many people are completely lost when they step foot into the exosystem.
In this session we will help you make sense of the graph ecosystem with an examination of a variety of graph datastores (e.g. Neo4j, DSE Graph, Titan, OrientDB, etc.) and graph processing frameworks (e.g. Giraph, GraphX, Elasticsearch Graph, GraphQL, Pregel, etc.). We will then discussing how you might use these technologies to augment or replace complex portions of your applications. In the end you will walk away with a better appreciation for the practical aspects of the graph ecosystem and you might even find out how to remove that complex recursive SQL CTE that gives you nightmares.
well good morning everyone
welcome to my talk
Today we’re here to cover to what graphs are if anybody here actually using a graph database or has used a graph database in the past which which some of this will probably be familiar to you guys then we’re going to talk a bit about some of the use cases that graph databases are good at then we’re going to talk a bit about the ecosystem so the ecosystem for graph databases is changing literally every day in fact yes I believe was yesterday that Microsoft announced they’ll mate they now have a new graph processing layer on top of SQL server so it’s a very fast growing area and then last why you should care about it
we know why not just use your relational database we’ll walk through a couple of examples showing you just how graphs can help simplify your life in some ways so hopefully you’ll walk out of here with you’ll know what a graph database does is an isn’t what you can what you can do with them you have an understanding of what the ecosystem of graphs are graph databases are and then what you can do to get so first off
we’re going to walk through a little bit of what I like to call it a graph 101. First we’re not talking about bar charts line charts Park lines we’re talking about interconnected webs of data.
Had any of you heard of the seven bridges of königsberg problem I guess probably a lot of people did at some point but so basically the seven bridges of königsberg problem laid down was basically a famous problem that was developed by a Leonardo Euler in 1736 he was a Swiss mathematician and this became what basically this problem laid down the foundations of what became graph theory on which graph databases are built basically there’s there was two islands and in the middle of this river so there’s four pieces of land and seven bridges connecting them and the problem was is it possible to walk across every bridge once and only once this became what known as an Eulerian walk if you can actually do this excessively what Euler did was basically he took this real world problem and was able to extract it out into what you see at the bottom here which is a the abstract concept of nodes and edges nodes and things are connecting them together instead of curiosity does anybody know if you can actually solve this not yeah you’re right you can.
It doesn’t work if you have greater than two nodes that have an odd number of edges of incident it’s just going into out and incident as edges are edges coming into or out of a node so first off what is a graph well a graph is an ordered set of vertices and edges and
Vertices are basically a finite set of elements and edges are a basically a set of two subsets of vertices so basically nodes or vertexes are connected by edges it’s pretty straight forward the other way you can think about this is it’s an entity and the relationship between them or its vert when you’re when you’re f you work with these you tend to think of vertices or the nouns and edges are the verbs in your domain model so both vertices and edges tend to have labels associated with them in this case we have two vertices which have the label of person and the edge in between them is labeled as Sun us they tend to connect similar things so in this case we have Jason who is the son of Alice so both the vertices are people they can also connect different things.
In this case basically we have person lives in a city so here Jason lives in Boston they can connect different types of things yet again here we have a a spider-man lunchbox which is part of the Marvel Universe it can connect multiple different things at the same time basically here we have a spider-man lunchbox which is connected into nut as which is part of a franchise and is recommended for the age groups 3 to 8 edges also tend to have directions with them specifically when we’re talking about edges with directions we’re talking about directed graphs which is a subset of graphs but it’s the type that got most graphs databases are built upon so it’s what we’re going to talk about in this talk and as you can see here we have our previous graph which had spider-man launch box which was part of the Marvel franchise and was recommended for ages three to eight but now it also has an inward facing edge from the back-to-school promotion for 2015 ah and what we’re gonna talk about today both vertices and edges may have properties associated with them.
This is actually one of the biggest differences between a graph database and some say something like a relational database in a relational database the relations between your different entities themselves can’t really contain any metadata of their own because in a relational database the relations between your data is basically foreign keys that’s how one one thing is related to another you can’t actually put metadata on those foreign keys you end up having to build a table in between like something I like to call it or I think of as a bridge table that to store any of that metadata well in a graph database relationships are first-class entities and those entities are able to actually have that metadata on that relationship itself and it’s one of the more powerful things I find one actually working with graph databases or migrating data from graph from relational databases to graph databases so what does a typical graph query look like so find me all the protein in this case we’re going to look at a simple a simple recommendation query sort of use case in this case we’re going to want to find all the products that are for the same recommended age as a spider-man lunchbox so what would you do the way you could think about this is you’d start here at the spider-man lunchbox and basically you would walk out the recommended edge to the age range of three to eight and then you would look for any other inward-facing edges that are also recommended for that same age group in this case there’s only one which is the Iron Man toothbrush when you’re working in graph what we just did is what’s actually uh you will be referred to as the traversal of your graph that was a very simple one but they get much more complex than that one of the things things that graph graphs do very well is they’re able to go through an arbitrary number of hops in order to in order to get two answers so in this case we’re basically wanting to find all the products for that are for the same recommended age it’s part of us the same or similar franchise as that spider-man lunchbox in this case we basically added somebody added to the graph that the DC Comics universe is a similar franchise to the Marvel Universe so what would we do here basically the way you would Traverse for this graph or think about how you would Traverse this graph is first off you would Traverse the same either graph the same way of going out the recommended age to the age range and then over to the Ironman toothbrush to get that as one of the responses back but you would also basically at the same in parallel you would actually Traverse out the part of edge to the Marvel franchise you would then Traverse out the similar edge to the DC Comics franchise and then you would reverse back the part of edge to get the Superman case so you would be in this case you would turn the Superman pillowcase and the Iron Man toothbrush but as you can kind of see as we’re walking through this each one took a different arbitrary path to get through there and now someone has gone and adds sales data to the graph so basically we want to build a more powerful recommendation engine based on items that people bought well as you can see if you wanted to solve
I’m not going to walk through this one cause it’s far more complex but you are able to basically walk through a very large arbitrary number of hops to get to that same sort of information here if you want to do something like this in a relational database it basically would probably require a lot of recursive common table expressions unions joins and it would be a query I would not want to run or I would not want to write maintain or probably try to play read any performance increases on because it would be a quite a headache to actually get it to work within a reasonable period of time so I guess before we go on are there any questions about what we talked about so far this one yeah so the question here was in this in this query would you get the Ironman toothbrush twice depends on how you actually put the query that you a lot of these when you’re querying these you could actually you by default you would probably get it twice but a lot of the query languages allow you to remove duplicates okay so next we’re going to talk a bit about some of the use cases for graph databases so what are common graph type problems so common graph type problems are thing like dependencies failure chains order of operations needs something like this a lot of time used for things like root cause analysis.
Actually a great example of basically finding dependency graphs in order to build my infrastructure I have to build the the V PC before I can build a subnet that’s part of that V PC well that’s a graph type problem to do that clustering finding things that are closely related to each other friends of friends who is on LinkedIn and gets be like these are people you might know or Facebook same sort of idea that’s would be a clustering sort of problem or fraud fraud is actually.
Another very common use case for graphs it the variety of different algorithms but clustering is one of them to basically say is this group of transactions similar to another clustered together with another group of fraudulent are known fraudulent track transactions similarity similar pet you want to find things that have similar paths or patterns associated with them let’s say you something like a recruiter you work on recruiting software and you want to take AI from the position I’m at I want to get to a CEO what other what is the most common path or what is a very similar path that’s because people that started like I have to get to become CEO of a company matching our categorization flow flow costs like type problem things like google maps are a float off low-cost type problem I want to find a shortest path from A to B centrality and search problems you want to find the most influential person in a social network another example of dependency type problem is a pipeline what is the root cause of a failure how do i route flow from X to Y when I take Y off line example vertices in like a pipeline it would be something like a storage tank a refinery or wells edges would be pipelines control lines things of that nature industrial assets would be a clustering sort of problem if Part A fails what other parts tend to fail with it I have a large group of industrial maybe I run a trucking fleet and if truck a fails what other what other trucks tend to fail with it with it and then you can do you can go back and actually do root cause analysis find out why they may have failed together they maybe they were all in the same place that had a sandstorm at the same time or something like that examples of vertices in here things like parts or assemblies or pieces of equipment and edges are things like consists of connected to is compatible with similarity.
An example of similarity social networks and fraud social networks are probably one of the the most common use cases that people think of with graphs LinkedIn and Facebook made using graphs for a social network a very common thing out there anymore which of my friends is the most influential which of my users activity or my users activities similar to a known fraudulent pattern here you would have vertices that are things like people of business or transactions edges are things like phone calls emails memberships purchases categorization
Recommendation engine is a very good example of a categorization you maybe you want to you have a cat different categories of people that you want to do recommendations against based on my history of buying of purchasing products what am I likely to purchase what type of customer I when I am I when I use a system here you have things like vertices w
ould be things like users or orders or web pages if you’re doing something like clickstream tracking edges would be things like purchased or clicked it worked with customers in this space where they basically have click streams coming in of applications and they want to be able to basically categorize you into different use cases are different groups of users to basically show you something like targeted advertising things of that nature flow cost problems transportation is the first one I tend to think of here what is the short stuff between x and y I don’t know about you guys but I’ve used Google Maps since I’ve been here in London and that’s an example of how you actually get from X to Y it’s really it has to do an unknown number of hops to get from a from one place to another they’ll connect somehow but you don’t know how many connections it will take to get there while it’s something you can do in a relational database it’s actually very difficult to do unknown numbers or joins you end up with recursive functions and they tend to be very not are very unperformed also things like I’m going to take station X offline for maintenance what’s the effect going to be on my transportation network how’s that how am I going to route traffic around it what are the different options here you have vertices that are things like stations or cities and edges of things like railways roadways intersections actually intersections I’m sorry would be a vertice this case centrality and search the internet the Internet is nothing but a giant graph of data this is actually where Google really so for those of you that are old enough to remember Alta Vista Ultra Vista was based on basically just searching the text inside your web documents this is where Google actually needs was differentiated from ulta this and when they first came out as they actually not only search the text inside the document but they searched they searched and indexed how those documents were links to other documents they basically built a graph of the web but centrality and search also useful for things like what are the critical parts of my network
I’m storing network management data or I’m storing data about how my network is configured I want to find the most critical piece of my infrastructure if this router goes down half of my network is going to go down because it’s the one single point of failure inside my network that’s a sort of graph problem to solve here you have things like your vertices would be like routers or computers your edged to be like fibers or Ethernet connections microwave connections case if you think about the Google use case it would be things like links of web pages really when you comes down to it basically enter any interaction out there you have as a graph here’s other sorts of problems that you can solve with graphs which professor publishes the most influential paper maybe you have a link of all papers published in the last five years as well as all their citations and you want to go out and find which of those papers has been cited the most either by that paper by another paper that cited papers that have cited it one we actually kind of run across quite on quite often if something like I have a user name to be Smith I have a user named Brian Smith maybe I’m a large maybe I work for a large company and we have multiple different ways in which a customer can interact with him well I want to figure out if those two people are the same person in order to basically provide them a better experience a more unified experience across multiple plot maybe a web and a mobile platform health care and life sciences basically how does the drug interact with other drugs and then the one I mentioned earlier which is the most common case common career path to get some myself to being CEO of a company so what sort of industries use graph this is by no means a definitive list of either industries or how the graphs are used inside those industries but somewhere like that this is an example of some of the ways in which people use it some of the ways we’ve worked with our customers in these areas software companies they deal with Knox in data management and that’s really nothing but a giant graph of how you how everything how all your networks are connected together social networks are an obvious one Facebook your LinkedIn things like that one that actually kind of surprised me when we first came across this as cases like identity and access management something like what Active Directory does in a Windows world is a real it really comes down to being a graph problem because you are in groups a B and C and in Group A your admin and Group B you only have read-only access group C you only have maybe you have readwrite access and Group C has access to a folder that group a has access to and Group B has access to and if sub files inside those what actual access do you have to a file is really basically a gigantic graph to find out what all the different permutations you have to get to your eye to actually have access to that end piece of data financial services fraud prevention is probably the most common one you hear in this in the in the financial services but there’s also they do social marketing they do impact analysis and actually do sentiment analysis they’re there
There are financial services companies out there that are reading your Twitter feed and your Facebook posts and finding out what you think about different companies because what you think about different companies affects how much or affects whether their stock goes up or down that day sometimes so there’s companies out there doing stuff like that telecommunications network management much similar to software companies master data management geospatial search there’s telecommunications companies out there that basically are tracking basically obviously they’re tracking where you’re using your cell phone and they’re trying to basically group to get cluster together groups of people into to see maybe that you’re you maybe you and all your friends went to dinner at the same time at the same place for roughly the same amount of time they’re trying to find this information out about you so they can better market to you web social and recruiting social graphs knowledge graphs sentiment analysis yet again probably one of the more up-and-coming places as better using graphs or things like healthcare and life sciences they’re looking for drug interactions looking for different things about gene sequencing now they’re looking for impact analysis on treatment and care there are any questions on some of the use cases before we move on okay the next thing we’re going to talk a bit about is the graph ecosystem that’s out there so first off what is a graph database and yeah I know there’s a couple of you that have used graph databases anyone else use any other know sequels sort of databases or as most people coming from the relational oil okay well the graph graph databases are a type of no sequel datastore they store data based on graph concepts so vertex is edges and properties there’s several different types out there that we’ll talk a bit about here in a moment specifically the ones that are out there already F triple stores property graph models and what I tend to call processing frameworks and what graph databases do is they really help you assess it’s efficiently and effectively navigate connected data data that’s highly connected with one another so a little base of database types so what you’ll see on the screen here is basically the different types of but the five basic types of data stores that are out there and how they each handle increasing data complexity I mean the simplest ones out there are things like key value stores key value stores or something like a dynamodb from AWS or reddish and they store very simple data they store a key value pair of data you want to get a little more have your data has a little bit more complexity to it and stores more than just one value to a key you get something like a column family store it allows you to store basically single rows of data but those rows are not don’t have relationships to other rows I don’t have explicit relationships to other rows I should say this is something like a patchy Cassandra patchy HBase you get a little more you get data that it’s a little more has a little more relation a little more complexity to it
When I’m talking about complexity here what I’m really referring to is basically the data is related to one another and how those relationships are manifested themselves from the datastore so you get something like a document data store well a document data store the the documents themselves are pretty atomic units of information but inside them they can have highly nested relations of data something like MongoDB or CouchDB or pretty good examples of dyeing are pretty common key examples of document data stores guess the most a lot of people here have probably used one or the other of those and then there’s the one that we’re all familiar with relational databases relational databases are good it’s they’re a good at storing relational data so what they’re built on they have tables that have Bob foreign keys to other table to have foreign keys to other tables and you can build out a hierarchy of data there these are your Oracle is your sequel servers your Postgres is your my sequel things of that nature and then for even more complex data that doesn’t fit in your relational database you have graph databases two of the most common ones out there for JJ and data stacks enterprise graph those are the two that we work with the most there’s many more out there else you’ll see here in a moment but as I mentioned earlier the real big difference between a relational database and a graph database is that in a graph database the relationships are also first class and it is in your system and those relationships can can store properties against them one other thing to note is that the data that graph databases are the fastest-growing type of data store out there at the moment since this is started in January 2013 until this January they’ve got at almost a 600 percent growth in so this comes from a I don’t know if any DBA engines calm but if you don’t it’s actually a very good site to that will kind of give us a base comparison of different types of data stores out there so just this is part of the reason why I wanted to come and do this talk was the fact that this is a very fast-growing area and it’s something that you guys will probably run into as or at least might want to think about using in some of the projects as you’re coming up or that are coming up for you so the first differentiation I’m going to make in the types of graph databases is the difference between a database and a framework a graph database is basically built
One of the difference between the database and framework is that databases will run real time queries the hand some of them can handle both transactional and analytical type workloads and they persist the data themselves they tend to have note they’ll have no scaling feature and no sequel features like scaling and high availability whereas frameworks really are built to work on humongous data loads and by humongous data loads
I’m talking about data loads that tend to not fit don’t even necessarily fit in the memory of the servers that they’re running on they’re built for OLAP workloads and they use another method underneath them to actually persist the data it’s often something like a Hadoop cluster the way I think of this is basically a framework essentially as a library that sits on top of something else to run your graph data on so you have to sit there you’ll have to load your data into that graph framework you have to run your graph queries and you’ll have to persist your data out something else a database itself is just like a sequel database it handles both not only the DB query of the data but also the persisting of the data back so some of the common graph processing frameworks out there if you get into this space or something like Apache giraffe I’m not quite sure how that’s supposed to be dropped but that’s what it is that is that so you have something like Apache giraffe which is built on top of Hadoop that’s where it runs first persistence layer you have something like graphics which is actually part of the Apache spark project then you have things like Oracle things which is obviously built on top of Oracle you’ll have things like sync span graph base info grid and this little guy here is actually apache hama one interesting one to note here is the apache tinker pop project so the apache tinker pop project is made up of a couple of different parts it’s in and of itself it’s a graph processing framework but that graph crossings scene framework actually specifies a query language associated with called gremlin which we’ll talk about here in a little bit but that as well as the query language it basically also specifies an interface in a graph engine that are used by some of the databases will talk in a about here in a moment so it’s kind of a very [Music] it’s a complete it’s kind of has a lots of its hands in a lots of different pieces of the graph database world probably more than anything else but in and of itself you can actually just use it as a graph processing framework so the next kind of differentiation I made in just to be honest these differentiations are semi arbitrary on my part they’re not really arbitrary against they are sort of there’s some logic behind it but other people might group these slightly differently but the next one
I’m gonna talk about is basically the two types of graph databases they’re out there one’s called RDF or all building core call the RDF triple stores they work with a a subject predicate object triple the entities that you store are these triples this comes from a background of the Semantic Web it has well defined standards associated with this and RDF databases are very efficient at finding relationships inside your data the other type is a property model graph database this works with the nodes edges and properties very similar to what the first section where I was showing talking about it has separate entities for nodes and edges and both the nodes nodes and edges can have properties on them and what property monographs databases are good at is they’re good at efficiently traversing relationships in your data.
I’ll show you an example of the difference between these two here in a moment but the biggest thing to probably take away from this slide is RDF databases are very good at finding relationships in your data and property model graph databases are very good at reversing relationships in their data so an example of RDF sort of an RDF database what what sort of data you would store in that is people are people with a common parent or siblings and a father is a parent so you would put these rules in other way RDF databases work is basically through an inference engine so you basically put these rules in as you’ve seen here the next thing you’d put in is you’d put in a couple of facts you put in the fact that Mike is the father of John and that Mike is the father of Steve ah what would happen then is that that that inference engine in the RDI database would basically infer new facts about your data based on the rules that you’ve already entered so in this case it would be able to infer that John and Steve are siblings because you they both have a father their father is Mike and people with a common parent are siblings property model graph databases on the other hand would look more what you have on the Left right up here but this one over here don’t know my life am i right sorry think what you would have to put in here is you would have to put in the same information you’d put in a node that is that there’s a person named Mike a person in a person named John and that there there’s a relationship between them of father of you would then put in that there’s a person named Steve and the relationship would be father of and here’s where property model graphs get a bit are definitely differ a little bit than RDF sort of databases so now you have an option you can either if when you go to actually run your traversal on this you can either basically run a traversal that would say okay give me for I want to find for John I want to find all the I want to find his father and then I would find who else he’s the father of and be able to in my app location logic know that those two people’s are siblings or you could at the time you put the data in put that relationship in relationships in a property model graph are basically precomputed joins of your data so you can either basically it works either way you either basically navigate this relationship inside your application inside your traversal or you’d have to pre-compute that joint ahead of time there are any questions on this I know this is a very kind of confusing sort of concept yes so so the question was in the RDF database you basically define the there you define the rules once and you put the data and when the data comes in you would you do it on the put in the property monograph you only define sibling of once well you would need to define sibling of for each peep each place you want to have that relationship so this is where it’s kind of an example of how RDF databases are great at finding information that you don’t have because it’s able to basically infer that this is a new relationship type whereas property monographs are really better at traversing that how those relationships already exists so that if you want to be efficient and you want to find everyone that’s a sibling of somebody you need to put this sibling of edge in for every person that’s assembling them maybe there’s John Steve Mike Tom Bill James who might have a lot of kids what can I say oh you would want to put the sibling of an edge in for each of this that that answer your question okay so what are some common property monographs databases the probably the most common one out there right now is me or 4j it’s definitely the probably it’s definitely the industry leader in this space it’s what the two people in this room have used it’s it’s probably the easiest to get started with then you move on something like data stacks Enterprise graphs
This is a very built to basically scale to larger data sizes it’s built on top of basically date stack Enterprise if any of you’re familiar with that well then you have one called orient DB orient DB is actually you’ll see it again later because it’s not only a property monograph database but a multi model graph database which I’ll talk about here in a moment but these three that you see across the top here are commercial commercial licensed products they each have like a community version essentially that you could you can try out for free the bottom two here are first one you’ll see is called Janus graph which is actually new one on the market it’s actually a fork of an old one of an old of a database that’s been around for a while called titan’ this is actually very new in as I think was last Thursday this was announced as part of the Linux Foundation so it’s an open source project run by the Linux Foundation actually the company I work for Spiro is one of the key members in there along with people like Google Hortonworks and IBM so that’s our a little shout-out to that one while I was here and then there’s another very interesting one as I believe then we see there an alpha or beta called DeGraff that’s built to do a highly distributed model of graph databases of property model graph databases the next this slide it basically shows you some of the common RDF data stores that are out there in this slide you have ones like star dog Allegra Graf blaze graph and on to text of these the only one that has any sort of open source option with it is blaze graph all of the rest of them are commercial license then you kind of get into the middle ground of these ones we call called multi model data stores you have ones so basically what these are or these are essentially I think of them is basically a framework graph crossing between framework on top of another no sequel data store note another type of data store but they both come together in one package so for example something like an orient DB you can store not only unstructured data in the form of documents in it but you can store graph data under the covers it’s a document data store with a graph layer on top of it but they expose both of them to you which can be very useful depending on what you’re trying to actually accomplish if you want to accomplish if you have a use case where you have where you want to use a multi model sort of approach maybe you have a bunch of unstructured data you want to store but you want to store the relationships between those data inside of graphs something like an orient DB would be great for that if any of you familiar with elasticsearch but it’s the same sort of thing they recently released a graph model on top of a graph framework on top but on top of their product but be combed as part of their product so like I said it’s kind of a it’s a middle ground between the two and then you have s AP Hana Arango DB and virtuoso this everything I’m showing up here in all these slides there’s there’s more of these types out there than I have on these slides I just tried to show a representative sample of some of the more common ones so before we go on and on the ecosystem of echo system questions out there okay so your data isn’t in graph what do you do with it now well the first thing you’re going to probably want to do is query your data out what I’m showing here are really there each of the data stores we showed tend to have their own proprietary query language a lot of times there’s some sequel like query language sometimes it’s almost exactly the same as sequel what I’m showing here is basically the only the standard or open source languages that are here that are implemented by multiple different products so the easiest one here is down here with RDF databases there’s a w3c standard called Sparkle pretty much any of the RDF databases out there will implement at least some version of sparkle some are more compatible than others just like every other standard after an Oracle I’ll pair at the top is where it definitely gets a little bit more complex.
The first one you see here is what’s called the open cypher project cypher is the native grass language that was developed by neo4j I think was a little over a year ago they they basically opened the specifications for cypher and a project called open cypher for other database vendors to implement them a version of cypher for them for their own language cypher is a very declarative language that’s reminiscent of sequel I’ll show you some examples here in a second and this currently I believe has somewhere around ten implementations of which they believe for go against databases don’t don’t quote me on those numbers because yet again the numbers are changing all the time but gives you a rough idea how opted it how well adopted it is the next one you’ll see here with the cute little green guy is Kremlin this is the query language that specified as part of the apache tinker pop project i talked about before and it takes a more traversal based approach to be basically to your syntax and it currently has basically seventeen plus implementations somewhere along that the last one you’ll see here is actually graph QL graph QL was basically developed by Facebook and it’s different from these other two in as much as it’s really a JSON based query language that’s for your API you’ll understand a little more what I mean here in a second when I show you an example but it’s it’s a bit of a different beast than these other two which are kind of more like comparable what sequel would be for the relational databases so for these property model queries this is the basic property model we’re going to work against pretty basic we’re going to do a small part of it but basically we’re what we’re worried about here is we’re going to have a we want to find all the people who acted in a film that was directed by JJ Abrams so basically we’re going to basically come from this person we’re going to find the person whose name is JJ Abrams then we’re going to follow this directed in edge to the movie and then we’re going to follow this acted in edge back to all the people that acted in movies that he directed so the first way we’re going to look at is the open cipher query as you can see basically pretty easily to read these sort of queries things to note about cipher is basically what you’ll end up with is basically the bright the round brackets are the different types of nodes you’re looking for and the square brackets are the edges and what you’ll see is actually the square brackets have dashes and greater than our lesson signs in front of them to basically indicate the edge that that direction is that you’re traversing out that direction so the next one is gremlin gremlin is as you can see is a bit different than the open-open cipher in as much as what you’re doing here is your is your basically the way I think about it is I’m actually this is I’m telling it exactly how I want it to walk through my graph in order to get the information I’m looking for so basically the way this works so givi for those of you not familiar is basically the way you say give me all of all the vertices in my graph I want to I want to find a person edge our person node that has a name of JJ Abrams and then once you follow the outwards directed edge called directed in then I want to so that would take us to the movie vertice and then I want to go back an inward directed edge called accented acted in to get back to the people this dot next is actually a very specific thing to gremlin that said basically it’s lazily evaluated and until you put something on there to force it to be evaluated it won’t actually print anything out that that dot next will actually force it to print out all your results so as you can see these are very similar yet different to one another but when you look at grass QL ye as you can see it’s very familiar if you used to doing something like rest calls but it doesn’t really look or map directly or as easily to gremlin or open Schaefer what it’s doing here is basically saying for I have I want to find a person who has a name of JJ Abrams then I want to go out the directed that actually should say directed in that was my bad I want to find all the things he directed in and then all the people that acted in those and returned the name and I said this is a very it was a standard per dude by Facebook I always think of these is very useful if you’re looking at build rest endpoints because who wants to send this sort of query across a rest endpoint there were any questions on this before I go on so the next thing is how do you visualize your data for visualizing your data there’s there’s a lot of off-the-shelf products something here like link curious Enterprise is actually a very powerful tool for doing some visualization of your data.
T line that’s probably one of the leaders in actually basically like web web toolkits to work against your data if you want to do a lot of different customization something like a d3 is very useful where this starts to get very interesting is if you have a geospatial products problem a lot of times what we find is that the problems we’re working with end up having a geospatial aspect with to them as well not only do I want to know how things were connected I want to know where they were at at the times that that certain event happens especially if you’re thinking like in the IOT space so these are just some of the tools out there but what can you kind of do with it so these are some basically some screens of things that we have worked on one and the my UX colleagues would would harass me if I didn’t basically say that just because it’s a graph piece of data just because the underlying data is a graph doesn’t mean you need to show just as no charts in a case of here this was an IOT sort of demo where you were following trucks and following sensors that makes sense to basically kind of almost show that as a no chart of the flow of where it’s going but over here this is a healthcare app the graph data underneath the data underneath it may be stored in a graph but that’s not how the end users are thinking about it and then users shouldn’t know or necessarily care that the data is being stored as a graph so applying good UX principles to these sorts of things are are really important in order to basically get the best information out of it good examples of this are something like a LinkedIn or Facebook or they don’t show you that the graph of how your friends how they got front – the friend you may know they just show you a list of people that you may know but it’s being powered under the covers Biograph
Amazon recommendations are the same sort of way you people who bought this off about that now let’s be empowered by a graph at some level but they’re not telling you that it’s powered by graph they’re just showing you the output in a way that’s easily consumed by the user so the last photo section here is why should you care about it well we’re going to look at a basic product recommendation engine here I want to find all the products that I have purchased find me all the products I haven’t purchased but we’re purchased by customers who purchased other things I bought pretty simple are pretty common I want our need for a any sort of e-commerce site is kind of to power out your product recommendation well here’s roughly the sequel it takes to do that the actual specifics don’t matter here I’ll share the slides with you later if anybody wants but here’s the corresponding cipher and gremlin to answer that same query for if you look at the gremlin or the cipher it’s about takes these three lines to answer what takes pretty much all the way down to this group by to get to the same sort of answer and the gremlin takes a little bit more than that but in general you’re talking for three or four lines versus probably 15 to get your answer performance I mean obviously performance you’re going to be very different depending on what you’re doing but I would not wreck it’s not going to be super performant uh whereas the sights on the gremlin will be because that’s what they’re optimized for so the next use case is basically to find an org chart I want to find out the employee all the employees who their supervisor is and where they live inside the organization basically build out a simple org chart well to do that in the sequel you essentially have to end you end up having to build this recursive recursive function to basically pull the information out it takes all the way till down here these probably twelve lines of code to get to basically the just the rough data before you actually can even pull it out well here is the decipher to do it basically a decipher to do all of these twelve lines is pretty much this first line will return you all the data after that it’s actually just formatting the data the way you want to actually pull it back out in Gremlin it basically takes about the line and a half to do the same sort of thing here’s kind of shows you how something like this is brought is more performant or can be more performant and certainly easier to maintain for these sorts of queries so how do if you need a graph this is another question we get all the time well if you can answer yes to any of these sorts of questions then you might have a problem that is worthy of a that would be worthwhile to take a look at using a graph database
Do you have queries with multiple joins and unions of data you have recursive common table expressions is the performance of your joins or common table expressions poor and can’t be increased by other means or by standard traditional relational database techniques is the structure of your data continuously evolving one of the things I hadn’t touched on much here is the fact that graph databases tend to be very good at allowing for evolving flexible schemas if you want to add a new relationship type to your graph you just relationship-type Geograph you don’t need to go in change tables add default values things like all those things you end up having to do in a relational database to make that two basic we’d make it a work is your domain that you’re working in a natural fit for them are you storing IT dependencies are you doing network management are you doing relationships between people or any of those sort of things we talked about in the use cases and the one that probably sticks with me more than anything else is are you dealing with the connections between things more often than you’re dealing with the things themselves do you really is what you’re really trying to care about housing is how the things are connected together more than what is connected together that’s a really strong indicator to myself that you might want to look at using a graph database so if you’re actually interested in getting started if you go to this website we actually have theirs getting started with graph link and it basically provides a long list of just different resources that you can go out to get started with graph including links to pretty much all of the data stores we talked about up here at least the versions that you can download for free their tutorials their getting started pages things like that and if anyone’s interested and once the slide deck is come see me after this and I can send it to you so you can get some of these addresses or whatever so I guess they’re any questions so the question was do these graph databases allow you to partition data to sort across multiple nodes in sort of a cluster yeah and allow you to do queries on the partitioned data the answer to that question is it depends on which graph data story you’re looking at I’m not as familiar with a lot of the RDF.
Apache Cassandra at all it’s built on a clustering technology by default and data stacks enterprise graph is built on top of that so it’s absolutely by default those are partitioned if you’re looking at something like neo4j if you go to their Enterprise Edition and their most recent release they actually added causal clustering to their to their graphs databases if you look at something like an orient DB they have the ability to partition that data out as well using their own internal mechanisms so yes each of them have it and each of them do it in a different way hey that answer your question any other questions so the question was are people using these as systems of record or as a primary data store or are they using them as secondary data store the answer to that is both it depends on what the specific use case is I’ve worked with customers where they basically migrated other paths or in the process of migrating all their data to be stored inside a graph as the system of record I’ve also worked with customers where this is used as a a secondary system a lot of times especially if they want to if they have a small part of their application or their it’s a small part of the system where they have a very graph based problem that they want to solve in a small area they may keep their large Oracle installation or sequel server installation to be the system of record and then transfer the data into this either in batches or continuously to handle those sorts of dependency queries or whatever it is they’re trying to actually pull out that is better solved in a graph any that eye prote that combined an RDF with the property database actually a there’s several of them that will allow you to traverse them as both I believe actually Oracle does this a star dog I believe does that as well and ice plays grass I believe also will allow you to basically do either RDF or property graph to ruffles of them I can’t say I’ve done anything with those in that specific use case so I don’t have any personal experience with it unfortunately but I believe they both a lot but those ones I believe allow you to do it in both ways okay that’s it thank you guys very much