WEBVTT 00:00.000 --> 00:05.000 Good morning, everyone. 00:05.000 --> 00:08.000 I will be doing my best to speak loudly. 00:08.000 --> 00:10.000 Is it OK in the back? 00:10.000 --> 00:12.000 OK, it comes up. 00:12.000 --> 00:13.000 Hi, everyone. 00:13.000 --> 00:14.000 I'm Gabor. 00:14.000 --> 00:21.000 I'm here to talk about graph databases, specifically where they come from, where they are, and where they are headed. 00:21.000 --> 00:28.000 So my background for this is that, for the last 13 years, I did a lot of work on graph databases. 00:28.000 --> 00:32.000 I did my masters, my PhD and postdoc in graph databases. 00:32.000 --> 00:37.000 Two years ago, I pivoted to the relational databases at DuckDB for my day job. 00:37.000 --> 00:41.000 But I still have a special place in my heart for graph databases. 00:41.000 --> 00:46.000 And I'm happy to come here and tell you my experiences and lessons I learned about them. 00:46.000 --> 00:48.000 So what is this talk about? 00:48.000 --> 00:53.000 I would try to bring you up to speed in 50 minutes on what graph databases are. 00:53.000 --> 00:56.000 And then you can decide whether they are something that you want to use. 00:56.000 --> 01:01.000 There is a chance that you don't really have a graph workload, but I encourage you to stick around, 01:01.000 --> 01:07.000 because I will show you some new research results that are applicable for graphs and also outside of graphs. 01:07.000 --> 01:11.000 And then finally, I believe performance is really important for data analytics, 01:11.000 --> 01:16.000 so I will give you a bunch of pointers for benchmarks and how to compare graph database systems. 01:16.000 --> 01:18.000 So let's get started. 01:18.000 --> 01:23.000 For this talk, I will use a running example that's rather complicated graph. 01:23.000 --> 01:25.000 It has four types. 01:25.000 --> 01:28.000 Countries, cities, people, and messages. 01:28.000 --> 01:32.000 And it encapsulates this weird social network in the low countries, 01:32.000 --> 01:39.000 where everyone and every city has a name with only three or four letters in the sake of fitting on the slide. 01:39.000 --> 01:45.000 We have four cities, a bunch of people, and you have the sort of edges between them that you would imagine. 01:45.000 --> 01:51.000 People know each other, people write messages, people are located in a city and so on. 01:51.000 --> 01:55.000 So what are graph databases in the first place? 01:55.000 --> 02:01.000 The definitions vary a lot, and it's really difficult to agree on what constitutes a graph database. 02:01.000 --> 02:06.000 But if you go to a graph database, meet up, or if you open the website of any of the vendors, 02:06.000 --> 02:10.000 there seems to be one thing that almost everyone agrees on. 02:10.000 --> 02:12.000 And that common theme is that joins our bad. 02:12.000 --> 02:16.000 If you go to a graph database, meet up, you get handed merch like this. 02:16.000 --> 02:22.000 Graph constructions own no joins behind this point, or they say no expensive runtime joins, 02:22.000 --> 02:27.000 or you should leave join those who left joins behind and leave joins behind. 02:27.000 --> 02:32.000 So what are they on about? It's kind of a bit murky, they don't really give any specific examples. 02:32.000 --> 02:40.000 I argue that sometimes they try to imply that relational databases cannot join, which is obviously an overstatement. 02:40.000 --> 02:43.000 Relational databases can join actually rather well. 02:43.000 --> 02:51.000 But then what's the problem? Well, let's take the middle part of our social network, where we have four cities and seven people, 02:51.000 --> 02:56.000 and we want to ask a question, where do the friends of the person then live? 02:56.000 --> 03:00.000 We can take this graph and then we can represent it as two tables. 03:00.000 --> 03:08.000 We have a no stable with person one and person two pairs, and we have a location table that says where a specific person lives. 03:08.000 --> 03:15.000 Now it's easy to see from the graph that then has two friends, car and deer, and they live in mall and os. 03:15.000 --> 03:19.000 And you can also see that these are represented in the tables. 03:19.000 --> 03:23.000 But if you try to write a secret creatively, you may end up writing something like this. 03:23.000 --> 03:32.000 You want to select the city from the no stable joint on the location table on the corresponding attributes, and you filter for Dan. 03:32.000 --> 03:36.000 But you will get only one city, you will get only os. What did we miss? 03:36.000 --> 03:42.000 We miss the fact that the no's edge is bi-directional, and this is not captured in the data. 03:42.000 --> 03:47.000 So what do we do? Well, one thing we can do is we can rewrite the query to make up for this. 03:47.000 --> 03:54.000 Another thing we can do is we can insert those edges into the no stable flipped, so that we will introduce some redundancy, 03:54.000 --> 03:56.000 but then the query will work. 03:56.000 --> 04:03.000 Or we can just do a view that captures the undirected no's edge by flipping the edges in union or all. 04:03.000 --> 04:10.000 All of these work, but already at this very simple complexity level, it starts to become a bit clunky. 04:10.000 --> 04:13.000 So this is one thing graphed at a basis, got really right. 04:13.000 --> 04:20.000 Most graphed at a basis have a syntax that is more elegant for capturing joints on something like this. 04:20.000 --> 04:28.000 They provide really concise and elegant syntax, and they also provide an elegant syntax for handling the bi-directionality. 04:28.000 --> 04:35.000 For example, the cipher query language, which is shown here on the right, has this construct where you can do brackets, 04:35.000 --> 04:38.000 and just dash is saying this is an undirected edge. 04:38.000 --> 04:49.000 And the rest is also simple, you say match for pattern matching p1 for person1, the name of the person location as a bracketed edge, and then return what you need. 04:49.000 --> 04:57.000 So most graphed databases have some sort of syntax sugar for expressing joints concisely. 04:57.000 --> 05:03.000 And there are actually quite a few graphed databases, and in the next couple minutes I will go through them. 05:03.000 --> 05:10.000 You can think about them as some systems along the transactional to analytical spectrum. 05:10.000 --> 05:16.000 So the more transactional system is, the more right oriented it is, and the more focused it is on high throughput, 05:16.000 --> 05:23.000 and the more analytical is, the more focused it is on running queries that are long running and read a lot of data. 05:23.000 --> 05:30.000 And I will argue that other than providing syntax sugar for joints, they also address other joint problems. 05:30.000 --> 05:36.000 Namely, the first category addresses something that is called the repeated joint problem. 05:36.000 --> 05:37.000 What does this mean? 05:37.000 --> 05:43.000 Well, let's take another slice of the social network with just the bagency of more. 05:43.000 --> 05:48.000 And imagine we have a client application that needs to render this in the user interface in some form. 05:48.000 --> 05:55.000 What happens here is that you can do maybe five joints and then return the table. 05:55.000 --> 05:59.000 It's actually not just five joints, you have to do left joints, you have to do some filtering, 05:59.000 --> 06:04.000 and ultimately you will end up with this kind of universal schema of all your data. 06:04.000 --> 06:07.000 But it has a bunch of problems. A big problem is redundancy. 06:07.000 --> 06:14.000 You have to spell out the city name, the postcode and the population for every person who lives there. 06:14.000 --> 06:18.000 Then you have to spell out all the person data for every message that they created. 06:18.000 --> 06:25.000 You have a bunch of nows, and this is just difficult. It's also quite difficult to reconstruct the graph on the client side from all this data. 06:25.000 --> 06:28.000 This is known as the overfetching problem. 06:28.000 --> 06:32.000 You run complex queries, you overfetch, and then the client has to do extra work. 06:32.000 --> 06:38.000 So in practice, almost no one does this. What they do instead is that they do iterative ramp trapping. 06:38.000 --> 06:43.000 They first get the cities, they then get the people in the city, and then they get their messages. 06:43.000 --> 06:48.000 But this is known as underfetching, because you don't get enough data to actually answer your problem. 06:48.000 --> 06:53.000 You end up doing n plus one iterations, just always have to get more data and more data. 06:53.000 --> 06:57.000 This is expensive, you run trip on the network. 06:57.000 --> 07:04.000 So this is kind of seems like a tough problem, and as most tough problems in computer science, it has a name. 07:04.000 --> 07:07.000 This is the impadence mismatch problem from object relational mapping. 07:07.000 --> 07:16.000 The impadence mismatch problem states that if you have a graph like object oriented model and relational model for your relational database, 07:16.000 --> 07:21.000 those are really difficult to map, and they don't really play nicely with each other. 07:22.000 --> 07:31.000 So one potential solution to this was invented by the API folks. This is GraphQL, which allows you to have a query on the left, 07:31.000 --> 07:39.000 where you can specify in this nest structure that you want the social network, then the city, it's people, and it's message and their messages. 07:39.000 --> 07:48.000 And then you can return an SD data structure, which usually is represented as JSON, and is a rather nice encapsulation of what you had. 07:48.000 --> 07:55.000 We don't over-pfetch, we don't under-fetch, there is no redundancy, and it's easy to reconstruct the graph from what you received. 07:55.000 --> 07:58.000 So this was actually picked up by a bunch of graph databases. 07:58.000 --> 08:12.000 Another solution that graph databases do, like Ori and DB, is they extend SQL by having some sort of a dot operator for introducing newer edges and navigating along the edges of the graph. 08:12.000 --> 08:20.000 In the stock, because we are at first them, I will also focus on the licensing aspects of projects, and Ori and DB is an interesting one to start with. 08:20.000 --> 08:32.000 So it was originally a patchy project, then it was bought by SAP, the company was bought, they spun out and enterprise version of it, and they saw licenses for it, then they abandoned it. 08:32.000 --> 08:40.000 Ori and DB lives on in three different versions, one is a conceptual fork represented by the light bulb, that's called Arcade DB. 08:40.000 --> 08:50.000 One is a fork called Utrecht DB, which actually powers the JetBrains Utrecht system for issue tracking, and there is still the original system. 08:50.000 --> 08:54.000 So it kind of forked off into three different projects. 08:54.000 --> 09:01.000 Another system is Microsoft Cosmos DB, this is an interesting system because it runs in the Microsoft Cloud, so it's obviously a proprietary piece of software. 09:01.000 --> 09:09.000 And it has this really famous use case of powering GenGPT, so that's kind of something a graph database can do. 09:09.000 --> 09:14.000 And Microsoft offers another SQL-like rule language, which is similar to Ori and DBs. 09:14.000 --> 09:20.000 It offers the Grammly language, and very interestingly, it offers a MongoDB compatible API. 09:20.000 --> 09:26.000 So this already illustrates that some graph databases don't really do complex pathways or pattern matching. 09:26.000 --> 09:31.000 They just do some document database style operations, because that's what people want. 09:31.000 --> 09:37.000 They want to think about their data as a graph, but they want to create in terms of nested JSON-like data structures. 09:37.000 --> 09:48.000 And one very nice evidence for this is that Microsoft actually spun out an open source fork of Cosmos DB, which is called Document DB. 09:48.000 --> 09:54.000 So this is really for document processing, it's licensed MIT, and it is running on top of Postgres. 09:54.000 --> 10:06.000 A bunch of other graph database systems are here, what you can note here is that graphQL and various graphQL forks are quite popular, as well as data log and the Grammly language. 10:06.000 --> 10:14.000 If you ever look for something like a graph serving system that does this sort of nested data returning queries, 10:14.000 --> 10:21.000 you should look for something that's backed by a key value store, a document store, or it's called a real-time graph database, 10:21.000 --> 10:33.000 if you go to the DB engines ranking and you open the list of graph databases, these will be called multi-modal systems, which are also document stores as kind of a second data model. 10:33.000 --> 10:39.000 So if I want to put a label on this, I would call this the Mongo DBs of the graph database space. 10:39.000 --> 10:49.000 The second category is interesting too. I would call them classic graph databases as a temporary name, and they basically solve the problem of recursive joins. 10:49.000 --> 11:00.000 Well, if you want to do something like shortest path finding, no way, just try to find the shortest path between two people, then you would do something like this, 11:00.000 --> 11:07.000 between Dan and Finn, obviously everyone can see that the shortest path is through Ada and Carr. 11:07.000 --> 11:16.000 But if you want to get the same answer from a SQL system, you will end up writing something known as a recursive query with the recursive closing SQL, 11:16.000 --> 11:27.000 and you end up with a monstrosity like this. I have spent a lot of time refining this query, but you do have to use lists to keep track of where you've been, you have to use window functions to look around the table, 11:27.000 --> 11:37.000 and it's not efficient to write, it's not nice to read, and it's also not efficient to run, so this is obviously a big limitation of SQL 99. 11:38.000 --> 11:45.000 Luckily, graph databases have another trick up their sleeves, inside for you can just say, you want the nose asterisk for cleaning star, 11:45.000 --> 11:54.000 and then you want a shortest path and then you return this. This is basically as simple as it gets, obviously to start different compared to the SQL query. 11:54.000 --> 12:02.000 And this language sci-fi was introduced by the Neo4j database system, Neo4j is kind of the first modern graph database, 12:02.000 --> 12:09.000 and they did great work in introducing sci-fi, turning it into open source with a specification called open sci-fi, 12:09.000 --> 12:14.000 and then pushing it through the ISO standardization process to EADGQL. 12:14.000 --> 12:21.000 Neo4j also did amazing advocacy works, they ran more than 10 fourth-dumb dev rooms, published books when meetups and so on. 12:21.000 --> 12:28.000 Neo4j is a transactional system in nature, but it leans into analytics, it has an analytical suite for graph algorithms, 12:28.000 --> 12:34.000 and if you get the enterprise version, it also has a parallel runtime for more complex queries. 12:34.000 --> 12:41.000 Another system that you may have come across in the last 10 years is called Titan, or it's called Genus Graph, or it's called Huge Graph. 12:41.000 --> 12:47.000 This has also been forked multiple times, and most of those forks are no longer maintained. 12:47.000 --> 12:52.000 The current version that's maintained is maintained mostly by by-do, and it's an Apache project, 12:52.000 --> 12:59.000 and it focuses on highly transactional setups where you store the graph distributably, mostly create with the Gramlin language, 12:59.000 --> 13:04.000 and focus on short reads and high throughput transactions. 13:04.000 --> 13:11.000 Other systems in this category include AWS, Neptune, Galaxy-based graph scope, mamm graph, Nebula graph, 13:11.000 --> 13:18.000 notably Google Spender Cloud, which is a new player, just being a few years, a few days ago, and Altypa. 13:19.000 --> 13:25.000 And you can see that there are three languages here, Gramlin systems, Cypher systems, and GQS systems. 13:25.000 --> 13:31.000 So it's kind of not unified yet in terms of three languages, but these are the key players. 13:31.000 --> 13:37.000 This is the category, which is really popular in terms of mapping it back to relational database system. 13:37.000 --> 13:47.000 Most of the legacy vendors, IBM, Oracle, Microsoft, they have versions of their systems of different maturity levels that do graph queries. 13:47.000 --> 13:52.000 There are also postgres extensions and working progress implementations in postgres. 13:52.000 --> 13:57.000 So I think it's fair to say that this is the postgres of graph database systems. 13:57.000 --> 14:02.000 They do transactional workloads, bunch of joins, and they also do recursive joins. 14:02.000 --> 14:06.000 And the final category is graph BI or graph overlap. 14:06.000 --> 14:12.000 If you will, this focus is on many too many joins, complex recursive joins, and other graph patterns. 14:12.000 --> 14:15.000 So what are complex recursive joins? 14:15.000 --> 14:21.000 Well, instead of doing shortest path, we can do weighted shortest paths, also known as cheapest path. 14:21.000 --> 14:28.000 For example, if you take our random social graph, we can do some crazy queries like, let's pick two cities. 14:28.000 --> 14:33.000 Let's get the residents of those people, then let's do a Cartesian product between them, 14:33.000 --> 14:36.000 and then try to find all the cheapest path between them. 14:36.000 --> 14:42.000 This is, of course, a very heavy-hitting operation, and it needs very special algorithms to deliver it. 14:42.000 --> 14:49.000 Namely, because you need something like a multi-source, multi-destination bi-directional cheapest path algorithm, 14:49.000 --> 14:53.000 which is something that isn't even there in terms of research as far as I'm aware. 14:53.000 --> 14:59.000 So this is something where real innovation can happen if someone wants to run these queries quickly. 14:59.000 --> 15:09.000 Maybe a more mundane example is cyclic queries, so a typical use case for databases in the graph space are the recommendation problems. 15:09.000 --> 15:14.000 So you can do something like, try to find people who have a lot of shared interest. 15:14.000 --> 15:20.000 So I added a bunch of topics that people are interested in, mostly based on music genres, 15:20.000 --> 15:23.000 and we can do a query like this. 15:23.000 --> 15:28.000 Let's take the nosages, join interest one, interest two, and make sure those two are the same. 15:28.000 --> 15:33.000 So visually, you can see that Ada and Bob have two shared interests, rock and ska, 15:33.000 --> 15:37.000 and Ada and Carl have one shared interest, rap music. 15:37.000 --> 15:46.000 So this is a triangle, and the triangle queries have this interesting problem that if you start to calculate them, 15:46.000 --> 15:50.000 they will have huge intermediates, even if the end result is actually not that big. 15:50.000 --> 15:57.000 The intermediates can blow up, but they only blow up if you do it with the so-called binary joins, 15:57.000 --> 15:59.000 when you join two operands together. 15:59.000 --> 16:05.000 So if you do interest one, nose and interest two, with binary joins, no matter how you order them, 16:05.000 --> 16:10.000 it can blow up, and it can blow up quadratically, where m is the number of edges. 16:10.000 --> 16:16.000 Obviously, this is not going to finish, but interestingly enough, if you do multi-way joins, 16:16.000 --> 16:21.000 then you can get it down to n on the power of 1.5, which is still worse than n log n, 16:21.000 --> 16:26.000 but it's something that we'll finish eventually. 16:26.000 --> 16:29.000 And the third category is acyclic queries. 16:29.000 --> 16:39.000 This is also kind of new research, which is, if you run a query saying, which persons have more topics of interest than friends, 16:39.000 --> 16:47.000 kind of an evil query, then we can do this actually quite simple secret query that says, take the nose edge, 16:47.000 --> 16:53.000 join it on the interest, and then just do group by on the person in the middle, 16:53.000 --> 16:58.000 and try to find the number of topics that they are interested in, and the number of friends that they have, 16:58.000 --> 17:02.000 and if there are more topics than friends, then that's a match for this query. 17:02.000 --> 17:10.000 So we can see that both, for example, are two friends, but five interest, so he will qualify for the results of this query. 17:10.000 --> 17:16.000 The problem is, is that if we do this join, we will get a huge blow up again, 17:16.000 --> 17:23.000 because you can see that we have a topic, person one and person two, and every time Bob would get a new friend, 17:23.000 --> 17:28.000 have to repeat all of his hobbies in the table, and every time he would get a new hobby, 17:28.000 --> 17:33.000 we would have to list all those friends that he already has next to the new hobby. 17:33.000 --> 17:40.000 So this is kind of another quadratic style blow up, and this is caused by a multi-value dependency, 17:40.000 --> 17:48.000 which means that the person one in the middle is conditional independent on the two sides, both the topics, 17:48.000 --> 17:52.000 and the friends that they have are independent. 17:52.000 --> 17:55.000 And this is a common pattern that's introduced by many to many joins, 17:55.000 --> 17:59.000 and it's not only common, it's very predictable, so it's easy to compress away, 17:59.000 --> 18:02.000 and this compression is captured by something known as factorization. 18:02.000 --> 18:06.000 factorization is a lost-less compression method for relational tables, 18:06.000 --> 18:13.000 and you can take a big flat table on the left, and turn it into something known as a factorized table. 18:13.000 --> 18:18.000 And the factorized table is basically this union of little Cartesian product, 18:18.000 --> 18:24.000 so we have put in the topics for ADA, topics for Bob, and all the people, 18:24.000 --> 18:29.000 and this is such an elegant and concise representation that you can answer the secret query 18:29.000 --> 18:31.000 by just looking at the table. 18:31.000 --> 18:35.000 You can say, well ADA has an equal amount of topics and friends, 18:35.000 --> 18:41.000 Bob has more topics, so that's a return, and then Cartesian has fewer topics. 18:41.000 --> 18:45.000 So you can just eyeball it and answer the query, which would be a lot more difficult, 18:45.000 --> 18:48.000 and memory intensive if you use the actual table. 18:48.000 --> 18:55.000 So it's kind of obvious at this point in research that the workloads that have a lot of these 18:55.000 --> 18:59.000 many to many join operations, or N to M edge join operations. 18:59.000 --> 19:04.000 They could benefit from factorization, but there are a lot of open questions around there, 19:04.000 --> 19:08.000 and there are also a lot of open questions on how to implement this efficiently, 19:08.000 --> 19:13.000 so both theoretical and implementation side, this is still up for grabs. 19:13.000 --> 19:18.000 And another thing that is quite interesting is that advanced ways of this factorization, 19:18.000 --> 19:23.000 called D representations, could also work for returning combat grabs. 19:23.000 --> 19:29.000 So if you remember for the first category of systems, where we tried to return a graph, 19:29.000 --> 19:32.000 but it was kind of difficult, factorization can also help with that. 19:32.000 --> 19:39.000 So this is, I think, a really promising direction that may be overarching across different graph database systems. 19:40.000 --> 19:45.000 The interesting thing about factorization and virtual optimal joins is that they are 19:45.000 --> 19:52.000 results that follow this really traditional and kind of idea path of first being published in theory papers, 19:52.000 --> 19:58.000 then being picked up by the database researchers, then being designed systems by systems architects, 19:58.000 --> 20:01.000 and that ultimately they will end up in production systems. 20:01.000 --> 20:04.000 But what's interesting is that this is really long journey, 20:04.000 --> 20:10.000 so even though the ideas themselves are quite simple, it takes 10 years to get the first good process access 20:10.000 --> 20:15.000 and maybe another 10 years to get the first good mature industry systems. 20:15.000 --> 20:20.000 Speaking of, there is one that's called Kuzu, this is from the University of Waterloo, 20:20.000 --> 20:24.000 and this implements both virtual optimal joins and factorization. 20:24.000 --> 20:29.000 It supports the Cypher Career Language, and it also has strong focus on path queries. 20:29.000 --> 20:35.000 It also has two academic precursors, graph flow, which was their system written in Java, 20:35.000 --> 20:38.000 and then they had grain DB, which was a dark DB fork. 20:38.000 --> 20:42.000 They abandoned both of those systems, but they reused all the knowledge that they accumulated 20:42.000 --> 20:49.000 and built Kuzu, which is this dark DB style single node in process system for graph processing. 20:49.000 --> 20:53.000 Another system, just to show you the variety, is called Tiger Graph. 20:53.000 --> 20:58.000 This has a completely different QR language called GSQL, and it focuses on traverses. 20:58.000 --> 21:06.000 So this is less on the join heavy, and more on the recursive cheapest path queries and other queries, 21:06.000 --> 21:09.000 where you do some sort of a fixed point computation. 21:09.000 --> 21:15.000 So if I want to put a final label on this, I would say this is the teradata of graph database systems. 21:15.000 --> 21:21.000 So once you have this categorization, it will help you select the system, would you something like MongoDB, 21:21.000 --> 21:24.000 or Postgres, or teradata for your problems? 21:25.000 --> 21:28.000 What about schema and distribution? There is no easy answer to this. 21:28.000 --> 21:31.000 Distribution can happen for all of these systems. 21:31.000 --> 21:36.000 The only rule of thumb that I could say is that if you have really heavy analytical queries, 21:36.000 --> 21:39.000 which are illustrated here by purple, then you need a schema. 21:39.000 --> 21:46.000 There is really not really a huge chance of being very fast if you don't have the schema of your data. 21:46.000 --> 21:51.000 But other than that, there are single node and distributed systems in all categories. 21:51.000 --> 21:54.000 So I promised that I will talk about benchmarks. 21:54.000 --> 21:59.000 I have been a longstanding member of the IDBC, the link data benchmark console, 21:59.000 --> 22:04.000 which is a non-profit with the mission to accelerate progress in graph data management. 22:04.000 --> 22:09.000 IDBC has a membership about 100 individuals, mostly researchers and practitioners, 22:09.000 --> 22:13.000 and also 25 organizations. Who are those? 22:13.000 --> 22:16.000 Well, we have a bunch of different categories here. 22:16.000 --> 22:21.000 We have traditional vendors like Oracle, we have AWS, graph database vendors, 22:21.000 --> 22:29.000 like Neo4j, Tiger Graph, the end group, Alibaba, and the bunch of other companies from hardware, 22:29.000 --> 22:33.000 from cloud and databases and also academic institutes. 22:33.000 --> 22:37.000 And the idea of IDBC is that it will take all of these members, 22:37.000 --> 22:40.000 the database companies, the hardware vendors and the cloud vendors, 22:40.000 --> 22:45.000 and it will force them to collaborate on standards, like benchmarks standards, 22:45.000 --> 22:49.000 and create language standards, and it will force them to compete on performance, 22:49.000 --> 22:54.000 purely because if they don't compete, they are left out from a leaderboard. 22:54.000 --> 22:58.000 So to this hand, IDBC defines a bunch of benchmarks and create languages. 22:58.000 --> 23:03.000 The first benchmark, one of the most popular workloads of ours, 23:03.000 --> 23:07.000 is called the social network benchmark interactive workload. 23:07.000 --> 23:12.000 And it has this sort of short read queries that fall into the classic graph data base, 23:12.000 --> 23:15.000 or post-grast of granted databases category. 23:15.000 --> 23:20.000 So for example here, we do a person and then we do one hope to hope and fetch the messages 23:20.000 --> 23:23.000 that were created before a given day. 23:23.000 --> 23:25.000 And obviously this is a graph database benchmark, 23:25.000 --> 23:29.000 so if you formulate the queries in SQL, you can formulate them, 23:29.000 --> 23:31.000 but sometimes they will be a bit unwieldy. 23:31.000 --> 23:34.000 For example, this is already a rather long query. 23:34.000 --> 23:38.000 But this allows me to demonstrate two new query languages. 23:38.000 --> 23:41.000 One is called SQL property graph queries. 23:41.000 --> 23:44.000 This will be mentioned in the next store by Daniel with more details. 23:44.000 --> 23:47.000 But this is an extension of SQL 1023, 23:47.000 --> 23:52.000 where you can say, from graph table and then you can do things like match a cyclic. 23:52.000 --> 23:55.000 And then you have something very similar to the Cypher language. 23:55.000 --> 23:57.000 You have the visual graphs syntax. 23:57.000 --> 24:01.000 You can say, one comma two instead of spelling out the joins, 24:01.000 --> 24:05.000 and you can return your results as a table. 24:05.000 --> 24:09.000 Another language which was standardized just last year is called GQL. 24:09.000 --> 24:14.000 This basically shares the pattern matching core with Cypher PGQ, 24:14.000 --> 24:18.000 but it's more oriented towards processing just graphs. 24:18.000 --> 24:22.000 The interactive workload has been out for a couple of years now, 24:22.000 --> 24:27.000 and we have received a lot of official results from vendors. 24:27.000 --> 24:33.000 And I think this is really impressive diagram because it shows that we have mostly 24:33.000 --> 24:36.000 like a doubling of performance every year, 24:36.000 --> 24:41.000 and even better if you take the performance and the price of the systems into account. 24:41.000 --> 24:49.000 But what's really interesting in this diagram is that all those systems that competed here 24:49.000 --> 24:52.000 and achieve better results are based in China. 24:52.000 --> 24:56.000 So apparently China has a really strong venture capitalic system, 24:56.000 --> 25:00.000 and they also do a lot of open source, two of these systems are open source. 25:00.000 --> 25:07.000 So I think this is something to think about two weeks after DeepSeek R1 came out. 25:07.000 --> 25:10.000 This is obviously something that hit like the mainstream media, 25:10.000 --> 25:14.000 but in graph databases there's also a huge advantage in Chinese vendors 25:14.000 --> 25:16.000 when it comes to performance. 25:16.000 --> 25:18.000 I cannot really speak about the usability of these systems, 25:18.000 --> 25:20.000 but performance wise they're good. 25:20.000 --> 25:24.000 And one system that's really impressive is the little red dot, 25:24.000 --> 25:28.000 which is not the fastest, but it's the fastest that uses the declarative language. 25:28.000 --> 25:32.000 This is who by a GES, it's the graph engine service, 25:32.000 --> 25:36.000 and it uses the cipher language, and it uses factorization internally 25:36.000 --> 25:38.000 to make the queries go fast. 25:38.000 --> 25:44.000 Another workload that I've worked a few years on is called the Business Intelligence workload. 25:44.000 --> 25:48.000 This is graphful app, complex patterns like finding triangles, 25:48.000 --> 25:51.000 complex pathways like finding cheapest paths, 25:51.000 --> 25:56.000 and doing like a long run of power test, 25:56.000 --> 26:00.000 with individual query run times, and throughput test with overlapping query run times. 26:00.000 --> 26:02.000 We also ordered a bunch of systems for this, 26:02.000 --> 26:06.000 but the vendors were very careful not to overlap in the scale factors. 26:06.000 --> 26:09.000 So I cannot do any systems system comparison here. 26:09.000 --> 26:14.000 We will do more all this this year, and hopefully we can compare their results. 26:14.000 --> 26:18.000 And the final benchmark that I would like to mention is the financial benchmark. 26:18.000 --> 26:21.000 This is a very interesting one because it was driven by the end group, 26:21.000 --> 26:24.000 and create link and another company called Altipo. 26:24.000 --> 26:28.000 They do very low latency queries, 26:28.000 --> 26:30.000 with relaxed consistency requirements. 26:30.000 --> 26:32.000 So think about the nosey, 26:32.000 --> 26:34.000 well, eventual consistency, 26:34.000 --> 26:37.000 and they have very interesting queries like they have path queries, 26:37.000 --> 26:41.000 where you have to have not only regular expressions type path, 26:41.000 --> 26:43.000 but also path where you keep some memory. 26:43.000 --> 26:45.000 You keep some state during the traversal, 26:45.000 --> 26:49.000 and you check that the amount of money that's strictly in across the path 26:49.000 --> 26:52.000 is only decreasing by small amounts. 26:53.000 --> 26:56.000 How do you use these TPC benchmarks? 26:56.000 --> 26:59.000 Well, we have a complicated benchmark kit. 26:59.000 --> 27:01.000 I have spent some time in this talk, 27:01.000 --> 27:04.000 making fun of the vendors who kept relicens in their software. 27:04.000 --> 27:07.000 We also did this because we picked GPL, 27:07.000 --> 27:10.000 and then our members out there, so we also relicens. 27:10.000 --> 27:13.000 And we do detailed specifications, papers, 27:13.000 --> 27:17.000 generators, data sets, and all sorts of frameworks 27:17.000 --> 27:20.000 to make our benchmarks be easy to implement. 27:20.000 --> 27:22.000 The benchmarks are easy to implement, 27:22.000 --> 27:24.000 but they're not easy to get audited. 27:24.000 --> 27:27.000 We have TPC grade benchmark audits, 27:27.000 --> 27:29.000 which means that we have certified auditors. 27:29.000 --> 27:32.000 And if you're a vendor, you have to talk to an auditor. 27:32.000 --> 27:35.000 It's usually something like 30 to 50,000 euros. 27:35.000 --> 27:37.000 It's a multi-week process. 27:37.000 --> 27:40.000 The auditor, we'll talk to the vendor, 27:40.000 --> 27:43.000 they will compile the specifications sheets, 27:43.000 --> 27:45.000 and then they will, of course, 27:45.000 --> 27:49.000 rerun the benchmark and write a full disclosure report out of it. 27:49.000 --> 27:52.000 This is kind of a painful process, 27:52.000 --> 27:53.000 but vendors go through it, 27:53.000 --> 27:55.000 because this is kind of the gold standard 27:55.000 --> 27:56.000 in the graph database space, 27:56.000 --> 27:58.000 and we have been doing audits for five years, 27:58.000 --> 28:00.000 published 50 results, 28:00.000 --> 28:03.000 and never had to retract any of the results. 28:03.000 --> 28:05.000 The most issues I could find was a few typos 28:05.000 --> 28:08.000 in old food disclosure reports. 28:08.000 --> 28:10.000 And as you can see, we have grown quite well 28:10.000 --> 28:11.000 in the last five years, 28:11.000 --> 28:13.000 so we have more and more members, 28:13.000 --> 28:15.000 lots of user community meetings, 28:15.000 --> 28:18.000 benchmarks, results, and so on. 28:18.000 --> 28:21.000 So what are the challenges in the graph database space? 28:21.000 --> 28:22.000 What is a big decline? 28:22.000 --> 28:25.000 I don't think that needs to be denied. 28:25.000 --> 28:28.000 The hype cycle has moved over to AI technologies, 28:28.000 --> 28:30.000 and I think there's a huge amount of confusion. 28:30.000 --> 28:32.000 I'm trying to have the confusion here a bit 28:32.000 --> 28:34.000 by setting these categories, 28:34.000 --> 28:36.000 but when vendors go out saying, 28:36.000 --> 28:38.000 we are a graph database, 28:38.000 --> 28:39.000 you don't need joins, 28:39.000 --> 28:41.000 and we will replace all relational systems. 28:41.000 --> 28:43.000 I don't think that does anything any, any one any good. 28:43.000 --> 28:45.000 I don't think anyone is seriously saying any more 28:46.000 --> 28:49.000 that graph database is very replace relational systems. 28:49.000 --> 28:51.000 There are many niche use cases, 28:51.000 --> 28:54.000 and there are vendors who are basically tied to one or two of their customers. 28:54.000 --> 28:56.000 They over optimized for that, 28:56.000 --> 28:59.000 and that ends up in this very fragmented space 28:59.000 --> 29:01.000 with dozens of clear languages. 29:01.000 --> 29:03.000 You can see the effect of that, 29:03.000 --> 29:06.000 so between 2013 and 2022, 29:06.000 --> 29:09.000 there was a 13-fold increase in the DB engines 29:09.000 --> 29:11.000 ranking for graph databases, 29:11.000 --> 29:13.000 but since the summer of 2022, 29:13.000 --> 29:14.000 it has been a decline, 29:14.000 --> 29:18.000 so we are now back to basically two thirds of the peak, 29:18.000 --> 29:21.000 and the decline doesn't look very good. 29:21.000 --> 29:23.000 If you have heard about graph frag, 29:23.000 --> 29:25.000 the vendors have heard about this, 29:25.000 --> 29:27.000 this is the life that they are trying to cling onto. 29:27.000 --> 29:31.000 This is something that the vendors think is going to help them, 29:31.000 --> 29:33.000 and indeed it produces a lot of attention, 29:33.000 --> 29:35.000 and it's a nice graph use case. 29:35.000 --> 29:37.000 Okay, to sum up, 29:37.000 --> 29:39.000 graph databases have syntax sugar 29:39.000 --> 29:41.000 and optimizations for joins. 29:41.000 --> 29:43.000 If you have ten joins in your query, 29:43.000 --> 29:44.000 and for some reason, 29:44.000 --> 29:46.000 it doesn't work with a relational system, 29:46.000 --> 29:47.000 you should try a graph database, 29:47.000 --> 29:49.000 but then based on the problem you may want, 29:49.000 --> 29:50.000 the MongoDB, 29:50.000 --> 29:51.000 the Postgres, 29:51.000 --> 29:53.000 or the Teradata of your database. 29:53.000 --> 29:55.000 And licensing is difficult, 29:55.000 --> 29:58.000 check the licenses before using the software, 29:58.000 --> 30:00.000 and check the performance results. 30:00.000 --> 30:01.000 Thanks a lot.