WEBVTT 00:00.000 --> 00:08.640 Hello everyone, thanks for having me here. 00:08.640 --> 00:14.120 Today we're going to talk about retrieval augmented generation or rack where it works, 00:14.120 --> 00:18.760 where it starts breaking and whether knowledge graphs can help us with that. 00:18.760 --> 00:23.520 A couple of words about myself, my name is Miki Tekimarski, I'm a technical 00:23.520 --> 00:30.560 co-founder of a startup where we do AI agents for autonomous negotiations in procurement. 00:30.560 --> 00:36.480 So I've had a chance to explore different LLM-based architectures, including rack and 00:36.480 --> 00:39.080 these extensions. 00:39.080 --> 00:40.680 Let's imagine a common scenario. 00:40.680 --> 00:46.760 You have a large knowledge base of unstructured documents, for instance, company policies. 00:46.760 --> 00:52.320 You would like to ask questions and get answers based on these documents. 00:52.320 --> 00:57.880 You, as a solution, you configured a rack pipeline on top of that. 00:57.880 --> 00:59.880 First, everything looks great. 00:59.880 --> 01:10.800 You ask simple questions, get valid, valid responses, but then questions become more complex. 01:10.800 --> 01:15.700 Questions that require connecting multiple documents, questions about contradictions, or 01:15.700 --> 01:19.480 about how something evolved over time. 01:19.480 --> 01:22.280 And suddenly, rack can not address that. 01:22.280 --> 01:26.000 This is what leads us to graph rack. 01:26.000 --> 01:32.280 But before talking about limitations, let's first align on what rack means at all. 01:32.280 --> 01:38.280 So I'll call the default implementation venue rack in order to distinguish it from its 01:38.280 --> 01:39.520 extensions. 01:39.520 --> 01:40.840 So the idea is simple. 01:40.840 --> 01:44.960 You have a large unstructured knowledge base. 01:44.960 --> 01:49.280 You want a large language model to reason on top of that, but you can not pass all the 01:49.280 --> 01:55.600 knowledge base just because all the lamps context window is limited. 01:55.600 --> 02:04.360 So you decide to split that text into chunks and then on each question, you retrieve 02:04.360 --> 02:10.800 top-key most similar chunks and pass them only to the large language model to serve as 02:10.800 --> 02:16.240 a context to answer your question. 02:16.240 --> 02:23.680 So for many cases, this approach works, but that's why it became so popular. 02:23.680 --> 02:27.040 But the problems start when questions stop being local. 02:27.040 --> 02:32.240 For example, questions that require combining information from multiple documents. 02:32.240 --> 02:37.280 Questions that depend on relationships between entities, global queries, like what are 02:37.280 --> 02:38.920 the main themes in this data? 02:38.920 --> 02:44.680 Obviously, it requires high-level view on the whole data set, not separate chunks. 02:44.680 --> 02:49.000 Just dealing with trying to spot contradictions, for instance. 02:49.000 --> 02:56.200 You might ask which documents disagree or contradict to each other, or questions related 02:56.200 --> 02:58.360 to temporal reasoning. 02:58.360 --> 03:06.840 Again, chunks of data, you do not capture what time they are relevant at or maybe invalid 03:06.840 --> 03:09.120 at. 03:09.120 --> 03:11.360 This is where knowledge graphs come in. 03:11.360 --> 03:17.360 GraphRack is an umbrella term for all these solutions that try to augment vanilla 03:17.360 --> 03:19.840 rack with knowledge graphs. 03:19.840 --> 03:25.120 So the idea is simple, maybe if we add structure, entities, relationships, maybe cluster, 03:25.120 --> 03:28.520 these entities generate summaries of those clusters. 03:28.520 --> 03:37.640 We can recover the reason in that vanilla rack lacks. 03:37.640 --> 03:42.040 Actually, GraphRack follows the pipeline similar to vanilla rack, but with much more 03:42.040 --> 03:43.960 going on under the hood. 03:43.960 --> 03:49.120 We can roughly split it into three stages, building a knowledge graphs from your unstructured 03:49.120 --> 03:54.960 data, typically text, retrieving relevant subgraphs to be used as a context for answering 03:54.960 --> 03:59.880 your question, and the final question generation using that context. 03:59.880 --> 04:03.680 From that outside, it still looks like ask a question, get an answer. 04:03.720 --> 04:08.640 However, internally, it's much more complex and more expensive, obviously. 04:08.640 --> 04:11.680 By the way, there is an open catalogue of GraphRack patterns. 04:11.680 --> 04:16.640 You may be interested to take a look at it. 04:16.640 --> 04:22.600 Let's consider some of the main open source GraphRack solutions available today. 04:22.600 --> 04:23.880 It's not an exhaustive list. 04:23.880 --> 04:31.840 I have picked some active lamenting solutions, which focus on different query types, Microsoft 04:31.880 --> 04:37.520 GraphRack and lemma index, both cover the whole rack pipeline from the building of the 04:37.520 --> 04:43.200 knowledge graph up to the generation of the answer, while graphity covers only knowledge 04:43.200 --> 04:50.400 graph building and context retrieval, leaving the responsibility of setting up LLM reasoning 04:50.400 --> 04:55.440 on top of that context and generating an answer to you. 04:55.440 --> 04:59.040 I picked two text corpora for testing. 04:59.040 --> 05:08.600 Christmas Carol as a static knowledge base and some toy data set with synthetic events to 05:08.600 --> 05:14.960 serve as a evolving knowledge base. 05:14.960 --> 05:17.640 Let's start with Microsoft GraphRack. 05:17.640 --> 05:23.320 The main promise here is an ability to answer global queries like what are the main topics 05:23.320 --> 05:25.360 in this data. 05:25.360 --> 05:31.120 It works by extracting arbitrary entities and relationships and clustering them into hierarchical 05:31.120 --> 05:35.080 communities, generating a summary of each community. 05:35.080 --> 05:41.360 It creates multiple abstraction layers, like even summaries of summaries, which allows you 05:41.360 --> 05:48.360 which allows the system to get a high level view of the whole data set or separate clusters. 05:48.360 --> 05:51.200 And thus, answer this global queries. 05:51.200 --> 05:55.840 However, it comes with some significant constraints. 05:55.840 --> 05:59.080 First of all, you cannot enforce an ontology here. 05:59.080 --> 06:04.680 In other words, entity and relationship types you would like to use. 06:04.680 --> 06:06.920 It's arbitrary in GraphRack. 06:06.920 --> 06:08.960 In incremental updates are not supported. 06:08.960 --> 06:14.640 So as new data arrives, you would need to reindex and it's very expensive in GraphRack. 06:14.640 --> 06:17.440 We'll see that later. 06:17.440 --> 06:22.880 That also, it's worth mentioning that tracing exact source documents that contributed 06:22.880 --> 06:30.200 to the answer is also difficult here, because the answer is generated out of the summaries 06:30.200 --> 06:36.240 that comprise a lot of different documents or even summaries of summaries, as I said. 06:36.240 --> 06:38.880 In addition to that, it lacks some operational features. 06:38.880 --> 06:48.280 For instance, you cannot configure persistence layer indexes stored as pocket files locally. 06:48.280 --> 06:56.440 So I indexed bold Christmas Carol on both Microsoft GraphRack and Vanilla Rack. 06:56.440 --> 07:03.720 And even on that relatively small data set, GraphRack consumed a lot of inference tokens 07:03.760 --> 07:11.040 for indexing, while obviously Vanilla Rack requires no inference on the embedding the chunks. 07:11.040 --> 07:18.880 And took some significant time and more importantly queries. 07:18.880 --> 07:23.520 So they consume much more tokens as well and take a lot of time as well. 07:23.520 --> 07:29.200 So the local query was who participated in the family dinner and the global query was what's 07:29.200 --> 07:33.000 the central lesson of the story. 07:33.000 --> 07:40.000 And again, if you new data arrives, you need to index and that's, again, indexing costs, 07:40.000 --> 07:41.440 you spend the same tokens. 07:41.440 --> 07:48.280 They use some cash and inside, but some steps, but community summaries should be recomputed 07:48.280 --> 07:56.120 and that spends the most of the tokens and the most of the time. 07:56.120 --> 08:02.360 Also an important part query types and GraphRack are explicit and you're supposed to choose 08:02.400 --> 08:04.240 them manually. 08:04.240 --> 08:09.840 Of course, you can try different query types for your question and compare the results. 08:09.840 --> 08:13.720 Local and global are two original query types. 08:13.720 --> 08:16.160 Drift, they added that later. 08:16.160 --> 08:18.840 I haven't experimented much with that. 08:18.840 --> 08:22.320 It's some expanded version of the local query. 08:22.320 --> 08:30.400 And basic is the Vanilla Rack just for your convenience to compare the results. 08:30.400 --> 08:37.120 I mentioned the provenance track and so tracing the documents, specific documents that contribute 08:37.120 --> 08:39.640 to your to the answer. 08:39.640 --> 08:47.200 I just wanted to show how GraphRack returns that there's a lie returns it embedded in text 08:47.200 --> 08:48.800 answer. 08:48.800 --> 08:54.160 You can see it's indexes of the entities in packet files. 08:54.160 --> 08:59.920 Obviously it requires parsing here, but they're Python API returns it as structured. 08:59.920 --> 09:10.440 Let's move on to the next solution, which is Lama Index. 09:10.440 --> 09:13.520 It's scaffolding library for Rack pipelines. 09:13.520 --> 09:21.960 It includes interfaces and different implementations for data ingestion, indexing, retrieval 09:21.960 --> 09:29.600 and persistence layers, layer and among other index implementations they have 09:29.600 --> 09:35.520 property graph index, which is another graph Rack implementation. 09:35.520 --> 09:40.800 It's highly customizable supports different query approaches and allows you to enforce specific 09:40.800 --> 09:43.920 entity and query types. 09:43.920 --> 09:50.120 Therefore it's not designed specifically for some particular query type, but it's supposed 09:50.120 --> 09:53.200 to be used as a configurable solution. 09:53.200 --> 09:55.160 And it's mainly a Python library. 09:55.200 --> 10:01.560 They have a TypeScript version as well, but it's limited, for instance, property graph index 10:01.560 --> 10:06.320 is absent there. 10:06.320 --> 10:13.880 I did the same comparison with Lama Index, it consumed much less resources than GraphRack, 10:13.880 --> 10:17.920 but still much more than vanilla Rack. 10:17.920 --> 10:26.400 But as I said, it's configurable, you may come up with different results, so the advantage 10:26.400 --> 10:36.640 here is more control, you configure what to extract, how to build relationships, etc. 10:36.640 --> 10:39.720 Another interesting solution is GraphIti. 10:39.720 --> 10:43.680 It focuses on dynamic and temporally aware knowledge graphs. 10:43.680 --> 10:46.160 What does temporally aware mean here? 10:46.160 --> 10:54.400 Peter deals with events, they call it episodes, having an occurring time for each event. 10:54.400 --> 11:01.440 It allows them to use that occurring time to answer temporal questions, and as new data 11:01.440 --> 11:09.600 comes in, instead of overwriting or simply overwriting old facts, it rather invalidates them 11:09.600 --> 11:14.360 with time stamp to preserve that historical accuracy. 11:14.360 --> 11:20.920 That allows you, potentially, to answer your questions, such as what was true at some 11:20.920 --> 11:32.160 point of time in the past, or what changed between that point of time and now. 11:32.160 --> 11:38.120 Testing GraphIti on a static data set like a Christmas Carol would be meaningless to properly 11:38.120 --> 11:44.600 evaluate temporal reasoning, we would need some data set of events that evolve over time, 11:44.600 --> 11:50.600 where facts change, get invalidated, and unfortunately, I didn't have enough time to prepare 11:50.600 --> 11:54.200 a fully representative data set of that kind. 11:54.200 --> 12:01.800 Instead, I just for proof of concept, I created a small data set of events, describing a fictional 12:01.800 --> 12:06.520 employee's career path. 12:06.520 --> 12:12.920 However, even with this simple data set, the default graffiti setup did not produce correct results. 12:12.920 --> 12:16.760 For example, when I asked who is the employee's current manager, 12:16.760 --> 12:25.640 graffiti return to relationship, it considered current, even though, according to the event history, 12:25.640 --> 12:31.240 the employee had already been promoted and reports to a different manager. 12:31.240 --> 12:33.800 However, it's a default setup. 12:33.800 --> 12:39.880 I did not use any ontology, did not enforce specific entity relationship types. 12:39.880 --> 12:49.320 It used some small large language models, so I assume the quality issues stem from that, 12:49.320 --> 12:52.280 and therefore there is room for experiment and improvement. 12:56.600 --> 12:59.480 The main risk here also is performance. 13:00.360 --> 13:03.880 Every new event triggers an LLM-driven knowledge graph of data. 13:05.240 --> 13:11.000 Even on small events from my toy dataset, an average ingestion for one event, 13:11.000 --> 13:18.680 was took 15 seconds, which is a lot, and given the use case, they describe as the main one, 13:18.680 --> 13:24.520 some real time events, it looks like it's feasible only for 13:24.840 --> 13:28.280 slowly changing data. 13:28.280 --> 13:35.720 But if you have some high volume event streams, then it's probably not feasible. 13:35.720 --> 13:42.040 So you would need too much inference resources on that, and latency is crazy. 13:43.800 --> 13:50.680 And query time, it's about exclusion in your reasoning, so it's just about retrieving 13:50.680 --> 13:54.440 the relevant nodes or entities or subgraph. 13:54.440 --> 14:00.840 You can configure that, graffiti allows some search configuration, including real ranking methods. 14:00.840 --> 14:06.680 And that's just what they claim I received on that toy dataset, similar results, 14:06.680 --> 14:10.040 but it may be not representative due to the size of the dataset. 14:12.920 --> 14:20.120 And at this point, and all these questions arises, how do we actually compare these systems? 14:20.120 --> 14:23.480 So cost and latency are easy to compare, but what about 14:25.000 --> 14:30.680 answer quality with different query types on different datasets? 14:30.680 --> 14:34.680 And in terms of how there is still lack of adopted benchmarks, 14:34.680 --> 14:40.360 some benchmarks have recently been released, but haven't gained an adoption yet, 14:40.360 --> 14:47.480 like graph-rack bench, where they do, where they compare different methods with quantitative 14:47.560 --> 14:55.240 metrics on different query types, exactly for, I listed them, or Microsoft benchmark QED, 14:55.240 --> 15:00.680 where they do pair-wise comparison, focusing mostly on local versus global queries. 15:02.200 --> 15:09.240 But the point is that in practice, you would need to set up your own evaluations on your own data, 15:09.240 --> 15:14.040 and queries you care about to understand the actual performance. 15:15.000 --> 15:22.360 There are also other open-source graph-rack solutions we haven't considered today that apply 15:22.360 --> 15:26.840 different techniques. You may be interested to look at it as well. 15:30.200 --> 15:35.800 So what can be conclude? Vanilla Rack is simple cheap and effective in some cases, 15:35.800 --> 15:42.760 until reasoning becomes structural or global. Graph-rack approaches genuinely extend what's possible, 15:42.760 --> 15:47.080 but introduce significant costs, complexity, and operational risk. 15:47.880 --> 15:52.280 There is no universally best solution, and without standardized benchmarks, 15:52.280 --> 15:58.760 adopting graph-rack blindly is dangerous. And the only reliable path today is evaluation on 15:58.760 --> 16:01.720 your own data with your own questions. 16:04.840 --> 16:10.520 Thank you for your attention. If you are experimenting with Rack, it's extensions, 16:10.680 --> 16:16.920 please send me a message. I'll be happy to answer your questions, so participate in the discussion. 16:25.960 --> 16:27.960 Here you go for the question. 16:28.920 --> 16:45.080 What's Rack's solution? The question is, what Rack's solutions do I apply at my company in my project? 16:45.080 --> 16:52.120 So currently, we do not use Rack for our current solution, but as a start-up, we iterated with 16:52.600 --> 17:01.000 different ideas, and one of our products was an anti-corruption copilot, which was intended to spot 17:01.000 --> 17:12.920 corruption risks in governmental bills. And there we applied Lama Index to design a graph with clauses 17:12.920 --> 17:20.680 and connections between these clauses to enable, for instance, you have references to another 17:20.680 --> 17:28.360 clause, and if you deal with chunks only, there is some reference like in clause 5.1, 17:28.360 --> 17:34.520 it's stated that something, and you do not know what is in that clause. And a knowledge graph 17:34.520 --> 17:38.760 allowed us to model that relationships between clauses and sections of the document. 17:40.440 --> 17:46.600 But unfortunately, that project was decommissioned due to finding, like a funding, and now we work 17:46.600 --> 17:49.960 on autonomous negotiations, there is no Rack at least now. 17:59.080 --> 18:03.240 The question is, embedding, say, used, which data type and which dimensionality? 18:03.800 --> 18:11.400 Dimensionality. Okay, it was, it's not open source, it's open AI text embedding small. 18:12.040 --> 18:18.760 And dimensionality is the default one, if I'm not mistaken, 148 dimensions. That one. 18:19.400 --> 18:30.200 And comparison, like you saw the numbers were rather rough, but I did not mean that to make it 18:30.200 --> 18:35.560 representative to some millisecond accuracy, I wanted just to bring some sense of what to expect 18:35.560 --> 18:44.280 at all from these solutions. I tried to use similar large language models, GPT5 Nano. 18:44.280 --> 18:52.120 It's not a commercial element, the quality is another. But yeah, just to make it representative 18:52.120 --> 18:56.920 of course, so that the same encoders used and the same inference speed approximately. 19:05.560 --> 19:28.360 So the question is, how to build the graph effectively? Because I experimented at the 19:29.000 --> 19:37.320 how to build the graph effectively. Because I experimented at the, for example, more chunk 19:37.320 --> 19:45.800 are better than bigger one. But what solution have you used? Microsoft Graph Rack, or you 19:46.520 --> 19:55.560 applied some different? On static knowledge base. Unfortunately, I haven't performed 19:55.640 --> 20:02.840 too much experiments on that. So I wanted to see them rough performance. But I haven't applied 20:02.840 --> 20:09.720 Microsoft Graph Rack in production. I just haven't used a use case for that. But what we 20:09.720 --> 20:17.160 lean towards, I don't know if it's correct, but we rather use long context Windows 20:17.400 --> 20:27.640 to add large models to capture more rather than chunking it to on small chunks. In just 20:27.640 --> 20:37.400 in order to capture more context, when you start chunking, when you do more, come up with smaller chunks, 20:38.200 --> 20:45.480 you kind of lose that context. So the bigger the chunks, the more the accuracy of the extraction. 20:45.560 --> 20:50.600 But again, it's more expensive since you have larger, you should use larger model, which is 20:50.600 --> 20:56.920 pretty much more. The question part of the user, like, I don't know, these terms between embeds or similar. 20:59.400 --> 21:08.360 Graph Rack is a, let's say, it's not configurable. So you pass the unstructured knowledge base, 21:08.440 --> 21:18.760 unstructured document text, text, especially. And it does everything under the hood. So yeah, 21:18.760 --> 21:23.960 could you repeat the question? Sorry. Sorry, because I also implemented a custom 21:24.280 --> 21:35.160 graph without any framework. And I, to cluster the, to make a cluster, the topics I use as a 21:35.160 --> 21:44.760 semantic, maybe you use something similar, maybe not. Not really, but graph Rack, they apply 21:44.840 --> 21:55.880 a latent, latent cluster in the king. So maybe you can cluster not based on semantics, but rather 21:55.880 --> 22:05.320 on distance on the graph. I mean connectivity between nodes. Graph Rack, if I'm not mistaken, 22:05.320 --> 22:14.600 it uses the same approach, so maybe it would be more effective. Yeah, exactly. But, 22:14.600 --> 22:24.920 you get some recent top of clusters. Thank you very much. Thank you.