WEBVTT 00:00.000 --> 00:01.000 Thank you. 00:01.000 --> 00:02.000 All right. 00:02.000 --> 00:03.000 Good afternoon, everybody. 00:03.000 --> 00:04.000 Thank you for joining. 00:04.000 --> 00:05.000 Yeah. 00:05.000 --> 00:09.000 So today, we are going to talk a bit about how to do query processing for learns 00:09.000 --> 00:27.000 parts models in a more efficient way than what you've seen so far. 00:27.000 --> 00:31.000 We, my name is Antonio Ferdinand. 00:31.000 --> 00:34.000 We work together at SELTS. 00:34.000 --> 00:37.000 Those are our contacts, you know, for free to reach out. 00:37.000 --> 00:41.000 Maybe like a few words about SELTS. 00:41.000 --> 00:47.000 SELTS is a startup that is working on enabling essentially AI workflows and LLMs and chat 00:47.000 --> 00:53.000 bots and foundational models in general to access the web as a tool. 00:53.000 --> 01:01.000 So pretty much web knowledge through an API for any application that needs to run independently 01:01.000 --> 01:05.000 autonomously and reliably, most importantly. 01:05.000 --> 01:10.000 So I'm going to pass the mic to Ferdinand. 01:10.000 --> 01:11.000 Yeah. 01:11.000 --> 01:12.000 Hey, everyone. 01:12.000 --> 01:13.000 Hi, for me as well. 01:13.000 --> 01:15.000 I'm going to give a quick context. 01:15.000 --> 01:16.000 Just a quick introduction. 01:16.000 --> 01:18.000 I guess about what searches. 01:18.000 --> 01:22.000 I'm sure most of you are all familiar and vaccine gave a nice quick introduction 01:22.000 --> 01:27.000 but this might be relevant just to understand a bit of the intricacies of what Antonio will be talking 01:27.000 --> 01:28.000 about later. 01:28.000 --> 01:31.000 So in which context are we, we're talking about ad hoc search. 01:31.000 --> 01:34.000 So we have some kind of big document collection. 01:34.000 --> 01:39.000 For example, in this case, a bunch of documents talking about different open search frameworks 01:39.000 --> 01:42.000 and maybe one random document about get codes and whatever. 01:42.000 --> 01:45.000 A bunch of different other documents in this collection. 01:45.000 --> 01:50.000 We pass this to some processor that will give us some impact scores or whatever 01:50.000 --> 01:56.000 that we can then pass on to an index and then have some query that comes in. 01:56.000 --> 02:01.000 And hopefully this index then gives us in this case maybe the top three documents. 02:01.000 --> 02:05.000 And it should hopefully filter out for our query open source search framework. 02:05.000 --> 02:09.000 It should detect that these three documents up here are the ones that are somewhat relevant. 02:09.000 --> 02:12.000 And the one about get codes probably is not. 02:12.000 --> 02:17.000 And so initially or what what we're going to look at in this case is quickly. 02:17.000 --> 02:20.000 This what I call here a processor. 02:20.000 --> 02:27.000 Originally and that's what Maxi Marty had a quick example for is usually we'd use some kind of 02:27.000 --> 02:30.000 lexical term-based frequencies, whatever. 02:30.000 --> 02:35.000 So for example, we have here just one processor that takes in the documents. 02:35.000 --> 02:42.000 And in this example, perhaps the first document contains Lucine, the term Lucine or the word Lucine 10 times. 02:42.000 --> 02:43.000 And that's just what we count. 02:43.000 --> 02:45.000 We have some term frequencies. 02:45.000 --> 02:48.000 Maybe is is in this document 15 times A a bunch of times as well. 02:48.000 --> 02:50.000 And this is what we do per document. 02:50.000 --> 02:53.000 We could do some more complex stuff, normalize them things like that. 02:53.000 --> 02:55.000 But just for the sake of this example. 02:55.000 --> 03:02.000 We're using term statistics to evaluate how important what the impact of one of these terms 03:02.000 --> 03:04.000 and our documents is. 03:04.000 --> 03:09.000 And what we then do in our index is we convert this into an inverted index. 03:09.000 --> 03:17.000 And just use the terms as the posting and then have a postings list where every single term 03:17.000 --> 03:22.000 we count which document contained this term and then have our impact score inside there. 03:22.000 --> 03:26.000 So in this case maybe the term frequency we had Lucine 10 times. 03:26.000 --> 03:30.000 Some other document in our collection document 17 has two times. 03:30.000 --> 03:32.000 But now in this case the list is sorted. 03:32.000 --> 03:35.000 So any document in between doesn't actually contain this term. 03:35.000 --> 03:41.000 And we can use this to quickly search through our corpus by we get a new query. 03:41.000 --> 03:44.000 Maybe we only need to look at the posting list that contain these queries. 03:44.000 --> 03:48.000 That's how search per say works very quickly. 03:48.000 --> 03:52.000 Nice thing about this terms terms statistics at least it's very very fast. 03:52.000 --> 03:57.000 We can use tokenizer for example to tokenize documents very very fast. 03:57.000 --> 04:00.000 We can index stuff very fast and it's very fast and retrieval. 04:00.000 --> 04:04.000 There's been years of optimizations and research to make this really very quick. 04:05.000 --> 04:07.000 And it's actually fairly effective. 04:07.000 --> 04:12.000 I mean BM25 is still being used today and it's fairly hard to beat still. 04:12.000 --> 04:17.000 Like we even the big neuromodals are somewhat better on some few tasks. 04:17.000 --> 04:21.000 But in the end these terms statistics are very effective. 04:21.000 --> 04:28.000 Some of the downsides which we'll talk about in a second what neuromodals maybe do give us is. 04:28.000 --> 04:30.000 In this case we only have lexical matches. 04:30.000 --> 04:36.000 We don't have any semantic matching we only have or can only find the documents that actually contain the terms that we're searching for. 04:36.000 --> 04:39.000 We could do some complex document expansion and this and that. 04:39.000 --> 04:45.000 But in the sense in this case right here we only have lexical matches we can actually work with. 04:45.000 --> 04:49.000 And the final negative maybe is these are a heuristic. 04:49.000 --> 04:55.000 We have a few parameters in BM25 we have two parameters where we might be able to change the scores a bit. 04:55.000 --> 05:07.000 But in general we have some heuristic that we're using to assess the impact of a score it's not learned and what we've seen in the past few years using models to learn scores maybe a bit better. 05:07.000 --> 05:14.000 So what happens yeah BM25 back yeah almost or more than 20 20 years ago now. 05:14.000 --> 05:23.000 Pretty much yeah a couple of things were developed in between but since Bert was created a bunch of different learned sparse models were created. 05:23.000 --> 05:34.000 Which yeah it's it's a it's a whole zoo and different types of variations and things like that but in the end they all kind of do a very similar thing or try to achieve similar things. 05:34.000 --> 05:45.000 And I'll try to summarize that in this quick slide where for the sake of this example let's just look at the first document here and we'll do our processors now doing some magic in some neural magic in the background. 05:45.000 --> 05:58.000 And instead of having yeah terms to this six what we're doing is we're trading the model to compute impact scores for the terms in our document so this 13.5 for example is just some. 05:58.000 --> 06:01.000 Score that the model learned this is. 06:01.000 --> 06:14.000 The relevance or the impact of this term in our document and it's able to learn for example that is an A and stop words or words that don't encode semantics they might not have high impact scores because they don't encode. 06:14.000 --> 06:32.000 A lot of information that might be relevant and the last part is they may actually be able to add semantically similar terms that yeah get aren't actually in the document they can add terms to be able to find documents that don't actually contain terms. 06:32.000 --> 06:39.000 Yeah avoiding this lexical mismatch problem so these are two advantages that these neural models give us on the one hand. 06:39.000 --> 06:50.000 Learnable impact scores on the other hand semantic term expansion the downside is it's very very slow and expensive building a web scale index or something with these types of models they're very expensive. 06:50.000 --> 07:03.000 And the other question is how fast is this can we reuse the index structures the optimizations that people have built over the last 25 30 years can we actually reuse these index structures using. 07:04.000 --> 07:09.000 These impact scores that kind of looks similar to terms statistics does that work. 07:09.000 --> 07:25.000 Yeah, let's look at this there's was a paper from 2021 from Joel McKenzie and a bunch of colleagues and they looked at this question and they looked at this from the perspective okay what are the distributions like the distributions of the different vectors for different types of models. 07:25.000 --> 07:32.000 So if we look at the m25 the unique terms and this is for the msmarkle passage corpus. 07:32.000 --> 07:48.000 There's 2.6 to 0.7 million unique terms that were created in the index if you use bm25 around 30 terms per document five terms per query and you have very very fast search through using this for example the piece engine. 07:48.000 --> 07:55.000 And they try different types of models encoding this corpus to see how this statistics change and how the latency changes. 07:55.000 --> 08:10.000 And for example deep impact one type of model it has a lot larger vocabulary it adds or has a lot more unique terms it has a lot more terms per document and a lot fewer or a bit fewer terms per query. 08:11.000 --> 08:17.000 Unicoil on the contrast has a very very small vocabulary compared to bm25. 08:17.000 --> 08:26.000 It's constrained to the vocabulary of the subword token vocabulary of the baseline model and it again has a lot more terms per document and a few more terms per query. 08:26.000 --> 08:35.000 And another model displayed same as unicoil constrained to the vocabulary of the base model and it adds a lot of terms to the document and to the query. 08:35.000 --> 08:48.000 And this reflects itself in the latency of using just a plain inverted index the piece engine it is a lot slower using these impact weights actually makes things way way slower. 08:48.000 --> 09:00.000 Even though we are a bit more effective, but how can we maybe get to a stage where we're still as fast as using the term impact scores and get the efficiency in retrieval using these learned impact scores. 09:01.000 --> 09:18.000 And okay the conclusion no we can't reuse these these index or in yeah these engines that were developed because the optimizations that were developed for these term frequency distributions just don't work very well for these neural sparse models. 09:18.000 --> 09:27.000 So what maybe two strategies we can work on and this is the last part for me before Antonio takes it away I want to give a bit of context what kind of stuff we can do. 09:27.000 --> 09:40.000 Two different strategies we could follow on the one hand the one is early exiting where we just don't score everything we try to see okay try to only compute the scores for the documents that maybe. 09:40.000 --> 09:52.000 We actually need to compute the scores for and one way to do that is for example here on the left side for every single document we have a very quick way to estimate an upper bound score this is the score that the document. 09:52.000 --> 09:56.000 Could achieve for a certain query and we do this very very quickly. 09:56.000 --> 10:11.000 And then if we want to for example in this case get the top three documents we can score the documents one by one the first one we have an actual score of 9.5 for example the second one a bit lower than the actual score as well and stuff like that. 10:11.000 --> 10:30.000 Now we have the top three scores and then we know the fourth and fifth document their maximum potential score is lower than the current minimum score of our top three documents are so we don't actually need to score these we can just skip scoring them and save a lot of computation in that way so we don't actually score these and be done with that. 10:30.000 --> 10:41.000 Or the other strategy is we score everything but approximately we use some faster scoring mechanism yeah to score everything but not as accurately. 10:41.000 --> 10:56.000 And this is yeah has a couple of advantages for example you're very very fast so in this case these are the actual scores and we might approximate these scores and compute them way way quicker the problem here is we don't have exact scoring anymore. 10:56.000 --> 11:13.000 For example in this case we have one document that has a score the exact score should be 4.8 this one should be 6.2 but our approximate score is actually switched the order of the documents so you're not actually sure you're getting the exact top k so we have the trade off here but we'll probably get a lot quicker. 11:13.000 --> 11:25.000 So these two strategies we can follow or in general that's how I would frame it or the potential we can save and that's where Antonio jumps in and talks about BMP. 11:25.000 --> 11:41.000 Yeah so following the two approaches that 30 just described we thought we thought about sort of like coming up with a completely different approach that is not actually based on inverted indexes or at least not the way we are used to. 11:42.000 --> 12:08.000 Well first of all we laughed the mental model of having every document going to be indexed in inverted index into passing lists and we started thinking about vectors which is a very common pattern in dense retrieval to have dense vectors vectors such as yeah a concatenation of like floating point values. 12:08.000 --> 12:25.000 But those are different those are different because those are like sparse vectors what does that mean it means that for not all the positions in these vectors are actually positive values or actually non zero values. 12:25.000 --> 12:44.000 For as for a sparse model for a lexical model as part factor indicate is basically indicating which tokens out of the original lexicon or the original vocabulary so all the unique terms that you see in the corpus which tokens which. 12:44.000 --> 12:58.000 terms are actually present so as you can see for example I love New York contains only three terms and only three positions in the sparse vector that corresponds are actually set with a value and those values are actually. 12:58.000 --> 13:20.000 and generated using a sparse model in this get learned sparse model in this specific case so now instead of having posting list we have vectors and without a bit about how can we make this more similar to the standard vector retrieval process that we see in vector databases and and the idea is to basically have. 13:20.000 --> 13:36.000 the collection of documents as as as vectors represented as a concatenation of these vectors but like also split into basically blocks as you can see here we have like three blocks each block contains three vectors. 13:36.000 --> 14:00.000 So now when you receive a query you also produce a sparse vector you convert it into a sparse vector by setting the position of the query terms for which that is contained in the original query text and you could potentially like compute the score for every document in the collection by you know doing a dot product a simple dot product between the query vector and the document vector. 14:00.000 --> 14:07.000 Now if you do that you're not saving any time because you're essentially scanning the entire collection the entire representation. 14:07.000 --> 14:20.000 So the idea was okay let's actually create some upper bounds so some scores that will tell us if we actually need to enter the block and score exhaustively the entire block or not. 14:20.000 --> 14:33.000 So we did what we added essentially an additional interaction an additional data structure of vector of maxis which are essentially the max for every position in the block contained. 14:33.000 --> 14:44.000 So for every position we look at the max score for all the in this case three vectors for that block and we represented as a single vector of max course. 14:44.000 --> 14:59.000 Now if your block is big enough let's say contains eight vectors 16 32 64 and so on you're basically reducing the number of vectors for which you have to compute the dot product by factor of the block size. 14:59.000 --> 15:07.000 Literally making a query processing x times faster with respect to what the block size is. 15:08.000 --> 15:13.000 And so now running dot product against the vector of maxis. 15:13.000 --> 15:25.000 Well first it's a very fast operation because it's on fewer vectors and at the same time it gives you like an upper bound so a good estimation of what's going to be inside the each block. 15:25.000 --> 15:32.000 You end up having this essentially priority list and you can use this priority list to basically jump around. 15:32.000 --> 15:42.000 Between blocks and say yeah I'm going to score this but I'm not going to score the other one which goes back to the early termination approach that fairly was mentioning. 15:42.000 --> 15:49.000 In order to do that and decide what you're not going to score you maintain like a top k e which is you know a data structure that. 15:49.000 --> 16:01.000 It keeps essentially the top k elements most valuable for your retrieval I estimate score and you always compare with the lowest. 16:01.000 --> 16:10.000 At least important element so usually the one that has the lowest score. 16:10.000 --> 16:24.000 Simple and faster we also introduce point quantization so instead of having a sparse vector of floating point values we decided to use 8 bits for every element. 16:24.000 --> 16:31.000 So the standard linear quantization which makes things much faster because you can then. 16:31.000 --> 16:38.000 And this was also useful for another reason that I'm going to introduce very soon. 16:38.000 --> 16:48.000 Which is sorting now we mentioned we have a list of like a priority list so a list of upper bounds scores. 16:48.000 --> 16:53.000 Upper bound scores that will tell you like which is the most promising block to access first. 16:53.000 --> 16:59.000 You want access in decreasing order so you literally want to sort this this this course. 16:59.000 --> 17:08.000 The problem is that you know if you're operating at the millions scale even sorting millions of values can be very expensive and slow. 17:08.000 --> 17:16.000 So the standard traditional sorting even done in place and so on it's like still to slow. 17:16.000 --> 17:24.000 We thought about like using a bucket for for and and this was possible for one reason you have a predefined set of buckets. 17:24.000 --> 17:34.000 I mean in this example it's 256 predefined set of buckets if you quantize to 8 bits and you know that you're going to summing up up to a max of number of terms. 17:34.000 --> 17:45.000 You know that you know these array of buckets can basically increase to a power of 2 to the 16 or something manageable essentially. 17:46.000 --> 17:58.000 What the way it works it's actually very simple so you iterate the upper bounds and you place them in the position corresponding to a score to the score those are are essentially. 17:58.000 --> 18:08.000 You know quantized score so you know exactly which is the array index that you have to access and each array is literally like a list of elements so you can append at the end. 18:08.000 --> 18:17.000 Now the returning these bucket IDs in order it basically means iterating on the bucket sort. 18:17.000 --> 18:20.000 Oh sorry on the on the on the bucket array. 18:20.000 --> 18:25.000 And and you know materializing the final the final sort. 18:25.000 --> 18:37.000 This is much faster for this specific use case than then traditional sorting and yeah it's basically it has some real world advantages like you know no cash misses. 18:37.000 --> 18:53.000 No no the branch miss prediction and and allocation free and practically you know you can keep the bucket array fixed and you only have to clear it out so you know not really allocating it every every time. 18:53.000 --> 19:01.000 There is another important concept which is we are dividing our vectors into blocks. 19:01.000 --> 19:22.000 The the way you assign a vector as part of vectoring to a block will make a difference in in the speed of your algorithm and the reason is if the if you try to maximize all the similar documents in a way into one block you know the decimilar documents are probably going to answer like. 19:23.000 --> 19:32.000 And so clustering these documents will make a difference in terms of like what how many live areas how many live blocks you have in in your collection. 19:32.000 --> 19:42.000 So that basically means clustering in this case trying to maximize sparsity of the max of the vector of maxes. 19:42.000 --> 19:50.000 Which in in the end and simple world basically means cluster together documents that have common terms. 19:50.000 --> 20:00.000 So that table sorry that plot shows essentially like what's the difference between random assignment of of sparsity vectors and curated assignment which is. 20:00.000 --> 20:07.000 Document reordering technique so in approach where you try to reorder documents that are similar. 20:07.000 --> 20:16.000 That can be replaced for example with camines or other similar similar approaches so by doing that you're trying to to squeeze together. 20:16.000 --> 20:23.000 The similar sparsity vectors that will probably end up in the in the same search result. 20:23.000 --> 20:34.000 So comparing to the by the baseline we we saw like the early techniques the early termination techniques that are definitely like popular these days and. 20:35.000 --> 20:48.000 And you know present many search engines like max core and B and W don't work extremely well with the learned spars models and what we noticed was reduction in in latency. 20:48.000 --> 20:50.000 It was quite sensible. 20:50.000 --> 20:54.000 So I'll pass it to 30 for. 20:54.000 --> 20:57.000 The rest of the talk. 20:58.000 --> 20:59.000 Yeah, thank you. 20:59.000 --> 21:10.000 So I'm going to quickly just show a bit like how can you maybe use this yourself and also come a bit to the why rust is in this title of the talk as well. 21:10.000 --> 21:12.000 So yeah, how do you how do you want to use this. 21:13.000 --> 21:22.000 BMP it's on on GitHub I think do we have a QR code for that for BMP I think at the very end it's it's late for sure you can find it. 21:22.000 --> 21:35.000 So there's a rust and a python interface here in this example is how you can use it in python either you can use batch inference and you just give it index and some queries. 21:35.000 --> 21:42.000 And get the top k and whatever or you can create your search or give it an index and then search through a query by query. 21:42.000 --> 21:52.000 Yeah, batch and single query interface and how do you or the question is now in this case how do you get this index how do you create the the BMP index. 21:52.000 --> 21:57.000 And the answer to that is a sift if you haven't heard of sift before. 21:57.000 --> 22:15.000 If is the common index file format here just like a quick example kind of it just defines it's trying to be an interface between different types of indexes and it's trying to define a unified format where okay we have some postings and some postings list and we just save everything. 22:16.000 --> 22:26.000 Unified format for everything and then yeah was defined in this initiative and there's a couple of different tools that you can convert between the different formats so for example. 22:26.000 --> 22:51.000 Between the the piece of research engine and sift you can convert back and forth and there's also converter from Jason L files of just yeah dictionary is pretty much and convert that to sift and then there's also one tool that you can go from sift to BMP and that's if you want to try it out yourself probably the easiest ways to have a Jason L convert that to sift convert that to to a BMP index. 22:52.000 --> 23:07.000 Yeah, this was from from a paper a cigar our 2020 from from Jimmy Lynn feel free to give it a read and yeah how does that work there's a tool sift to BMP you just give it the sift index file and then it outputs a BMP index. 23:08.000 --> 23:19.000 How might you without having to create your own sift index use this there's actually the sift hub that hosts a couple of different indexes already if you're doing research and for something. 23:19.000 --> 23:29.000 There's for example the msmark of passage index already there you can just download it yeah repository of different sift indexes and queries for several popular. 23:29.000 --> 23:37.000 Sparse models and things like that you can quickly download it and and run it and yeah if you have some further reading. 23:37.000 --> 23:46.000 If you if you're interested I'll leave this up here for a quick second of anyone wants to take a picture otherwise we have two minutes I guess maybe. 23:47.000 --> 23:59.000 Perfect the maybe I'll just show us a bit of a short like code demo that this actually works and yeah quick quick demo I guess so. 24:00.000 --> 24:05.000 Where I need to find the correct via code window obviously there it is. 24:07.000 --> 24:09.000 Let me zoom in a bit and. 24:10.000 --> 24:15.000 Close all these and go like this okay so this is just a quick example script. 24:15.000 --> 24:40.000 Some curl commands to download the sift index from the sift hub it's posted on on hugging face and open and yeah next command just converts the index to calls the sift to BMP converts the index to BMP index and the last one is just an example script which uses the BMP cargo. 24:40.000 --> 25:00.000 Interface or BMP rust interface to run the query and then the end we use track evel to this runs the 2019 track deep learning queries and yeah I ran this a couple of minutes ago and we can show this pretty much run through grabs the whole collection converts it to BMP and in the end we have. 25:00.000 --> 25:19.000 Yeah 24 seconds on my small laptop running on its battery to turn through these 44 43 queries and we actually get some results in the end so yeah works you can use it at home if you're doing research or something it's it's always a good bet at baseline feel free to use it thank you. 25:19.000 --> 25:41.000 And if there's the now it's gone there was a Indian the talk will be in pre talks and you can read it yourself the video is going to be recorded and should be a little by the end of next week. 25:41.000 --> 25:47.000 I'm going to take questions and here you go and repeat the question please for the video. 26:11.000 --> 26:29.000 Okay so the repeat or to repeat the question the question is and talk back to me if I misinterpret it but why here deep impact has such a different distribution of terms compared to the other sparse ones. 26:29.000 --> 26:58.000 I think probably Antonio can hop in later as well but the optimizations that are built most of them. 26:58.000 --> 27:21.000 You use early exiting and in this case because or it's built especially on having a very big vocabulary we have very very sparse vectors and then you can estimate the max scores really well because pretty much everything is sparse and then you have very good estimations for a particular document because a lot of it is sparse. 27:21.000 --> 27:41.000 In this case for especially for unicole and splaid you're losing a lot of the sparsity or the yeah because you have compared to about 3 million now you have only 30,000 different dimensions and but still you have in this case 230 of these are non zero so. 27:41.000 --> 28:00.000 So especially you have higher max scores per document that you're estimating and then it's very difficult to find good early exits is how I would intuitively put it so that makes sense. 28:00.000 --> 28:15.000 You're almost going to and you're not you're not probably not doing a linear scan but you need to score a lot more yeah. 28:15.000 --> 28:19.000 I question by you when you know that there's a. 28:49.000 --> 29:18.000 Yeah repeat the question let me tell me if I understood this correctly you're basically asking what happens when you have terms that. 29:18.000 --> 29:21.000 Similar spelling or. 29:21.000 --> 29:34.000 Yeah people choose words to choose that word they don't want to choose the synonyms or words that are in contact typically used. 29:34.000 --> 29:44.000 Yeah that's right and if you remove that and you use this sort of sparse vectorization you reduce you have. 29:44.000 --> 29:52.000 Oh no the lexicon is still present right so you still have the original term. 29:52.000 --> 30:02.000 The only thing is that you're not repeating it in the representation you know you just create one array once. 30:02.000 --> 30:15.000 You don't need the lexicon or you're basically converting like term into an idea right. 30:15.000 --> 30:21.000 I think we can I have to take it offline if you want to chat more about like how it works the internal some. 30:21.000 --> 30:28.000 Yeah but we keep the terms we don't lose I mean they're part of the. 30:29.000 --> 30:34.000 Okay. 30:34.000 --> 30:37.000 Thank you very much.