WEBVTT 00:00.000 --> 00:08.000 All right folks, we are ready to get started again. 00:08.000 --> 00:12.000 Carol Chen is here to present, get your docs in a row with ducklings. 00:12.000 --> 00:14.000 Let's give it a hand to Carol. 00:14.000 --> 00:15.000 Thank you, Daniel. 00:15.000 --> 00:17.000 Hi everyone. 00:17.000 --> 00:22.000 Thanks for being here today. 00:22.000 --> 00:26.000 I think this is my 10th awesome perhaps. 00:26.000 --> 00:30.000 I've been coming almost year or less since 2013. 00:30.000 --> 00:37.000 But always learning something new from people and new development in projects. 00:37.000 --> 00:40.000 And new faces, old faces. 00:40.000 --> 00:44.000 A lot of people, I'm based in Finland, a lot of people who live in Finland. 00:44.000 --> 00:47.000 I only see them once a year here in Brussels. 00:47.000 --> 00:49.000 Yeah, I'm good to be back. 00:49.000 --> 00:53.000 Anyway, today I'm going to talk about this project called Darkling. 00:53.000 --> 00:56.000 It's a open source project, obviously. 00:56.000 --> 01:03.000 And it's about processing, digesting, parsing documentation, keeping the meaning of it. 01:03.000 --> 01:10.000 And consistently, you know, like being able to kind of do that with different kind of documentation, 01:10.000 --> 01:12.000 document formats. 01:12.000 --> 01:22.000 And this in a row really reliably organized way is kind of my take on it when I was coming out with the proposal for this talk. 01:22.000 --> 01:24.000 So my name is Carol Chen. 01:24.000 --> 01:26.000 I'm from Red Hat. 01:26.000 --> 01:28.000 This is an open source project. 01:28.000 --> 01:31.000 It's actually part of the Linux Foundation. 01:31.000 --> 01:35.000 And I'm just, you know, part of the community. 01:35.000 --> 01:36.000 I really like this project. 01:36.000 --> 01:39.000 And I would like to share that with people. 01:39.000 --> 01:42.000 Before I start, I would say I'm not a technical writer. 01:42.000 --> 01:45.000 I'm not a documentation subject matter expert. 01:45.000 --> 01:51.000 You know, like people like Daniel and other people in this room talking here. 01:51.000 --> 01:54.000 It has a lot more experience in a lot of the things. 01:54.000 --> 02:00.000 So apologies in advance if I get some terminology or concepts wrong. 02:00.000 --> 02:05.000 And feel free to give me feedback on like the Fediver's Macedon. 02:05.000 --> 02:08.000 I'm a Macedon or Matrix. 02:08.000 --> 02:12.000 And you know, LinkedIn, if you want to, LinkedIn there. 02:12.000 --> 02:20.000 So you can also send me a PDF or write me a feedback in a document format. 02:20.000 --> 02:23.000 A DocX or HTML. 02:23.000 --> 02:24.000 Why? 02:24.000 --> 02:32.000 Because docling can help me process parts and make sense of information from different document formats. 02:32.000 --> 02:35.000 So here's the doc in project. 02:35.000 --> 02:38.000 The key things is it can parse different formats. 02:38.000 --> 02:40.000 Like I just mentioned. 02:40.000 --> 02:46.000 And one of the key things is the advanced parsing of PDFs. 02:46.000 --> 02:52.000 We talk a lot about, you know, the different ways to represent information and documentation. 02:52.000 --> 02:56.000 A lot of it is about how it looks visually. 02:56.000 --> 02:58.000 Like mark down mark up. 02:58.000 --> 02:59.000 Whatever. 02:59.000 --> 03:01.000 You're looking at this headers. 03:01.000 --> 03:07.000 There's like paragraphs, sections, tables, diagrams and stuff like that. 03:07.000 --> 03:15.000 PDF is one of those formats that can encapsulate and reproduce reliably across different platforms. 03:15.000 --> 03:22.000 You know, I made a copy of this presentation in PDF just in case I couldn't present it on my own laptop. 03:22.000 --> 03:24.000 And we have to use some other laptop. 03:24.000 --> 03:28.000 So a lot of times, there's a lot of information. 03:28.000 --> 03:33.000 But it's great for humans to read it, to digest it, to understand it. 03:33.000 --> 03:38.000 But machines, you know, that's why it's really probably quite challenging to parse PDFs. 03:39.000 --> 03:54.000 So doclin has this whole process pipeline to make use of certain AI models to be able to do a very thorough and accurate parsing of PDFs. 03:54.000 --> 04:01.000 But besides that, there's also like, since the project started, this was not like right away from the start. 04:01.000 --> 04:08.000 It's a support for things like images and audio files with OCR and ASR. 04:08.000 --> 04:11.000 So there's different pipelines for the different file formats. 04:11.000 --> 04:13.000 So it's multi-model. 04:13.000 --> 04:22.000 And while the key concepts for the blockchain is this unified, expressive doclin document representation format, 04:22.000 --> 04:30.000 where it captures no matter what kind of input you give it, to be able to capture and preserve the meaning behind 04:30.000 --> 04:43.000 the document. So that, especially because when you want to translate it or convert it to some other format, that meaning of the document is not lost. 04:43.000 --> 04:52.000 A lot of times when you do like conversion or parsing, like it's one to one, maybe like PDF to HTML or PDF to mark down. 04:52.000 --> 04:58.000 But, you know, sometimes going through the process from one to the other, you lose some information data. 04:58.000 --> 05:10.000 So by converting that into like an intermediate unified format, it's able to do that and you can then use that for outputting to different other different formats. 05:10.000 --> 05:16.000 So I'll get into a little bit of that more in the coming slides. 05:16.000 --> 05:19.000 Another key thing is the local execution. 05:19.000 --> 05:24.000 A lot of trans, I'll say, convert this, I've used myself. 05:24.000 --> 05:29.000 You know, you have to kind of upload the file somewhere to some cloud. 05:29.000 --> 05:37.000 And I guess it's fine if you're working on documentation for an open source project, most of the things are in the open. 05:37.000 --> 05:45.000 But sometimes you might just want to work with something that's a little bit more sensitive, or, you know, before a project release, 05:45.000 --> 05:51.000 there's something that you don't want things to kind of get leaked or anything like that. 05:51.000 --> 05:57.000 And, you know, can you trust all the different cloud services or SaaS out there, right? 05:57.000 --> 06:07.000 So, without being, you can download everything on your computer, on your laptop or on-prem and be able to execute that locally. 06:07.000 --> 06:12.000 So, yeah, I'll get into some of these later on. 06:12.000 --> 06:14.000 And it's a Python package. 06:14.000 --> 06:18.000 So, and you can run, you just pick and install a dockling. 06:18.000 --> 06:26.000 And there's also a CLI version, of course, you can also include it as a Python library, another app. 06:26.000 --> 06:28.000 So, first, very versatile. 06:28.000 --> 06:33.000 And let's look at more about the project. 06:33.000 --> 06:36.000 I'll try to help promote it. 06:36.000 --> 06:42.000 And it started, I think, and near the end of 2024. 06:42.000 --> 06:48.000 So, more than a year ago, slightly more than a year ago, around, right here, you know, 06:48.000 --> 06:55.000 November, 24, it started big job in how to say popularity. 06:55.000 --> 06:58.000 And it was like, for a while, trending on GitHub. 06:58.000 --> 07:02.000 And I know the dockling team would love it. 07:02.000 --> 07:06.000 If I say this, please, you know, like and subscribe. 07:06.000 --> 07:10.000 I mean, like and follow the GitHub repose. 07:10.000 --> 07:14.000 And, you know, lots of popularity and lots of contributors. 07:14.000 --> 07:17.000 It's not just about using it, right? 07:17.000 --> 07:20.000 People find it useful and want to add stuff to it. 07:20.000 --> 07:25.000 So, we have a very active community as well behind this project. 07:25.000 --> 07:29.000 And, yeah, I was just looking at Python. 07:29.000 --> 07:32.000 The down numbers are also pretty amazing. 07:32.000 --> 07:37.000 So, take a look at GitHub.com slash stopping project. 07:37.000 --> 07:43.000 And there's a lot of, of course, repose under that organization. 07:43.000 --> 07:49.000 Now, let's take a step back and imagine that you are reading a paper, 07:49.000 --> 07:53.000 comparing the English language and the Finnish language. 07:53.000 --> 07:56.000 I don't know why you want to do that, but let's just go with that. 07:56.000 --> 08:06.000 So, we know that the English language has, sorry, the English alphabet has 26 letters, right? 08:06.000 --> 08:08.000 The Finnish has 29. 08:08.000 --> 08:14.000 So, if you, you know, know the language that's probably not very surprising, 08:14.000 --> 08:21.000 there's this idea of pengram, which is like the shortest sentence you can form 08:21.000 --> 08:25.000 with that contains all the letters in the alphabet. 08:25.000 --> 08:31.000 And I think also many of us know that, so the paper goes right. 08:31.000 --> 08:38.000 So, an example of a short sentence that contains all the 26 letters in the English alphabet 08:38.000 --> 08:46.000 is, quote, the quick brown fox jumps over page 6, horizontal line, footnote. 08:46.000 --> 08:52.000 Oh, here's a study about how kind of change may impact the speed of brown foxes. 08:52.000 --> 08:57.000 Seven, the lady dog, and quote. 08:57.000 --> 08:59.000 Wait a minute. 08:59.000 --> 09:05.000 That was quite a short sentence, and I'm well sure it contains other 26 letters, 09:05.000 --> 09:09.000 but you won't quote me on that. 09:09.000 --> 09:12.000 You won't quote the paper like that, right? 09:12.000 --> 09:16.000 Of course, it's also a fake scenario, because that's not such paper yet. 09:16.000 --> 09:21.000 And if I ever get to write it, if I get my finish, you know, improved enough to write that. 09:21.000 --> 09:26.000 I will make sure to have that page break over the lazy dog. 09:26.000 --> 09:34.000 But there are actually real cases of stuff like that happening, you know, right? 09:34.000 --> 09:38.000 Now, this is actually a paper from the 50s. 09:38.000 --> 09:42.000 Some 1959 article, yeah, this is there. 09:42.000 --> 09:47.000 And it talks about, I have no idea what it talks about, but that's this part. 09:48.000 --> 09:52.000 It's actually in two columns, that says, vegetative, electron, 09:52.000 --> 09:55.000 multiple copy, okay, whatever, right? 09:55.000 --> 10:02.000 But it's actually not, you know, in the same paragraph, just, you know, half a century later, 10:02.000 --> 10:08.000 a bunch of research scientific papers actually quote the study and say, 10:08.000 --> 10:14.000 oh, where it transform an infrared spectroscopy, vegetative, 10:14.000 --> 10:17.000 electron, microscopy, blah, blah, blah, blah, okay? 10:17.000 --> 10:19.000 Now, that's not the only one. 10:19.000 --> 10:24.000 If you search like in the Google scholar site, that's like, 10:24.000 --> 10:27.000 site of 112 papers. 10:27.000 --> 10:37.000 So, we can see how parsing wrong information from documents can have an average effect. 10:37.000 --> 10:41.000 That may be a bit more extreme, extreme case. 10:41.000 --> 10:46.000 And sure, you know, probably most parses now can and do something like a two column thing. 10:46.000 --> 10:49.000 Oh, we've been out, I don't know, because when you have images and graphs, 10:49.000 --> 10:54.000 it also can cause more confusion. 10:54.000 --> 11:02.000 When we use top thing on the same paper, we see that it accurately checks and 11:02.000 --> 11:05.000 understands the layout, at the two columns. 11:05.000 --> 11:10.000 And, you know, parsed it correctly, like what happens to the vegetative cell wall, 11:10.000 --> 11:16.000 when the source release blah, blah, blah, and then, you know, effects by means of 11:16.000 --> 11:18.000 electron, microscopy, blah, blah, blah. 11:18.000 --> 11:23.000 So, again, having the right understanding of the PDF document, 11:23.000 --> 11:27.000 made a big difference. 11:27.000 --> 11:35.000 It's one thing to get kind of the meaning wrong is another just kind of losing content all 11:35.000 --> 11:36.000 together. 11:36.000 --> 11:43.000 A lot of like basic parses probably, you know, like the quick brown fox case, 11:43.000 --> 11:49.000 you know, put the header and the fooder and all those fun information together 11:49.000 --> 11:55.000 with the context that I mean the main part body of the paper. 11:55.000 --> 12:02.000 So, the meaning is kind of included or kind of diluted by this kind of undesired 12:02.000 --> 12:09.000 page header, headers, tables become like a list of numbers that has no relation 12:09.000 --> 12:10.000 to each other. 12:10.000 --> 12:14.000 You don't understand what this list of numbers mean. 12:14.000 --> 12:17.000 The whole image is missing and line wraps. 12:17.000 --> 12:23.000 It's just like chunks of lines, broken and there's no wrap. 12:23.000 --> 12:29.000 So, you know, like when you, again, like parse those lines, you lose that context 12:29.000 --> 12:36.000 and you may like, you know, break them out in the wrong places and stuff like that. 12:36.000 --> 12:39.000 So, talking to the rescue. 12:39.000 --> 12:40.000 Markdown. 12:40.000 --> 12:41.000 I love Markdown. 12:41.000 --> 12:48.000 It's one of those formats that's easily understood whether it's like, yeah, 12:48.000 --> 12:52.000 actually formatting thing, the kind of the raw part as well as, of course, 12:52.000 --> 12:57.000 you can then make very kind of clear and structured output. 12:57.000 --> 13:04.000 So, in that sense, Markdown is usually not the problematic one, right? 13:04.000 --> 13:09.000 But, again, like I was saying, PDFs tend to be one of those formats that, 13:09.000 --> 13:15.000 even though it's not easy to kind of create or whatever, 13:15.000 --> 13:23.000 but it does have to capture that visual representation with diagrams and tables and stuff like that. 13:23.000 --> 13:28.000 But then once you have that to kind of go reverse, it's really difficult. 13:28.000 --> 13:36.000 So, when talking first started, they created this PDF pipeline to kind of address that more specifically. 13:36.000 --> 13:40.000 Like Markdown's HTML, those are a little bit easier. 13:40.000 --> 13:47.000 They have a kind of a problematic pipeline and kind of a bit more straightforward parsing, 13:47.000 --> 13:48.000 but PDFs. 13:48.000 --> 13:53.000 First, you have to use an OCR to kind of scan a whole page. 13:53.000 --> 13:58.000 Find out where's the tags, where's the pictures, information, whatever. 13:58.000 --> 14:05.000 And there's actually two key parts of the process here, a layer analysis. 14:05.000 --> 14:08.000 So, again, like the column flow, right? 14:08.000 --> 14:10.000 Is it two columns, three columns? 14:10.000 --> 14:15.000 You know, when you have a line under that diagram, 14:15.000 --> 14:20.000 it's that part of the paragraph below, or the image above, they kind of thing. 14:20.000 --> 14:25.000 The table structure, tables are simple yet, not really, 14:25.000 --> 14:30.000 because you can have, like, multi columns that, you know, 14:30.000 --> 14:33.000 spans across multiple columns and rows. 14:33.000 --> 14:36.000 You can have tables within the table and so on. 14:36.000 --> 14:46.000 So, this is actually two small models that they use to kind of do this job specifically. 14:46.000 --> 14:53.000 Like, instead of throwing the document into a large LLM, you know, 14:53.000 --> 14:55.000 two B7B, whatever, right? 14:55.000 --> 15:00.000 Sure, they can probably do a very decent kind of general, 15:00.000 --> 15:05.000 parsing of that, but maybe it's like 70-80% accurate. 15:05.000 --> 15:09.000 If, maybe some simple PDFs, and that's fine, 15:09.000 --> 15:12.000 and, you know, you're happy with the results, 15:12.000 --> 15:17.000 but for more kind of specific use cases, you want models 15:17.000 --> 15:20.000 that design to do that specific task. 15:20.000 --> 15:24.000 So, these are, they kind of models, it's small, tiny models, 15:24.000 --> 15:29.000 but if I remember correctly, like 40, 42, and million, rather than billion, 15:29.000 --> 15:34.000 you know, so downloads on your laptop, you can process everything locally, 15:34.000 --> 15:38.000 like I said, and because they are trained to just do, 15:38.000 --> 15:44.000 what one to just do layout analysis and other just to do table structure. 15:44.000 --> 15:47.000 So, they are very good at that task. 15:47.000 --> 15:50.000 They're good at nothing else, but those two, those specific tasks, 15:50.000 --> 15:53.000 and be able to do that. 15:54.000 --> 15:58.000 Later on, the team also developed this VLM pipeline, 15:58.000 --> 16:03.000 which is a slightly bigger model, like 258 million parameters, 16:03.000 --> 16:08.000 but still, small compared to most LLMs or Foundation models out there. 16:08.000 --> 16:14.000 So, this VLM pipeline takes the step further, and do all this, 16:14.000 --> 16:18.000 there is design to do all these steps in one go, 16:18.000 --> 16:21.000 so you don't have to kind of go through the whole thing, 16:21.000 --> 16:29.000 and be more efficient in the understanding and the parsing of the PDF document. 16:29.000 --> 16:35.000 It does take more resources, but it also produces more accuracy, 16:35.000 --> 16:39.000 more accurate results, and you know, if you can get something back, 16:39.000 --> 16:44.000 that's like 92, 95% accurate, sure, it's never 100%, 16:44.000 --> 16:49.000 but you spent a lot less time, we always want to double check the work, 16:49.000 --> 16:53.000 and we don't 100% trust the AI model. 16:53.000 --> 16:57.000 So, but you have a model say, you can save a lot of time 16:57.000 --> 17:02.000 in that kind of follow-up checking work compared to general models. 17:02.000 --> 17:07.000 So, again, this is the kind of the intermediate document format 17:07.000 --> 17:12.000 that I talked about, and then, which then can be exported to different, 17:12.000 --> 17:14.000 oh, my goodness, I'm taking too much line. 17:14.000 --> 17:21.000 Okay, so when I saw, when I first heard about this document document format, 17:21.000 --> 17:24.000 I was like, you know, this is like, yeah, another one. 17:24.000 --> 17:27.000 But the key thing is, like I said, you know, you want something 17:27.000 --> 17:32.000 that you able to have almost lost this representation of that information data, 17:32.000 --> 17:34.000 so that you can be able to parse it out to, 17:34.000 --> 17:37.000 walk down to, I think that's an issue in my own and different ones, 17:37.000 --> 17:42.000 and one of the interesting things that 17:42.000 --> 17:46.000 I'm adopting wanted to solve was parsing for 17:46.000 --> 17:51.000 AM models to train, to fine tune, for rag, and stuff like that. 17:51.000 --> 17:54.000 So, I'll talk about chunking a little bit. 17:54.000 --> 17:56.000 So, these are the input formats. 17:56.000 --> 17:58.000 I think the list is growing every time this, 17:58.000 --> 18:02.000 and you really say something new supported as well as the output formats. 18:02.000 --> 18:03.000 And there's a thing, right? 18:03.000 --> 18:07.000 It's an open source project, and it's able, 18:07.000 --> 18:09.000 if there's some formats that you care about, 18:10.000 --> 18:15.000 the ability, because of the standard document format, 18:15.000 --> 18:19.000 you can then, you know, write your own kind of input for two or 18:19.000 --> 18:21.000 that, and then also output. 18:21.000 --> 18:26.000 So, the update list is probably on this link, 18:26.000 --> 18:30.000 but this was when I grabbed it like maybe a week or two ago. 18:30.000 --> 18:36.000 So, it's just a lot of 10 code details about the document format. 18:36.000 --> 18:41.000 Again, you can see from the link, but basically it expresses 18:41.000 --> 18:48.000 a lot of the things that a lot of times it's not easy to capture 18:48.000 --> 18:54.000 with just simple, like, how to say. 18:54.000 --> 18:57.000 Like, you preserve that, okay, we know this is a text, 18:57.000 --> 18:59.000 we know this is a table, and how it relates to things. 18:59.000 --> 19:01.000 So, and also the hierarchy, 19:01.000 --> 19:05.000 we know H1 is under, sorry, H2 is under H1, 19:05.000 --> 19:10.000 and so on. So, a lot of parsing may not be able to capture that. 19:10.000 --> 19:15.000 So, it does that, and then, again, like I said, 19:15.000 --> 19:20.000 the APIs to build the document from scratch with different formats. 19:20.000 --> 19:24.000 Like I said, it does a CLI, you can use that, 19:24.000 --> 19:26.000 but I also wanted to introduce this document project, 19:26.000 --> 19:31.000 actually done by somebody in my team, in Red Hat, 19:32.000 --> 19:34.000 because we were playing wrong with Doppling, 19:34.000 --> 19:39.000 and I was like, it would be nice to have like a simple UI, 19:39.000 --> 19:42.000 GUI interface that we can, you know, try things out, 19:42.000 --> 19:46.000 instead of trying to check check all the different CLI parameters 19:46.000 --> 19:47.000 and configurations. 19:47.000 --> 19:51.000 So, if you go to ducking, that's UI.org, 19:51.000 --> 19:55.000 there is the, you can find how to use this. 19:55.000 --> 19:59.000 And I kept this non-sensical thing here, 19:59.000 --> 20:03.000 because I fell asleep at 2am, updating the slides. 20:03.000 --> 20:07.000 And I don't know if it was my hand or my face was on the laptop, 20:07.000 --> 20:10.000 and I thought, well, this is artifact for Boston. 20:10.000 --> 20:14.000 Let's, I'll, I'll feed these to ducking and see what it says about that. 20:14.000 --> 20:16.000 But, that's for next time. 20:16.000 --> 20:20.000 I was just chatting with my teammates yesterday. 20:20.000 --> 20:23.000 He said, that's the latest release, which, of course, I have to install, 20:23.000 --> 20:27.000 and it supports internationalization. 20:27.000 --> 20:30.000 So, you can see, like, now it has this, 20:30.000 --> 20:33.000 I think German and French, yeah. 20:33.000 --> 20:35.000 So, and I think a couple of others. 20:35.000 --> 20:38.000 So, let's see, we can just run the, 20:38.000 --> 20:42.000 actually, I already ran the demo, 20:42.000 --> 20:44.000 because it takes a couple of minutes. 20:44.000 --> 20:47.000 So, I wasn't going to, like, let it run, 20:47.000 --> 20:52.000 but it is live running on my laptop. 20:52.000 --> 20:55.000 And I also tested this, 20:56.000 --> 20:58.000 without, right now, it's connected to the Boston Wi-Fi, 20:58.000 --> 21:01.000 but I tested it without, just in case it was working. 21:01.000 --> 21:04.000 And again, just to prove that it works locally, 21:04.000 --> 21:07.000 you don't need to, it connection after you download all the models. 21:07.000 --> 21:10.000 So, you can perform everything on your machine. 21:10.000 --> 21:13.000 So, this converted, I had, like, 21:13.000 --> 21:18.000 some free, sensible, ebook that just downloaded, 21:18.000 --> 21:22.000 in PDF format, extracted images from it, 21:23.000 --> 21:25.000 understood all the tables. 21:25.000 --> 21:29.000 That was in the PDF and chunking. 21:29.000 --> 21:30.000 Right. 21:30.000 --> 21:35.000 Most AI models, when they take input to, you know, 21:35.000 --> 21:37.000 do rag or do whatever, 21:37.000 --> 21:40.000 the context window and, you know, 21:40.000 --> 21:42.000 it's limited number of tokens. 21:42.000 --> 21:44.000 There's kind of the most primitive chunking, 21:44.000 --> 21:46.000 we'll just say, okay, 500 tokens. 21:46.000 --> 21:48.000 I'm just going to count 500 tokens, 21:49.000 --> 21:51.000 but then you lose meaning, 21:51.000 --> 21:53.000 and you lose context when you do that. 21:53.000 --> 21:55.000 So, here you can see, 21:55.000 --> 21:57.000 you actually did hundreds of something chunks, 21:57.000 --> 21:59.000 and each one is like, I want to show, 21:59.000 --> 22:01.000 let me be this here, 22:01.000 --> 22:04.000 like, you know, heading. 22:04.000 --> 22:07.000 It's clear, like, this heading is abstract. 22:07.000 --> 22:10.000 This one is one introduction to getting started. 22:10.000 --> 22:12.000 It's able to, 22:12.000 --> 22:13.000 and because, 22:13.000 --> 22:15.000 docking understands format, 22:16.000 --> 22:18.000 it creates blocks of information. 22:18.000 --> 22:20.000 So, you understand that it's able to 22:20.000 --> 22:22.000 group that as a chunk. 22:22.000 --> 22:24.000 And if it's like, 22:24.000 --> 22:30.000 within the context window limit, 22:30.000 --> 22:31.000 you will just use that. 22:31.000 --> 22:34.000 If not, you'll find a appropriate way to, 22:34.000 --> 22:37.000 kind of chunk that into smaller blocks, 22:37.000 --> 22:39.000 but still be able to kind of keep that, 22:39.000 --> 22:41.000 cementing information. 22:41.000 --> 22:44.000 And then I just, again, want to quickly show, 22:44.000 --> 22:46.000 like, 22:46.000 --> 22:47.000 docking versus, 22:47.000 --> 22:48.000 this is docking. 22:48.000 --> 22:49.000 It's able to, like, you know, 22:49.000 --> 22:52.000 pass the index with page numbers accurately. 22:52.000 --> 22:53.000 This was, 22:53.000 --> 22:54.000 I can remember, I used, 22:54.000 --> 22:56.000 market down on some other parser. 22:56.000 --> 22:57.000 It's like, you know, 22:57.000 --> 22:58.000 block, 22:58.000 --> 23:00.000 numbers, block, numbers. 23:00.000 --> 23:01.000 What's chapter one? 23:01.000 --> 23:02.000 I have no idea. 23:02.000 --> 23:03.000 Here, you can see, 23:03.000 --> 23:05.000 about chapter one, 23:05.000 --> 23:07.000 it keeps all the, you know, 23:07.000 --> 23:08.000 heading and information, 23:08.000 --> 23:10.000 and be able to output that accordingly. 23:10.000 --> 23:12.000 So, 23:12.000 --> 23:13.000 again, this is, 23:13.000 --> 23:14.000 this was from, 23:14.000 --> 23:16.000 I just used the, 23:16.000 --> 23:18.000 the startling UI to be able to do that. 23:18.000 --> 23:19.000 So, making it easy for you. 23:19.000 --> 23:21.000 You don't have to remember, 23:21.000 --> 23:23.000 the, you know, 23:23.000 --> 23:25.000 all this CLI references. 23:25.000 --> 23:28.000 You can access that from here, 23:28.000 --> 23:29.000 right from the setting. 23:29.000 --> 23:30.000 You can enable, 23:30.000 --> 23:31.000 OCR. 23:31.000 --> 23:32.000 You can select one. 23:32.000 --> 23:34.000 You can even switch OCR, 23:34.000 --> 23:35.000 and that if it's not installed, 23:35.000 --> 23:36.000 you install the automatically, 23:36.000 --> 23:37.000 except for test drag, 23:37.000 --> 23:39.000 which is since system white install, 23:39.000 --> 23:40.000 so you can do that. 23:41.000 --> 23:43.000 Then add options, 23:43.000 --> 23:44.000 which, again, 23:44.000 --> 23:45.000 you don't have to remember all the, 23:45.000 --> 23:46.000 all the different parameters. 23:46.000 --> 23:47.000 You can just, 23:47.000 --> 23:49.000 turn them on and off here. 23:49.000 --> 23:51.000 And then there's even documentation, 23:51.000 --> 23:52.000 built right in, 23:52.000 --> 23:53.000 MK docs, 23:53.000 --> 23:54.000 perfect markdown, 23:54.000 --> 23:55.000 very easy. 23:55.000 --> 23:57.000 And I've one minute left. 23:57.000 --> 23:58.000 So, 23:58.000 --> 24:01.000 let's see what's my slides. 24:01.000 --> 24:03.000 So, feel free to, 24:03.000 --> 24:05.000 check it out about chunking. 24:05.000 --> 24:06.000 As I clearly, 24:06.000 --> 24:07.000 like I said, 24:07.000 --> 24:08.000 it's about, you know, 24:08.000 --> 24:09.000 splitting it in the right way, 24:09.000 --> 24:11.000 it preserves meaning and context, 24:11.000 --> 24:12.000 and, 24:12.000 --> 24:14.000 be able to understand layout 24:14.000 --> 24:17.000 and reduce hallucination for the models. 24:17.000 --> 24:18.000 All right. 24:18.000 --> 24:19.000 So, 24:19.000 --> 24:21.000 I will upload all these slides, 24:21.000 --> 24:24.000 maybe I'll remove that chunk of nonsense letters, 24:24.000 --> 24:26.000 but I'll upload this to pre-talk, 24:26.000 --> 24:27.000 so you'll be able to see, 24:27.000 --> 24:30.000 get all these links for both document and document. 24:30.000 --> 24:32.000 And that's it. 24:32.000 --> 24:33.000 The document team, 24:33.000 --> 24:36.000 by the way, is from IBM Research in Zurich. 24:36.000 --> 24:38.000 You know, this is not the full list. 24:38.000 --> 24:40.000 We have them with, you know, 24:40.000 --> 24:42.000 of course, a lot of these images 24:42.000 --> 24:46.000 and the docling logo is from them, 24:46.000 --> 24:50.000 but the docling author is from Red Hat David. 24:50.000 --> 24:51.000 Hi, David. 24:51.000 --> 24:53.000 Thank you so much for all your help. 24:53.000 --> 24:54.000 If you're watching this, 24:54.000 --> 24:56.000 amazing stuff. 24:56.000 --> 24:57.000 So, thank you. 24:57.000 --> 24:58.000 If you have any questions, 24:58.000 --> 24:59.000 like I said, 24:59.000 --> 25:01.000 feel free to look for me as sidebat. 25:01.000 --> 25:03.000 And that's my time. 25:03.000 --> 25:05.000 Thank you very much. 25:05.000 --> 25:08.000 Thank you. 25:08.000 --> 25:11.000 Thank you.