WEBVTT 00:00.000 --> 00:11.760 Yes, hello, hello, everybody. Welcome to Nikolai Stoke. Yes, unfortunately, he was 00:11.760 --> 00:18.600 able to make it. So, yeah, my imposter syndrome was never as bad as today because, yeah, 00:18.600 --> 00:26.880 that's not my talk. Okay, I will try hard to, you know, to deliver what he wanted to deliver. 00:26.880 --> 00:32.400 Nikolai's friend of mine, he's an amateur musician and a music lover. So, he put an epigraph to 00:32.400 --> 00:40.080 his talk, saying from a French list. And he asked me to tell that this is like free open source 00:40.080 --> 00:49.040 conference and at least was probably one of the free spirit like people and history. And 00:49.040 --> 00:58.160 mine this epigraph will get to it later. Well, why just don't print out like we saw examples of, 00:58.160 --> 01:04.960 you know, the document producing like technologies. So, probably we have somewhere this print 01:04.960 --> 01:14.080 menu item or expert to PDF or expert to like auditi. Why just it doesn't work. Like, you know, 01:14.160 --> 01:20.560 from practice, it doesn't. So, what to do with, for example, you have code listing and it has 01:20.560 --> 01:27.600 like very wide line and what do you do? Like, you may scale, just make it smaller, but scale, 01:27.600 --> 01:35.600 if you are doing like semantic markup, scaling is not semantic, you know, part. Or you may 01:35.600 --> 01:42.800 change just to use landscape orientation for this page only. And again, it's not not a part of 01:42.800 --> 01:49.520 semantic or should it be like should we insert some hints there? It's not known. Or probably we 01:49.520 --> 01:56.160 would like to do both. Or probably we would like to fire an exception like, you know, now your 01:56.160 --> 02:03.280 code building failed and author please rewrite your code example so that your snippet will contain like 02:03.280 --> 02:10.080 short lines. Or probably you would like to rub them and again there are options. You can use 02:10.160 --> 02:16.720 line feed and spaces, but if you care about validity of your code, somebody copies it from PDF 02:16.720 --> 02:23.360 and inserts it like it may break. Or you may use indents, which will work with text processors, 02:23.360 --> 02:30.880 but will not work with PDF. So, choices, choices, choices. And while on the web, in HTML, 02:30.880 --> 02:36.400 we can just add horizontal scroll bar and everything. And this like button copy to clipboard. 02:36.480 --> 02:42.240 And that's just one example. So, there is a core mismatch. semantic markup meets the rigid 02:42.240 --> 02:49.520 world of print. And yeah, what's interesting like this talk is like doing the opposite, like going 02:49.520 --> 02:55.360 in the opposite direction from the doc link that we like heard before because now we are going back 02:55.360 --> 03:02.240 from like doc link back from printed media to semantics and we are going from semantics to printed 03:02.320 --> 03:10.000 media. And this is also like very hard task. So, the main focus like what Nikolai wanted to present 03:10.000 --> 03:18.880 in this is this unidoc publisher repository, GitHub repository. This is actually a result of evolution 03:18.880 --> 03:25.600 because like over the years trying to solve this problem. He did three approaches. 03:26.560 --> 03:32.320 Uh, first one was the access cell of four approach from aski doctor. The second one is the open 03:32.320 --> 03:40.480 document converter. So, the idea was simple. You have aski doc document just produce nicely looking 03:40.480 --> 03:46.560 open document like the library office writer format for this. And finally, unidoc publisher, 03:46.560 --> 03:54.880 which claims to be like any markup to any printing, painting rendering engine, quite a series, quite a 03:54.880 --> 04:04.000 bold claim. Let's see like what it does and how. First of all, what formats do we have for printing? 04:04.000 --> 04:08.720 Like first that comes to mind, of course, PDF, PDF of course, if we have PDF we can print it. 04:10.480 --> 04:15.920 Second is text processing format, right? So, you if you are using word or library office, 04:15.920 --> 04:23.120 like you can print it because there are page, you know, break HTML. Who thinks that HTML is a 04:23.200 --> 04:32.000 format for printing? Wow, some people, some people do. Well, actually we have CSS page media, 04:32.000 --> 04:39.680 which is a CSS extension, which defines styling specific for printing. But the harsh truth of life 04:39.680 --> 04:49.040 is that also we have been with CSS page media for 15 years. Now, still browser support is limited. 04:49.040 --> 04:55.200 And what in practice, like how it is using practices, why are some open source tools, 04:55.200 --> 05:04.400 which convert HTML, plus CSS, page CSS into PDF. So, it's kind of intermediary format, 05:04.400 --> 05:10.000 rather than, you know, well, used format. But still, technically, yeah, this is possible. 05:10.720 --> 05:17.840 Okay, how do we render to this format? Well, if you want to go low level, really low level, 05:17.920 --> 05:24.720 good luck. You can use native PDF generating. There is access LFO. There is tech, of course, 05:24.720 --> 05:32.160 if we are like speaking about PDF and printing, we cannot miss it. And as mentioned before, 05:32.160 --> 05:40.400 HTML plus page media CSS. Also, if we have, like text processor, you know, files, 05:40.400 --> 05:46.640 we also can render to, you know, print it media either by direct printing or converting it to PDF. 05:48.720 --> 05:54.480 The truth is that these technologies are not aligned in a great number of various ways. 05:55.440 --> 06:02.640 So, yeah, we can compare every way, as its own limitations and problems. Like a patch, 06:02.640 --> 06:08.720 fob has long-standing problems with dots in a table of contents. Libre office writer doesn't support 06:08.720 --> 06:16.240 typography, like keep with next, with in tables. Microsoft doesn't recommend using, like, 06:16.240 --> 06:23.520 automation, automating work, actually, to produce PDF. So, there are lots of problems, and it looks 06:23.520 --> 06:30.800 like, you know, old joke about those paleologists who meet in a narrow cave, and one says, 06:30.800 --> 06:35.440 I'm coming from the dead end, and the other says, like, I'm coming from the dead end, too. 06:35.440 --> 06:41.680 So, like, dead ends everywhere, and the world of printing is the world of constraints. 06:42.560 --> 06:51.120 And those constraints, they differ for each technology. So, often, you end up supporting several 06:51.120 --> 06:59.040 chains. For example, your main output is exclusively looking text tech book, and in order to, 06:59.040 --> 07:07.840 you know, navigate reviews or approvals, you are sending Libre office files just for this, 07:08.160 --> 07:17.280 you know, workflow. So, there are no universal solutions. And Nikolari Kamens is still 07:17.280 --> 07:23.120 unidoc publisher, if at least one of these holds. So, if you don't prepare the 07:23.120 --> 07:29.360 documentation, especially for printing purposes. So, if you are doing something like, you know, 07:29.360 --> 07:35.280 on Torah, the documentation like huge tutorials website, and you just need occasionally, 07:35.280 --> 07:42.320 you just need to present a PDF to somebody. If you are automating the documentation generation, 07:42.320 --> 07:48.640 I'm going to speak about it one like our later. And, like, much of your documentation is not 07:48.640 --> 07:53.360 manually produced, but automatically. And you hope it will look good, no matter what will be generated. 07:54.240 --> 08:00.640 And, if you're output format is one of the text processing format, this is also impossible. 08:02.000 --> 08:11.360 So, where do we start? Like, where did Nikolai start? He first he tried to create and ask 08:11.360 --> 08:20.320 you doctor open document converter. So, the idea was the following. So, ask you doctor, 08:20.320 --> 08:27.520 parses it's own mark up into an abstract syntax tree. You may extend it, you may transform this 08:27.520 --> 08:34.320 IST with the aski doctor tree processor and, you know, run each template. So, actually writing 08:34.320 --> 08:41.680 a custom aski doctor processor. And, the writer can be already done either with pure ruby, 08:41.680 --> 08:48.080 or with special templates. So, this is how it looked in practice. So, in the left side of this slide, 08:48.160 --> 08:57.760 this is a simplified AST of an aski doctor document after parsing to the right. There is a slim template. 08:57.760 --> 09:03.440 And, you see, we have him mixed ruby code, everything which starts with the dash is ruby. 09:03.440 --> 09:08.240 Everything which doesn't start with dash is a part of the template. It's just XML output. 09:08.880 --> 09:19.520 So, yeah, you can use aski doctor parser, this extension point that it provides and built 09:19.520 --> 09:29.520 whatever you like, including open document format. It was great, but, yeah, you know, this template 09:29.520 --> 09:36.640 cannot be universal. Everybody wanted their own, you know, particular features. And, it wasn't 09:37.600 --> 09:44.000 it was hard to graduate just part of the template. People had to copy this template paste it 09:44.000 --> 09:50.240 and do actual their own work. That's not how it can be maintained. And, then you should invent 09:50.240 --> 09:56.880 something for styling. By styling, we mean the styling for open documents for the word processor format. 09:56.880 --> 10:02.320 You know, then when you select text and you apply style from the list of styles. And, you say, 10:02.400 --> 10:07.760 okay, you have semantic styles and the word processors. They do support styling. 10:08.480 --> 10:16.320 Well, actually, this is kind of different things because, like here, like both bold and green 10:16.320 --> 10:23.200 are applied to some part of the text, but in open, like in libre office or in Microsoft, 10:23.200 --> 10:29.280 what you can only apply one style. So, you actually need three styles, one for bold, one for green, 10:29.280 --> 10:38.080 and one for bold green, and all the combinations. So, what did he have to do is, 10:40.000 --> 10:47.840 like invent slightly extended intermediary open document format, which contains not only standard 10:47.840 --> 10:56.880 attributes, but extended semantic attributes, and then post-processing it in order to add missing styles, 10:56.960 --> 11:05.760 like missing combination of styles. Well, it kind of worked. And, one of the interesting, 11:05.760 --> 11:12.160 like finding was that users of this approach, they started to transform this extended open document 11:12.160 --> 11:19.680 format. This wasn't intended, intended, you know, feature or intended capability of this product. 11:19.760 --> 11:27.040 Yet, Nikolay noticed that people wanted to transform a ST before converting it into something 11:27.040 --> 11:36.480 printable. Yet, this idea of post-processing and adding, like separately styles, proved to be useful. 11:37.680 --> 11:44.320 And, also, Nikolay liked very much the idea of using Gradle, Gradle is a built-to-from-JVM world, 11:44.400 --> 11:52.400 quite powerful with statically typed DSL. So, it has statically typed checks before you run 11:52.400 --> 12:02.480 this Gradle script. Magnificent in gluing all the parts together. So, before we go to the second step, 12:02.480 --> 12:12.480 the Unidoc publisher, like some final thoughts, like if creating, universal converter is impossible, 12:12.560 --> 12:18.160 which will create meta-converter, a platform for building converters. So, that everybody who 12:18.960 --> 12:25.200 face the problem of transforming their existing, like, body of the documentation into PDF, 12:25.840 --> 12:31.840 they can just simply do it by a custom scripting, but without too much effort. 12:32.960 --> 12:40.880 And, what are the requirements? Native converter is a reader. We will explain this at the next slide, 12:41.200 --> 12:48.400 what's your name? So, we're learning AST. So, yeah, this should be more or less easy to do, right? 12:49.840 --> 12:57.360 And, styling as a separate focus, well, this is just, you know, learning or finding from the previous 12:59.520 --> 13:07.280 previous version. And, yeah, ideally, it should have good integration with CICD with focus on 13:07.360 --> 13:15.440 homogeneity. What is meant here is that everything must be from the same ecosystem, like either JavaScript, 13:15.440 --> 13:26.480 or Java, or Python, but not mix, because mix is often problematic. What do we mean by native 13:26.480 --> 13:34.080 converter as a reader? The idea is that each converter outputs HTML and HTML is quite semantic. 13:34.800 --> 13:43.040 So, if we take as an input, an HTML output, we can actually build the universal thing. 13:43.920 --> 13:52.080 So, markdown can produce HTML, ask your doctor can produce HTML, Microsoft Word, whatever, every, 13:52.160 --> 14:07.520 you know, we can produce HTML. So, you can do it, just read it and there are good, you know, HTML, 14:07.520 --> 14:15.680 reading HTML parsing libraries in open source, if it's in Java, it's JSON, which is quite, quite good. 14:16.320 --> 14:22.240 So, the experiment, let's convert this presentation, the presentation that you see 14:23.200 --> 14:34.720 into printable form. So, this presentation, by the way, is written in asking doctor. So, this is the source 14:34.800 --> 14:45.200 code of Nikolai's presentation. What we want to get is document like this. So, it's LibreOffice 14:45.200 --> 14:53.280 Reiter, and as you can see, it's converted into this printable form. The fact that it is converted 14:53.280 --> 14:58.480 into something else, you can see before you're all nice. It's this presentation, it's asking doctor 14:58.560 --> 15:07.680 review JS pipeline, which just converts it into this clickable, you know, HTML. So, how do we 15:07.680 --> 15:22.400 do this? Using Nikolai's tool. So, yeah, he's using Kotlin internally because, well, it's JVM approach, 15:22.400 --> 15:30.320 is using Gradle with the Gradle script. And the following code listings, they actually as snippets 15:30.320 --> 15:36.960 taking from from the scripts. So, every code snippet that you can see on this presentation, 15:36.960 --> 15:45.600 is actually included from the presentation source code. We will share a GitHub repository. So, 15:45.600 --> 15:50.800 you will see that this is actually not something copy and paste that it's included. So, this is 15:50.800 --> 16:00.400 workable code, which actually works. I will speak like a bit more about this in my next presentation. 16:00.400 --> 16:11.920 So, the boilerplate, this is like a snippet of Gradle script, build script. And what does it do? 16:11.920 --> 16:18.160 Like, first we need the HTML. So, we are using a skiddoctor.js is its library, which allows 16:18.160 --> 16:25.040 to build HTML from a skiddoctor, it's Java library, to build this HTML. Then we are 16:25.040 --> 16:30.480 adding a template. The template here is the starting point. It's just to know that empty 16:30.480 --> 16:37.920 word or library office file, which we are populating with content. Then some, some other technical 16:38.000 --> 16:50.320 details, right, we need, but if we, I'm need this part, which is commented out, and just 16:50.320 --> 16:59.840 run it without any further customizations, we will get something like this, let me show. So, 16:59.840 --> 17:07.760 this is the default transformation. So, already like not bad, probably. So, as you can see, 17:07.760 --> 17:14.960 there is some, you know, output. You can see page headers, but on the first page, you have 17:14.960 --> 17:23.600 page header as well. The first slide looks ugly because like, you know, sizes of images. 17:25.600 --> 17:36.080 The epigraph is lost. So, we need to fix it somehow, right? So, the idea is that, by default, 17:36.160 --> 17:41.600 you are getting like pretty good result, but if you want an ideal result, you should, like, 17:41.600 --> 17:47.520 get your hands dirty and do some abstract syntax tree transformations. That's it. So, how do you do this? 17:47.520 --> 17:53.520 Like, after parsing, you will get access to abstract syntax tree. For example, you need to 17:53.520 --> 18:01.280 extract the first section to build this beautiful header. So, you just filter this out by 18:01.280 --> 18:08.240 source tag name, like take the first section, and make it title out of this. So, how does this first 18:08.240 --> 18:15.200 section look like in askiDoctor? So, this is the source code, and actually, yeah, this is quite 18:15.200 --> 18:21.920 interesting because it's technically, this is an askiDoctor page, which includes the part of itself 18:23.680 --> 18:29.760 within the askiDoctor page. So, yeah, this is like a listing like a, 18:29.840 --> 18:36.560 a part of the listing of this askiDoctor. So, we are going to extract symmetrical variables from 18:36.560 --> 18:43.200 there. Like, you can see roles here, full name, title, photo, biolog and stuff. So, this is how we're 18:43.200 --> 18:50.560 doing this. Like, we have access to AST, so we just, you know, filter out these roles and assign 18:50.560 --> 18:57.760 them to a bunch of Kotlin variables. That's it. And then, having this Kotlin variables, 18:58.720 --> 19:08.720 we can rebuild just three orange, rebuild reconstruct the header, the beginning of this document. 19:08.720 --> 19:16.000 So, what we have here, a pinched out table, table row group, is the builder for open document format. 19:16.000 --> 19:21.280 It can be some other format, but in this case, we are building it. And we are using this, you know, 19:21.280 --> 19:32.240 photo by a contact variables that we extracted in our abstract syntax three. So, what do we have 19:32.240 --> 19:40.800 in the end? Like, let me, let me show it once again. So, we have beautifully looking, you know, 19:40.800 --> 19:50.480 document, which contains all the slide contents. It contains all the speaker notes. And also, 19:50.560 --> 19:56.480 we have an epigraph. We didn't lose it. Also, it used some, you know, unsupported 19:56.480 --> 20:02.320 troll of whatever feature of, you know, of ASCII doctor, which prevented it from, you know, default 20:02.320 --> 20:09.360 expert. Now, it's being shown and it's being shown in, in, in its correct place, in, as the result. 20:13.680 --> 20:14.480 Sorry? 20:14.480 --> 20:18.720 Could you enable the unprintable character? Unprintable characters? 20:23.520 --> 20:28.480 Yes, if it's possible via a builder like LibreOffice Builder. So, it's, like, 20:32.000 --> 20:34.240 I see, you mean this one, okay? 20:35.920 --> 20:43.120 Yeah, yeah, yeah, yeah, you wanted this, yeah, sure, sure, of course. Yeah, yeah, yeah. So, it's, 20:43.440 --> 20:52.320 you wanted to see if it's not all, like, spaces, right? Okay, okay, okay. No, no, it's, it's building, like, 20:52.320 --> 20:57.200 it's building, it's structurally. Yes, it's building it structurally. I'm going to skip this one. 20:57.200 --> 21:03.680 It's, like, highly technical and the Nikolai knows better. By the way, Nikolai is available in 21:03.680 --> 21:11.680 element chat for this group. So, he will be, uh, happy to answer everything. Uh, a bit about testing. 21:11.760 --> 21:19.120 So, it's, uh, kind of, a lot of code already, this transformational code. And we need to make sure 21:19.120 --> 21:23.600 that we, if we are maintaining it, if we are keeping it for some, you know, documentation project, 21:23.600 --> 21:29.440 we need, uh, to, to put some regression testing on it. So, the idea is quite simple. It's, uh, 21:29.440 --> 21:38.800 snapshot testing. So, uh, what we do is we prepare small, small snippets, like a small table, or, 21:38.960 --> 21:48.720 some, you know, combination of, uh, styles. And we ask this pipeline to output the picture, 21:48.720 --> 21:57.840 like printed. So, uh, we have an approved picture just in, in Git. And, uh, if, uh, due to some 21:57.840 --> 22:04.720 version changes or something, uh, the picture becomes to something becomes to, um, something is sliding, right? 22:05.360 --> 22:12.240 Uh, we see it immediately. Our, uh, CI build fails. And, uh, if we are okay with this sliding, 22:12.240 --> 22:18.000 we may just re-approved, which is, say, okay. Now, this version is, uh, kind of fine. So, 22:18.000 --> 22:23.760 this is how it's gonna be. But if something breaks, like, something like this, uh, table disappears 22:23.760 --> 22:29.120 completely, right? In this case, it will be obvious that we need to fix, uh, to fix the code. 22:30.000 --> 22:39.520 Uh, okay. So, what did we get, uh, the result? Uh, so, uh, we got everything CI friendly. What can't 22:39.520 --> 22:46.000 be more CI friendly than, you know, a build script, grade-all build script, because grade-all is a build tool. 22:46.000 --> 22:54.080 And everything is built, uh, with a DSL within grade-all. Uh, like, there are no declarations. 22:54.080 --> 23:05.040 What is meant, meant here is that, uh, it's, uh, this product doesn't, uh, offer some default, uh, 23:06.240 --> 23:13.360 behavior that you can override with some, you know, knobs and switches. And, uh, it's definitely, 23:13.360 --> 23:18.960 like, it was, um, in principally, did there other way around. So, if you want to change something, 23:18.960 --> 23:28.000 you need to get your hands dirty and do some AST transformations. And, uh, it has type, uh, AST, 23:28.000 --> 23:35.920 and clean and testable code. Well, that's, get to conclusion. So, three printing is engineering. 23:35.920 --> 23:42.480 Designed, coded, tested, automated. Printing is a loss of transformation. Not only, you know, 23:42.480 --> 23:48.240 from printed media to semantic markup, also vice versa. It is a loss of transformation. Some 23:48.240 --> 23:53.840 semantics cannot survive it. Uh, keep rendering logic programmable under your control. 23:53.840 --> 24:00.320 And have a look, check out this, uh, you need dog publisher and this repository, which contains 24:00.320 --> 24:07.680 this exact slides and the way how, like, printed documents can be built from them. Thank you very much. 24:07.680 --> 24:08.640 Questions? 24:08.640 --> 24:33.680 Yeah.