WEBVTT 00:00.000 --> 00:05.000 over. 00:07.000 --> 00:12.000 Aye. 00:12.000 --> 00:16.000 Albuquerque. 00:16.000 --> 00:26.600 You could. 00:26.600 --> 00:37.120 will be no drift, not one. Okay, we're gonna start now because I don't want to 00:37.120 --> 00:42.880 spend any minute speaking about anything else than this topic. 00:42.880 --> 00:54.760 Hello everyone, I am from, well I was born in France, but now I have the luxury to live 00:54.840 --> 01:01.280 in Valencia, Spain and your city is really freezing. 01:01.280 --> 01:10.720 Just saying I work as a freelance, I have flying phobia so it cost me a lot to come here. 01:10.720 --> 01:20.760 I'm using that BSD since 1998, yes I am that old and I am a net BSD 01:20.800 --> 01:28.520 cometer since 2009. And for those who use it, I am the initial author of 01:28.520 --> 01:34.680 packaging, the the the binary packaging manager for net BSD, but not only 01:34.680 --> 01:44.680 also for MacOS, Illumos and so on. And right now, Jonathan Perkins, the current 01:44.680 --> 01:51.680 maintainer and he's doing a magnificent job with it. 01:51.680 --> 02:02.680 So today I'm going to talk about a work that started back in 2006, but the real hard work 02:02.680 --> 02:13.680 started three years ago. I always was passionate about making net BSD small. I don't 02:13.680 --> 02:20.680 know why that what what's kind of an obsession I wanted to you know reduce the size 02:20.680 --> 02:33.680 of net BSD. So first I did this with USBK, 64 megabytes, megabytes, USBK. 02:33.680 --> 02:46.320 Then in 2016, as everyone was really api about Docker and containers and so on, I created 02:46.320 --> 02:52.720 sailor, which is not a container system, it is based on a sea of fruit, but it allows you 02:52.720 --> 03:01.560 to create a self-contained net BSD system running only what is netted. 03:01.560 --> 03:12.360 At that time, there was noise about this firecracker thing, something that AWS did back 03:12.360 --> 03:23.240 then, which is basically their serverless spine, you know about it. And basically what 03:23.240 --> 03:33.680 this is, those are virtual machines, and those virtual machine boots in something around 03:33.680 --> 03:44.360 one hundred milliseconds, which is enormous in my opinion. And I wanted to do something resembling 03:44.360 --> 03:52.720 to it, and I knew that QMU for those who know, for those who don't know QMU, I can't 03:52.720 --> 04:04.480 do anything for you. But then, yeah, QMU for ages as this dash kernel flag. I have to admit 04:04.480 --> 04:14.680 I had no idea what that meant, but what I knew is that Linux, always Linux, knows how to 04:14.760 --> 04:26.120 be booted right straight from QMU with a dash kernel flag. For years, five years ago now, 04:26.120 --> 04:36.200 I worked on something resembling to firecracker. Basically, I trimmed up a netbSD kernel, 04:36.280 --> 04:47.880 and netbSD 32 bits. Why? Because at that time, the only version of netbSD that was capable 04:47.880 --> 04:56.920 of booting with a dash kernel flag was the 32 bit one, because the kernel in 32 bits add 04:58.040 --> 05:04.840 was called the multi-boot feature, which allows the kernel to boot directly from QMU like this. 05:06.440 --> 05:18.600 But, I mean, 32 bits. Recently, two years ago, I started this little really badly 05:18.600 --> 05:29.080 name project called MK small NB, which basically consisted in stripping every bit of driver that's 05:29.160 --> 05:39.240 not needed from a 32 bit kernel, so it can be like life-hacked without having to recompile it. 05:39.240 --> 05:51.960 But, that was this guy calling, calling person. Who was booting a free-abused system in, not using 05:51.960 --> 06:03.640 QMU, actually, his work was based on firecracker. And he was able to boot the in more or less 06:04.200 --> 06:21.000 25 to 30 milliseconds, not bad, not bad. So, two years ago, one year and a half, I said, okay, let's 06:21.080 --> 06:28.840 try it. Let's see, let's see what all this is about. How does the kernel, you know what? I'm 06:28.840 --> 06:39.320 going to put, like, so I can see that I'm not drifting. Five minutes, five minutes. So, yeah, 06:39.320 --> 06:48.360 let's see what all this is about. What does it take to boot a kernel from the QMU dash kernel 06:48.840 --> 06:59.800 parameter? And I mean, I did very little kernel acting in the past. It's it's pretty frightening. 07:02.120 --> 07:11.320 And I mean, it's not easy. And I thought, you know what? I'm going to watch what Kulin did 07:12.040 --> 07:19.000 when I stole it. And patch and that's it. And probably there will be nothing more and it will work 07:19.000 --> 07:29.960 just like that. Nope. Nope. I ended up patching the probably the worst part of the kernel, 07:29.960 --> 07:37.800 which is the local.s, which is the assembly part that boots the kernel, you know. 07:38.440 --> 07:46.360 And with this, I was able to boot if we have time or see the details of the the theory 07:47.160 --> 07:55.800 after. With that, I was able to boot the kernel directly using a protocol called pvh. We'll 07:55.880 --> 08:08.600 talk about that. But it booted in 200 millisecond. Please. Please. So, I ended up 08:10.040 --> 08:18.760 tweaking, acting, modifying parts of the kernel that some of them are 30 years old, like, 08:19.640 --> 08:34.360 when I broke the build tweaking the serial code. And I ended up booting a kernel in a normal machine, 08:34.360 --> 08:41.880 in a recent machine in about 20 millisecond. This laptop is a laptop. It's not plugged to the 08:41.880 --> 08:51.960 power, so it has less power than a real desktop machine or server. So, don't do me when it takes 08:51.960 --> 09:02.760 more than 20 milliseconds. And as I really wanted to show what it is about. Because I did 09:04.120 --> 09:10.360 kind of disparate presentations, not the same format at base decant. And in at base decant, 09:10.360 --> 09:19.320 I did it very terribly. I explained the code, what I didn't look or and so. But I had like 09:19.320 --> 09:30.760 very little time to show what was it about. So, today I prepared demos. Most of the of the 09:30.760 --> 09:40.120 presentation should be the more if they work obviously. So, what is the point of all this? Why? 09:40.360 --> 09:47.960 Why do we want to boot a kernel so fast? Well, let's see that. So, first, 09:50.680 --> 09:59.160 like with no other fanciness than just seeing the kernel and seeing a shell, okay? 09:59.640 --> 10:11.800 I created a small project. I can, yeah, I'll do it. Because there's something that I want, 10:11.800 --> 10:20.440 I want to show. Yeah, okay, I'm going to zoom a bit. Where's that? 10:20.840 --> 10:37.000 There. Yep. Yep. Okay. So, the project I've been working on now for years, I renamed it from 10:37.000 --> 10:47.880 MK that I don't want to small BSD. Small BSD is not an OS. It's an OS builder. I'm going to see that. 10:48.760 --> 10:57.720 So, it's mainly composed of a script of two scripts, which are very easy to use. 10:59.160 --> 11:04.600 You create images, okay? I already created it not to have any surprise while doing the demo. 11:06.200 --> 11:14.040 And the main, the main star is this guy. There's BSD small, which is a kernel which has been built 11:14.920 --> 11:27.080 with all the stuff I wrote. And the rescue is basically just an RC file. We're going to see 11:27.080 --> 11:35.160 what it is about. And a shell. Shell being from the rescue directory. If you know the rescue 11:35.160 --> 11:40.840 directory, I guess, every BSD has it. I don't know, pretty BSD, I don't know. 11:41.800 --> 11:49.320 Yep. Okay. So, basically, it's a directory where you have one crunch, it's called crunch binary. 11:50.440 --> 11:55.800 We're pretty much like BZ box. Okay. 11:57.320 --> 12:02.440 That, well, 53. Okay, let me try something. Okay, let me try something. 12:02.520 --> 12:10.120 Cause 53.000 seconds, no pe pe pe pe pe pe pe pe pe pe pe pe pe the 12:11.400 --> 12:24.120 let's see, if I do that. But. But istem. Okay, so, that's a very, very simple OS, 12:24.120 --> 12:37.320 basically, like I said, the kernel and this, but now, from there, what can we do? We are 12:37.320 --> 12:45.160 actually from there, we can do anything. We can create our own operating system. Let's 12:45.160 --> 12:58.160 say, everything evolves around a make-fall. Yeah, I don't have any Yamaha, we know with CICD. 12:58.160 --> 13:08.160 No, it's plain make-fall. And in this, in this make-fall, well, I have targets which are images. 13:09.000 --> 13:20.000 One of the main images is base, so I can do make-based. And in order to, I'm afraid, you know. 13:20.000 --> 13:28.200 In order to make it easy to kill the, the, the, the, the virtual machine, you can pass the make-fall, 13:28.200 --> 13:36.500 amount R O equals yes, which will make the the the FSTab mount slash in 13:36.500 --> 13:41.680 riddenly, allowing me to just kill the machine and and don't care about 13:41.680 --> 13:51.720 shut down alt and so on. Okay, so for example, I can do base, I'm looking 13:51.720 --> 14:00.360 you, I don't know. So that's base and that's the base file system with every 14:00.360 --> 14:08.280 tool you you should find in base. Okay, okay, that's cool, not that impressive, but 14:08.280 --> 14:17.560 nevertheless cool. But like I said, you could create your own operating system. 14:17.960 --> 14:29.800 You could like modify in it or change in it, change the way the system boots and for 14:29.800 --> 14:42.600 example, why not, why not create a system image that's called system BSD. 14:42.600 --> 14:54.240 And this is exactly what what you what you think it is. This is exactly what you think 14:54.240 --> 15:04.880 it is. Yeah, I I also created a couple of of configuration files and yeah, this 15:04.880 --> 15:16.840 is system BSD.com and system BSD.com uses the init, which is basically an init system 15:16.840 --> 15:25.960 which is pretty fast and and those the services with the the the init command like 15:25.960 --> 15:38.600 the init start, stop, and well, you can do fun sitting like this. Okay, and so 15:38.600 --> 15:46.920 welcome system BSD, which is like a net BSD version, which is not exactly in that 15:46.920 --> 15:53.880 BSD, it has another another init system, but after that it's only net BSD, you have 15:54.120 --> 16:00.600 the same commands and all. I'm going to accelerate a bit. Okay, we're not going to talk 16:00.600 --> 16:09.960 about the internals. I know we we we won't have time for that. So okay, that's cool. 16:10.520 --> 16:23.920 This five cracker thing and all the container topic was really appealing to me. And so what 16:23.920 --> 16:41.600 about starting not just a virtual machine, but a container with net BSD inside. So for 16:41.680 --> 16:58.160 example, if I do, I don't know that's not the one I want to start. I want to show you 16:58.160 --> 17:10.160 something before, like instead of just having a shell, we can start the service. Okay, like 17:10.160 --> 17:18.400 for example, an HTTP server. And that's the only service that will start when when I boot boot 17:18.400 --> 17:33.120 the virtual machine and this works just like a container. Okay, but obviously with a bit more 17:33.600 --> 17:41.120 security, because a container, let me remind you that it is only a system called 17:41.120 --> 17:49.760 and share online, I mean, here we have an entire kernel operating system with an isolated 17:49.760 --> 18:01.120 process. Okay, so yeah, what can we do from there? Well, as it is very light and fast. 18:03.360 --> 18:16.000 We can just run it as a Docker container. And here we go. And it behaves like the same. Okay, 18:17.840 --> 18:29.120 and okay, this was a bit slow, 66 milliseconds. But and from there, if we can start it as a container, 18:30.080 --> 18:43.520 what atrocity can we do? Exactly. As we can start it as a container, well, we can have 18:44.800 --> 18:56.320 this awful thing. We can have net BSD as a Kubernetes pod. Okay, so I have a small cluster 18:56.320 --> 19:18.560 in on this laptop where I will create a namespace. And I have a pod manifest that uses the 19:18.560 --> 19:37.760 container that I have just created. Okay, so there we go. Hey, it works. And yeah, the the 19:37.760 --> 19:45.120 cluster that is running inside it, it is called kind, which is Kubernetes in Docker. And it 19:45.200 --> 19:54.960 explodes, you can query the container by querying the host, which is actually the container itself. 19:55.760 --> 20:08.800 So if I do that, oh, much, I understand, okay, I'm getting good. Okay, so the IP of this guy is this. 20:09.680 --> 20:27.120 So I can absolutely curl this, look. And it's actually the container running the virtual machine 20:27.680 --> 20:38.720 that's answering to my curl. Okay, hey, I did all the demos. So from there I assume you 20:38.720 --> 20:48.560 have a pretty good idea of the vast possibilities that we have with that. What I showed here was 20:49.280 --> 20:57.680 with QMU, you saw this. This works also with firecracker because the basically the technique 20:58.240 --> 21:07.440 used to do the channel is exactly the same, it's using pvh, which is a system included in 21:07.440 --> 21:17.120 Xen, like forever. And, well, first calling, because let's be clear, the main work has been done 21:17.120 --> 21:26.320 by calling festival like three years ago, and 22, three years ago. And his work inspired what I did 21:26.320 --> 21:32.720 after that. Okay, the implementations are obviously not the same. We had some problems, I had others, 21:33.680 --> 21:42.960 but the boot method is the same, it's pvh, and actually it's not that complicated. Instead of 21:42.960 --> 21:52.720 booting in the start entry point of the kernel, there is a special entry point called start 21:52.800 --> 22:01.840 Xen, what it was called, start Xen. And this entry point uses the information that are passed by 22:03.200 --> 22:10.800 the virtual machine manager, QMU, or firecracker, or whatever. You don't need, when you think about it, 22:10.800 --> 22:16.400 you don't need any bios, any bookloader, anything when you are starting a virtual machine from 22:17.280 --> 22:24.960 a VMM. You already know how many RAM do you have, what's the disks, and so on and so forth. 22:24.960 --> 22:34.560 So, you will gain a lot of time by just grabbing those information that are pushed by the VMM. 22:34.560 --> 22:42.960 Okay, and this is the main point of pvh, is using a new entry point to just avoid all the bootloader 22:43.040 --> 22:57.680 stuff. Okay. Okay. Okay. I will not go through the implementation details, but because 22:57.680 --> 23:06.320 well, you can see that part of the presentation on the bios decane, bios decane, 23:06.480 --> 23:14.560 presentation, while I go deep into how it is implemented. I just want to mention that 23:14.560 --> 23:24.480 apart from pvh, which is the boot system only, after that, there was, yeah, there's a lot. 23:24.480 --> 23:34.160 Yeah, I like this tweet because this is the first time the kernel with my modifications, 23:34.240 --> 23:43.120 it in it part of the kernel. So, at that point, it worked, and I was very proud, as it showed. 23:45.120 --> 23:54.880 So, yeah, another technique to speed up the boot process, and to speed up the kernel in general, 23:54.880 --> 24:06.560 is to use, implement, and use instead of using like the PCI bus, which is obviously in a virtual 24:06.560 --> 24:14.560 machine, you don't really need PCI. Okay. So, you can use what's called MMIO, which is basically 24:14.560 --> 24:23.120 memory mapping, instead of a PCI bus, meaning that instead of using all the PCI complicated infrastructure, 24:23.200 --> 24:30.320 you just map an address, which will be used as the bus between the guests, the host and the guest. 24:30.320 --> 24:35.680 Okay. So, basically, it's copying data through a structure, there's nothing faster than that. 24:37.440 --> 24:46.320 And, Virtayo has an implementation of MMIO. And so, one of the big works was to 24:47.280 --> 24:58.320 create a dummy bus, which is called VT, PV, sorry, in the VSD, which I stole from an 24:58.400 --> 25:12.480 VSD, basically. And this bus permits to plug MMI, Virtayo on MMIO, and not use any bus, which will 25:12.480 --> 25:23.520 be overkill. Okay. So, that's the second big chunk. After that, and I won't go into the details. 25:23.600 --> 25:31.200 So, that's a little joke that Krayo from the NetBSD crowd. Each time I was booting a 25:31.200 --> 25:40.400 kind of faster, it was saying, faster. So, that's what I did. And using Vio's techniques, 25:41.520 --> 25:50.320 I killed some delays that were, or useless, or avoidable in the context of Virtro machine. 25:51.040 --> 26:03.280 And this is where I achieved those, depending on the machine, from 10 milliseconds to 20, 25. 26:04.400 --> 26:12.880 Someone with Intel i9, I don't know what, achieved 8 milliseconds, which is very good. 26:14.560 --> 26:18.400 Okay. I'm going to stop there to get some questions, or 26:21.200 --> 26:33.760 Okay. Okay. Okay. So, okay. Cool. Okay. Cool. So, that's about the implementation. Now, 26:34.960 --> 26:44.800 calling the tremendous work in the arena of optimization, speed calculation, and so on and so forth. 26:45.680 --> 27:00.000 And for example, it did something called TSLog. TSLog is very simple, but genius. TSLog what it does, 27:00.000 --> 27:11.360 you know that on on on X86 infrastructure, we have a clock, or more like a counter, which is called 27:11.520 --> 27:23.520 RSDT. Basically, it's every CPU cycle. But with that, you can know what time is it in your boot process. 27:23.520 --> 27:32.480 Okay. And using this, along with function names, you could say, okay, this guy is taking me 27:33.200 --> 27:40.640 15 milliseconds, this guy 20, this guy 100. And you, well, I implemented it, stole it. 27:41.760 --> 27:53.200 Yeah, I basically stole code. That's more or less what I do. And using TSLog, I realized that there were 27:53.200 --> 28:02.240 their function that probably were subject to, to optimization. And that was the, the boot without 28:02.320 --> 28:12.800 optimization with only pvh and mmio. And that's where I came down to. Okay. By again, 28:12.800 --> 28:23.680 killing some user functions, optimizing other things, putting, you know, sometimes 20 millisecond, 28:24.640 --> 28:32.640 you can gain them just by putting a return somewhere instead of, if blah, blah, blah, blah, 28:32.640 --> 28:43.680 okay. Sometimes you break the bills, also doing that. Okay. Questions? 28:54.400 --> 29:00.480 Have you linked the adding support to the VSD's hypervisor for microvm's architecture? 29:00.480 --> 29:09.840 Yeah, that's a very good question. So, that BSD has an hypervisor called NVM works. 29:11.120 --> 29:16.960 Bit slower than KVM. Yeah, I didn't say that. See it, but the, oh, yeah. 29:17.120 --> 29:28.720 I was asked if I tried it and implemented it in net BSD's hypervisor. Net BSD, 29:28.720 --> 29:37.280 hypervisor, hypervisor is called NVM. It works on, on this machine. And on my development machines, 29:37.280 --> 29:46.560 I use KVM, Linux, KVM. There's a very good reason for that. I mean, let's, let's face it. 29:47.200 --> 29:53.440 The world is a big Linux. And I mean, people are using Linux and Mac. 29:55.440 --> 30:03.200 So, I wanted this project to work primarily for Linux and Mac, and it does. 30:05.600 --> 30:11.360 But, I mean, I couldn't let Net BSD be in. So, I tried all this,