WEBVTT 00:00.000 --> 00:13.640 OK, our next talk is by Stefan Greber about Incus and how he managed to enable it to run 00:13.640 --> 00:14.640 OCI containers. 00:14.640 --> 00:19.360 All right, hello everyone. 00:19.360 --> 00:25.960 So, as Christian just said, I'm Stefan Greber, I'm the project leader for Linux containers, 00:25.960 --> 00:27.720 one of the maintainers of Incus. 00:27.720 --> 00:32.920 I'm also the owner of my home company doing consultations stuff and a CTO of future 00:32.920 --> 00:37.800 fusion, which is another company doing large scale and price Incus work. 00:37.800 --> 00:42.000 Today we're going to be talking specifically about OCI, so application container with 00:42.000 --> 00:48.520 an Incus in this case, and kind of what we've done and why and how. 00:48.520 --> 00:55.040 So, kind of kicks in off just very briefly what's Incus, because maybe some of you don't 00:55.040 --> 00:56.040 know that. 00:56.040 --> 01:00.840 So, Incus is a system container in veteran machine manager, and nowadays also application 01:00.840 --> 01:01.840 container manager. 01:01.840 --> 01:08.160 It's all image based based on a REST API, it's got a pretty simple CLI, it's got support 01:08.160 --> 01:11.280 for most of the stuff you would normally expect, so you can use non-flots, backups, 01:11.280 --> 01:14.880 or bunch of different networking storage options. 01:14.880 --> 01:21.840 It also has a small web UI, you can use projects to segment things, so you can actually 01:21.840 --> 01:26.960 with external authentication and authorization, you can turn it into multi-tenant environment, 01:26.960 --> 01:34.320 it can be clustered up to 50, 100-ish servers, so you can run it at reasonably large scale, 01:34.320 --> 01:41.120 it supports distributed storage with a three-step, but we're also adding NINSTOR now, and 01:41.120 --> 01:46.560 we support shared blocks for LVM, and then on the network side we use oven for all of 01:46.560 --> 01:51.160 the software-defined networking bits that are also an option, but physical networking works 01:51.160 --> 01:52.160 just as well. 01:52.160 --> 01:57.200 I mentioned in integrative, if you have external things, it can use oven ID, connect for 01:57.200 --> 02:03.360 authentication, can use oven FG for a fine-grained authorization, there are a number of web interfaces 02:03.360 --> 02:07.920 that's kind of the one we usually go with when someone wants to see it, personally I didn't 02:07.920 --> 02:15.320 always use the CLI, so the demo afterwards is going to be CLI, and yeah, so it's a reasonably 02:15.320 --> 02:20.500 active project these days, we've had 130-ish contributors last year, all written in 02:20.500 --> 02:25.140 go all up and so also, but she too, and get had. 02:25.140 --> 02:32.820 So why support application containers, anyways, from all the way at the beginning of the project 02:32.820 --> 02:37.420 back when we went back home for a still, like the under canonical, we only focused on system 02:37.420 --> 02:43.720 containers, so running full Linux distros on this thing, and yeah, it didn't even do 02:43.720 --> 02:48.920 v-hands, it was really just containers, just full Linux distros, and that worked pretty well, 02:48.920 --> 02:54.000 it was very useful to build a bunch of folks, and our father the time was, well, there's 02:54.000 --> 02:57.320 Libvert people are going to be using that for VMs, and they're going to do both on the same 02:57.320 --> 03:02.400 system, and we're going to be expecting Libvert to become more and more user-friendly, get 03:02.400 --> 03:07.200 an API without kind of stuff all the time, so that everyone would be happy. 03:07.200 --> 03:13.480 We noticed that this never really happened, Libvert's going to remain where they were, 03:13.480 --> 03:18.680 because in a presentation detail used by OpenStark and others, instead of turning into something 03:18.680 --> 03:22.360 that could assume that regular users really enjoy using. 03:22.360 --> 03:26.960 So after a few years of kind of waiting and to see what happened, but like, okay, fine, 03:26.960 --> 03:32.400 we'll just buy the bullet and we'll add VM support to, it was next the other time. 03:32.400 --> 03:35.720 And that worked pretty well for us, but it was actually reasonably easy, I think we've 03:35.720 --> 03:40.200 got it working just a few months, we've painted up all the couple of years to support 03:40.200 --> 03:45.040 a lot of features, but that was pretty easy, and we had to extend the same situation with 03:45.040 --> 03:46.560 application containers. 03:46.560 --> 03:51.680 We obviously, Docker has been a big thing for a while, people have been using it, they've 03:51.680 --> 03:56.920 been using it alongside in-cast, they've been using it inside of in-cast containers, both 03:56.920 --> 04:04.280 kind of work, but Docker alongside in-cast has a bit of a tendency to keep black networking, 04:04.280 --> 04:08.440 because it assumes it owns everything, and so it injects a bunch of firewall rules that 04:08.440 --> 04:10.440 then blocks everything else on the system. 04:10.440 --> 04:13.680 You need to go and mess with that, we've got documentation on how to fix it, but it's kind 04:13.680 --> 04:15.560 of a new online. 04:15.560 --> 04:19.800 You also end up having your online network on storage, both in-cast, and in-docker, and 04:19.800 --> 04:22.560 gets a bit of a new online integrating things together. 04:22.560 --> 04:27.640 If you instead go with Docker inside of in-cast, that works well enough, but now in-cast 04:27.640 --> 04:32.120 generally, running in previous containers with higher security, actually interfered with some 04:32.120 --> 04:35.840 of the Docker images and the wage-run things. 04:35.840 --> 04:40.240 The number of storage options was quite limited in that scenario, and networking was still a 04:40.240 --> 04:43.480 bit of a mess, because now you've got a network inside of a container, if you want to 04:43.480 --> 04:47.920 integrate with something outside of it, and get a bit messy. 04:47.920 --> 04:51.720 So that was kind of the state of things, but we've seen a lot of people have used for 04:51.720 --> 04:58.960 application containers, whether it's for IoT stuff, basically all of the IoT bridges, for 04:58.960 --> 05:03.640 like Zigbee, Z-Wave, whatever, they all shipped us Docker containers these days, like all 05:03.640 --> 05:14.600 of the Home Assistant components, our shipped, our shipped, our shipped as OSI images of 05:14.600 --> 05:18.440 a bit of a Docker hub, and people are consuming that instead. 05:18.440 --> 05:24.600 So there's no, there's no, it's just a bit of a weird fit to try and like manually repackage 05:24.600 --> 05:28.640 those things to run them on top of in-cast, and otherwise you were doing nested Docker, that 05:28.640 --> 05:33.200 was always a bit dodgy, there's like, and more and more applications effectively, officially 05:33.200 --> 05:38.520 shipped as a Docker OSI image, this is. 05:38.520 --> 05:42.120 So there was a bit of a need for that, we noticed that like we don't want to start competing with 05:42.120 --> 05:47.120 Kubernetes or something that's not our intention, but a lot of people just need a few containers 05:47.120 --> 05:50.840 running, they don't start getting them up and down constantly, so it can be a bit 05:50.840 --> 05:53.200 sense to add that for us. 05:53.200 --> 05:57.160 Also the reason to do it, it didn't are being quite easy and fun, so that's always a good 05:57.160 --> 05:59.440 justification for doing something. 05:59.440 --> 06:01.400 Now, how does it work? 06:01.400 --> 06:06.320 Well, what we do is actually reasonably simple, because they're good tools that they're 06:06.320 --> 06:08.080 that simplify a lot of that. 06:08.080 --> 06:13.200 So we need, obviously, to interact with a registry, so we use a scope out for that, then 06:13.200 --> 06:17.520 we need to fetch that from the registry, again, scope out as that for us, then we need 06:17.520 --> 06:22.960 to go and turn that into a viable root-fight system, so we're using much, you're much 06:22.960 --> 06:28.040 free for that, which effectively looks at the layers and squashes everything together. 06:28.040 --> 06:31.360 And once we've got that and we turn it into a normal in-case image, we load the image 06:31.360 --> 06:35.800 into in-case, we create a normal container from it, and at that stage, don't think that's 06:35.800 --> 06:41.120 different from a regular system container, is that we also process all of the OCI config and 06:41.120 --> 06:44.320 metadata, so we look at the environment variables, we look at the extra mass, we look at 06:44.320 --> 06:49.920 all that stuff, and the entry point, and we've quickly put all of that in place within 06:49.920 --> 06:53.400 the follow-config, and then the container starts up. 06:53.400 --> 06:57.240 One common misconception is that we effectively have in-case-riving Docker or something, it's 06:57.240 --> 07:03.480 not the case, we turn the OCI image into effectively in-case image, and we burn it through 07:03.480 --> 07:06.200 a normal container runtime, which is LXE. 07:06.200 --> 07:11.560 We don't use RENCY, only of those at all, in this case, we use the exact same runtime, 07:11.560 --> 07:14.920 whether it's a system container or an application container. 07:14.920 --> 07:21.000 And then, yeah, start the container, and it just works, basically. 07:21.000 --> 07:30.240 So, time for a quick demo, on the first and my file, which is always fun, and so here 07:30.240 --> 07:35.080 I've got an empty in-case project, and the first thing we need to do is actually, so for 07:35.080 --> 07:39.640 all images, it comes pre-configured file image server, we could, in theory, pre-configured 07:39.640 --> 07:42.920 a Docker hub because it's the most common one, but there are many other registries, so 07:42.920 --> 07:43.920 we just don't do it. 07:43.920 --> 07:49.160 So, you need to actually add your registry, so we, in this case, for a Docker hub, you 07:49.160 --> 07:55.280 can do that, and then see the protocol is OCI, and once you've done that, now you can 07:55.280 --> 08:03.160 do Docker, C, NGNX, and my NGNX, and I'll step the image already, don't know that, so we 08:03.160 --> 08:09.720 don't need to enjoy the Wi-Fi too much, and effectively it just launched it. 08:09.720 --> 08:15.480 So at that point, hey, I've got a container running, I can go and check that we do have 08:15.480 --> 08:22.080 NGNX actually running on this thing, which we do, and if we go look at the config, those 08:22.080 --> 08:27.680 were used to normal, in-case containers of VMs, usually the config is really empty at the 08:27.680 --> 08:33.440 beginning, it just has some image and files in some volatile info, that's different for OCI 08:33.440 --> 08:37.480 containers, you can see, like, for just specifically for that. 08:37.480 --> 08:41.560 The environment variables that are defined in the OCI image get automatically added to our 08:41.560 --> 08:46.480 config, and so that there, once that's done, it works a bit differently than what you're 08:46.480 --> 08:51.680 used to with Docker, because with Docker, it's, I don't know, maybe there's some magic stuff 08:51.680 --> 08:56.520 I don't know how to do, but it's not trivial to go and reconfigure things in place, well, 08:56.520 --> 09:00.320 as with Incus it is, like you can add additional months and stuff while the thing is 09:00.320 --> 09:05.600 running, you can change the amount of CPU memory while it's running, you can add GPUs, 09:05.600 --> 09:08.880 what it's running, and if you want to change the environment, you don't need to delete it, 09:08.880 --> 09:13.640 you can just change the environment, restart it, and you're done, so that makes it quite 09:13.640 --> 09:20.000 a bit easier, for my personal use case at home, which is mostly running old bunch of IoT 09:20.000 --> 09:25.120 home automation types stuff, I can run those things basically, if I ever, if I need to reconfigure 09:25.120 --> 09:29.080 where the MQTT endpoint is or something, I can just go change the environment, restart the 09:29.080 --> 09:35.560 thing, I'm done, it also uses normal incase networks, storage, all of those features, 09:35.560 --> 09:42.240 so it's obviously if you're running like a mix of containers and like, it's the 09:42.240 --> 09:46.400 questioners and VMs, now you can do those alongside it and it just all fits nicely, it's 09:46.400 --> 09:50.680 on the same network, you can put the same firewall policies and stuff between them, it goes 09:50.680 --> 09:54.480 on the same storage as Incus, if you're running a production cluster with red and 09:54.480 --> 09:59.560 then storage, then now you've got red and then storage on those two, so it's, it just fits 09:59.560 --> 10:04.960 really nicely and the actual amount of code and effort to do this was pretty minimal. 10:04.960 --> 10:09.920 We did have to do a bit of extra work afterwards, because for example, Incus had zero 10:09.920 --> 10:15.200 need for a notary start policy, because we were running either VMs or system containers 10:15.200 --> 10:19.880 and in those, they usually don't die, like if you, if QMQ crashes, you probably have bigger 10:19.880 --> 10:26.120 problems and if PID-1, like system they in a container crashes, you probably also have 10:26.120 --> 10:31.600 bigger problems, so we'd never needed a restart policy, but obviously with application containers, 10:31.600 --> 10:35.840 it's pretty common that if a service wants to just restart itself to reload, it just 10:35.840 --> 10:41.240 exits the container and dies and starts backup, so we've had to add auto restart, the other 10:41.240 --> 10:46.680 thing we didn't need to do in Incus is because we're running full distros or full operating 10:46.680 --> 10:51.680 systems, they usually have a network management tool of some kind that does the HCP for 10:51.680 --> 10:57.880 network config, that didn't exist here, so we actually need to write the tiny DHCP clients, 10:57.880 --> 11:01.720 which when the container starts that runs, gets a lease, stays in the background and those 11:01.720 --> 11:07.080 renewables, but it also means that you can literally bridge those OCI containers directly 11:07.080 --> 11:11.120 on your physical network, and it will just grab an IP from DHCP nice and easy, you 11:11.120 --> 11:22.720 don't need to mess with static IPs, I don't think so, now we get to the kind of what's 11:22.720 --> 11:30.760 coming up next, I mean for my personal use we're done, it works, but there are always 11:30.760 --> 11:36.200 things we can do better, currently I don't love the fact that we shed out to you much 11:36.200 --> 11:40.800 she and scop here, because both of them are local bases, we are local base, we should be 11:40.800 --> 11:46.240 able to just use the right logic and not need distributions to shift as separate tool, 11:46.240 --> 11:50.440 so that's something that we'd like to do, I know the phone view much easier, it's pretty 11:50.440 --> 11:55.400 easy to do, also a geometry creator and maintainer, it's also an incost maintainer, so if 11:55.400 --> 12:00.800 we need changes there, nice and easy, scop here is a bit worse from what I've seen, it's 12:00.800 --> 12:05.840 not particularly well split, the parts, it's not really designed to be included in other 12:05.840 --> 12:10.400 code bases, so we might need to look at what we do there, there's a bunch of discussions 12:10.400 --> 12:16.200 around handing of private registry which we currently don't do, around how do we handle 12:16.200 --> 12:21.360 the authentication and all of that, like obviously we're sure that's the API with 12:21.360 --> 12:26.280 a kind server type of design, so we're depending on the authentication, but just using 12:26.280 --> 12:29.880 in password, we can pass that through the request very easily, if it's something more 12:29.880 --> 12:34.160 complex where you need to get a contemporary, use token and stuff that gets a bit more 12:34.160 --> 12:38.880 complicated, so we're looking at the best options to handle that in a way that's mostly 12:38.880 --> 12:43.360 natural and easy for those of dealing with that on paratroopers, trees, in Docker and 12:43.360 --> 12:46.000 other tools. 12:46.000 --> 12:50.320 Something is going to be a bigger piece of work, but for us, we can't complete the set 12:50.320 --> 12:57.880 is allowing running goes as VMs as well, so we've for our normal images, if you do incost 12:57.880 --> 13:02.760 launch images Ubuntu 2444, you get a container, if you do dash dash VM, you get the same 13:02.760 --> 13:08.160 thing as a VM, we want the same experience for OCI images, so that if you launch them 13:08.160 --> 13:12.800 as they are, you get a container, if you do dash dash VM, you get a very thin VM layer with 13:12.800 --> 13:18.880 the container, image running inside it, so that's cannot cut a like design, we've probably 13:18.880 --> 13:23.680 a very, very similar design of Q and U, VOTIFS, we've said all of those things are ready 13:23.680 --> 13:28.280 for all VMs, just a matter of putting the right bits in the right places, and the last 13:28.280 --> 13:36.520 thing is potentially handling layers, but the reality so far is that 99% of the images 13:36.520 --> 13:40.840 we've looked at, there's so small ones, there's actually squashed together into a single 13:40.840 --> 13:45.400 every day, so we can manage, we've not really seen a good use case for that, the one big 13:45.400 --> 13:52.120 use case would be people doing the IML type stuff with the massive Nvidia type by base layer, 13:52.120 --> 13:57.520 there it would be a bit annoying to run three containers and have in theory 99% of the image 13:57.520 --> 14:03.720 being shared, but having them duplicates, but then for us to start supporting layers throughout 14:03.720 --> 14:08.640 all of our image management logic, volumes, or trust like, but doesn't different storage 14:08.640 --> 14:13.040 back and across a cluster and all of that, it's not trivial, so it's going to matter 14:13.040 --> 14:19.920 of balancing the need for that, so far, it's like basically if you have that need, might 14:19.920 --> 14:26.840 as well use, keep using Docker, and that's basically it for me, if you want to play with 14:26.840 --> 14:34.160 it online, we've got the online demo that is you play with incurs containers, VMs, and unless 14:34.160 --> 14:39.360 the IP address is changed and I need to be the firewall, normally also OCI images from 14:39.360 --> 14:45.800 the Docker hub, so yeah, that's a good way to effectively get a VM on an incurs cluster that 14:45.800 --> 14:50.320 has nested VM support and that has incurs installed, so you get to play with it for 14:50.320 --> 14:53.720 it, and we've got a few minutes for questions. 14:53.720 --> 15:06.400 I'm going to steal one question, couldn't you, the layering problem, couldn't you 15:06.400 --> 15:11.200 do this similar to what system you're doing with system extensions and contract extensions 15:11.200 --> 15:16.560 that you essentially have images that you compose using overlay, for example? 15:16.560 --> 15:20.440 That's probably how we would do it, yeah, you would want to, don't know the layers and then 15:20.440 --> 15:24.080 do a overlay effect, currently the biggest, and that's not necessarily that difficult 15:24.080 --> 15:25.080 for us. 15:25.080 --> 15:28.720 The part that's more difficult for me is that right now in our image store and all of 15:28.720 --> 15:33.640 our internal tracking, we've got an image like a single object, now with layers, we're 15:33.640 --> 15:37.840 going to have to track potentially 20 different objects for an image and keep track of 15:37.840 --> 15:43.320 like who's using what, and so when we do replication of images we need to cluster, we 15:43.320 --> 15:47.200 can't just have the layer replicated to one hole in the next layer on the other 15:47.200 --> 15:50.400 holes, because then it's on a wrong machine, so it's all of that tracking logic that's 15:50.400 --> 15:54.200 kind of tricky for us, more than the actual assembling the thing at the end, because 15:54.200 --> 15:59.680 yeah, assembling the thing is setting up a value of S is, we've not done it inside 15:59.680 --> 16:03.720 of ink, but we've done it before in LXC, we're pretty familiar with that process, that's 16:03.720 --> 16:04.720 not too difficult. 16:04.720 --> 16:10.480 It's mostly all of the keeping track of usage, when can you expire something, all of 16:10.480 --> 16:13.880 that stuff, which is actually more complex. 16:14.880 --> 16:20.880 Thanks, Stefan, it's very good talk, I was just wondering about something that you mentioned 16:20.880 --> 16:25.880 that you can add, for example, mounts in the container, while the container is running, 16:25.880 --> 16:31.880 and you also mentioned that this is also applicable with GPUs, and I think Christian presented 16:31.880 --> 16:38.680 how the GPUs worked a few years ago, but I was wondering, is it possible to also hot swap 16:38.680 --> 16:41.880 to remove GPUs or external devices? 16:41.880 --> 16:47.520 Yeah, exactly, so we can do hot plug, obviously for mounts, we use some weird tricks to 16:47.520 --> 16:53.480 propagate the mounts into the container, we can do that kind of stuff, for GPUs, for GPUs, 16:53.480 --> 16:58.680 we can do it, because we add the kind of same thing, effectively bind mounts, the character 16:58.680 --> 17:02.680 of the devices that are needed in, and we can remove them as two. 17:02.680 --> 17:06.080 Technically, it's a bit dodgy, because if you remove a GPU that's currently being in 17:06.080 --> 17:10.080 use, you honestly keep track of that, and you don't know what process to start killing 17:10.080 --> 17:14.680 on a thing, so removing a GPU that's currently in use will likely let them still use it 17:14.680 --> 17:20.880 until the next application tries and then it's gone, but that works just fine, and it's 17:20.880 --> 17:24.760 been something we've worked quite a bit on InCurses that's hot plug works for just about 17:24.760 --> 17:30.080 everything, both on containers and on VMs, like on VMs, you can also do GPU hot plug, it 17:30.080 --> 17:34.800 will do PCI hot plug in, PCI hot plug out, now you might get your canal panic if you've 17:34.800 --> 17:40.800 not correctly clear the usage inside the VM when we're yanking out, but we do support 17:40.800 --> 17:44.640 that, and my core VMs will also support CPU hot plug and hot remove, and that works 17:44.640 --> 17:52.480 surprisingly well with the right ACPI events, thanks, sensor. 17:52.480 --> 17:56.480 Is this already integrated into the patched LSD UI? 17:56.480 --> 18:02.400 Yeah, so, well, kind of, we do have detection for those, so they will show up as container 18:02.400 --> 18:09.840 up in there, launching them is a bit trickier because it doesn't know about all of the potential 18:09.840 --> 18:15.360 remote since it's all of the potential hubs, I think it's possible using the Yamal option, 18:15.360 --> 18:18.960 otherwise at least anything that you launch previously, you should be able to select the 18:18.960 --> 18:24.400 cashed image and create more of that, I know we did sort of work on the terraform sites that 18:24.400 --> 18:30.400 the terraform provider now handles OCI just fine, but I think the UI could do for a bit of an improvement 18:30.480 --> 18:34.560 for like saying I want an OCI and it's going to ask you like what registry what's the name 18:34.560 --> 18:40.080 because we, with scope here we can't easily go and list all images on a registry, but as 18:40.080 --> 18:46.240 as we can for distros, so you're just a bit of a gap there, and I think that's it, we're starting 18:46.240 --> 18:50.560 for someone else shortly, thank you.