WEBVTT 00:00.000 --> 00:11.520 Hello everyone. Welcome to our presentation on building cloud infrastructure for AI. I think 00:11.520 --> 00:15.720 everyone's quite familiar with what cloud infrastructure is. Today we'll be looking at it 00:15.720 --> 00:21.800 in a context of AI. By AI, I simply mean accelerated compute. For all intending purposes 00:21.800 --> 00:29.960 today, it will be GPUs. Who am I? I'm Dave Hughes, software engineer, system admin, network 00:29.960 --> 00:38.840 person, done it all at some stage. Passionate technologist. I am a open source champion. 00:38.840 --> 00:43.080 I love open source. I've been a digital maintainer at one point in my life. Contributed 00:43.080 --> 00:47.880 back to various user space, normally bug fixies when I've been packaging. The user space 00:47.880 --> 00:55.400 software, whatever else. Primarily problem is over. I like to solve problems. 00:55.400 --> 01:03.120 I'm a customer. My background is originally in typefones computing, parallel computing. 01:03.120 --> 01:08.680 That's sort of thing. I got into GPU compute when it was really new and everybody was excited 01:08.680 --> 01:14.080 by Cudon OpenCL. Got into rendering through that, been contributing to Blender for 11 years 01:14.080 --> 01:20.560 now. Through a pure coincidence, I grew working on distributed rendering for that. I got into 01:20.560 --> 01:26.000 the whole cloud infrastructure area and I ended up really enjoying it. I stuck around 01:26.000 --> 01:34.040 and here I am. Founded it's really cool. I'm still very passionate about making things 01:34.040 --> 01:39.520 perform fast, but the whole performance aspect of things. I found it a really enjoy thinking 01:39.520 --> 01:44.280 about and building the big picture, not just hacking away on one component of the stack 01:44.280 --> 01:52.400 for years at a time. We're going to see a lot of this big picture later on. 01:52.400 --> 01:57.840 First of all, what is a cloud? Simple. We have looking at it. I want to skip through. Again, 01:57.840 --> 02:04.680 abstraction of resources and infrastructure. Self-service for developer teams on demand. Elastic 02:04.680 --> 02:10.720 nature. I think everyone's quite aware of that. API-driven, multi-tenant. Kind of obvious, 02:10.720 --> 02:19.880 but not obvious. How does it factor into GPU? GPUs are typically, other accelerators will 02:19.880 --> 02:26.880 be common in the future, GPUs that type of thing, but GPUs used for accelerated compute. Typically, 02:26.880 --> 02:30.960 RDMA interconnects, if you're doing a big cluster, not always. It's quite common that 02:30.960 --> 02:37.120 you'll see cases where people will throw consumer grade GPUs into servers and do things 02:37.120 --> 02:43.200 like that with no RDMA interconnect. Fast storage. I do want to caveat though. It's not necessarily 02:43.200 --> 02:47.600 fastest. Everyone seems to get hung up on this topic where they think they need to be fastest. 02:47.600 --> 02:54.160 You just need an acceptable level. Not necessarily RDMA either. There's been a bit of a discussion 02:54.160 --> 03:00.080 on this. A company-bending name of XTX released a new distributed file system recently named 03:00.160 --> 03:06.240 TurnFs, and they chose not to use RDMA in that either, and generates quite a bit of discussion. 03:07.760 --> 03:13.520 Dense, very big servers. Again, we think a big, beefy storage server historically, but these days, 03:13.520 --> 03:19.520 it's very big dense compute servers. Pilling as much as, like, between 12 and 20 kilowatts per node. 03:20.880 --> 03:25.680 Free rough skills of consumption in this model. A single GPU, again, just passing through the GPU 03:25.680 --> 03:31.920 to a virtual machine, or an entire node that could be 4 GPUs in the box, it could be 10, 03:31.920 --> 03:37.680 it could be 8, it could even use envealing if you're using Nvidia. Could be cluster, 03:37.680 --> 03:40.960 and that's where I came back to earlier. You're dealing with cross-note interconnect, 03:41.520 --> 03:48.080 and basically tying thousands of these things together. GPU clouds, the good to bad, 03:48.080 --> 03:53.600 it's in the ugly. So there's a few different ways of what these GPU clouds actually come together. 03:54.160 --> 03:57.280 First of all, you'll see many people sort of masquerading as a cloud, 03:57.280 --> 04:02.240 but what they really do is purchase assets, image the assets, and simply hand over the keys to the 04:02.240 --> 04:08.800 BMC, and now you break who buy type deal. Other people will throw open stack on there. It's quite an 04:08.800 --> 04:16.640 obvious sort of path to a quick MVP. It's rough with problems, I would say. Quite tightly coupled. 04:16.640 --> 04:21.920 I mean, many cases where we couldn't have high availability on control planes, you would restart 04:22.000 --> 04:27.440 DHCP server in the news part of the network. It's nice in a lot of places, but it 04:27.440 --> 04:32.080 detect coupled nature of many of the components makes it really difficult to use in production. 04:32.640 --> 04:36.720 Arguably it's designed for private cloud operation on your own premises or whatever, 04:36.720 --> 04:41.440 and people insist on using it as a public cloud for multi-tenancy, and it's not ideal. 04:42.160 --> 04:46.800 This is coming from experience of running it in production for like half a decade or more 04:46.800 --> 04:50.800 and trying to build a GPU cloud with it, not hacking around for six weeks and deciding it 04:50.800 --> 04:58.160 wasn't fit for purpose. Kubernetes, another one that's sort of used today, and people seem to think 04:58.160 --> 05:02.160 that they can throw a workload manager onto a cluster, and all of a sudden it's a cloud, 05:03.520 --> 05:08.960 arguably not. Out of the box no strong isolation of tenants due to the nature of using containers 05:08.960 --> 05:15.040 first as opposed to virtual machines on Linux, shared kernel and drivers, not exactly ideal when 05:15.040 --> 05:19.280 you might have people that want different driver versions for their compute instance and so on. 05:20.080 --> 05:24.320 Various other parts and networks, not many of the common networking options for Kubernetes, 05:24.320 --> 05:28.800 like Kaliko and everything else, not quite what you want for a cloud infrastructure of the box. 05:31.920 --> 05:35.200 Other, you get a few other options that normally come up in cloud infrastructure, 05:35.200 --> 05:39.360 and no mad by Hashecorp was used quite a lot again for additional sort of cluster workload 05:39.360 --> 05:42.560 manager. Some people have successfully built things out of that. 05:43.200 --> 05:49.360 Triton was the cloud stack by giant, very popular back in its day, still being used today by M&X. 05:52.000 --> 05:57.360 How to design a cloud 101? We're done today everywhere, again, kind of obvious, but not obvious. 05:58.320 --> 06:02.800 Scheduling maintenance windows doesn't really scale when you've got a thousand customers. 06:03.760 --> 06:09.280 A lot easier to just be able to take the control plane out of production and still have to available 06:09.280 --> 06:15.920 in the infrastructure. Avoid as many manual steps as possible during bootstrapping, network 06:15.920 --> 06:22.320 allocations, everything in there. Isolation, again quite obvious, most people think it's purely 06:22.320 --> 06:29.360 for security and privacy. It's actually for QLS as well as security. Standardization quite key, 06:30.080 --> 06:34.560 again obvious, but not obvious, but being able to move customers' compute instance 06:34.560 --> 06:42.000 is machine to machine will save you in production. Minimum state, we found this, again, the sort of 06:42.000 --> 06:47.840 MVP that was originally built for the GPU cloud that we had was state full everywhere and it quickly 06:47.840 --> 06:52.960 became a nightmare. We sort of came to a conclusion that if you kept sort of core infrastructure 06:52.960 --> 06:59.600 state full and had everything else stateless. Part of that was down to the not going to say data 06:59.600 --> 07:04.720 centers, but facilities we were operating in former mining sites, power could flap constantly 07:04.720 --> 07:09.920 and if you've got power flapping on a state full infrastructure, it's a nightmare to bring it back up. 07:11.200 --> 07:14.960 A hardware selection for this, there's a few comments that are passed to this. 07:16.640 --> 07:25.440 Standard path normally is OEM kit. Again, Dell, HPE, whatever, it does what it says in the box, 07:26.400 --> 07:30.480 I think about three or four years ago, it was quite difficult to combine most of this kit. 07:30.480 --> 07:35.600 I've super micro were one of the few OEMs actually selling HPE servers in any particular amount of 07:35.600 --> 07:42.640 number. Dell, everyone else doing it these days. Again, the most identical, the all-follow sort of 07:42.640 --> 07:48.720 reference designs from AMD and Nvidia, so on quite overkill and specs. Most people don't actually 07:48.720 --> 07:53.840 need an HPE user server. I think four-way are quite common these days as well with a focus and 07:53.840 --> 07:58.880 shifting to sort of inference in data centers with lower power density. Another important point on 07:58.880 --> 08:03.520 the overkill specs is that often they will have a single spec where they say, if you want this box, 08:03.520 --> 08:09.120 you also need to buy 60 terabytes of SSDs, you also need to buy 800 gigabit nix because that's just 08:09.120 --> 08:15.920 how we sell it. Most people don't be that. Next one. Dell stands at approach. Again, 08:15.920 --> 08:22.480 mainboards, PLX, switch and GPUs. This sort of approach originated from the mining and VFX rendering 08:22.880 --> 08:29.040 or, again, standard server with some GPUs and not great from a density perspective, but if you're 08:29.040 --> 08:35.840 operating in legacy data centers where you've got 8 kilowatt racks, if you've got an AMHP server or 08:35.840 --> 08:40.080 something with a couple of three PCI slots, it's easy just to drop a card of two and then actually 08:40.080 --> 08:46.480 try something out. OTP has obviously been proven to be king at scale, but you really need the 08:46.480 --> 08:52.880 scale to justify non-OEM kit. You're going to have to either sit down with ODMs and Taiwan in 08:52.880 --> 09:00.880 really based or designs of the open compute projects hardware. firmware. I do world, you want to 09:00.880 --> 09:07.600 sell control of the firmware, both from a security perspective and scalability. By open firmware, 09:07.600 --> 09:14.000 we typically mean core boot and open BMC. It's really, really rare to find this in the wild 09:14.000 --> 09:19.680 unfortunately. There's a lot of room for improvement here. I have been really happy to say that 09:19.680 --> 09:24.880 there's active work going on from some of the vendors. Super micro-contributed quite heavily to do 09:24.880 --> 09:29.840 getting intel sapphire and emerald rapid CPUs support into or chipsets support into core boot. 09:30.560 --> 09:36.800 I think the super micro nodes we had in production previously, they had a sort of work in progress, 09:36.800 --> 09:42.160 core boot port for that, but it never finished. Seeing a lot more work being done, though, there. 09:42.240 --> 09:46.560 From the open source community, we've seen obviously great work from free MDB and nine elements 09:46.560 --> 09:52.160 over the past five, six, seven, eight years, plenty others in the community as well. We've got red 09:52.160 --> 09:58.560 fish on through an IPMI solution. Also, fun facts. You find the case where in video GPU 09:58.560 --> 10:04.000 trays have their own BMC, which is good fun when you're trying to control your infrastructure. 10:06.960 --> 10:11.440 Right, with that now we have hardware, we have firmware, but we need something to run it on. We have 10:11.440 --> 10:15.760 a lot of workloads, both from the customer workloads, of course, the people are actually paying, 10:15.760 --> 10:20.480 you have internal workloads, all of your services that you need to run the customer workloads and 10:20.480 --> 10:25.040 supporting workloads, things like monitoring, logging, the bread and butter stuff that everybody does. 10:26.000 --> 10:31.600 We decided to go with Kubernetes here and not out of any deep love for the technology, but just 10:31.600 --> 10:36.320 it does the job, it's widely known, it's widely supported, whatever, pick your battles, that's not 10:36.320 --> 10:41.040 something to spend a lot of time on when there's something to work. Need to stress, we use this 10:41.040 --> 10:45.360 purely internally, it's not like the case we mentioned earlier where people just let people 10:45.360 --> 10:49.520 let the customers run parts and their Kubernetes know, this is purely internal, all 10:49.520 --> 10:53.280 customer workloads are behind the virtualization barrier, just for security reasons. 10:55.120 --> 10:58.720 The nice thing about Kubernetes is that you can also use it as a platform for managing your 10:58.720 --> 11:04.160 own resources, so called CRD is, this is quite common for projects that build on top of 11:04.160 --> 11:09.760 Kubernetes, things like QBurt or Rook or whatever, but you can also use it for your own application. 11:10.800 --> 11:14.720 Quite convenient, like all of our internal state things like Overlay Networks, instances, IP 11:14.720 --> 11:18.720 allocations, all of it is model as a customer resource and Kubernetes, and that gives you 11:19.520 --> 11:23.920 right, and that gives you automatic API implementation, things like patching endpoints, 11:26.240 --> 11:33.440 resource validation, watch streams, libraries, text UIs, all of that for free, which is 11:33.520 --> 11:38.880 quite convenient, and then the rest of the stack, all of the internal services that I mentioned 11:38.880 --> 11:43.760 are basically built as controllers, operating on these resources. In our case nowadays, mostly written 11:43.760 --> 11:48.880 in Rust, similar reasons as Kubernetes not out of any deep particular left, but because a lot 11:48.880 --> 11:52.480 of people on the team knew it, the others want to post to it, so it's what we use now. 11:54.240 --> 12:00.960 Boxstation is nice, but you need a new devs host to run it on. For that, the traditional model is 12:00.960 --> 12:06.320 probably put the server in a data center, installs an OS on it, maybe have a network boot 12:06.320 --> 12:13.200 based solution that images the servers. Problem about that is we did this earlier in earlier 12:13.200 --> 12:16.720 iterations, and over time you end up with a lot of state drifted, you're like, oh, we kind of 12:16.720 --> 12:20.720 reboot this machine, this important customer on there, so at some point the images two years out of 12:20.720 --> 12:25.760 date, and then you have something with Ansible and all of the machines are different, and it's not great. 12:25.760 --> 12:30.320 So solution here is, again, as mentioned earlier, minimize state wherever possible, 12:30.400 --> 12:35.200 in our case this means that the servers are completely ephemeral. They boot over the network, 12:35.200 --> 12:39.680 they load OS into Rambus, can just run from that, and when you reboot them, they just pull 12:39.680 --> 12:45.680 whatever is the latest image, and now you have to use data on them. Some status required, of course, 12:45.680 --> 12:51.120 in practice the Kubernetes control plane needs to persist things across reboot, and set for storage, 12:51.120 --> 12:56.480 of course, also needs to be persistent, so that just gets mounted in after the image boots, 12:56.480 --> 13:03.040 but everything else is stateless. For the images, we built them with make OSI, or however you can 13:03.040 --> 13:07.680 set it's part of the system D project, it's quite new tool, we use it to build a Debbie and 13:07.680 --> 13:13.200 base images, nothing particularly fancy there. The only interesting part is that we don't have a 13:13.200 --> 13:17.440 classic root of us for network booting, but we just pack the entire distribution into the 13:17.440 --> 13:22.320 inner drd, because, well, it's running from rm anyways, you might as well, and you skip the 13:22.320 --> 13:26.240 entire part where you now also need to configure your inner drd to bring up the network to 13:26.240 --> 13:32.080 pull the root of us, just a nice simplification. Speaking of network boot, this is roughly 13:32.080 --> 13:38.160 the flow we have there. So, traditionally what you would have, you would have a pixie rom, which 13:38.160 --> 13:42.720 can chain loads into i pixie, which fetches a config, which fetches a kernel, an inner drd, 13:42.720 --> 13:47.120 which boots, which brings up the network again, which fetches the root of us, which then boots. 13:47.120 --> 13:50.480 A lot of steps here already mentioned we're cutting out the root of us, another thing we're 13:50.480 --> 13:56.160 cutting out is the whole pixie thing. We use uefea, hdp booting, tends to be quite well supported, 13:56.400 --> 14:03.520 by firmware nowadays. So, basically the firmware of the box just brings up hdp, does hdp request, 14:03.520 --> 14:09.040 fetches a uefea binary boot center. You can pack linux and an inner drd enter uefea binary nowadays. 14:10.960 --> 14:16.480 One more thing we do here is config for the host. So, when the system boots, it sends a request 14:16.480 --> 14:21.280 to this net boot service that I mentioned here, this looks to netbox, fetches configuration 14:21.280 --> 14:26.560 for the server, packs it into an inner drd, appens it to the image, packs it into a uefea 14:26.560 --> 14:35.600 merge, strings that out, server boots it, goes into linux, and then it's pid1 is a quick utility 14:35.600 --> 14:39.840 that rend us out templates in the image with the config values to come from netbox, 14:39.840 --> 14:43.920 and then it hands over to system d interest brings up the system normally. Sounds quite hecky, 14:43.920 --> 14:48.240 but works very well in practice, and we don't need to worry about being persistent across 14:48.240 --> 14:52.960 reboots because everything is a few mirror line ways. So, with then then we come to networking, 14:52.960 --> 14:58.560 which is arguably the core of a cloud operation, because there's a lot of features, and so only 14:58.560 --> 15:03.680 you want to build the tip end of the network. So, there's a lot of thought you want to put into getting 15:03.680 --> 15:08.880 that right, scalability flexibility are really important here. So, what we do in practice is we 15:08.880 --> 15:13.440 have a strict separation between underlay network, whose only job is to connect these servers, 15:13.520 --> 15:18.880 and overlays, which then connect the workloads on top. Underlay, as I said, only jobs that 15:18.880 --> 15:22.880 this server can talk to each other, you can do it very basic, just connect it to a switch, and 15:22.880 --> 15:27.600 PhDP, whatever, done. If you have a big operation, you want to want something 15:28.640 --> 15:32.800 more powerful, something that we found there, which works really well, is routing to the host, 15:32.800 --> 15:37.280 every host has a beach pdman on it, has a loopback IP, peers with the switches, and houns 15:37.280 --> 15:41.200 a stat, that way you can build whatever network to apology you want, you can do things like having 15:41.200 --> 15:46.240 two network cards, uplink to separate switches without having to screw around, m-leg or whatever, 15:47.120 --> 15:54.480 everything just works. Under switches, we like Sonic, which is a switch from where open source, 15:54.480 --> 15:59.920 based on Linux, has some problems in practice, but you can make it work, and when you make it work, 15:59.920 --> 16:03.200 it's really nice, because then you can do whatever you want, which is fully programmable, 16:03.200 --> 16:10.640 which is very powerful. For the overlay, we use Vxline EVPN, simply again, standard, everybody supports 16:10.720 --> 16:16.560 it, hardware supports it, no reason not to. The underlay logic is isolated into its own network 16:16.560 --> 16:20.320 namespace, so this is roughly what the host looks like, the actual network cards are here, 16:20.320 --> 16:24.480 in the network namespace, there's a beach pspeaker, and then we have the Vxline 16:25.520 --> 16:32.160 interfaces and bridges here, there we have a one Kubernetes overlay, which then is hooked up into the 16:32.160 --> 16:38.480 main host, and it namespace sort of where Kubernetes and SolarLifts, so our cluster network 16:38.560 --> 16:44.000 and everything just goes through this overlay, then we have a public overlay more on that next slide, 16:44.000 --> 16:48.000 which is used to connect workloads to the internet, and then of course we have tenant overlays, 16:48.000 --> 16:52.880 which is just L2 plus DHCP for our customer, to connect customer instances together, 16:52.880 --> 16:59.920 kind of like a private network. For this public overlay, it's sort of a arguably at the end, 16:59.920 --> 17:05.440 but very basic and just built in-house to do what we need to do, again, to have flexibility there 17:05.440 --> 17:10.640 for offering various features, the logic here is this endpoints in this public overlay, 17:10.640 --> 17:15.680 things like instances, load balances, services like object storage, and of course the edge routers, 17:15.680 --> 17:20.480 which can act as to the internet, they also sit in this overlay, they announce switchability in 17:20.480 --> 17:25.280 for about themselves and about the routes or prefix exit that they're advertised, again, 17:25.280 --> 17:30.320 this goes into Kubernetes resources, one resource endpoint, so this looks like this for example, 17:30.320 --> 17:34.720 so then the endpoints says, this is my IP and this overlay, this is my MAC address and 17:34.720 --> 17:38.960 these routes, please send them to me. If you curious about this port sting, next slide. 17:40.160 --> 17:45.600 This is a public IPv6 space, again, just hooked up to the internet, but you can also route IPv4 17:45.600 --> 17:50.080 traffic via an IPv6 next top, of course, it's just simplified things to not need to do as 17:50.080 --> 17:55.760 like addressing there, because IPv4 addresses are quite scarce nowadays, and then we just have a network 17:55.760 --> 18:01.280 demon on each host, which watches all of these resources, and then sets up the networking state 18:01.360 --> 18:05.360 within the endpoints accordingly, so we pre-populate the neighbor table based on the 18:05.360 --> 18:09.600 back information that's there to avoid neighbor discovery flooding in the VX-lan, 18:09.600 --> 18:14.800 we populate the route table, potentially with multi-hop based on what each node advertises, 18:15.600 --> 18:20.400 and that ends up working quite well there. Of course, we want to hook it up to the internet 18:20.400 --> 18:24.000 eventually, so then we also have one component that takes all of this and translates it into 18:24.000 --> 18:28.720 BGP announcements, which then can go out to some hardware router that's connected to the internet. 18:28.720 --> 18:33.200 The reason why we don't just use BGP for everything in here is, again, flexibility, 18:33.200 --> 18:39.200 things like announcing specific ports, you can't really do there. We have status layer followed 18:39.200 --> 18:44.560 balancer in their custom control plane, data plane is basic, Linux IPv6 right now, we're looking 18:44.560 --> 18:53.520 at implementing something with the second chance routing that could up proposed in their 18:53.520 --> 18:57.760 gilded load balancer and cloud flannel also uses their very good blockposts about that. 18:59.120 --> 19:03.040 For traffic offloading to get more throughput, I mean, all of these servers have 19:03.040 --> 19:06.960 looked for, and a little bit next now doing that with Linux kernel networking is kind of challenging. 19:07.360 --> 19:12.400 One thing we did there in the past, and now looking at again is using VPP for this, so you bind 19:12.400 --> 19:18.560 the fiscal network cards to VPP modifier network demand to push all this data into there, 19:18.560 --> 19:23.200 and then you use a tap-and-faces to connect traffic back to containers applications, 19:23.200 --> 19:29.520 like Kubernetes, and VMs that are directly hooked up with the host user. Of course, 19:29.520 --> 19:34.400 nowadays, the very modern thing is to use a diffuse and offload everything there, but not quite 19:34.400 --> 19:40.400 there yet. Networking for instances, again, each instance gets a public IPv4, subnet, and a 19:40.400 --> 19:47.120 delegated network, instance traffic to the internet, just regular firewall, net fill the contract 19:47.120 --> 19:53.040 works well. We run this in the instance to avoid, again, having a central point of 19:53.040 --> 19:58.480 failure, because this needs to keep track of connection states. Public IPv4 address is optional, 19:59.200 --> 20:03.920 because the scarce alternatively just use a net. There's some trade-offs there, because you need it to be 20:04.800 --> 20:09.440 needed to be decentralized, again, you need to be scalable, but if you get several user ports, 20:09.440 --> 20:14.240 you need to be able to make back to the customer. So what we do there is we slice up public IPs, 20:14.240 --> 20:20.160 based on port number, we assign one port range to each instance, and then then we just 20:20.160 --> 20:27.120 make the instance net only for that. Audium A network, again, is important for GPU clusters. 20:27.120 --> 20:32.000 The classic number, there is around a 3.6-terror bit per node. Obviously, it needs to be hardware 20:32.000 --> 20:36.240 accelerated. Luckily, you usually have one network or a card per GPU, so you don't need to slice these up. 20:37.280 --> 20:41.200 For infinity band, you partition, the infinity band is one Audium A network, you partition with 20:41.200 --> 20:45.520 something called peak keys, it's kind of like a VLAN, sorry to skip over some things here for 20:45.600 --> 20:51.840 time reasons. Rocky goes over Ethernet, it's kind of similar, you can do basic VLAN isolation 20:51.840 --> 20:57.120 on the network card, but it's better to do it on the switch. Again, Sonic is an ICM. For storage, 20:57.120 --> 21:01.360 you want something that's fast and reliable, unfortunately, in all ways quite feasible. So in practice, 21:01.360 --> 21:06.400 we split it. Fast storage is just local and we meet on the server. Again, if you're 21:06.400 --> 21:11.280 in the middle, because we don't want to guarantee anything about state, so this is used for cashing, 21:11.360 --> 21:15.520 and then we have network storage, which needs to be robust. The local storage, it's easy to 21:15.520 --> 21:20.320 bottleneck on that. If you have four gen 4 NVMe drives, you want to slice it up for customers, 21:20.320 --> 21:25.360 you want to encrypt it, you see the bottleneck. So we use SPDK and practice there. It's again, 21:25.360 --> 21:31.840 like the PP uses space off load, but for storage, and we get 98% of the raw throughput, 21:31.840 --> 21:35.920 even though we do crypto and LVM and rate and pass you to VM. So that's quite nice. 21:36.880 --> 21:42.080 For the network storage, we use SAF. There's everything, file system exposed. We have 21:42.080 --> 21:47.520 a virtual IOFS, block storage to respdk, object storage, raw escape bay, and the Kubernetes cluster 21:47.520 --> 21:53.440 on it to re-rock. And then virtualization at the end, obvious option. There's the first glance 21:53.440 --> 21:58.640 is queued, we looked at that. But the thing is, last time we checked at least, it's designed to integrate 21:58.640 --> 22:03.440 legacy applications into your Kubernetes cluster, we want the exact opposite. We wanted to be isolated, 22:03.600 --> 22:07.440 and with all of the special things, we do hacking that into Kubernetes. At some point, it's easier 22:07.440 --> 22:11.760 to just say, you know what, we just run our own Qemu and a container. That's the job. 22:12.640 --> 22:16.320 You want to, so this is what typical APA GPU server looks like. You need to 22:17.440 --> 22:21.280 mind the enumerant apology there, so you want to select GPUs that are close together, 22:21.280 --> 22:26.400 then you want to select CPU cores on the same node. Then you want to pin the VCUs, both 22:26.400 --> 22:31.120 performance and for security, again, side-channel attacks nowadays, everybody's aware of that. 22:32.000 --> 22:35.840 And you want to pass this topology info to guess, so it can schedule accordingly. 22:37.280 --> 22:42.640 Then, for the last point, GPU specific things about virtualization, you need to be mindful 22:42.640 --> 22:47.760 of the PCA topology here. There's a lot of GPU to GPU and GPU to network hard traffic, 22:47.760 --> 22:52.800 which is PCA, IPA to peer. If the goal goes through your CPU root complex, you're going to 22:52.800 --> 22:58.000 bottleneck hard because there's not enough lanes there. Workloads need to know the topology, 22:58.000 --> 23:02.400 so that it doesn't accidentally send traffic across the entire node because it thinks it's 23:02.400 --> 23:08.160 local. You can configure this into the instance manually, but it's easier to just model it through 23:08.160 --> 23:14.800 Qemu by having like virtualized PCA switches in the topology. Classic pitfall there. If you just 23:14.800 --> 23:18.880 pass through these devices, you have the IOMM U and the data path, and that forces all the traffic 23:18.880 --> 23:24.480 to your CPU. So instead of your 2.6 terabytes, you're suddenly getting 180 gigabytes, not very nice. 23:24.560 --> 23:29.200 The way to avoid that, there's a feature called address translation services that you're 23:29.200 --> 23:33.600 able on the network hard, and then you're able directly to traffic and access controls service 23:33.600 --> 23:39.280 on the PCA switch. You just Google that, find the commands for that, and then you get your 23:39.280 --> 23:43.040 full traffic. In that case, you need to trust the network hard firmware to not do malicious 23:43.040 --> 23:47.680 DMI attacks on your host. Well, this is just something you have to do in this case to be honest. 23:48.400 --> 23:53.600 Finally, the last point you want your GPU into the connect, something like enabling, 23:53.600 --> 23:59.280 or a link for more general solution. If you do GPUs one at a time, it's really easy. 23:59.280 --> 24:03.280 You don't have GPU in the connect, just pass through the GPU you're done. If you pass all of 24:03.280 --> 24:07.680 the GPUs through it's also easy, you just pass everything through and let the guest manage it. 24:08.880 --> 24:13.600 If you're mixing it, it's difficult because now you need to configure which GPUs are allowed to 24:13.600 --> 24:18.480 talk to which very, very vendor specific right now. Unfortunately, I hope that this is 24:18.480 --> 24:23.840 the area that will get more support and standardization in the future. 24:25.840 --> 24:29.760 With that to the quick tour, unfortunately, you couldn't go into a lot of details on 24:29.760 --> 24:34.400 many things just because again, big pictures, small times that. But if you have any questions, 24:34.400 --> 24:39.200 any follow-ups, if you've read talk to any of us at the event, send us an email. If you want to 24:39.200 --> 24:41.920 learn more about any of this, let us know and we can do another talk next year. 24:41.920 --> 24:46.240 Thank you very much everyone.