WEBVTT 00:00.000 --> 00:08.000 Thank you very much to the best of you. 00:08.000 --> 00:11.000 Welcome to my talk. 00:11.000 --> 00:16.000 The agenda for the talk today is for those of you who don't know NOVA. 00:16.000 --> 00:23.000 I'll give a very brief architecture overview and explain what NOVA is and what runs on top of it. 00:23.000 --> 00:28.000 Then I'll talk a little bit about the form of verification given update on the current status 00:28.000 --> 00:33.000 and what NOVA is one of those microcrongals that is undergoing from the verification. 00:33.000 --> 00:38.000 And then we get to the main topic of the talk which is scalability. 00:38.000 --> 00:42.000 You just heard from Michael that scalability is an important topic. 00:42.000 --> 00:47.000 And in NOVA there's a lot of work going on improving the scalability. 00:47.000 --> 00:53.000 And today we will focus primarily on supporting thousands of devices, 00:53.000 --> 01:01.000 this thousands of interrupts on big iron server machines and the new microchannel APIs that require that. 01:01.000 --> 01:05.000 And hopefully we'll have some time for questions at the end. 01:05.000 --> 01:11.000 So the architecture overview of NOVA is shown on this slide. 01:11.000 --> 01:16.000 So you have the microchannel running at the highest privilege level. 01:16.000 --> 01:20.000 In our case it's called the microhypervisor because it's not just the microchannel. 01:20.000 --> 01:22.000 It is also a hypervisor. 01:22.000 --> 01:27.000 And it's roughly 15 to 20,000 lines of code depending on how you count. 01:27.000 --> 01:35.000 And then on top of it you have a componentized multi-server operating system running of all these colorful boxes 01:35.000 --> 01:42.000 and every box represents a component that runs isolated in its own protection domain or its own address basin. 01:42.000 --> 01:46.000 And NOVA is a microhypervisor because some of these components, 01:46.000 --> 01:51.000 shown as these yellow boxes at the top, are actually used on mode virtual machine monitors. 01:51.000 --> 01:55.000 And there's an instance of the MM for every virtual machine. 01:55.000 --> 02:01.000 And you can run unmodified guest operating systems, unique kernels, 02:01.000 --> 02:06.000 container run times, offices, whatever you want inside those virtual machines. 02:06.000 --> 02:11.000 And the whole stack is undergoing formal verification. 02:11.000 --> 02:19.000 And for those of you who are very interested in the topic of writing specifications for each of the functions 02:19.000 --> 02:24.000 and mathematically proving those, there was a talk at the microchannel bedroom last year, 02:24.000 --> 02:31.000 given by my colleague Hay Dang, who talked in great depths about the form of verification that we do in NOVA. 02:31.000 --> 02:35.000 So for those of you who are interested, I suggest you check that out. 02:35.000 --> 02:46.000 NOVA is written in C++20, so formal verification attaches a specification to every C++ function. 02:46.000 --> 02:56.000 And then there is proofs for those functions that mathematically verify that each function conforms to its specification. 02:56.000 --> 03:04.000 And in a NOVA case, this is actually a very complex endeavor because NOVA is a highly concurrent microchannel. 03:04.000 --> 03:12.000 And when you have functions accessing the same state concurrently, the state space can explode. 03:12.000 --> 03:16.000 But we are using separation logic, which makes this problem tractable. 03:16.000 --> 03:23.000 And the approach that we are taking is actually cutting at the system called or helper call layer. 03:23.000 --> 03:25.000 And we're very fine from there, top-stown. 03:25.000 --> 03:33.000 So you can read this table, which I give me to provide an update on the progress that we're making. 03:33.000 --> 03:39.000 And at the lower, at the left side, you have a list of all the helper calls in NOVA. 03:39.000 --> 03:45.000 And which of those already have a full specification. 03:45.000 --> 03:57.000 And then we're rolling down into the stack until at the very right side we arrive at capabilities, memory allocators, timers, coroner memory, and so forth. 03:57.000 --> 04:06.000 It's an ongoing effort. And there's a lot of links for the NOVA specifications and the proofs. 04:06.000 --> 04:13.000 And one of the things that happened in the last year is that we actually made those specs and proofs open source under QPRV2. 04:13.000 --> 04:19.000 So you can go to the top link there and you find all that and you can see it. 04:19.000 --> 04:25.000 The next steps that we're doing in form of verification is that we actually look at weak memory reasoning. 04:25.000 --> 04:32.000 Reasoning about when one core does store, what is the point in time when all other cores have seen that store. 04:32.000 --> 04:40.000 And also interaction with the hardware, this is the page workers in the MMU and this is TLDs that requires a lot of complex modeling. 04:40.000 --> 04:46.000 But I don't want to go into the details here because the subject of the talk is not from the verification. 04:46.000 --> 04:50.000 The subject of the talk is scalability. 04:50.000 --> 04:57.000 But currently runs on the two major architectures. It runs on on the ADA called ARX64. 04:57.000 --> 05:07.000 And it runs on the X8664 bit architecture. And it scales all the way up from the smallest embedded devices. 05:07.000 --> 05:13.000 So in the case of ARM, I've just listed the Raspberry Pi as an example of a smaller embedded device. 05:13.000 --> 05:17.000 All the way up to the largest servers that you can find in cloud providers. 05:17.000 --> 05:30.000 So the largest you've currently tested is an AWS Graphitone 4 which has 192 cores, lots of memory, lots of SMMUs, lots of interrupt controllers. 05:30.000 --> 05:35.000 And it actually has 104 PCI devices. 05:35.000 --> 05:44.000 The same on X86, we have one binary that you can build that runs all the way from smallest X86 machines. 05:44.000 --> 05:49.000 So we like to use internal looks in these small form factors. 05:49.000 --> 05:56.000 All the way up to the latest granite records here on 60 views, also with 192 threads, lots of RAM. 05:56.000 --> 06:07.000 450 is something PCI devices, 10 IOMUs, 192 CPUs, five level pitchable, so really the big iron boxes. 06:08.000 --> 06:12.000 And when you compile a normal, it's very simple. 06:12.000 --> 06:21.000 Assuming you have a cross compiler for the architecture that you care about or you compile it natively, you type make arc equals whatever your architecture is. 06:21.000 --> 06:26.000 And in the ARM case, you also have to specify that you want a CPI as the board. 06:26.000 --> 06:32.000 And then you get a binary which is roughly 100 kilobytes in size. 06:32.000 --> 06:36.000 So really a micro kernel, not some big bloated piece of software. 06:36.000 --> 06:42.000 And the same binary works across the entire architecture stack from small to large. 06:42.000 --> 06:52.000 What I actually want you to take away from this slide is that the machines that we run on are now so large that we need really hundreds of devices. 06:52.000 --> 07:01.000 And if you assume that every device just needs one interrupt which is not true, many devices nowadays need tens of interrupts, tens of messaging and interrupts. 07:01.000 --> 07:12.000 But the number of interrupts that traditional micro kernels have provided in the past is no longer sufficient. You really need to have thousands of them. 07:12.000 --> 07:15.000 So what do we need to do? 07:15.000 --> 07:23.000 And you just heard from Micheal that we gained a lot of isolation and a lot of robustness from running device drivers in user mode. 07:23.000 --> 07:28.000 And I want to take a quick look at what it takes to run a device driver in user mode. 07:28.000 --> 07:32.000 And what infrastructure a micro kernel has to provide. 07:32.000 --> 07:41.000 So first of all, the code and the data of the device driver, you can put that in a user mode protection domain just like an application. 07:41.000 --> 07:46.000 But that application that device driver application needs access to the device registers. 07:46.000 --> 07:53.000 And in modern platforms, this is memory map.io on some all the internal machines you also have sometimes pop-based.io. 07:53.000 --> 08:01.000 And this is simple to achieve. You take the MMU, you map the memory map registers of the device into the address space of the application. 08:01.000 --> 08:04.000 And then you can drive them with normal load install instructions. 08:04.000 --> 08:07.000 So that functionality is trivial to implement. 08:07.000 --> 08:09.000 But then it gets a little bit harder. 08:10.000 --> 08:15.000 You need to make sure that this device driver can actually receive interrupts. 08:15.000 --> 08:19.000 So you need to convert the hardware interrupts into some sort of notification mechanism. 08:19.000 --> 08:24.000 And when the device driver is done doing the intrap handler work, the interrupt needs to be acknowledged. 08:24.000 --> 08:32.000 So you need to have some handshake in protocol with the kernel because you cannot give it a device driver access to the intrap controller for security reasons. 08:32.000 --> 08:36.000 That needs to be under the control of the micro kernel. 08:36.000 --> 08:43.000 So the way no bus stops this is it transforms an occurrence of an interrupt on an app operation on a summer floor. 08:43.000 --> 08:49.000 And when the device driver is done, it dounds the summer floor, which automatically acknowledges the interrupt. 08:49.000 --> 08:55.000 And that's a very nice and very flexible cross-core and cheap notification mechanism. 08:55.000 --> 08:57.000 So that's the interrupt problem. 08:57.000 --> 09:04.000 What we also need to do is we need to control the access for device drivers which we got to the MA. 09:04.000 --> 09:08.000 You cannot have a device driver program in arbitrary memory address. 09:08.000 --> 09:11.000 DMA to that address and scribble all over the kernel memory. 09:11.000 --> 09:16.000 So you need to have some access control mechanism, some hardware enforced access control, 09:16.000 --> 09:22.000 where the use of the MA or the addressable range for the MA is restricted, 09:22.000 --> 09:28.000 such that the device drivers can only DMA to memory or from memory that belongs to them. 09:28.000 --> 09:35.000 And this problem has been solved in many architectures with the concept of an IOMU or system MMU, 09:35.000 --> 09:40.000 where you can have a translation from DMA addresses to hospital addresses. 09:40.000 --> 09:46.000 So this type of DMA remapping, but you need to design also interfaces for that and we'll talk about that. 09:46.000 --> 09:53.000 And in another case, there's a hyper-core called assign-dev, where you can assign a device to a DMA space. 09:53.000 --> 10:02.000 And the DMA space is effectively a page table that represents an address space that you can assign to a device to restrict 10:02.000 --> 10:06.000 what region of memory it can DMA to or from. 10:06.000 --> 10:14.000 And you also need to provide access to interrupts because not every device or every device driver should be able to trigger any arbitrary interrupt. 10:14.000 --> 10:21.000 You want to say this device can trigger this particular interrupt or this particular set of interrupts, 10:21.000 --> 10:29.000 but no more you shouldn't be able if you are the next driver, you shouldn't be able to assert an interrupt that belongs to a storage driver or the GPU. 10:29.000 --> 10:38.000 So you need an API that transforms a global system and route that not just to a summer for an interrupt summer for, 10:38.000 --> 10:45.000 but also to a particular CPU because not all of these thousands of interrupts should hit the same CPU core, 10:45.000 --> 10:49.000 you actually want to send them out through many cores. 10:49.000 --> 11:01.000 And the hyper-core that does this is a sign-in, where you can say assign this particular hardware interrupt to this device and route it to this particular CPU and configure it. 11:01.000 --> 11:10.000 So I want to explain a little bit how the DMA translation part works because not many people know all the details in the, 11:10.000 --> 11:17.000 and I'll use the arm SMUV3 as an example. So in arm SMUV3 is starting at the top left corner, 11:17.000 --> 11:23.000 the DMA stream that comes from a device is identified by a stream ID. 11:23.000 --> 11:32.000 And inside the SMMU, there's a multi-level stream table, which is set up by the normal microcom or micro hypervisor, 11:32.000 --> 11:44.000 and every device has a stream table entry that points to a DMA pitch table, which controls what regions of memory that device can target. 11:44.000 --> 11:52.000 And then the SMU also has a command queue, where Nova is the producer, so it puts commands into the queue, 11:52.000 --> 11:59.000 and the SMU is the consumer, it reads commands of the queue, and there's an event queue which works in the opposite direction, 11:59.000 --> 12:06.000 where the SMU puts commands in, and the events in and Nova reads events are also, for example, the SMU could say, 12:06.000 --> 12:13.000 there was a DMA remapping error, where some device tried to access some region of memory that it couldn't access. 12:13.000 --> 12:23.000 The problem with the SMUV3 is that it uses stream table entries, which are 512 bits in size. 12:23.000 --> 12:28.000 So if you do the mass, that is 864 bit word. 12:28.000 --> 12:34.000 You do not have a processor instruction that sets up a stream table entry automatically. 12:34.000 --> 12:42.000 You have to write these 64 bytes in a particular order, and you're effectively racing with the SMMU against those writes. 12:42.000 --> 12:49.000 So the spec specifically says that the SMMU may read all those 8 bytes in any arbitrary order. 12:49.000 --> 13:00.000 The question is, how do you actually set up a stream table entry in such a way that the SMU always sees a consistent state and never some partial or corrupted state? 13:00.000 --> 13:05.000 And you need to design a protocol for that, and the protocol works as follows. 13:05.000 --> 13:17.000 So we start in the state, in the invalid state, where the valid bit of a stream table entry is zero, which means whatever the SMU has fetched, whatever configuration will not be considered valid, 13:17.000 --> 13:19.000 and it won't act upon it. 13:19.000 --> 13:26.000 And the moment that the valid bit becomes one, whatever configuration it has set will be used. 13:26.000 --> 13:36.000 So the first thing is we need to, or it's clear, that we actually need to set the valid bit last, because that's what ultimately commits the entry. 13:36.000 --> 13:44.000 So we are not just racing against the SMMU, we are also racing against the same call being made from other CPUs, 13:44.000 --> 13:50.000 which may also try to configure a stream table entry. So we need to arbitrary to win this race. 13:50.000 --> 13:57.000 So the first thing we do is we automatically set the configuration in the stream table entry to something other than zero, with an atomic store, 13:57.000 --> 14:03.000 and whoever wins that race can proceed and everybody else will get an aborted error code. 14:03.000 --> 14:09.000 And then once we've set up the configuration, the valid bit is clear, SMMU won't use the entry. 14:09.000 --> 14:13.000 We have to make that configuration observable. 14:13.000 --> 14:22.000 And that means in ARM case, that we need to make sure that the stream table walker and SMMU can see it, that if the SMMU is not coherent, 14:22.000 --> 14:29.000 that we push those stores all the way out to memory, out of the cache hierarchy, that the SMU can actually see it. 14:29.000 --> 14:35.000 And then, because the SMMU may have already read some of those other words, this configuration words, 14:35.000 --> 14:38.000 we have to tell the SMMU to forget this entry. 14:38.000 --> 14:44.000 So this means we've written the valid configuration, be tell it to forget whatever the invalid configuration was, 14:44.000 --> 14:48.000 and then we set the stream table entry to a valid bit to one. 14:48.000 --> 14:56.000 We make that also observable, but the SMMU may have still cached valid equal zero, so we have to invalidate the STE again. 14:56.000 --> 15:03.000 So you see this very complicated dance, you have to do to make the SMU actually see all the changes in the right order. 15:03.000 --> 15:10.000 And likewise, if you want to remove an entry, you have to automatically clear the valid bit, 15:10.000 --> 15:16.000 and to ever clear the valid bit first, wins the race to unconfigure the entry. 15:16.000 --> 15:25.000 And then you can undo all the remaining configuration, but you also have to invalidate the entry at SMMU, 15:25.000 --> 15:32.000 and to invalidate the TLD, which means forget any translation entries from those DME page tables that were set up using this entry. 15:32.000 --> 15:39.000 So this is just an example of the complicated synchronization you have to do in a highly concurrent microchannel, 15:39.000 --> 15:45.000 where you're not just racing among the 192 CPUs, but you're also racing with the hardware. 15:45.000 --> 15:51.000 The good thing is that the bad thing is that whatever locking scheme you may have in mind, 15:51.000 --> 15:54.000 the hardware doesn't know anything about the locks in your software. 15:54.000 --> 15:59.000 So you have to be atomic between the software and the hardware. 15:59.000 --> 16:06.000 But the moment you get that right, you get the synchronization between the cores for free. 16:06.000 --> 16:13.000 So let's look at the interrupts system on ARM, because I said we want to scale to thousands of interrupts 16:13.000 --> 16:18.000 and we need to understand what types of interrupts exist and how to configure them. 16:18.000 --> 16:26.000 And in the old ARM world, let's say before gig version 3 was introduced, 16:27.000 --> 16:31.000 the interrupts up the system was split into several components. 16:31.000 --> 16:36.000 Starting at the top left corner, we have a gig distributor, which has many incoming lines. 16:36.000 --> 16:40.000 So these are wired interrupts, which ARM calls shared peripheral interrupts. 16:40.000 --> 16:44.000 There's an extension called extended shared peripheral interrupts, 16:44.000 --> 16:49.000 but you can think of all of these being wires that go into this global interrupt controller. 16:49.000 --> 16:55.000 And the interrupt controller is set up to convert the assertion of any of those interrupt lines into an int ID, 16:55.000 --> 16:59.000 and route it to one of the redistributors. 16:59.000 --> 17:03.000 So every core, in our case, we've just two cores, because I couldn't fit more on the slide. 17:03.000 --> 17:08.000 You have Core A and Core B. You decide whether you want to route it this way or that way. 17:08.000 --> 17:13.000 And that gets sent as a message, so A dashed line here, denotes a message being sent. 17:13.000 --> 17:16.000 That's no longer a signal, that's a memory right. 17:16.000 --> 17:21.000 And then the redistributor, for what's that to the CPU cores, 17:21.000 --> 17:26.000 takes CPU interface, where the interrupt will be delivered to software. 17:26.000 --> 17:33.000 Another question is, how do we assert that interrupt to a user-motivized driver? 17:33.000 --> 17:39.000 And in the ARM case, all the interrupts have a global interrupt ID, which is a number, 17:39.000 --> 17:44.000 ranging from zero all the way up to 2 to 4 or 64. 17:44.000 --> 17:53.000 And we have a summer for table entry. A summer for table, this up to 65,536 entries, 17:53.000 --> 17:59.000 where you can use the hyper-core API to attach interrupt summer force. 17:59.000 --> 18:04.000 So if you are a device driver, and you have access to say interrupt 87, 18:04.000 --> 18:10.000 you can plug the interrupt summer for that interrupt into the summer for table. 18:10.000 --> 18:15.000 And whenever that interrupt is asserted, no one will take a look at the summer for table 18:15.000 --> 18:21.000 to see which interrupt summer for is attached to that and will update interrupt summer for. 18:21.000 --> 18:28.000 So starting with gig version 3, there was a new component added, which is shown in the bottom left corner. 18:28.000 --> 18:31.000 And that component is called the interrupt translation service. 18:31.000 --> 18:38.000 And the purpose of that is to scale the number of interrupts much more than wires ever could. 18:38.000 --> 18:47.000 So as I said, we can have up to 2 to about 16 interrupts here, but you couldn't possibly put that many wires on a small SOC. 18:47.000 --> 18:58.000 So the new concept for interrupts that allow much greater scalability is called LPI, locality approval interrupts, which are message signaled. 18:58.000 --> 19:07.000 So rather than asserting a device, an interrupt line, a device would send a message to a specific register 19:07.000 --> 19:10.000 in the gig ITS called the translate register. 19:10.000 --> 19:15.000 And then the gig ITS actually sees that memory right decodes it to many tables. 19:15.000 --> 19:19.000 So there's a device table and an interrupt translation table and a collection table. 19:19.000 --> 19:21.000 We look into those in a moment. 19:21.000 --> 19:25.000 Figures out where this message signaled interrupt needs to go. 19:25.000 --> 19:34.000 And then again converts it to an int ID and it ripples through the hierarchy the same way as a pin based interrupt wood. 19:35.000 --> 19:38.000 On X86, it is a bit different. 19:38.000 --> 19:48.000 And the goal here is for NOVA to provide the same API or as much of a uniform API to device drivers as we can. 19:48.000 --> 19:56.000 And to abstract the differences between armament X86 as much as possible, but we have to look at how interrupts work on X86. 19:56.000 --> 20:08.000 So we also have a via interrupt controller, which in this case is called an IOAPIC, where pin based interrupts are asserted via real electrical lines. 20:08.000 --> 20:13.000 And then an interrupt message is sent to the CPU course. 20:13.000 --> 20:19.000 But in order to make that secure, there is interrupt remapping in the IOM view that sits in the middle. 20:19.000 --> 20:28.000 And in the sets, the MSI acid flows from the IOAPIC to the CPU's local APIC or X2ATIC. 20:28.000 --> 20:36.000 And the format of messaging interrupts on the left side of the IOM view, the inbound side, is in remapable format. 20:36.000 --> 20:45.000 So every interrupt carries a source identifier, which in the X86 case is not a stream ID or a device ID, it is a bus device function. 20:45.000 --> 20:52.000 So it says it's this device on this PCI bus and this device number and this function number. 20:52.000 --> 20:57.000 And it carries an interrupt number, which is this lot number of this interrupt remapping table. 20:57.000 --> 21:01.000 And it again has two to about 16 entries. 21:01.000 --> 21:06.000 This interrupt remapping table then translates the interrupt to a vector. 21:06.000 --> 21:13.000 Because X86 for backward compatibility reason does not use global int IDs like on V8 does. 21:13.000 --> 21:21.000 It actually has an IDT and interrupt script or table, this only 256 entries on every core. 21:21.000 --> 21:26.000 And out of those 256 entries, 32 are already consumed by exceptions. 21:26.000 --> 21:32.000 And some are consumed by Nova which leaves approximately 200 entries that you can use for device interrupts. 21:32.000 --> 21:42.000 So the goal of the interrupt remapping unit is to translate the incoming MSI in remapable format into an outgoing MSI in compatibility format. 21:42.000 --> 21:57.000 And the compatibility format has a destination ID, which is the APIC ID of where the interrupt is supposed to go to which CPU and a vector, which will then be used to index into the interrupt vector table on that particular CPU. 21:57.000 --> 22:08.000 So the difference between X86 and ARM is that on ARM you have a global table, whereas on X86 you have a pure core local table and we somehow need to deal with this. 22:08.000 --> 22:18.000 Otherwise, it's the same as soon as an interrupt arrives, it will be translated through the sum of four tables of that core and will be signaled on the corresponding interrupt sum of four. 22:19.000 --> 22:25.000 So to sum up what the key differences are, is in both architectures you have global pin-based interrupts. 22:25.000 --> 22:30.000 You have some per CPU local interrupts that are not exposed to device drivers. 22:30.000 --> 22:35.000 They are used by Nova internally like timer interrupts, performance counter interrupts and so forth. 22:35.000 --> 22:44.000 And you have messaging interrupts, which in the case of ARM are translated through tables in the gig IDS in the interrupt translation service. 22:44.000 --> 23:00.000 And they assert a global interrupt ID, and on the X86 case you have messaging interrupts that are translated through one interrupt table in the IOMU and they assert a per CPU vector. 23:00.000 --> 23:10.000 So the table on ARM is indexed by the int ID, and Nova checks that the incoming int ID actually is within the range of the table. 23:10.000 --> 23:16.000 And that you can only plug an interrupt sum of four into the slot belonging to that interrupt sum of four. 23:16.000 --> 23:19.000 So the interrupt number must match the slot number. 23:19.000 --> 23:27.000 And the scalability limit is two to the power of 16 GSI, because that's the capacity that you can have in the remapping table. 23:27.000 --> 23:33.000 The practical limit is actually how many bits of LPI int ID the system implements. 23:33.000 --> 23:39.000 In the X86 case, you only have these CPU local tables with about 200 entries. 23:39.000 --> 23:46.000 So Nova checks if the vector fits with the range of this IDT vector table. 23:46.000 --> 23:54.000 But you have a free choice to say I want to map this global interrupt on this particular CPU to this vector or that vector. 23:54.000 --> 24:00.000 So the scalability limit is two to the power of 16 GSI, but per segment group. 24:00.000 --> 24:07.000 If you have multiple pieces as segment buses, you get twice or three times or four times two to the power of 16 GSI. 24:07.000 --> 24:11.000 The practical limit though is 200 interrupts per CPU. 24:11.000 --> 24:16.000 So if you have just four CPUs, you're practically limit to 800 interrupts. 24:16.000 --> 24:20.000 But if you have 1952 CPUs, the limit is much higher. 24:20.000 --> 24:25.000 So that's just an artifact of X86. 24:25.000 --> 24:34.000 I want to talk a little bit about the gig interrupt translation service and what pain and operating system person has to go through, 24:34.000 --> 24:40.000 because hardware manufacturers try to be very flexible and then decide not to be. 24:40.000 --> 24:47.000 So in the ARM case, a device sends an interrupt with two identifiers. 24:47.000 --> 24:53.000 It has a device ID, which is usually communicated through some sideband signals, 24:53.000 --> 25:02.000 the ITS or this particular device or that particular device allows the ITS to tell different devices apart. 25:02.000 --> 25:11.000 And it has an event ID, which is a local interrupt number that can get translated to this global entity. 25:11.000 --> 25:20.000 And the device table, the interrupt and the collection table are both pointed to by what ARM calls base registers. 25:20.000 --> 25:26.000 And inside an ITS, you have base registers from zero to seven. 25:26.000 --> 25:31.000 But it's not architecturally specified which base register maps what type of table. 25:31.000 --> 25:34.000 So software has to figure that out and remember it. 25:34.000 --> 25:38.000 And they only define this in version gig version four. 25:38.000 --> 25:43.000 But since Nova is backward compatible to gig B3 as well. 25:43.000 --> 25:45.000 We have to discover this and remember this. 25:45.000 --> 25:49.000 So this is just unnecessary inflexibility. 25:50.000 --> 25:56.000 So in interrupt arrives, the ITS uses the device ID to look up an entry in the device table. 25:56.000 --> 26:00.000 And the device table points to an interrupt translation table. 26:00.000 --> 26:06.000 The event ID is used as an index into the interrupt translation table. 26:06.000 --> 26:11.000 And inside there, the ITS finds the final int ID. 26:11.000 --> 26:18.000 And the interrupt translation table entry also carries a collection ID, which says which interrupt collection. 26:18.000 --> 26:21.000 So you can group interrupts into collections. 26:21.000 --> 26:25.000 That's this belong to and we don't use this in Nova too much. 26:25.000 --> 26:29.000 We effectively say every CPU has a collection. 26:29.000 --> 26:33.000 So the number of CPUs defines how many collections you have. 26:33.000 --> 26:39.000 And this allows us for, for example, to set up the collection table only once at boot and to never touch it again. 26:39.000 --> 26:41.000 So now comes a really interesting part. 26:41.000 --> 26:44.000 It's when you want to program these base registers. 26:45.000 --> 26:52.000 Depending on how many bits of device ID you want to support, this device table can be one level or two level. 26:52.000 --> 27:00.000 And there's an indirect bit which you can set in the device table entry to say I want the one level table or two level table. 27:00.000 --> 27:01.000 Sounds very flexible. 27:01.000 --> 27:10.000 Until you read the fine print in the manual that says this bit is read as zero right ignored for gig implementation that only support flat tables. 27:10.000 --> 27:16.000 So in the reverse case you decide a software that you want the two level table and the hardware says, nope. 27:16.000 --> 27:18.000 I only support flat tables. 27:18.000 --> 27:27.000 And then there's other bits which say what is the shareability of this table can this actually participate in coherency or not. 27:27.000 --> 27:33.000 And you may say I want the table to be shared with all the processes in the in our shareable domain. 27:33.000 --> 27:38.000 And then the fine print in the manual says it is implementation defined whether this field has a fixed value. 27:39.000 --> 27:50.000 Or can be programmed so again you have to deal with all the possible cases of oh it's actually not shareable and you have to do manual coherency management. 27:50.000 --> 28:02.000 And the worst part of it all is there's a page that's actually two page size bits where you can say a table in this hierarchy is 4k 16k or 64k in size. 28:03.000 --> 28:11.000 And again there's some fine print that says the gig implementation may only support one fixed page size and the field might be read only. 28:11.000 --> 28:25.000 So what software actually has to do is it has to make a guess and say I wanted to level table with 4k try to program that and then you read back the configuration the hardware says nope. 28:25.000 --> 28:34.000 Okay, then I'll use 16k tables and then the hardware says nope. 28:34.000 --> 28:45.000 And in the worst case you end up with oh you can only have 164k table and you can only have one level so a flat table. 28:45.000 --> 28:59.000 And then you may end up having to concatenate 256 64k pages in one flat table which means you have to allocate a giant 16 megabyte table and your memory allocator needs to support that. 28:59.000 --> 29:05.000 So we have this algorithm in over and it works but this amount of pain is interesting. 29:05.000 --> 29:30.000 So let's talk about the number API because when you want to configure an interrupt or a device you have to name it and we make interrupt and devices available as capabilities because capabilities provide for very nice authorization model where you can say if I have a capability I can name some resource and it also conveys my access to it. 29:30.000 --> 29:45.000 So multiple of these hypercalls need to name devices or interrupts but as you've seen the naming between x86 and arm is not the same on arm an interrupt has an end ID on x86 an interrupt has a global system interrupt number. 29:45.000 --> 30:04.000 So the idea is that when we create a device or when we create an interrupt number 4 we provide the name once and this is the part of the API which is architecture specific. 30:04.000 --> 30:25.000 So for when you create an interrupt sum of 4 you either provide the name in an arm specific manner on an x86 specific manner and then you get a capability to an interrupt sum of 4 which is uniform across all architectures and likewise when you create device context you provide the name for the device either in the x86 or in the arm format depending on which architecture you're on. 30:25.000 --> 30:35.000 And then you have a device capability which is again generic across architectures so it keeps the rest of the hypercalls API clean and you no longer have to worry about those differences. 30:35.000 --> 30:52.000 So device context is what we introduced for devices to encapsulate all this architecture specific state into a new object you don't really create a device because the device is there but you create a management context for device which you call device context. 30:52.000 --> 31:11.000 And it encodes the topology so the device ID the stream ID on an arm case and the PCI topology in the x86 case but it also encodes information about which SMMU or get asked is managing the DMA and interrupt translation for the device. 31:11.000 --> 31:25.000 And then you get a capability which you can use with the assign dev or assign hint called to associate that device with the ms base or to route an interrupt from that device. 31:25.000 --> 31:40.000 So in inches of time I'm going to skip this slide which talks about the novel releases and which features appeared in which novel release suffice it to say that we do a release a new release every two months for this six releases per year. 31:40.000 --> 31:54.000 And all the things that are listed in both on the slide we've briefly covered in this talk there's one thing that I want to mention at the end and that is when we implemented all this stuff we could get no running quite nicely on. 31:54.000 --> 31:59.000 Grab it on server machines which require all these features for scaling. 31:59.000 --> 32:07.000 But then we tried to put an over on a new machine that we've got which was an Nvidia digit spark which is one of those Nvidia AI machines. 32:07.000 --> 32:16.000 And unfortunately this machine doesn't have any URLs or know you are connected so you can't really capture the output. 32:16.000 --> 32:22.000 And what we had to do is we actually had to implement the framework for console in Nova. 32:22.000 --> 32:30.000 This is something that I've tried to avoid for a long time because I didn't want to do any front rendering and scrolling and blitting. 32:30.000 --> 32:34.000 But I had to do it to actually get some output on this box. 32:34.000 --> 32:44.000 And I had to implement a lot of performance tricks but none of this code uses any architectural specific assembler code it's all portable C++. 32:45.000 --> 32:52.000 But it's very heavily optimized and we can actually get very high frame rates for scrolling. 32:52.000 --> 32:57.000 So on the Intel platform we can actually get 4,500 frames per second. 32:57.000 --> 33:02.000 And even on some Vimpi Raspberry Pi we still get 245 frames per second. 33:02.000 --> 33:09.000 And you can actually see the boot output from just Nova is nothing running on top of it on the digit spark. 33:10.000 --> 33:18.000 So if you're interested in what we needed to do to make this happen, it's just 350 lines of code of which 250 lines are front loops. 33:18.000 --> 33:24.000 So what's next, Nova has been open source for 15 years. 33:24.000 --> 33:29.000 And as I mentioned, the open source from the specs and proofs last year. 33:29.000 --> 33:37.000 This year, later this year we are going to make significant portions of the Nova user and stack available publicly on an open source license. 33:37.000 --> 33:45.000 And we are currently figuring out how we can reduce the barrier of entry such that it becomes easy for all of you to use this. 33:45.000 --> 33:49.000 So what we decided we will do, but it's open for discussion. 33:49.000 --> 33:51.000 So if you have ideas, talk to us. 33:51.000 --> 33:58.000 We want to make everything configurable with K config so that you have a global config for the entire software stack. 33:58.000 --> 34:04.000 When you do this and you type make you get one giant image which contains all the components. 34:04.000 --> 34:12.000 And you can just boot this image from your favorite bootloader and there's three different ways of booting supported. 34:12.000 --> 34:24.000 You can either boot this as a booty boot image or you can boot this like a Linux kernel in the form you would boot it busy image or you can boot this as you if I executable from your UEFI shell. 34:24.000 --> 34:27.000 So the same image works in all these three different ways. 34:27.000 --> 34:30.000 So with that, my time is up. 34:30.000 --> 34:33.000 I don't know, maybe you have room for one or two questions. 34:33.000 --> 34:38.000 I have one question. 34:40.000 --> 34:41.000 I have a question. 34:41.000 --> 34:42.000 Thank you. 34:42.000 --> 34:45.000 For questions, regarding data. 34:45.000 --> 34:52.000 So did you have the chance to introduce user interrupt insights and shows on the latest Intel CPUs? 34:52.000 --> 34:58.000 I read about it in the manual, but it's becoming available only on the latest processes of generations. 34:58.000 --> 35:04.000 and since we go all the way back to the first 64-bit capable processors, 35:04.000 --> 35:08.000 it's not clear if you want to support that. 35:08.000 --> 35:10.000 Yeah. 35:16.000 --> 35:20.000 The question was, what are the firmware requirements? 35:20.000 --> 35:23.000 On X86, it's all UFI. 35:23.000 --> 35:25.000 On ARM, we actually prefer UFI. 35:25.000 --> 35:30.000 If we can, but we can also launch from UBoot or on ARM, 35:30.000 --> 35:34.000 we ask the firmware not to close boot services. 35:34.000 --> 35:38.000 So, you boot into Nova with UFI boot services enabled 35:38.000 --> 35:41.000 and then Nova actually closes the boot services 35:41.000 --> 35:45.000 after obtaining a memory map and taking some information from UFI. 35:45.000 --> 35:49.000 So, for example, we use the UFI graphics output protocol 35:49.000 --> 35:53.000 to determine whether we have a framework for and what the resolution is. 35:54.000 --> 36:00.000 Which means, we actually switch to the highest possible resolution in UFI. 36:00.000 --> 36:05.000 So, UFI sets up the framework for and the resolution for us. 36:05.000 --> 36:09.000 And we do not have to worry about EDID, display resolution, 36:09.000 --> 36:10.000 mode switching. 36:10.000 --> 36:14.000 We effectively get a linear framework for and we can just write it. 36:14.000 --> 36:19.000 We still have to do all the manuals crawling and glyphorandering and whatnot. 36:19.000 --> 36:21.000 Thank you. 36:21.000 --> 36:23.000 Thank you.