WEBVTT 00:00.000 --> 00:10.140 Hello and welcome. My name is Andrin. I'm a PhD student in the secure and trustworthy 00:10.140 --> 00:17.500 systems group at EKH Cirque. And today I'd like to introduce you to open CCA, a framework 00:17.500 --> 00:23.380 or a tool that we've been building to help you to research on RMSCCA and existing hardware 00:23.380 --> 00:33.100 before CCH is actually available. So quick spoiler, I brought the box, this thing here. 00:33.100 --> 00:37.940 And later in this talk we'll attempt a live demo and put a confidential virtual machine 00:37.940 --> 00:46.260 on open CCA and run some cheaper workloads on it. But more on this later. So the main challenge 00:46.260 --> 00:51.960 today with RMSCCA is really that for most of us there is no CCA hardware yet available 00:51.960 --> 00:58.880 you can just buy and tinker with it. The first rollouts, if you check online, they're likely 00:58.880 --> 01:05.080 data center first. So if you check, you'll see that a bunch of companies, including which 01:05.080 --> 01:12.240 it's Microsoft and also likely in video, they all have announced to provide CCA capable CPUs 01:12.240 --> 01:20.120 for the cloud. But even then it's unclear how open affordable and heckable those platforms 01:20.120 --> 01:28.160 will be. And so if you want to do research on CCA, you typically have two choices today. 01:28.160 --> 01:35.040 One choice is software simulation. So you're thinks like the RMSCCA or QMU. This system 01:35.040 --> 01:40.080 is simulator and entire CCA, you're at the hardware stack in software. And this is great 01:40.080 --> 01:45.600 to validate the correctness of your new research design and also great to validate the compatibility 01:45.600 --> 01:52.240 of how your camera code will run on the next generation of hardware. But what software simulation 01:52.240 --> 01:58.080 is not good at is that it does not tell you anything about how fast your camera code actually 01:58.080 --> 02:04.880 runs. All you know is that you run your instructions correctly, but you let any further insights 02:04.880 --> 02:12.520 into micro-architectural effects like cycles. And so since people also care about cycle, performance 02:12.520 --> 02:18.480 and overheads, what they typically do, and this is also in the context of research, is that 02:18.480 --> 02:23.160 they use their design. They have now run on software simulation and they transplanted 02:23.160 --> 02:32.080 to arm-version aid boards. So the current iteration of arm hardware. And the inside here is, 02:32.080 --> 02:39.560 in many ways, CCA hardware will run similarly to how existing hardware works. So this 02:39.600 --> 02:49.600 can be used to estimate the performance overheads. But this comes with its own set of challenges 02:49.600 --> 02:54.120 because typically these performance prototypes that people are building, they are not open 02:54.120 --> 03:02.560 sourced, making it difficult for others to reuse, re-reproduce their work and in general 03:02.560 --> 03:08.160 which is ways a lot of engineering, since everyone sort of builds their own thing. And 03:08.160 --> 03:15.440 yeah, that does not open sourced it. And so with open CCA, we tried to solve some of these 03:15.440 --> 03:20.640 pain points of these performance prototypes that people are building. By providing some sort 03:20.640 --> 03:28.520 of open baseline to measure and run research designs on open CCA, along existing hardware. 03:28.520 --> 03:39.000 And so at the high level we are trying to accomplish the following goals. So first, we want 03:39.000 --> 03:45.400 to keep the changes to the RCCA reference stack as minimal as possible, while also ensuring 03:45.400 --> 03:50.600 correct functionality on arm-version aid hardware. So this means we want to run confidential 03:50.600 --> 03:58.440 virtual machines on existing hardware. Second, this is very important. So we cannot 03:58.440 --> 04:05.240 give the same or we cannot give security claims of RCCA. This runs on non-CCA hardware and so 04:05.240 --> 04:14.840 this is only for benchmarking and synchron with real devices. Third, we tried to target affordable 04:14.840 --> 04:22.920 and open arm-version aid boards. So the barrier to entry is low and everyone can try to replicate 04:23.240 --> 04:29.240 our setup. And fourth, we tried to focus on the reusability aspect of a framework 04:30.120 --> 04:35.800 in the sense that we tried to be not forged specific as far as this is possible, so that our 04:35.800 --> 04:44.360 work can also be pointed to different boards. So now that we've seen some high level goals, let's 04:44.360 --> 04:50.280 take a look at how we actually building this. And for this, let's recap a bit on on CCA background first. 04:51.080 --> 04:56.920 So you probably noticed the 14 introduction of arm-version 9 we had trust zone that divided 04:56.920 --> 05:04.120 compute into the normal world and the secure world. And now with the introduction of arm-CCA 05:04.120 --> 05:09.960 or in particular a hardware feature called the realm management extension, the architecture 05:09.960 --> 05:17.000 introduces two mobiles. So we have the realm world for CCA's version of confidential virtual machine 05:17.960 --> 05:23.080 machines and we also have the root world for the most privileged firmware code. 05:24.360 --> 05:28.200 Now what's great about the arm architecture is really that it's very explicit. 05:29.000 --> 05:35.400 So we can write firmware in codes and most things are not hidden in closed microcode. 05:37.960 --> 05:44.120 So in open CCA we tried to reuse the reference deck as much as we can while also keeping the changes 05:44.200 --> 05:52.280 small but how do we actually do this? Since we only have the normal world and the secure world 05:52.280 --> 05:58.520 on arm-version 8 hardware, we emulate the realm worlds within the architectural normal world 05:59.480 --> 06:04.520 and the secure world within the root world, sorry within the architectural secure world. 06:05.800 --> 06:11.560 And so at first glance this might seem straightforward but things can get messy quite quickly 06:11.560 --> 06:17.000 because CCA firmware expects certain hardware features to be available and while these are clearly 06:17.000 --> 06:25.560 not present on version 8 hardware. And so this essentially boils down to how do we emulate enough 06:25.560 --> 06:31.000 of these missing hardware features in software to make the firmware believe it's actually running 06:31.000 --> 06:38.760 a real CCA environment while also keeping the changes small. And this essentially means in code 06:38.760 --> 06:46.680 this that we got missing hardware features in in codes and either re-implement them 06:46.680 --> 06:51.560 in software if they're strictly needed to boot confidential virtual machines or we forced 06:51.560 --> 06:58.840 disabled them if they're not strictly needed. And so in our paper and also in code we're going 06:58.840 --> 07:04.360 to much more details how we actually do this but let's take a look at one of these missing hardware 07:04.440 --> 07:11.560 features that are not available on this board. And so this is a hardware feature called TTSC 07:11.560 --> 07:19.480 stands for short translation tables. It's an optimization for the MMU. If your address space is 07:19.480 --> 07:25.880 small with this hardware feature the MMU does not have to do as many page table walks 07:27.160 --> 07:33.720 and so the the page block is faster. And for this let's take a quick look at the memory layout 07:33.720 --> 07:41.320 of the trusted type of devices as the RMM. So on ARM we have two translation table base 07:41.320 --> 07:52.920 registers. We have TDBRC or N1. This is similarly to MCO3 on X86 and the memory layout of the RMM 07:52.920 --> 08:01.560 decides to use TDBRC or things that are mostly identity mapped and shared across course. 08:04.280 --> 08:13.640 And TDBR1 for things that are not identity mapped and per CPU. Thank you. 08:14.440 --> 08:23.320 And so the inside here is in so the RMM only touches a few megabytes in size for TDBR1. 08:23.880 --> 08:31.800 So they use hardware feature so this TTSC. And so the challenge here was that all this feature 08:31.800 --> 08:40.680 is not available on 8.2 so the version that we use. And so the challenge was how to find this 08:41.640 --> 08:48.040 this since the TDBRC crashed and it was only shown in the way out the page structure was filled. 08:50.040 --> 08:56.600 And so the work count here is that we exploit what ARM version 8.2 has which means we increase 08:56.600 --> 09:04.680 TDBR1 to spend to a larger virtual memory size and then we can exploit what the hardware has to 09:04.680 --> 09:15.240 actually make the memory work. So in 2025 we looked into around 40 different boards for this project 09:15.240 --> 09:25.320 and we picked the RK3588 by a rock chip in particular the rock size B model. It's a great 09:25.320 --> 09:31.880 as a season. It has open EL3 so we can flash VMware code. It has good documentation and it's also 09:31.880 --> 09:39.320 somewhat affordable with the 16 gigabyte version starting around 250 US dollars but I think for 09:39.320 --> 09:45.640 gigabyte RAM starts at around 100. And so as I said it's based on 8.2 architecture has 09:46.280 --> 09:54.040 Cortex A76 and A55 course and yeah this is also what I brought today with me. 09:55.000 --> 10:06.360 So currently we are able to boot confidential virtual machines on a stack that's maybe a year 10:06.360 --> 10:14.680 old so it's based on TFA version 2011 and the RMM version 0.5. We have someone that looks into 10:14.680 --> 10:22.120 pulling in the latest changes I think currently it's 0.8 for the RMM and totally we touched around 10:22.200 --> 10:29.400 2.5,000 lines of codes and only enlighten at the FA and the RMM so we don't need to change the 10:29.400 --> 10:37.240 guest or the host or the VMM and so in this 2.5,000 lines of codes this is mostly or to a 10:37.240 --> 10:44.840 large percentage it's board definitions in the RMM and also I think reported the console drivers 10:44.840 --> 10:54.680 of the effective change is actually much smaller than this. Now the OpenCistware project went 10:54.680 --> 10:59.720 for several iterations you're now at the point where we have these stacked boxes in our 10:59.720 --> 11:07.880 lab that include the Raspberry Pi to streamline the firmware flashing and power management on the board 11:08.200 --> 11:17.800 and the edges making working with the platform more easily. Okay so this brings me to 11:17.800 --> 11:24.200 the live demo and the way I want to structure this is so in two parts first that will show you 11:24.200 --> 11:31.080 what the demo will show and then I will tell you what I had to change on top of OpenCCA to make 11:31.240 --> 11:36.760 the demo work. So in the demo we'll boot the confidential version of machine on OpenCCA 11:36.760 --> 11:45.880 I think with one VCPU and 512 mbps of RAM and then we'll attach the Mali G6010 so there is 11:45.880 --> 11:54.680 an integrated GPU on the board we'll attach that and then in the CVM we'll start X and run some 11:54.680 --> 12:03.880 OpenGL benchmark on the GPU and so the purpose of this demo is really to show how easy it is to 12:03.880 --> 12:13.880 prototype systems research ideas on OpenCCA so disclaimer this is not trusted I.O. on purpose we 12:14.680 --> 12:21.720 leave the GPU MMIO hypervisor shared so this is mostly so I think I have to change the ARMM 12:21.720 --> 12:31.080 this uses Mali-LAR ARMM APIs your mathematics goes to the hypervisor and yeah so I had to prototype 12:31.080 --> 12:37.720 and VFI were inspired into app routing and then also create a Stage 2 mapping for the GPU so 12:37.720 --> 12:48.120 the driver can actually talk to the GPU in the CVM yeah so there's a QR code for all the demo code 12:48.120 --> 12:57.720 that I wrote on top of OpenCCA here in this QR code all right so now let's see if this works 13:08.440 --> 13:25.640 so first I will mirror my screen okay so what we're seeing here is I have now UART output and 13:25.640 --> 13:32.920 I mean the normal world in the untrusted hypervisor so this KVM Linux and so as a first step 13:33.720 --> 13:43.560 I will detach the GPU and see and then I will now use KVM tool to boot around the end 13:47.000 --> 13:53.720 with these changes that I introduced before so we create Stage 2 mapping of the GPU MMIO and we have 13:53.720 --> 14:02.840 an introproctor that now forwards the physical introps of the GPU into the CVM if they're 14:02.840 --> 14:15.160 right we also changed KVM tool okay this to create a device to entry for the GPU for the CVM okay so 14:15.160 --> 14:21.560 we are now in the CVM and we we see that it did a bunch of random management interface and 14:21.560 --> 14:30.120 random service interface calls okay let's go on and so as the next step now in the GPU 14:30.920 --> 14:39.240 and so in the CVM I will attach the Pantro driver and use TigerVNC server to spawn X 14:40.600 --> 14:45.480 so you can see here since we created the mapping the the Pantro so that's the GPU driver can actually 14:45.480 --> 14:56.920 talk to the the GPU and TigerVNC server now exposes VNC over TCP IP so now on on my laptop 14:58.120 --> 15:11.320 I can now connect to the CVM or over VNC so now we are in X in the real VM and I sketch some 15:11.320 --> 15:20.760 OpenGL demo for fasting if we take now a look at Proc introps so this does 15:21.480 --> 15:30.920 cat on Proc introps we see that the Pantro driver receives introps to submit or complete 15:30.920 --> 15:36.680 a GPU rendering drops and so if actually now hides this window we see that in no longer 15:36.840 --> 15:44.040 it's been no longer received introps since GPU is no longer used for this open shell rendering 15:45.720 --> 15:52.760 and so so now we are in the in the CVM so this runs a 612 kernel with I think CCA 15:52.760 --> 16:01.880 guest enlightenment version 7 and in our D message we'll see that the RM explosives 16:01.880 --> 16:05.880 where I'm servicing the patient one to the CAND 16:10.120 --> 16:13.000 alright 16:13.000 --> 16:42.000 And so this concludes my talk. So if you work with arm CC and are currently constrained by some of the limitations that self-dressimulation gives, 16:42.000 --> 16:50.000 self-dressimulation gives you, or want to think with real devices in the context of arm CCA, 16:50.000 --> 16:58.000 please check out Open CCA. We have documentation on how to build one of these boxes online, 16:58.000 --> 17:05.000 and our forks of the upstream wrappers on GitHub. We also have an academic publication that goes into more details, 17:05.000 --> 17:13.000 how we build this, and if you're excited about this, please reach out. So thank you very much.