WEBVTT 00:00.000 --> 00:11.760 Okay, so keeping in line with this idea of various types of inference frameworks and what 00:11.760 --> 00:19.440 you can do, for the next talk, we're going to somewhat extreme low end of compute capabilities 00:19.440 --> 00:25.320 that can still do a lot of interesting things in AI and ML. 00:25.360 --> 00:29.640 With that, I would like to introduce Anastasia and Anastasia, who will be talking about 00:29.640 --> 00:33.480 tiny ML, so take it away, guys. 00:33.480 --> 00:37.200 Thanks so much for the introduction. 00:37.200 --> 00:46.120 So I'm Anastasia, so along with my colleague Anastasia, we're going to talk about how we balance 00:46.120 --> 00:55.280 accuracy and inference latency in really resource constraint boards like IoT, 00:55.280 --> 01:04.760 like as expressive boards where we want to run in much inference. 01:04.760 --> 01:11.400 So a bit of information about us, where we're really small team, we do research, we focus 01:11.400 --> 01:22.480 on hardware abstractions, on low level, we do low level OS work, we have gone into the 01:22.480 --> 01:28.800 cloud native space, we do containers, contain a runtime send, we bridge all these things 01:28.800 --> 01:36.480 into a coherent end-to-end ecosystem essentially. 01:36.480 --> 01:49.160 Of course, this work was also done by colleagues that do not work with us at this time. 01:50.120 --> 01:58.760 First, let's see why we think it's an important issue, what we tackle. 01:58.760 --> 02:08.000 IoT devices are everywhere, there are sensors, they have some compute capability, but unfortunately 02:08.080 --> 02:22.960 we can not run large models, we can not run like inference with accuracy on their compute 02:22.960 --> 02:32.440 capabilities, so the reason for this issue is that we don't have so much memory, we don't 02:32.520 --> 02:42.880 have such compute capabilities, and of course there are issues with a network latency, giving 02:42.880 --> 02:52.280 data from the sensor to a more compute capable infrastructure. 02:52.280 --> 03:00.600 What happens at the moment is that we need to shrink ML models that will run on IoT 03:00.680 --> 03:11.360 devices, and we do that using various methods, with the quantization, we prune, we do a lot 03:11.360 --> 03:19.560 of stuff that actually hurts the accuracy of these models. If we don't do that, then we 03:19.560 --> 03:28.560 get a really increased latency, so there is a trade of that we need to consider between the 03:28.640 --> 03:35.000 accuracy of the inference that we run on these devices, and the response latency that we 03:35.000 --> 03:41.720 want. If we want real time, we cannot be accurate, if we want to be accurate, we cannot 03:41.720 --> 03:55.280 get real time responses, in general, so we consider, we will see that in one of the next 03:55.280 --> 04:05.400 slides as well, we consider IoT devices like ESP32 devices, microcontrollers, we don't consider 04:05.400 --> 04:18.880 IoT devices at us, but we will consider that device as an edge device. In this landscape, let's 04:18.880 --> 04:29.120 say on the IoT device, we need some kind of compute capability to be able to process 04:29.120 --> 04:36.600 the input, and we consider using edge devices like Raspberry Pi, and then with the adjation 04:36.600 --> 04:43.960 with the GPU, or even a cloud device to be able to do this actual computation. Of course, 04:43.960 --> 04:51.600 in order to orchestrate all these deployment, let's say software deployment and do the 04:51.600 --> 04:57.840 actual inference and get the results that we want, we need to consider as well the security 04:57.840 --> 05:06.480 aspects, so is this device legit? Does it feed me with data that I trust? Can I trust 05:06.720 --> 05:17.600 the other device that's running the actual inference? It's kind of a mess. I think that's 05:17.600 --> 05:26.560 what I kind of said. What we want essentially is we don't want to change the application 05:26.560 --> 05:33.600 code that we deploy, even if it is on the IoT device, on the ESP32, even if it is on the 05:33.600 --> 05:41.640 Raspberry on the jet zone, wherever, we want it to be secure by default. We want to be able 05:41.640 --> 05:50.160 to run in the whole continuum, it's a buzz word, but what we mean is that we want to 05:50.160 --> 05:56.000 be able to be cloud native, so deploy something and make sure that this can run on a cloud 05:56.000 --> 06:06.800 server, on a Raspberry, on an IoT device, wherever, and we want to be able to run inference 06:06.800 --> 06:18.160 efficiently in the infrastructure that we have available. What we actually built for that 06:18.160 --> 06:28.160 is a ML IoT. This is a large framework, in this talk we're going to talk only about the 06:28.160 --> 06:35.880 inference of loading, but just to give you an idea, we securely onboard devices using this 06:35.880 --> 06:43.600 framework, we have a mechanism to attest both the device itself and the application that 06:43.600 --> 06:52.680 runs on this device, so both with secure boot and with EATs to make sure that we run 06:52.680 --> 06:58.800 what we want to run. We package all this, all this firmware and the application that running 06:58.800 --> 07:06.440 on an Edge device, let's say in an OSCI image, so it's a container-based thing, and we 07:06.440 --> 07:14.760 have this framework, we excel, where we use it to transparently upload, compute intensive tasks 07:14.760 --> 07:24.440 like image inference, to enable a node that is able to run this kind of models. I'm not 07:24.440 --> 07:31.000 going to bore you with the cloud native framework, I'm just going to say we use dice 07:31.000 --> 07:38.440 for the device identification, to generate a certificate based on a unique device secret, 07:38.440 --> 07:45.640 some of the ESP32-based devices do support that, so we are able to make sure that the device 07:45.640 --> 07:53.240 is legit and we can onboard it into our cluster, into our trusted cluster. We use a EATs 07:53.240 --> 08:06.200 to make sure that the application that runs on the device is legit and using this cloud native 08:06.200 --> 08:14.760 framework, where we built an open source component called Acre to identify the devices 08:14.760 --> 08:24.840 to do the inventory, and we end up with a cluster using IoT devices, Edge devices and 08:24.840 --> 08:31.800 cloud nodes, and we are able to run applications there and make them communicate with each 08:31.800 --> 08:40.360 other. This is a rough sketch of the architecture of the cloud native framework, I'm not going 08:40.360 --> 08:49.320 to bore you with the details just that we use Acre for the inventory, so we discover these 08:49.320 --> 08:59.640 devices, we onboard them using dice and EATs, and then we can deploy over their updates on 08:59.640 --> 09:09.880 this device, so we can repurpose these devices purely in a cloud native way. We consider, as I 09:09.880 --> 09:16.600 mentioned earlier, these kinds of devices in the continuum show we have ESP32 as the IoT, and 09:16.600 --> 09:23.880 point we have Raspberry and the Jet Show, and then via Jet Show with a GPU as the Edge devices 09:23.880 --> 09:32.680 and the cloud server with a GPU, a desktop-like GPU. Now the framework that we use for the API 09:32.760 --> 09:39.640 remote operation is something that we have been building for quite some time now, so we 09:39.640 --> 09:47.320 we call it the Excel, we started developing it for GPU sharing in VMs, we ended up using it with many 09:47.320 --> 09:55.560 many ways, so the core concept is that the application consumes a high level API, so it 09:55.560 --> 10:04.840 imagines inference, torch, load, torch, run, a TensorFlow, load, or session creator run and stuff 10:04.840 --> 10:12.200 like that, and we match these API operations to the underlying plugins, so we have a hardware specific 10:12.200 --> 10:20.600 plugin that could implement torch run as the actual torch run. We have a hardware-based plugin 10:20.600 --> 10:27.480 that implements image inference using the Jet Show inference framework or using a torch 10:27.480 --> 10:33.080 implementation to do image inference and all that stuff. What is really interesting about this 10:33.080 --> 10:43.160 framework is that we can pack the actual operation into a small number of bytes and forward it 10:43.160 --> 10:50.120 to another instance of every Excel application that has access to an accelerator, so we can 10:50.120 --> 11:00.760 do that completely remotely. This is what we actually leverage here, both from ESP 32 devices 11:00.760 --> 11:10.840 and from Marathbury Pi that they don't have GPUs and we can forward whatever we want to run 11:10.840 --> 11:19.080 on GPU on a Jet Show, on a GPU in a cloud server and get the result back and all that with the 11:19.080 --> 11:27.080 same API, and that's what we consider as important that we don't care where the actual 11:27.080 --> 11:36.520 inference runs, as long as we get our SLA, as long as we get low latency where we want it or 11:36.520 --> 11:44.520 more accuracy where we want it. A bit of information how this RPC thing works, 11:44.520 --> 11:57.080 so there's the actual VxLAPI on top, and when we specify that this operation is going to run 11:57.080 --> 12:05.720 using the RPC plugin, we forward that using GRPC, we'll see more information about that in 12:05.720 --> 12:12.200 the next slide, to another VxL application, we call it the VxL agent, and this this application 12:12.200 --> 12:22.760 essentially runs on the device with the actual hardware capabilities. Now our initial implementation 12:22.760 --> 12:31.320 with VxL for the transport layer was with TTRPC Rust, which is a modified version of GRPC, 12:31.320 --> 12:41.640 to be able to do things efficiently, however, in the special case of IoT devices, this was still a bit heavy, 12:42.200 --> 12:54.520 so what we did is that we got TURP to make sure that we can run light with remote operations 12:54.520 --> 13:04.760 from the IoT devices to enable a node. What we tested, so we're going to show you some numbers, 13:04.760 --> 13:11.160 and Anastasia is going to elaborate on how we took them, and what we did, we used these devices 13:11.160 --> 13:17.240 that I mentioned earlier, so we need speed 32, Raspberry Pi, Edge Edge and the Cloud Server, 13:17.240 --> 13:23.320 and we offloaded from the IoT device, from the Raspberry Pi and from the Edge and to the Cloud Server, 13:24.040 --> 13:32.280 and we measured, we saw the accuracy versus latency rate of, as well as the accuracy not only 13:32.280 --> 13:40.040 in terms of the model itself, but in terms of how many images we missed. Anastasia is going to 13:40.280 --> 13:52.840 elaborate more in that. Hello everyone, so for each model, we captured some metrics, we evaluated 13:52.840 --> 13:59.800 the 15 images from a free Modnet test set, we, of course, record the inference latency and 13:59.800 --> 14:05.080 20 latency, the top one and top three agreement with the high accuracy model, the model we took 14:05.160 --> 14:14.120 for reference model is the mobile net V3 large on an X86, and we also took a efficiency, 14:14.120 --> 14:19.640 which I'll talk about later, as a metric, and the confidence deviation, which is the difference 14:19.640 --> 14:29.160 between the predicted class softmax produced by the reference model and the quantized model. 14:29.240 --> 14:36.520 So here we see the key results that we got on the local versus offloaded implementation, 14:36.520 --> 14:41.320 note that we also tried pruning and the head of time, compilation, which produces a shared 14:41.320 --> 14:47.720 object and not the graph, which is better. We actually saw that the latency drops at approximately 14:47.720 --> 14:55.480 47 milliseconds for our best offloaded implementation, which was the mobile net V3 IoT model. 14:56.280 --> 15:04.200 Yes, one thing that we learned along the way is that the post training 15:04.200 --> 15:11.320 quantization that we use can be very unstable for some architecture architectures. Mobile net V3 15:11.320 --> 15:18.200 is a good example, it actually uses this squeeze and excitation blocks, which take a very small signal 15:18.200 --> 15:23.560 and use it to scale whole feature maps, and when that signal is quantized, numerical errors 15:23.640 --> 15:30.040 are amplified across the network. This made worse with the H, we say activations that have very 15:30.040 --> 15:36.120 narrow ranges and when we quantize their flattened. So in the figure on the right, many local 15:36.120 --> 15:42.040 microcontroller runs either miss the correct class entirely, or predict the wrong one. 15:45.480 --> 15:50.600 So this color plot actually seems a bit complex due to the many different configurations 15:50.680 --> 15:55.880 we tried, but to make it more intuitive, the model runs the light down, right, are the slow and 15:55.880 --> 16:02.360 inaccurate. Then we have the slow and accurate on the upright, and we finally see both fast and 16:02.360 --> 16:10.920 accurate model runs on the uplift. The plot is for mobile net V3, which had a high quantization 16:10.920 --> 16:19.160 error as I said, and we actually tried to dispatch some layers to 16 bit the integer, but we still 16:19.240 --> 16:26.040 had very accurate results. If anyone wants to recommend something for post training, in this model, 16:26.040 --> 16:32.440 please feel free to contribute. Now something that we wanted to show as a metric we derived, 16:34.040 --> 16:40.760 and think it's very important for edge devices, is the correct prediction per minute second. 16:40.760 --> 16:46.040 We want this device to be as responsive as we can, but we don't want them to output garbage. 16:47.000 --> 16:53.000 So for the ESP of loading, we're getting approximately two times improvement, 16:53.000 --> 16:57.480 while for the raspberries and zetion, we get three to five times improvement. 16:59.320 --> 17:04.840 So this is also a graph that shows a efficiency, of course, the higher the the better, and the 17:04.840 --> 17:09.880 bar blood, the here proofs that remote of loading is the winner. 17:10.840 --> 17:18.920 So what we learned, we learned that local email on microcontroll is fragile. We have severe 17:18.920 --> 17:25.320 memory constraints. Quantization can break semantic classic classes, even with mixed precision, 17:25.880 --> 17:31.880 and of course, we also show that network overhead is outweight by latency and accuracy. 17:32.040 --> 17:45.240 Thank you. Thanks so much. So, of course, the the whole framework is open source. 17:46.360 --> 17:52.360 We believe it it makes sense because we can we can actually do real deployments using it. 17:52.360 --> 18:01.400 We can scale the actual inference to many nodes. We can reuse hardware in that sense that we 18:01.480 --> 18:09.000 can repurpose the actual device to a larger model or a smaller model or a completely different 18:09.720 --> 18:16.840 application, and we don't using this VXL framework, we don't we don't have to to rewrite the 18:16.840 --> 18:25.800 application. So using the same API, inference can unlocally or remotely, completely in a seamless way. 18:26.120 --> 18:33.160 This work we have we have sent it to a peer reviewed scientific conference, and we're going 18:33.160 --> 18:39.960 to present it in Florence this this May. So if you want one information about how we got the 18:39.960 --> 18:49.240 numbers, what we did, we're going to share the link in the slides that we will upload to the actual 18:49.320 --> 18:57.080 paper. Thanks very much for for listening. If we still have time, I think we can if we have a couple 18:57.080 --> 19:00.520 of minutes I can show where demo of this running. 19:19.240 --> 19:39.480 Okay, so this is the the actual device. 19:42.280 --> 19:44.280 Yeah, it's okay. 19:49.240 --> 20:07.880 I'm going to show it regardless. So this is this is the local inference, and these are the 20:07.880 --> 20:15.240 seconds, this is the classification ID, the confidence score, this is the local on the ESP. 20:15.240 --> 20:23.480 This is the local on the ESP, and I do a config command to say that I want this offloaded to this 20:23.480 --> 20:31.960 inference server, which are on it on the background with this model. So I do that, and 20:31.960 --> 20:35.320 inference is like 50 milliseconds, 70 milliseconds, I'm going to tell you. 20:45.560 --> 20:53.160 It's a little bit easier. It is in the jets of alone, good enough for that, or it's the latest, 20:53.160 --> 20:59.880 or the overhead. Jet jam is great for that, but Jet jam, but in the ESP, it was like that. 21:00.440 --> 21:04.920 It consumes like 100 of the power. 21:05.000 --> 21:11.000 Yes, but I don't know, but I suppose the jet's anyways. 21:11.000 --> 21:12.760 Correct, yes, yes, yes. 21:12.760 --> 21:15.560 But does it use less power in that way? 21:15.560 --> 21:24.360 For the actual ESP, yes, for the jet, so imagine if you have an oboken thing and not a 21:24.360 --> 21:30.840 speed, it's a logistics space environment, you have two directions, and a lot of these things moving around. 21:30.840 --> 21:32.840 So you don't have to put it down to the bottom.