WEBVTT

00:00.000 --> 00:11.760
Okay, so keeping in line with this idea of various types of inference frameworks and what

00:11.760 --> 00:19.440
you can do, for the next talk, we're going to somewhat extreme low end of compute capabilities

00:19.440 --> 00:25.320
that can still do a lot of interesting things in AI and ML.

00:25.360 --> 00:29.640
With that, I would like to introduce Anastasia and Anastasia, who will be talking about

00:29.640 --> 00:33.480
tiny ML, so take it away, guys.

00:33.480 --> 00:37.200
Thanks so much for the introduction.

00:37.200 --> 00:46.120
So I'm Anastasia, so along with my colleague Anastasia, we're going to talk about how we balance

00:46.120 --> 00:55.280
accuracy and inference latency in really resource constraint boards like IoT,

00:55.280 --> 01:04.760
like as expressive boards where we want to run in much inference.

01:04.760 --> 01:11.400
So a bit of information about us, where we're really small team, we do research, we focus

01:11.400 --> 01:22.480
on hardware abstractions, on low level, we do low level OS work, we have gone into the

01:22.480 --> 01:28.800
cloud native space, we do containers, contain a runtime send, we bridge all these things

01:28.800 --> 01:36.480
into a coherent end-to-end ecosystem essentially.

01:36.480 --> 01:49.160
Of course, this work was also done by colleagues that do not work with us at this time.

01:50.120 --> 01:58.760
First, let's see why we think it's an important issue, what we tackle.

01:58.760 --> 02:08.000
IoT devices are everywhere, there are sensors, they have some compute capability, but unfortunately

02:08.080 --> 02:22.960
we can not run large models, we can not run like inference with accuracy on their compute

02:22.960 --> 02:32.440
capabilities, so the reason for this issue is that we don't have so much memory, we don't

02:32.520 --> 02:42.880
have such compute capabilities, and of course there are issues with a network latency, giving

02:42.880 --> 02:52.280
data from the sensor to a more compute capable infrastructure.

02:52.280 --> 03:00.600
What happens at the moment is that we need to shrink ML models that will run on IoT

03:00.680 --> 03:11.360
devices, and we do that using various methods, with the quantization, we prune, we do a lot

03:11.360 --> 03:19.560
of stuff that actually hurts the accuracy of these models. If we don't do that, then we

03:19.560 --> 03:28.560
get a really increased latency, so there is a trade of that we need to consider between the

03:28.640 --> 03:35.000
accuracy of the inference that we run on these devices, and the response latency that we

03:35.000 --> 03:41.720
want. If we want real time, we cannot be accurate, if we want to be accurate, we cannot

03:41.720 --> 03:55.280
get real time responses, in general, so we consider, we will see that in one of the next

03:55.280 --> 04:05.400
slides as well, we consider IoT devices like ESP32 devices, microcontrollers, we don't consider

04:05.400 --> 04:18.880
IoT devices at us, but we will consider that device as an edge device. In this landscape, let's

04:18.880 --> 04:29.120
say on the IoT device, we need some kind of compute capability to be able to process

04:29.120 --> 04:36.600
the input, and we consider using edge devices like Raspberry Pi, and then with the adjation

04:36.600 --> 04:43.960
with the GPU, or even a cloud device to be able to do this actual computation. Of course,

04:43.960 --> 04:51.600
in order to orchestrate all these deployment, let's say software deployment and do the

04:51.600 --> 04:57.840
actual inference and get the results that we want, we need to consider as well the security

04:57.840 --> 05:06.480
aspects, so is this device legit? Does it feed me with data that I trust? Can I trust

05:06.720 --> 05:17.600
the other device that's running the actual inference? It's kind of a mess. I think that's

05:17.600 --> 05:26.560
what I kind of said. What we want essentially is we don't want to change the application

05:26.560 --> 05:33.600
code that we deploy, even if it is on the IoT device, on the ESP32, even if it is on the

05:33.600 --> 05:41.640
Raspberry on the jet zone, wherever, we want it to be secure by default. We want to be able

05:41.640 --> 05:50.160
to run in the whole continuum, it's a buzz word, but what we mean is that we want to

05:50.160 --> 05:56.000
be able to be cloud native, so deploy something and make sure that this can run on a cloud

05:56.000 --> 06:06.800
server, on a Raspberry, on an IoT device, wherever, and we want to be able to run inference

06:06.800 --> 06:18.160
efficiently in the infrastructure that we have available. What we actually built for that

06:18.160 --> 06:28.160
is a ML IoT. This is a large framework, in this talk we're going to talk only about the

06:28.160 --> 06:35.880
inference of loading, but just to give you an idea, we securely onboard devices using this

06:35.880 --> 06:43.600
framework, we have a mechanism to attest both the device itself and the application that

06:43.600 --> 06:52.680
runs on this device, so both with secure boot and with EATs to make sure that we run

06:52.680 --> 06:58.800
what we want to run. We package all this, all this firmware and the application that running

06:58.800 --> 07:06.440
on an Edge device, let's say in an OSCI image, so it's a container-based thing, and we

07:06.440 --> 07:14.760
have this framework, we excel, where we use it to transparently upload, compute intensive tasks

07:14.760 --> 07:24.440
like image inference, to enable a node that is able to run this kind of models. I'm not

07:24.440 --> 07:31.000
going to bore you with the cloud native framework, I'm just going to say we use dice

07:31.000 --> 07:38.440
for the device identification, to generate a certificate based on a unique device secret,

07:38.440 --> 07:45.640
some of the ESP32-based devices do support that, so we are able to make sure that the device

07:45.640 --> 07:53.240
is legit and we can onboard it into our cluster, into our trusted cluster. We use a EATs

07:53.240 --> 08:06.200
to make sure that the application that runs on the device is legit and using this cloud native

08:06.200 --> 08:14.760
framework, where we built an open source component called Acre to identify the devices

08:14.760 --> 08:24.840
to do the inventory, and we end up with a cluster using IoT devices, Edge devices and

08:24.840 --> 08:31.800
cloud nodes, and we are able to run applications there and make them communicate with each

08:31.800 --> 08:40.360
other. This is a rough sketch of the architecture of the cloud native framework, I'm not going

08:40.360 --> 08:49.320
to bore you with the details just that we use Acre for the inventory, so we discover these

08:49.320 --> 08:59.640
devices, we onboard them using dice and EATs, and then we can deploy over their updates on

08:59.640 --> 09:09.880
this device, so we can repurpose these devices purely in a cloud native way. We consider, as I

09:09.880 --> 09:16.600
mentioned earlier, these kinds of devices in the continuum show we have ESP32 as the IoT, and

09:16.600 --> 09:23.880
point we have Raspberry and the Jet Show, and then via Jet Show with a GPU as the Edge devices

09:23.880 --> 09:32.680
and the cloud server with a GPU, a desktop-like GPU. Now the framework that we use for the API

09:32.760 --> 09:39.640
remote operation is something that we have been building for quite some time now, so we

09:39.640 --> 09:47.320
we call it the Excel, we started developing it for GPU sharing in VMs, we ended up using it with many

09:47.320 --> 09:55.560
many ways, so the core concept is that the application consumes a high level API, so it

09:55.560 --> 10:04.840
imagines inference, torch, load, torch, run, a TensorFlow, load, or session creator run and stuff

10:04.840 --> 10:12.200
like that, and we match these API operations to the underlying plugins, so we have a hardware specific

10:12.200 --> 10:20.600
plugin that could implement torch run as the actual torch run. We have a hardware-based plugin

10:20.600 --> 10:27.480
that implements image inference using the Jet Show inference framework or using a torch

10:27.480 --> 10:33.080
implementation to do image inference and all that stuff. What is really interesting about this

10:33.080 --> 10:43.160
framework is that we can pack the actual operation into a small number of bytes and forward it

10:43.160 --> 10:50.120
to another instance of every Excel application that has access to an accelerator, so we can

10:50.120 --> 11:00.760
do that completely remotely. This is what we actually leverage here, both from ESP 32 devices

11:00.760 --> 11:10.840
and from Marathbury Pi that they don't have GPUs and we can forward whatever we want to run

11:10.840 --> 11:19.080
on GPU on a Jet Show, on a GPU in a cloud server and get the result back and all that with the

11:19.080 --> 11:27.080
same API, and that's what we consider as important that we don't care where the actual

11:27.080 --> 11:36.520
inference runs, as long as we get our SLA, as long as we get low latency where we want it or

11:36.520 --> 11:44.520
more accuracy where we want it. A bit of information how this RPC thing works,

11:44.520 --> 11:57.080
so there's the actual VxLAPI on top, and when we specify that this operation is going to run

11:57.080 --> 12:05.720
using the RPC plugin, we forward that using GRPC, we'll see more information about that in

12:05.720 --> 12:12.200
the next slide, to another VxL application, we call it the VxL agent, and this this application

12:12.200 --> 12:22.760
essentially runs on the device with the actual hardware capabilities. Now our initial implementation

12:22.760 --> 12:31.320
with VxL for the transport layer was with TTRPC Rust, which is a modified version of GRPC,

12:31.320 --> 12:41.640
to be able to do things efficiently, however, in the special case of IoT devices, this was still a bit heavy,

12:42.200 --> 12:54.520
so what we did is that we got TURP to make sure that we can run light with remote operations

12:54.520 --> 13:04.760
from the IoT devices to enable a node. What we tested, so we're going to show you some numbers,

13:04.760 --> 13:11.160
and Anastasia is going to elaborate on how we took them, and what we did, we used these devices

13:11.160 --> 13:17.240
that I mentioned earlier, so we need speed 32, Raspberry Pi, Edge Edge and the Cloud Server,

13:17.240 --> 13:23.320
and we offloaded from the IoT device, from the Raspberry Pi and from the Edge and to the Cloud Server,

13:24.040 --> 13:32.280
and we measured, we saw the accuracy versus latency rate of, as well as the accuracy not only

13:32.280 --> 13:40.040
in terms of the model itself, but in terms of how many images we missed. Anastasia is going to

13:40.280 --> 13:52.840
elaborate more in that. Hello everyone, so for each model, we captured some metrics, we evaluated

13:52.840 --> 13:59.800
the 15 images from a free Modnet test set, we, of course, record the inference latency and

13:59.800 --> 14:05.080
20 latency, the top one and top three agreement with the high accuracy model, the model we took

14:05.160 --> 14:14.120
for reference model is the mobile net V3 large on an X86, and we also took a efficiency,

14:14.120 --> 14:19.640
which I'll talk about later, as a metric, and the confidence deviation, which is the difference

14:19.640 --> 14:29.160
between the predicted class softmax produced by the reference model and the quantized model.

14:29.240 --> 14:36.520
So here we see the key results that we got on the local versus offloaded implementation,

14:36.520 --> 14:41.320
note that we also tried pruning and the head of time, compilation, which produces a shared

14:41.320 --> 14:47.720
object and not the graph, which is better. We actually saw that the latency drops at approximately

14:47.720 --> 14:55.480
47 milliseconds for our best offloaded implementation, which was the mobile net V3 IoT model.

14:56.280 --> 15:04.200
Yes, one thing that we learned along the way is that the post training

15:04.200 --> 15:11.320
quantization that we use can be very unstable for some architecture architectures. Mobile net V3

15:11.320 --> 15:18.200
is a good example, it actually uses this squeeze and excitation blocks, which take a very small signal

15:18.200 --> 15:23.560
and use it to scale whole feature maps, and when that signal is quantized, numerical errors

15:23.640 --> 15:30.040
are amplified across the network. This made worse with the H, we say activations that have very

15:30.040 --> 15:36.120
narrow ranges and when we quantize their flattened. So in the figure on the right, many local

15:36.120 --> 15:42.040
microcontroller runs either miss the correct class entirely, or predict the wrong one.

15:45.480 --> 15:50.600
So this color plot actually seems a bit complex due to the many different configurations

15:50.680 --> 15:55.880
we tried, but to make it more intuitive, the model runs the light down, right, are the slow and

15:55.880 --> 16:02.360
inaccurate. Then we have the slow and accurate on the upright, and we finally see both fast and

16:02.360 --> 16:10.920
accurate model runs on the uplift. The plot is for mobile net V3, which had a high quantization

16:10.920 --> 16:19.160
error as I said, and we actually tried to dispatch some layers to 16 bit the integer, but we still

16:19.240 --> 16:26.040
had very accurate results. If anyone wants to recommend something for post training, in this model,

16:26.040 --> 16:32.440
please feel free to contribute. Now something that we wanted to show as a metric we derived,

16:34.040 --> 16:40.760
and think it's very important for edge devices, is the correct prediction per minute second.

16:40.760 --> 16:46.040
We want this device to be as responsive as we can, but we don't want them to output garbage.

16:47.000 --> 16:53.000
So for the ESP of loading, we're getting approximately two times improvement,

16:53.000 --> 16:57.480
while for the raspberries and zetion, we get three to five times improvement.

16:59.320 --> 17:04.840
So this is also a graph that shows a efficiency, of course, the higher the the better, and the

17:04.840 --> 17:09.880
bar blood, the here proofs that remote of loading is the winner.

17:10.840 --> 17:18.920
So what we learned, we learned that local email on microcontroll is fragile. We have severe

17:18.920 --> 17:25.320
memory constraints. Quantization can break semantic classic classes, even with mixed precision,

17:25.880 --> 17:31.880
and of course, we also show that network overhead is outweight by latency and accuracy.

17:32.040 --> 17:45.240
Thank you. Thanks so much. So, of course, the the whole framework is open source.

17:46.360 --> 17:52.360
We believe it it makes sense because we can we can actually do real deployments using it.

17:52.360 --> 18:01.400
We can scale the actual inference to many nodes. We can reuse hardware in that sense that we

18:01.480 --> 18:09.000
can repurpose the actual device to a larger model or a smaller model or a completely different

18:09.720 --> 18:16.840
application, and we don't using this VXL framework, we don't we don't have to to rewrite the

18:16.840 --> 18:25.800
application. So using the same API, inference can unlocally or remotely, completely in a seamless way.

18:26.120 --> 18:33.160
This work we have we have sent it to a peer reviewed scientific conference, and we're going

18:33.160 --> 18:39.960
to present it in Florence this this May. So if you want one information about how we got the

18:39.960 --> 18:49.240
numbers, what we did, we're going to share the link in the slides that we will upload to the actual

18:49.320 --> 18:57.080
paper. Thanks very much for for listening. If we still have time, I think we can if we have a couple

18:57.080 --> 19:00.520
of minutes I can show where demo of this running.

19:19.240 --> 19:39.480
Okay, so this is the the actual device.

19:42.280 --> 19:44.280
Yeah, it's okay.

19:49.240 --> 20:07.880
I'm going to show it regardless. So this is this is the local inference, and these are the

20:07.880 --> 20:15.240
seconds, this is the classification ID, the confidence score, this is the local on the ESP.

20:15.240 --> 20:23.480
This is the local on the ESP, and I do a config command to say that I want this offloaded to this

20:23.480 --> 20:31.960
inference server, which are on it on the background with this model. So I do that, and

20:31.960 --> 20:35.320
inference is like 50 milliseconds, 70 milliseconds, I'm going to tell you.

20:45.560 --> 20:53.160
It's a little bit easier. It is in the jets of alone, good enough for that, or it's the latest,

20:53.160 --> 20:59.880
or the overhead. Jet jam is great for that, but Jet jam, but in the ESP, it was like that.

21:00.440 --> 21:04.920
It consumes like 100 of the power.

21:05.000 --> 21:11.000
Yes, but I don't know, but I suppose the jet's anyways.

21:11.000 --> 21:12.760
Correct, yes, yes, yes.

21:12.760 --> 21:15.560
But does it use less power in that way?

21:15.560 --> 21:24.360
For the actual ESP, yes, for the jet, so imagine if you have an oboken thing and not a

21:24.360 --> 21:30.840
speed, it's a logistics space environment, you have two directions, and a lot of these things moving around.

21:30.840 --> 21:32.840
So you don't have to put it down to the bottom.