WEBVTT 00:00.000 --> 00:20.000 It's currently estimated that it takes about 50 terawatt hours of electricity to run the world's AI data centers with estimates that this is going to go up to even closer with estimates that this is going to go up to 00:20.000 --> 00:29.000 400 terawatt hours by 2030. As well as the cost of running these data centers we have the cost of getting data to the data centers. 00:29.000 --> 00:41.000 It also often said that it takes about 0.03 kilowatt hours of electricity to transfer just one gigabyte of data and if you're transferring you have a cost in time as well. 00:42.000 --> 00:59.000 For many applications these trade-offs are just not acceptable and a great deal of what we are talking about today in the AI diagram is how you own your own AI and how you can do things locally yourselves. 00:59.000 --> 01:11.000 While there's a lot of very good tooling and infrastructure to do things locally and to do things yourselves for some particular use cases the infrastructure sort of nascent and underdeveloped. 01:11.000 --> 01:18.000 And one particular use case that is of interest to us is the microcontroller class of processes. 01:18.000 --> 01:26.000 Where if you if you want to run things on a sort of microcontroller which doesn't necessarily have an operating system a file system. 01:26.000 --> 01:38.000 It is possible at the moment but it is certainly difficult most especially if you're trying with your personal project or with your business to do something interesting and novel. 01:38.000 --> 01:49.000 So as we've been introduced I'm Dr. William Jones and I'm with my colleagues James Lattery and Pietro Ferroa and our goal today with him because and by the way. 01:49.000 --> 02:04.000 And our goal today is to show you that while the tooling for this microcontroller class of processes is nascent and underdeveloped it is very possible to bring up any AI project that you're doing the personal or business on. 02:04.000 --> 02:21.000 With a good AI modeling framework using only three and open source tooling. In our case we will be talking through a project we have done with a novel with a high processor with a custom accelerator bringing up the executive torch framework. 02:21.000 --> 02:25.000 And Pietroa is going to start our talk about that. 02:34.000 --> 02:47.000 Okay, say can you guys hear me well. 02:47.000 --> 02:53.000 Okay, so what are Pietroa tuners at the torch? 02:53.000 --> 03:02.000 So our high level AI inference falls a predictable path as most people may already know. 03:02.000 --> 03:06.000 Monera work is represented as a graph. 03:06.000 --> 03:20.000 So here we have a graph evaluator which traverses the graph performing intensive assignment and operation evaluation which means allocating memory and doing the maths. 03:20.000 --> 03:28.000 And then the departure sends this task through the best hardware, the CPU or a specialized accelerator. 03:28.000 --> 03:33.000 Crucially before, as a torch, runs the program. 03:33.000 --> 03:45.000 We can apply graph level transformations to simplify or fuse the graph, making the model linear before even reaches the chip. 03:45.000 --> 03:48.000 So why do we need a new tool? 03:48.000 --> 03:52.000 So stand a pie torch is massive. 03:52.000 --> 03:59.000 Is a Python based framework built for both training and inference. 03:59.000 --> 04:04.000 Which means they should have it for embedded systems. 04:04.000 --> 04:12.000 Executive torch however, it's like way sibling handling inference only. 04:13.000 --> 04:24.000 And our goal is to take a model trained in the flexible pie torch environment and hard on it to run on tiny resource constraint devices. 04:24.000 --> 04:29.000 Executive torch follows a build and then customized approach. 04:29.000 --> 04:33.000 So the build phase is handled by the out of time compiler. 04:33.000 --> 04:41.000 In your desktop, it does all the have a lifting, simplifying the logic and stripping away Python so the chip doesn't have to. 04:42.000 --> 04:54.000 And then we have the cost of my face, which is bought lives in the hardware. 04:54.000 --> 05:03.000 And to bridge the gap between the two, we use a backhand delegate, which during the AOT phase, the head of time phase. 05:03.000 --> 05:18.000 We tell us that torch, don't run as maps on the CPU, package them from a specialized accelerator, allowing us to plug the hardwares unique bring into this torch body. 05:18.000 --> 05:27.000 So the build pipeline, so the journey from research production follows a linear path. 05:27.000 --> 05:35.000 We have the part which model, which is the starting point, is the full, full of Python overhead. 05:35.000 --> 05:43.000 Then we have the AOT export, which compress the strips all the Python and serialized the logic into a PTF file. 05:43.000 --> 05:51.000 The PTF file is the universal language for a Zacot torch, so it contains the graph, the weights and instructions. 05:51.000 --> 06:01.000 Then we have the runtime, which is the lightweight engine, on the device that reads the PTF file, allocates the memory and executes. 06:01.000 --> 06:09.000 And the result of all of this is fast inference and devices that don't even have an operating system. 06:09.000 --> 06:13.000 We have some constraints. 06:13.000 --> 06:23.000 We applied all of this to a risk five base platform in a bare mat to environment, which means the radical strings are very absolute. 06:23.000 --> 06:37.000 So memory is a luxury, we work with megabytes, non gigabytes, and by using the AOT phase to pre-calculate memory of sets, we save the chip from having to figure out at runtime. 06:37.000 --> 06:41.000 We have few cores, few accelerators, and most importantly, we don't have an OS. 06:41.000 --> 06:47.000 So there's no Linux, no file system, and no dynamic linker. 06:47.000 --> 06:53.000 So yeah, the Zacot torch runtime is statistically linked directly into a binary. 06:53.000 --> 07:01.000 And because it's modular and dependency-free, we can run the PTF model directly on the metal. 07:01.000 --> 07:11.000 And turning a specialized chip into the dedicated AI engine, which gives us a lot of sorts of minor changes. 07:11.000 --> 07:20.000 When it comes to customizing the performance, we have two main methods, we have dropping replacement. 07:20.000 --> 07:30.000 So if you have a hand-shoon kernel for a standard operation, for example, a convolution, you can simply swap the default version for the optimized one at build time. 07:30.000 --> 07:41.000 And we also have more advanced optimizations, for example, graph level optimizations that can fuse layers together. 07:41.000 --> 07:53.000 And by the end of this pipeline, we move from a generic model to a highly tuned system where software and hardware act as a single cohesive unit. 07:53.000 --> 08:01.000 And now I'm going to pass them to my colleague Shane. We'll go each more detail about the observations we have worked on. 08:01.000 --> 08:03.000 Thanks, Pietro. 08:03.000 --> 08:10.000 So yeah, we've been working on as Pietro said, a risk-by-processor with a custom NPU. 08:10.000 --> 08:22.000 And we've employed a number of optimization strategies in order to get the model working as fast as possible. 08:22.000 --> 08:37.000 So the baseline of executor's will essentially, it will take the, it has the operators all built in from Pietro's. 08:37.000 --> 08:46.000 So anything you bring down from Pietro's to executor's, it has an implementation from a lot of the main operators. 08:47.000 --> 08:52.000 But obviously, you want to optimise this. 08:52.000 --> 09:04.000 So the baseline here is where a single core CPU will, it'll take the sensors and work on them. 09:04.000 --> 09:14.000 So the first optimization strategy we employed was tiling. 09:14.000 --> 09:30.000 So tiling allows for breaking up the problem and essentially making it so that you can do things concurrently. 09:30.000 --> 09:41.000 And this has been pretty instrumental in terms of getting the operators to work as fast as possible. 09:41.000 --> 09:47.000 The next optimization strategy we worked on is multi-treading. 09:47.000 --> 09:53.000 And this again, then allows you to take the tiles you've just done. 09:54.000 --> 10:13.000 And split this across multiple cores on your CPU or on your cluster to allow for, again, working as fast as possible with the resources you have. 10:13.000 --> 10:23.000 So a lot of the tensors in Pietro's and executor's are all float 32, which are 32 bits. 10:23.000 --> 10:30.000 And these are expensive operators to work with, especially on an embedded platform. 10:30.000 --> 10:37.000 So the other optimization strategy we have worked on is quantization. 10:37.000 --> 10:58.000 And this process is essentially where you take the, do you take the 32 bit float and transform them down into 8 bit integers for more optimal and faster performance. 10:58.000 --> 11:04.000 So the first optimization is in regards to memory. 11:04.000 --> 11:15.000 This is where we have been using the multiple cores with the, specifically L1 and L2 memory. 11:15.000 --> 11:27.000 So, through the use of a DMA, we can break the down into tiles and bring these into L1 memory for faster performance. 11:27.000 --> 11:35.000 And yeah, to work on things quicker. 11:35.000 --> 11:49.000 So this then allows us to, the tensors generally tend to live an L2 memory because this is where you have the most memory. 11:50.000 --> 11:56.000 As your L1 tends to be pretty small. 11:56.000 --> 12:08.000 And with the use of the DMA, you're allowed to then break these into the tiles and use the tiling algorithms that we have been working on. 12:08.000 --> 12:26.000 So, pretty, for the likes of an image convolution, this lets you pre fetch that to break this down into sub-tiles. 12:26.000 --> 12:41.000 So, as you can see, this is a pretty straightforward algorithm where, as you are computing on tile n, you're able to then use a different core to load tile n plus 1. 12:41.000 --> 12:47.000 So, in that sense, you are essentially double buffering. 12:48.000 --> 12:57.000 And with the use of the DMA, this lets you quickly swap between memory as needed. 12:58.000 --> 13:10.000 So, as you're computing tile n, you're way fewer results. You can store your, um, store your results into the output tensors. 13:10.000 --> 13:18.000 And then you can compute on your next buffer straight away. And this essentially loops around. 13:18.000 --> 13:31.000 You can load the next tile as you're waiting for the computation of the previous one. 13:31.000 --> 13:39.000 The next authorization that we employed is quantization. 13:39.000 --> 13:53.000 And as I previously mentioned, this allows you to take the 32-bit weights and biases and other tensors that you may have. 13:53.000 --> 14:00.000 And break them down into 8-bit integers. 14:00.000 --> 14:08.000 This is a pretty core optimization that is very necessary for these embedded platforms. 14:08.000 --> 14:19.000 Because essentially the, um, working with 8-bit integers, it just tends to be much faster than using floats. 14:19.000 --> 14:40.000 So, you can calculate the scaling parameters needed and a offset, which it will let you take the 32-bit floats and roughly map them to an 8-bit integer. 14:40.000 --> 14:58.000 And this then can be used post-training to put into your operator and get out results much faster. 14:58.000 --> 15:06.000 So, based on these optimizations, this is the benefits we've seen so far. 15:07.000 --> 15:12.000 This is ongoing progress, so there's other optimizations that you can make. 15:12.000 --> 15:25.000 But with the current optimizations I've already outlined, as you can see, there's a 3.5-ish performance benefit with a convolution, 15:25.000 --> 15:29.000 sure the use of, um, tiling and acceleration. 15:29.000 --> 15:49.000 And with softmax, there is a, um, at least a 2 times benefit, but we've seen higher benefits with bigger tensorsizes, um, where the green is the baseline, which is the operators, 15:49.000 --> 15:58.000 that executive torch implements, and the blue is our versions of the operators. 15:58.000 --> 16:10.000 So, we'd like to thank the team at Mosaic SCC, who supported our work with this processor. 16:10.000 --> 16:29.000 And yes, so this is being, uh, this allows us to take those big PyTorch models that are really energy, um, not efficient and make them energy efficient on small and better devices. 16:29.000 --> 16:31.000 Thank you. 16:31.000 --> 16:39.000 And I think we have five minutes for questions. 16:39.000 --> 17:06.000 Yes, if anyone. 17:06.000 --> 17:18.000 Now, uh, so we've great quantization comes great loss of accuracy, so I wonder just what can you run up yet. 17:18.000 --> 17:35.000 We quantizing at eight bits, we have similar issues, not being able to run, uh, sound related models, so I wonder what are your numbers on this case. 17:37.000 --> 17:50.000 We're still developing this fairly early on our numbers, our numbers look good, but we can't conclusively say that we can't conclusively say that we've solved the problems that everybody else is struggling with yet. 17:50.000 --> 17:54.000 Sorry, I don't have anything more concrete in that. 17:54.000 --> 18:03.000 Uh, something that just, while you, while you find the next, there's a question, um, there is one. 18:03.000 --> 18:10.000 While we're doing this, um, this convolution graph here, where we're comparing data, this is quantized convolution versus quantized convolution. 18:10.000 --> 18:15.000 So, if we were doing quantized versus on quantized, it would be expected that we'll solve a full full increase. 18:15.000 --> 18:18.000 Best, because we're going home. 18:18.000 --> 18:26.000 Quantized versus quantized, it's very, very close to the six. 18:26.000 --> 18:35.000 Uh, so I'm wondering, we'll catch. 18:36.000 --> 18:43.000 So, uh, have you looked at scaling this up to larger rest five systems in chess? 18:43.000 --> 18:57.000 We've explored, we wouldn't use the executive coach framework like this for larger rest five systems, because it's specifically aimed for doing things at the framework control over class processes. 18:57.000 --> 19:02.000 But we have before, but secondly, I'm doing things. 19:02.000 --> 19:04.000 At a much larger scale. 19:04.000 --> 19:08.000 Um, and we, we would use a similar approach. 19:08.000 --> 19:13.000 We would use normal Python to do it, I suppose, is there, is there, is there where we'd got to do with that? 19:13.000 --> 19:19.000 I think the system we looked at when we were looking at doing this, a larger scale had something like a thousand cause of all. 19:19.000 --> 19:21.000 So, yeah. 19:22.000 --> 19:31.000 Um, so, so the question was, um, when you're using TF light. 19:31.000 --> 19:44.000 So, the question was, um, when you're using TF light. 19:44.000 --> 19:56.000 So, the question is, um, with TensorFlow, you can convert all the operators into TensorFlow light. 19:56.000 --> 20:02.000 When using TensorFlow light, um, and this is the same with PyTorch to executorch. 20:02.000 --> 20:11.000 Um, for the most part, yes, it's, um, very, um, capable of, um, as much operators as you need really. 20:11.000 --> 20:23.000 There are, our limitations, specifically, were operators that you'd use during training rather than inference, but, um, I believe TF light tends to have those exact same issues. 20:23.000 --> 20:29.000 Um, and it's, it's still early days where exactly Torch, I believe it's only a few years old. 20:29.000 --> 20:48.000 So, um, it's, as the PyTorch develops and as executorch is developing, um, you're getting more and more optimizations, more and more operators and, yeah. 20:49.000 --> 20:59.000 Something I'd say on that as well is that the, the way executorch does support for this is it maps down the very, very large set of all PyTorch operators down into a sort of smaller core set. 20:59.000 --> 21:04.000 And it's, as Shane said, it's still early stages. It's pretty good at reducing things. 21:04.000 --> 21:09.000 Mapping things from this big set of operators down to the smaller set and in theory, everything should be supported. 21:09.000 --> 21:14.000 Our experience is, it's a little rough around the edges because it's still being actively developed.