WEBVTT 00:00.000 --> 00:13.960 So, next talk, I have a pleasure to introduce Ramon, who will be talking about closing the loop, 00:13.960 --> 00:16.360 and self-living compilers. 00:16.360 --> 00:24.680 I've heard today that there is an NVIDIA published AI agent generated AI generator for 00:24.680 --> 00:29.960 to generate models, which is six times slower than the one written by people. 00:29.960 --> 00:35.640 So, let's see if we can do a much better work with self-learning compilers for AI. 00:35.640 --> 00:37.640 Let's see. 00:37.640 --> 00:38.640 Hi. 00:38.640 --> 00:39.640 Hi. 00:39.640 --> 00:41.840 So, my name is Ramon Rush. 00:41.840 --> 00:46.520 I represent Daisy Tuner, where I work as a compiler engineer, and I'll be presenting, 00:46.520 --> 00:49.520 closing the loop, a self-learning compiler for AI accelerators. 00:49.520 --> 00:50.520 Okay. 00:50.520 --> 00:54.920 So, I think it's pretty safe to say we live in the age of hardware accelerators. 00:54.920 --> 01:01.000 Early a month goes by where there's not news about a new accelerator or even a new architecture. 01:01.000 --> 01:03.440 So, what does that mean for the user? 01:03.440 --> 01:08.960 First, we have to choose what kind of accelerator would we want to target if we determine 01:08.960 --> 01:10.880 that we need acceleration. 01:10.880 --> 01:16.240 And then there's always the balance between portability and specialization. 01:16.240 --> 01:21.040 You can remain agnostic to the actual hardware accelerator, easily swap them out if you 01:21.040 --> 01:25.360 remain on the right side, but you'll properly sacrifice a lot of performance, and you 01:25.360 --> 01:31.040 can get in the best performance on the left side, but probably looking at vendor lock-in or 01:31.040 --> 01:33.040 API lock-in. 01:33.040 --> 01:40.600 So, let's look at like an example stack, a user that comes with a portable system that hasn't 01:40.600 --> 01:48.360 used in accelerator before, the problem proves that it works on complex problems. 01:48.360 --> 01:52.000 And we still can here see the balance between generality and specialization. 01:52.000 --> 01:55.600 If you want the best performance, you would write directly for the instructions at architecture 01:55.600 --> 02:01.000 of the accelerator, but the heart of this problem is the mapping. 02:01.000 --> 02:05.240 Here we can get all the performance or lose all the performance, but I'll get back to that 02:05.240 --> 02:06.240 later. 02:06.240 --> 02:08.920 First off, let's look at what we want to do. 02:08.920 --> 02:15.280 We want to build a toolbox that allows for optimized mapping between all sorts of different 02:15.280 --> 02:20.960 accelerators from basically any kind of input language or format. 02:20.960 --> 02:21.960 Yeah. 02:21.960 --> 02:24.600 So, let's look how we do with that. 02:24.600 --> 02:30.480 First off, the representation on which we think, so you get a better idea on which level 02:30.480 --> 02:32.520 we're tackling that mapping. 02:32.520 --> 02:38.440 So we call our representation structured data of a stateful data flow multi-graphs. 02:38.440 --> 02:41.080 Let's start out with a simple data flow graph. 02:41.080 --> 02:45.800 We have access notes, which represent variables in this case. 02:45.800 --> 02:50.080 We have tasklets, which is the basic math operation, and we have the edges, which we call 02:50.080 --> 02:51.080 memlets. 02:51.080 --> 02:52.080 This would be a load. 02:52.080 --> 02:53.880 Here would be a store operation. 02:53.880 --> 02:55.920 We can make this more complex. 02:55.920 --> 03:00.560 We can add dependent data flow on any very simple, very easy to understand this format. 03:00.560 --> 03:04.080 We can add parallel execution in that graph. 03:04.080 --> 03:08.720 We can add offsets and operate on arrays instead of scalar variables. 03:08.720 --> 03:10.280 Still looks pretty simple. 03:10.280 --> 03:13.000 Then we can wrap this in a loop. 03:13.000 --> 03:16.360 In this case, it would be a data parallel loop. 03:16.360 --> 03:20.960 And the memlets are now indirect accesses, and we can take this further, wrap it into 03:20.960 --> 03:21.960 more loops. 03:21.960 --> 03:24.440 Then we have the naive math implementation. 03:24.440 --> 03:30.680 Here the memlets are multidimensional accesses with a types attached technically. 03:30.680 --> 03:37.200 So we can take this one abstraction further and just represent a call. 03:37.200 --> 03:42.440 In this case, our memlet edges become just address calculations, no memory access, and 03:42.440 --> 03:47.680 the call could be anything to other graphs, to external functions, in some kind of environment 03:47.680 --> 03:52.800 or interendicts, which are just well-known functions, like gem that we could replace with 03:52.800 --> 03:57.640 the underlying code, or just call something else. 03:57.640 --> 04:00.480 Then lastly, we have conditional execution. 04:00.480 --> 04:05.520 It's also pretty simple, and you can chain all of these elements for larger programs. 04:05.520 --> 04:11.400 To summarize, the format expresses data flow and control flow at the same time. 04:11.400 --> 04:12.640 It's structured. 04:12.640 --> 04:17.000 You can see the loops directly, so that makes it very convenient if you're talking about 04:17.000 --> 04:20.240 optimizing, transforming, changing loops. 04:20.240 --> 04:26.280 It's a lot more tors than other formats with us, and it's simpler and easier to analyze. 04:26.280 --> 04:28.320 What it doesn't allow is a regular control flow. 04:28.320 --> 04:32.920 So you cannot represent a program that contains just like branch help. 04:32.920 --> 04:37.960 You need to simplify that, but we don't think it's a large problem, because most pseudo-code 04:37.960 --> 04:43.480 is structured and doesn't have that problem. 04:43.480 --> 04:47.800 Also, you might have noticed the address and condition calculations are in a different 04:47.800 --> 04:51.480 format than the data flow, especially for parallel problems. 04:51.480 --> 04:57.480 This is very convenient, and the more separate they are, the more parallel the problem 04:57.480 --> 05:00.240 tends to be. 05:00.280 --> 05:06.160 Okay, so what are the elements that we have in our toolbox so far? 05:06.160 --> 05:12.360 So we have our SDFGs, APIs to build them, we have serialization so far it's just JSON, 05:12.360 --> 05:14.320 more to come. 05:14.320 --> 05:21.440 Then we have analysis passes similar to compilers that operate on loops, data dependencies 05:21.440 --> 05:28.080 and parallelism, and we have optimization passes that can optimize the loops, transform 05:28.080 --> 05:35.200 it to parallel execution, and so on, and right now our output is CC++ code generation. 05:35.200 --> 05:40.160 So this way we are currently very agnostic of what comes behind it, any CC++ compiler 05:40.160 --> 05:46.920 could compile it to an architecture, just like what I showed in the stack graph. 05:46.920 --> 05:52.400 In terms of front ends, right now part of our library are Python bindings, so you can build 05:52.400 --> 05:58.000 those graphs from Python, and we also have a decorator, and let's look at example. 05:58.000 --> 06:04.320 If we have this simple code, we can annotate it with our annotation, and when it executes, 06:04.320 --> 06:09.760 our library will pass that code, transform it into our graph representation, and if you let 06:09.760 --> 06:17.080 it do a default work, it will output CC++ code compile it, load it into the runtime again, 06:17.080 --> 06:22.800 and execute it, so it will just make simple Python code faster, so far, but it's a good 06:22.800 --> 06:26.480 platform to experiment and look at what the representation does. 06:26.560 --> 06:36.560 So this would, for example, be our graph, that then executes, okay, we also are working on, 06:36.560 --> 06:43.600 I believe it's been pushed today into the git, it's not really doing much so far, but 06:43.600 --> 06:50.720 more to come, a front end to the ingest PyTorch models directly without having to go to Python 06:50.800 --> 06:59.200 code, and also working on ingesting MLIR graphs directly. On the target side, we have the generic 06:59.200 --> 07:04.160 host target, which basically uses the normal CC++ code generation, so basically if you want 07:04.160 --> 07:10.640 to execute it on the CPU, we have an open MP back end, which basically adds open MP annotations 07:10.640 --> 07:17.120 to the output code, so it runs on multiple cores, we use Google Highway for vectorization, so we can 07:17.200 --> 07:28.720 use that, and we also have a CUDA back end, so take a look at that, okay, so for CUDA, so far we had 07:28.720 --> 07:35.760 normal code linear code runs on the CPU, now we want to offload that, so we need to cut out a portion 07:36.560 --> 07:41.760 and we need to generate code that runs on the device, we need to run the, generate the host code 07:41.760 --> 07:48.160 and we need to generate memory management and transfers to the device and back, as a graph, 07:48.160 --> 07:53.600 this would look something like this in the middle, the same mutmal graph from before and just the 07:53.600 --> 08:01.200 memory transfers above and below it, and what we also added on top is optimizations around the 08:01.200 --> 08:07.280 data structures, so if you were to execute this twice, the second one on the results of the first, 08:07.360 --> 08:11.920 it would be a ways to transfer that data back to the host and back to the accelerator again, 08:11.920 --> 08:16.160 and we can just remove those transfers, similar with read-only data that has never changed, 08:16.160 --> 08:22.800 it can just remain on the device, and this optimization is also generic, it could apply to any 08:22.800 --> 08:30.240 accelerator that is modeled as a target for this library, okay, so let's look at how well this works 08:30.240 --> 08:38.000 so far, this is NP-bench, popular Python benchmark, uses a few different back ends also with decorators, 08:38.000 --> 08:44.720 just as us, so they just in time replace the code that executes and measure the performance, 08:45.440 --> 08:52.240 and NumPy is the baseline on the bottom in milliseconds and everything else is relative to 08:52.800 --> 08:59.120 NumPy green indicates speed up, red indicates a slowdown, so these are just the frameworks that come 08:59.200 --> 09:08.880 with NP-bench, and we run it on paper size, okay, so this is baseline, we don't do any crazy 09:08.880 --> 09:14.000 optimizations yet, it's basically just cleaning up what we get from Python, and the C code we compile 09:14.000 --> 09:22.000 with a clang on O3, and you can see we are achieving some speedups, there are a few gaps where 09:22.000 --> 09:28.480 we don't map all operations just yet, but we're actually pretty happy with how good that works, 09:28.640 --> 09:35.040 if we enable OpenMP, it still works, right now we're not seeing that much speedups because 09:35.040 --> 09:41.120 it's annotation based, so if it's just a library call that's emitted, the annotation won't actually 09:41.120 --> 09:49.360 do anything to paralyze it, and the CUDA back end works and can upload code that was not optimized 09:49.360 --> 09:57.040 for any GPU or anything, and we are also pretty happy that we're beating in some cases the CUPA implementation 09:57.120 --> 10:05.120 already. So, but as I said in the intro, mapping is the heart of the problem, so how do we map 10:05.120 --> 10:11.680 something such as this to various accelerators? For example, here a really a CUDA architecture, 10:12.240 --> 10:18.320 which is actually pretty similar to a normal CPU in terms of memory topology and address space, 10:18.320 --> 10:23.760 it's one shared address space, any processor can access any data, it just determines the speed in 10:23.760 --> 10:29.920 which it does it. But we can also go crazy, and for example, look at the tense-turned architecture, 10:29.920 --> 10:35.760 which is just a mesh of many connected cores or need to sum up the neighbors, so here if we 10:35.760 --> 10:40.160 wanted to get access to some data, we need to move that data through all those cores, 10:40.160 --> 10:45.680 we need to consider all the bottlenecks we created on the way, on each core, and even if we get the 10:45.680 --> 10:50.720 data to its target, it's still not a normal core, it's still five cores that need to move 10:50.880 --> 10:57.040 on stream data around between the various accelerators. So, the mapping is at its height 10:57.040 --> 11:02.160 about parallelism, on the different levels, CPU cores, threats, vectors, it's about the memory 11:02.160 --> 11:08.240 topology and locality, and it's about data movement, and you can also further do data format 11:08.240 --> 11:14.000 and position as we have before. And if you do all of this right, you get a good performing corner. 11:15.120 --> 11:20.560 Right now, the standard way to do is to basically build a library of hard-coated or hand-optimized 11:20.560 --> 11:25.920 kernel from experts for a given accelerator. Very popular math functions, you can just build a 11:25.920 --> 11:33.200 library for a few of those and have good performance. The hard-coated, agnostic way, sorry, 11:33.200 --> 11:39.120 I turned those around. Yeah, there are already solutions to do this hard-coated, agnostic way 11:39.120 --> 11:44.720 that automatically map this, but they don't get close in performance to the manual performance 11:44.960 --> 11:51.200 engineering. Okay, so to explain manual performance engineering, let's look at the short example 11:51.200 --> 11:57.200 again, the simple mathematical application, and this would be the Nvidia example how to run 11:57.200 --> 12:04.560 that fast in CUDA, and just to show you what goes into it. So, here's code to initialize the hardware 12:04.560 --> 12:08.960 on allocate memory. We have data transfers to the accelerator because it's a separate memory. 12:09.920 --> 12:17.120 We have to match the exact hardware size we have. Here, this hides the two outer loops because 12:17.120 --> 12:22.320 the hardware managers are those. Then this is where the code that runs on the accelerator starts. 12:23.680 --> 12:30.080 We allocate a new memory and pre-fetch data into that memory before we calculate it. We pre-fetch 12:30.080 --> 12:34.320 in a different order to maximally utilize cache lines, so we fetch all the data linearly. 12:35.280 --> 12:39.920 We still accumulate in the existing order. We need to manage multiple threats that we didn't 12:39.920 --> 12:47.360 have before, and we're saving us of right and to memory. So a lot of stuff, but also we can break 12:47.360 --> 12:54.960 this down as a collection of building blocks. Most of those optimizations are like textbook optimizations. 12:57.040 --> 13:03.680 They have parameters and you to figure out on what to apply it and when, but it's basically building 13:04.320 --> 13:08.880 blocks. So if you are enough of an expert in the hardware, you can pick the standard text book 13:08.880 --> 13:14.320 transformations, pick their parameters, pick on which parts of the algorithm you apply to, 13:15.120 --> 13:20.400 and then the order of it, and then instead of looking at the complex kernel I showed on the right 13:20.400 --> 13:26.960 side, you can look at the basic pseudo code on the left side and a recipe that is specialized 13:26.960 --> 13:32.080 for the algorithm and the specific hardware. We'll get you the same result, but this way you can look 13:32.640 --> 13:37.280 at the base algorithm and the recipe is what determines what hardware you can run it on. 13:37.280 --> 13:40.400 And as long as you get the recipe at a run anywhere as fast as it can. 13:42.640 --> 13:50.800 An example for the recipe left the simple code and the transformation we're applying is loop 13:50.800 --> 13:55.920 tiling. So we're splitting up loop into two. The outer loop will make bigger jumps, blocks of 13:55.920 --> 14:00.000 32 in this case, and the inner loop will iterate over the elements sequentially as before. 14:00.800 --> 14:08.320 And we can do this to the outer loop as well. Then for example, we could run loop interchange, 14:08.320 --> 14:13.040 we just swap two loops. They don't have their own code, so we can do that. Without affecting 14:13.040 --> 14:21.440 the inner most code, we get that and why do we do this? Okay, now here we can see the green loops 14:21.440 --> 14:26.800 basically just operate on the rect angular on the right side. And the outer loops work over the 14:26.800 --> 14:32.560 rest of the data. So what that gets us is very localized data, which for example could fit into 14:32.560 --> 14:40.000 a cache of one processor core. So basically this is what we would run on the multiple cores, 14:40.000 --> 14:46.560 and the inner code is what would run on one of them cores, to optimally paralyze something like this. 14:46.560 --> 14:51.680 And we can repeat the same concept on the inner loop to exploit vectorization, for example. 14:51.680 --> 14:59.440 But where do we get the fast recipe? We talked about how we can apply it, but not where we 14:59.440 --> 15:08.400 do actually get that. We have the building blocks. Okay, there is a standard solution for this, 15:08.400 --> 15:13.440 it's called outer tuning. But what outer tuning is is basically just trying all the possibilities, 15:14.080 --> 15:18.160 spend a lot of time on it, and eventually you find the best performing solution for whatever 15:18.240 --> 15:24.400 performance target you had. This takes a long time, it's infeasable to do this every time you 15:24.400 --> 15:29.760 want to compile something or build an application. So what we really want is we want to cache on 15:29.760 --> 15:35.680 this remote server that you don't have to see those results, the recipes. So once we have it, 15:35.680 --> 15:40.720 everybody can use the recipe without having to know anything about the accelerator, and as long as 15:40.720 --> 15:44.800 the recipe exists, you can switch the accelerator and gain the best performance for that hardware, 15:44.800 --> 15:52.960 as well. Okay, so, but there are many accelerators, many different architectures, and even for 15:52.960 --> 15:58.640 the same architecture, the optimizations can change, because there are microtactual implementations 15:58.640 --> 16:06.240 like cache sizes, which change how to get the best performance out of it. Okay, so what we want 16:06.240 --> 16:11.840 on top of the cache is similarity based to search. We don't just want to find the exact match, 16:11.920 --> 16:17.360 we want to find any algorithm that was similar on similar hardware and get the recipe for this 16:18.160 --> 16:23.920 because it will get you similar performance on code we haven't seen before. This would look 16:23.920 --> 16:29.680 something like this, a database of similar algorithms, and we can port over the 16:29.680 --> 16:37.520 optimization recipes from a slightly different code. Okay, so we are using a neural network to 16:37.520 --> 16:44.320 classify the code, so to get the coordinates in that diagram, and we store for each point in 16:44.320 --> 16:49.360 the diagram, the optimization recipe, the results we observed, and we can do that with multiple 16:50.080 --> 16:54.880 target architectures, and then you basically need a server farm to do auto tuning nonstop 16:54.880 --> 16:58.720 for every architecture that you want to support to find the recipes for every code you know of. 16:59.840 --> 17:05.600 And this is self learning, if you feed it new code, it can figure out how to optimize that, 17:05.680 --> 17:10.560 if you give it a new target architecture and you accelerate, it can learn how to do that as well. 17:12.560 --> 17:18.480 To give you preliminary showcase of how well that is working out, we're using polybench, 17:18.480 --> 17:24.880 this is running our internal C++ LLVM front end, so we're running on C code instead of the 17:24.880 --> 17:32.240 python I showed before, it shows the speed up of already vectorized code by us if we apply our 17:32.320 --> 17:37.440 transfer tuning to it, and the transfer tuning database we use is currently very limited because 17:37.440 --> 17:42.880 it's very fresh. Okay, same type of graph, so you can see there are some benchmarks that 17:42.880 --> 17:49.840 profit by as much as fact the two of the transfer tuning, and locally it was not much work for the 17:49.840 --> 17:55.120 compiler to do this because somebody else already ran the auto tuning and figured out what to do, 17:56.080 --> 18:02.400 but there are some outliers where we make it slower by a lot. We hope to fix that soon. 18:03.840 --> 18:10.160 Okay, so in terms of building blocks, we're adding also the remote optimization and our public 18:10.160 --> 18:16.640 repository includes a demo server that basically responds with three set responses of what to optimize for 18:17.680 --> 18:21.680 because the server is not live yet in that way, but the client code is already there. 18:22.000 --> 18:28.960 And internally, as I mentioned, we are already working on doing that from C++ code directly 18:28.960 --> 18:37.120 and the transfer tuning database, but those are not that radiate. Yeah, and we hope to gain 18:37.120 --> 18:43.440 contributions for different accelerators if we don't get to them ourselves and new front ends, 18:43.440 --> 18:47.760 and if you're curious and want to give it a try, these are repositories, the above one is the two 18:48.720 --> 18:53.920 and just yesterday we published the Python project that does the transformation that I showed in the 18:53.920 --> 18:57.920 beginning. Thank you for your time.