WEBVTT 00:00.000 --> 00:10.000 Yes, thank you, Kenneth. You can hear me, okay? Back there? All right, perfect. So I'm 00:10.000 --> 00:14.880 JP, some of you may know me from Twitter or Macedon or whatever, and I'm going to talk 00:14.880 --> 00:20.240 about the program and models that we have in Rockham. Now, I somehow miss how many minutes 00:20.240 --> 00:25.680 I have, so I've waited many slides probably, so let's keep it up. First off, AMD has 00:25.680 --> 00:29.920 too many compilers that some complaint that are here are the times that there's two compilers. 00:29.920 --> 00:34.480 There's, for instinct, there's Rockham, and for Epic, there's AOCC. Don't confuse them. 00:34.480 --> 00:38.960 Anything else is for like development and research or whatever. So Rockham, if you want to 00:38.960 --> 00:44.720 do GPU offloading AOCC, if you're targeting the Epic processors, okay? Two compilers, not too many. 00:46.480 --> 00:54.480 Okay, Rockham, what is Rockham? Rockham is like our software stack, right? So we go from 00:54.640 --> 00:59.440 deployment towards and low-level stuff at the bottom all the way up to actual applications and 00:59.440 --> 01:04.640 benchmarks. And there is libraries in between. We have support for different operating systems. 01:04.640 --> 01:12.400 We're bringing in more support for operating systems. But today, I will be mostly looking at this 01:12.400 --> 01:19.840 part of the stack. So this is the programming models, which is hit, open MP, and to some extent, 01:19.920 --> 01:24.480 open CL, and then sitting on top of that is the library stack, right? So this is like a rock 01:24.480 --> 01:31.200 glass and rock FFT or something. Okay, so thinking about the programming models, 01:32.880 --> 01:39.360 there is hip. Hip is our grid language. So this is basically, when you're used to doing CUDA, 01:39.360 --> 01:44.000 you would, for AMD GPU, you would do hip. So it's a grid language, and I'm going to explain more 01:44.320 --> 01:51.200 what I mean with that. And hip is meant for giving you maximum control of how you program your 01:51.200 --> 01:58.000 kernels, right? But it's, in a sense, vendor specific, albeit, you can run hip applications on 01:58.000 --> 02:05.680 and video hardware. So keep that in mind. Then there's open MP. Now open MP traditionally would use 02:05.680 --> 02:10.720 a for joint model from thinking, even though it has transition more to a task-based idea, 02:11.360 --> 02:15.600 but it's standardized. So there's a standard, and we implement that standard. So it's not vendor 02:15.600 --> 02:21.920 specific. That's also important and keep that in mind. And then there's open CL, which I tend to 02:21.920 --> 02:26.720 think also of like a grid language, because you actually have more control than with open MP, 02:26.720 --> 02:31.680 but it's also standardized. It's actually standardized from the kernels group. And so you can rely 02:31.680 --> 02:39.600 on having a standardized thing. So when one vendor implements the open CL specification, you can 02:39.680 --> 02:47.840 potentially pour to that vendor. Now the other thing is when you think about, when you think 02:47.840 --> 02:55.600 about programming models, you also, or at least I also think about the languages that you can 02:55.600 --> 02:59.200 use these programming models from, right? Because that's important. The programming model doesn't 02:59.200 --> 03:05.040 help you anything if you're programming in the wrong language, right? So this is why I also have 03:05.040 --> 03:10.560 the programming languages here. So we support in Rock and we have C++, C, and Fortran. So basically 03:10.560 --> 03:16.560 you can mix and match here. And then I also tend to think of what libraries do I get, because 03:16.560 --> 03:21.360 library support is also important since you don't want to redo all the work necessarily yourself, 03:21.360 --> 03:27.760 right? And so this is, I have the Rock and Lips, kind of a umbrella term here, and then we're 03:27.760 --> 03:32.960 going to look at Stutt PAR, and also hit for them. I'm going to touch on briefly later in the stock. 03:33.520 --> 03:39.520 All right, so this matrix, we're going to see further along, and I will highlight which parts 03:39.520 --> 03:48.400 of the stack I'm going to show you parts of. But first off, let's look at the models and the 03:48.400 --> 03:55.040 languages. They actually all go through the same LVM compiler back end. So that gives you, if we 03:55.040 --> 04:00.240 improve some part of the compiler that typically reflects to all programming models and all languages, 04:00.560 --> 04:06.160 right? So that's also an important part. And we are, if you download Rock and Right now, 04:06.160 --> 04:11.840 that's not entirely true, because our Fortran compiler is based on an older version of LVM, 04:11.840 --> 04:19.040 because it's still based on classic flying, but we're moving towards this. And you will get a link 04:19.040 --> 04:25.760 where you can actually get access to this stack later on in the talk. Okay, so as I said, I'm going to 04:26.320 --> 04:30.320 show this matrix more. So I will have examples for hip open and PC++ Fortran, 04:31.040 --> 04:36.160 Stutt PAR rock and lips, and I actually think hip forward too. I may have missed to make that 04:36.800 --> 04:44.640 thing visible. Thank you. So first look at, first let's look at hip and C++. So getting back 04:44.640 --> 04:49.040 to the grid fundamental or grid programming languages, what do I mean with that? So when you think 04:49.040 --> 04:55.360 about the grid programming model, then it's how you have to program a kernel and map basically 04:55.760 --> 05:04.640 your problem to the GPU, right? Or how do you execute your program on the data on the GPU? 05:04.640 --> 05:09.600 And this is what it looks like. So you have a grid, you have a grid of blocks and in those blocks, 05:09.600 --> 05:14.320 you have warps. And I know that's in video term, but we have similar things like way 05:14.320 --> 05:17.680 fronts and whatever, but this is people are more used to these terminology, right? 05:18.560 --> 05:24.160 Now one thing I'm going to make is whenever I speak of a lane, I'm actually making what would 05:24.320 --> 05:30.720 be called a thread because I prefer to term lane in this regard, right? That's just a distinction 05:30.720 --> 05:38.880 whenever I say lane, think about like a kuda thread potentially. So when you do hip grid programming, 05:38.880 --> 05:44.800 what do you do? You write a kernel. So here's an example. You put a global specifier here, 05:44.800 --> 05:50.880 and then we're going to have a running example of a XB because that's basically a nice and 05:50.880 --> 05:56.080 easy enough to put on the slides, right? You're nodding, okay? That's right. So in hip you would do a 05:56.080 --> 06:04.080 XB, you get some floats, okay? That's great. Then you compute your specific lane ID in the whole grid 06:04.720 --> 06:10.240 using this. You potentially want to check if you're outside of your data set. And if you're not, 06:10.240 --> 06:17.200 you're going to do the computation, right? And so here you see that all of these things is just 06:17.280 --> 06:24.960 identifying your specific lane, your specific work item, in this grid thing to do the computation. 06:26.080 --> 06:31.040 But so people who are used to program with kuda, this should not be a surprise to anyone, right? 06:31.040 --> 06:38.720 That's just how you do it. And it's also just what you do in hip, right? So that's GPU part, 06:38.720 --> 06:45.600 and then how do you get that onto the GPU? Is he actually say, okay? I need some pointers, 06:45.680 --> 06:51.280 I need some memory, I need to copy some data, and then I'm going to execute the kernel on the GPU. 06:51.280 --> 06:56.240 And we also have the triple chef runs syntax here for actually launching the kernel. You can put 06:56.240 --> 07:00.400 that onto stream so you can have multiple streams and flight at the same time. So that's very much, 07:00.400 --> 07:07.120 what do you would expect from kuda? Speaking of kuda, let's say you have a kuda application, right? 07:07.120 --> 07:12.320 And you want to use that for an AMD GPU, what you're going to do? So we have something that works 07:12.960 --> 07:20.400 sometimes, that's called hipify. I know it is not perfect, I know that, so you know, 07:20.400 --> 07:24.720 but it's there and it gets you some way towards actually being able to run stuff on a 07:24.720 --> 07:30.720 AMD GPU. So that would then give you a hip application to a certain extent. The neat thing about 07:31.760 --> 07:37.680 hipify, what and hip what I think is that you can actually do incremental porting because you 07:37.680 --> 07:43.680 can set hip to comply for in video. So you can replace parts of the application with just 07:43.680 --> 07:49.520 called so the hip run time and actually like hip things. And then at compile time say oh by the way, 07:49.520 --> 07:53.360 I want to run this on an video and you would still be able to execute the same thing on in video. 07:53.360 --> 07:57.920 And you can mix and match this while you're doing actually the porting. And I think that's neat. 07:59.200 --> 08:03.120 Now there's two versions of hipify. Of course, again there's too many versions, right? 08:03.840 --> 08:08.640 There's a text based translation that's more for like if you have a single file you want to do this 08:08.640 --> 08:14.320 like easily in a sense. You would use the text based version and then there's a compiler based version 08:14.320 --> 08:19.520 does is that's more elaborate, but also you have to make sure that your application already compiles 08:19.520 --> 08:24.720 with a whole thing, right? Because you have to give all the inkless paths, all the definitions, 08:24.720 --> 08:29.280 everything you need to actually compile that thing just for doing the hip translation. 08:30.080 --> 08:39.200 So that's something to keep in mind. Okay. Now open mp. I have to say I am actually more on the 08:39.200 --> 08:44.240 open mp team so maybe I'm going to show you more open mp than other people will show you but anyway. 08:44.240 --> 08:51.440 So open mpc plus plus and four to the next. First open mp fundamentals. Open mp I said for 08:51.440 --> 08:57.440 join model. What does that mean? That means that theoretically you would do okay you start 08:57.440 --> 09:02.240 sequentially and then you fan out into a parallel region and then you basically synchronize and 09:02.240 --> 09:09.280 get back to this. Yes, hip basically does the same, right? You start sequentially then you fan out and 09:09.280 --> 09:15.280 whatever. The difference is that in hip you are the person who does how do I map all the data to all the 09:15.280 --> 09:22.800 lanes. In open mp it's the compiler. So it's not you. It's in a sense easier. You don't get the full 09:22.800 --> 09:29.440 control. That's right. But you also do not have to do all the work with oh. How do I you know 09:29.440 --> 09:36.400 which lane is going to write to which data items during kernel, right? So that's I think a positive. 09:37.360 --> 09:45.920 albeit not suitable for every problem you have. Okay. So open mp and c plus plus. Sex be again. 09:45.920 --> 09:51.760 So again we have the same signature. Oh boy. Thank you. Ten minutes. We have the loop that does a 09:51.760 --> 09:57.920 computation and then we basically do impractor on ptarget a super parallel for. We map some data 09:59.280 --> 10:05.920 to the device because we do not need to transfer the result back for the x part and then we map 10:05.920 --> 10:12.320 two from for y. So we bring the data of y to the device and back. Right? So that's xp running on the 10:12.320 --> 10:18.320 GPU with open mp and c plus plus. And that makes it actually possible to have a main function that 10:18.400 --> 10:24.080 has the data does the computation and brings everything to the GPU. So that's working. That's the 10:24.080 --> 10:30.960 whole example basically. Of course I admitted some things for gravity here. So you know I'm not going 10:30.960 --> 10:35.360 initialization here or printing out the values but basically this is a fully functional GPU 10:35.360 --> 10:41.680 offloading sex be implementation. So that's great. I think that's great. The same actually also 10:41.680 --> 10:46.000 applies for fortune. So I know the fortune person by the way. Okay. So I copy paste it this basically 10:46.560 --> 10:52.400 and let's see. So we have some reals, we have some integers, we have a do loop. And then we say omptarget 10:52.400 --> 10:57.360 a distributed parallel do. Again we do the mapping and we get a functioning GPU offloading 10:57.360 --> 11:00.960 for-term program that does the sex be on the GPU. And I think that's neat. 11:03.600 --> 11:11.680 All right. C plus plus stood par and rockham dips. By the way, anybody here who prefers to write 11:11.680 --> 11:19.920 pure C plus plus and does really dislike open mp for the pragmas? Yes. Okay. Okay. I assumed there 11:19.920 --> 11:26.560 were people here in the audience that would have that. So sex be. Let's do a so transform here. 11:26.560 --> 11:33.120 That's kind of nice and nice C plus plus potentially. How do you bring that to the GPU? Well you could 11:33.120 --> 11:39.840 simply do another execution policy here which is stood execution par on sequence, give a compile 11:40.000 --> 11:45.440 of like an apple offload that transform to the GPU using some hip magic. Okay. So if you prefer 11:45.440 --> 11:56.720 to stay within C plus plus that might be a way for you to go. Rockham libraries. Maybe you don't want 11:56.720 --> 12:04.800 to do actually writing the kernels yourself, right? You can use the libraries. And so here 12:04.800 --> 12:10.400 what you would do is you would do rockblas for example. You would create a handle. Then you still need 12:10.400 --> 12:14.720 to do some memory allocations here and memory transfers. So you would need to interact with hip 12:14.720 --> 12:22.320 in that sense, doing hip malloc and hip mem copy. But then basically you make rockblas be able to 12:22.320 --> 12:28.160 access the a pointer on the host. We're using that just for the alpha value which is a scalar. 12:28.160 --> 12:34.720 So we don't bother transferring that to the GPU ourselves. Then we call into the rockblas API 12:34.960 --> 12:40.560 with us for doing this xp. And we're going to copy the result back to the host. And basically 12:40.560 --> 12:50.000 the only actual or all the GPU work happens here. And we have to destroy the handle again. So 12:50.000 --> 12:57.600 that's xp when you're using rockblas. Okay. Now I don't have to preview here. Let's see. 12:57.600 --> 13:05.600 I think now I'm talking about more 4-trend. And that's the next generation 4-trend compiler. 13:06.320 --> 13:12.320 Yeah. Great. So that's the journey of our team. So the next generation 4-trend compiler, 13:14.240 --> 13:21.840 the in 2017 upstream LVM started a project to come up with the next generation 4-trend compiler. 13:21.840 --> 13:28.720 So they started implementing the base language features. Then at some point they realized, 13:28.720 --> 13:34.400 okay, we are far enough when it comes to base language features. Let's do some open-MP host side 13:35.120 --> 13:45.280 support. And then what AMD does is for two years, two and a half years now, I think. 13:45.280 --> 13:51.920 We are actively contributing open-MP GPU support upstream for open-MP target offloading 13:51.920 --> 13:56.400 in upstream LVM's 4-trend compiler. And we also do that downstream. 13:58.720 --> 14:05.760 So there was a recent blog post about the journey of one sum or people from our 14:05.760 --> 14:12.160 apps team with the next generation 4-trend compiler. And you can find the actual article. 14:12.240 --> 14:16.640 There was a blog post. You can find that at that link. They go through what they had to do, 14:16.640 --> 14:23.760 what worked, and what didn't. And I think it's worth a read. Everybody's done taking photos? 14:25.760 --> 14:32.720 Okay. And one more. Okay. All right. So these code examples are taken from the blog post. 14:32.720 --> 14:37.840 Okay. So again, I'm not a 4-trend person. The example is a Jacobi solvers. So we have, 14:38.240 --> 14:45.360 okay. I think I can make that. So we have Jacobi. We have a module. So we have some type here. 14:45.360 --> 14:51.920 And then we continue on the next slide. There are not too many surprises here. So we 14:51.920 --> 14:57.040 simply allocate the components. And then here we actually map the components to the GPU 14:57.040 --> 15:03.760 doing the open-MP target metadata. And then we have some more code that's removed for brevity. 15:03.840 --> 15:12.560 And then in the actual run Jacobi, we have a do loop. Excuse me. And we call it some routines. 15:12.560 --> 15:17.440 And I'm going to show you two routines from that, from that example. So update and norm. 15:17.440 --> 15:25.680 And I'm only going to show this because the target annotation here, that brings us, 15:25.680 --> 15:30.480 brings this code to the GPU. And we have collapse here. So we can collapse the two loops. 15:30.560 --> 15:34.160 And that's an important feature when you're going to the GPU because you want to increase the 15:34.160 --> 15:38.160 iteration space so you can map these things better to the teams that you're running on the GPUs. 15:39.680 --> 15:46.240 And the other part is here is the norm because we have the collapse here too, but we also have 15:46.240 --> 15:52.000 the reduction. And we want to reduce on the GPU too. And so we support that, of course, we actually 15:52.000 --> 15:57.440 have a pretty fast implementation for the reduction on the GPUs, both in flying and in 15:57.440 --> 16:02.480 in Fortran and in in C++. So that's something to to keep in mind as well. 16:03.920 --> 16:11.680 And here is that's where I actually wanted to come to. So we make now preview versions of 16:11.680 --> 16:15.920 rock and available for download through the infinity hub that has the next generation 16:15.920 --> 16:24.400 Fortran compiler. So this is kind of a preview release. So you can download it, you can install it, 16:24.560 --> 16:31.040 you can use it. And we are happy if you do that to evaluate it and open tickets against any of 16:31.040 --> 16:37.760 the components in our GitHub. So you can download rock and hear in the preview build and do your 16:37.760 --> 16:47.200 experimentation. So in addition or sitting on top of the program models and the libraries, 16:47.200 --> 16:52.480 we also have other frameworks that we internally also test to make sure that we hit the performance 16:52.560 --> 16:56.640 every one. And you can make use of that. So that's for example the name one is Kokos. 16:57.840 --> 17:01.760 Sometimes it's a little, at least if you're using the OpenMP target, back end from Kokos, 17:01.760 --> 17:11.440 compile time to kind of bit. Sometimes a little too long. But we enable to use these frameworks 17:12.160 --> 17:20.240 and test internally that, you know, you get the performance you want. So wrapping up, 17:20.320 --> 17:25.520 AMD Rockham is high performance open source and portable. High performance. So it powers 17:25.520 --> 17:32.080 some of the top 500 list leaders. We have solutions for HBCN and I and AI. We have compilers, 17:32.080 --> 17:37.440 libraries and frameworks as part of the stack. It's open source and we are committed to the open 17:37.440 --> 17:42.880 ecosystem. So we contribute a lot of work upstream for LVM at least. That's why I work. So 17:42.880 --> 17:47.600 that's where I'm most familiar with. We're active in community engagement and we're driving 17:47.600 --> 17:53.200 development both in implementation and in standardization. So we are also a member for example 17:53.200 --> 18:00.560 of the OpenMP ARB and we contribute there for the standardization efforts. And it's portable. So 18:00.560 --> 18:09.760 in the sense that we think portable models should be what everybody uses. So you can easily switch 18:09.760 --> 18:18.480 between vendors. And I think that's a good solution also for the evolving landscape of accelerators. 18:19.200 --> 18:27.760 It's not necessarily just tied to GPUs. And of course, Rockham or our products are also supported 18:27.760 --> 18:33.760 through these third-party libraries as I mentioned. So if you're uncocus, you can more or less 18:33.760 --> 18:40.720 easily move to our products or GPUs. And with that, I think I have to show you this slide. 18:40.720 --> 18:43.920 And I thank you very much and I'm happy to take questions. 18:43.920 --> 19:06.880 Any questions? Can you tell me if I can tell you whether there's NPU support in Rockham? 19:06.880 --> 19:15.680 Yeah. I don't know. Sorry. I would love to have the answer myself, but I don't know. 19:15.680 --> 19:33.840 So the question is, LVM's backend has been more opinionated towards CPUs and what challenges 19:33.840 --> 19:45.200 we face. I'm not necessarily a GPU person, but I believe the sum of the optimization passes 19:45.200 --> 19:51.360 to not necessarily assume some of the address-based difficulties that we face. And we've stumbled 19:51.360 --> 19:56.160 over address-based problems every now and then that we need to fix them because there were assumptions 19:56.160 --> 20:03.120 in the actual optimizations or in the co-gen that we had to fix for targets that use more than one 20:03.120 --> 20:14.240 address-based. Yeah. Thank you. 20:14.240 --> 20:28.560 Yes. So whether the Zylings FPGA will be supported in Rockham at some point, I don't know. Sorry. 20:33.120 --> 20:51.280 Okay. Okay. Also the problems. 20:51.280 --> 20:58.960 Yeah. Yeah. So the common was that there's one programming model missing, which is 20:58.960 --> 21:05.040 a sickle, and I'm not sure if there's- so this is the officially supported programming models. 21:05.040 --> 21:09.280 Right? And I think there's a distinction to be made because we actually have more stuff in Rockham 21:09.280 --> 21:14.880 that you can do, but the question is whether it's officially supported. Because we inherited 21:14.880 --> 21:22.880 a bunch of upstream stuff, but we don't necessarily test all of it, but yeah. So on AMD GPU, 21:22.880 --> 21:28.160 you can go the route of sickle through things like adaptive CPP for example, which was formally 21:28.240 --> 21:33.200 known as hip sickle, and so you would get that too. But I don't think it's officially 21:33.200 --> 21:38.000 in part of the Rockham stack. Good comment though. 21:51.280 --> 21:57.360 What's that? What about, so I mentioned that hit the support and video, what about Intel? 21:57.840 --> 22:02.800 I don't know. Yeah, I'm sorry. It's a very short answer, but I don't know. 22:03.840 --> 22:06.800 Currently, it doesn't. For sure. Right? Okay. 22:12.800 --> 22:18.240 Is around, when you're talking about credibility, the most things we have is what's going to be 22:18.240 --> 22:24.320 intermediate by R, where does it lead across parts? My understanding is that AMD has now, 22:24.320 --> 22:28.720 you've got a portability to code low, but when you're actually targeting, when you have multiple 22:28.720 --> 22:34.560 targets in your cluster, then I handle a dozen of this review in my university data set, 22:35.360 --> 22:39.040 re-outwork, and I can't do any easiest on the way to the right amount. 22:39.040 --> 22:43.200 I've already planned to fix that, and something else fear me in the long-standing sound. 22:43.200 --> 22:46.080 Yeah. Do you agree with the different parts that we have? 22:46.400 --> 22:54.800 Yeah. So the question was whether there are any plans for an intermediate representation 22:54.800 --> 23:00.880 that we would be able to do jit compilation more or less to other targets, because right now, 23:00.880 --> 23:06.080 you would basically always have to compile to specific ice, and that kind of limits portability. 23:06.720 --> 23:12.480 So we currently land, or we recently landed patches upstream, that would allow us to compile 23:13.120 --> 23:18.960 for generic targets, so you would say gfx10 generic, and that would give you access to the actual 23:18.960 --> 23:24.800 all of the gfx10 series GPUs. You may still want to compile to the gfx10, 23:25.440 --> 23:29.440 30, for example, that's the most specific one, the ones you care about most. 23:30.480 --> 23:36.560 So that it's easy, or you get the best performance on the target you care most, but you could still 23:36.640 --> 23:47.040 target all the other GPUs, too. For the SPRV question, I'm not complete to sure what exactly the plans are, 23:47.040 --> 23:53.680 but of course we are looking into ways to exactly solve the problem, because it is a problem, 23:53.680 --> 23:56.240 even internally. Thank you. 24:06.560 --> 24:21.520 I did not quite catch that question. Maybe we can take it off line.