WEBVTT

00:00.000 --> 00:27.960
Okay, quiet down everyone. The next talk is about to begin. Quiet, quiet down. Okay, good, good, good.

00:27.960 --> 00:35.960
So it is a great pleasure for me to introduce our next speaker, Ruben came to this Devroom last year and did an absolutely

00:35.960 --> 00:47.960
amazing presentation on Vulcan and the use of Vulcan and Lama CPP and ever since I've been bugging him to keep presenting on that lovely subject and so he agreed and so he is. Take it away Ruben.

00:47.960 --> 01:07.960
Thank you. All right, so my name is Guamotam. I'm now a very fresh machine learning engineer from Redhead but this work has like my work on Vulcan has mostly happened in my free time actually.

01:07.960 --> 01:17.960
But I want to like briefly introduce what's the Vulcan API? Why should you even care about it? Why is it relevant?

01:17.960 --> 01:32.960
Just briefly about Lama CPP as well like then what have we done since last year? There was a ton of work on the back and on the Lama CPP back and there's a lot that has changed since then.

01:32.960 --> 01:44.960
Some benchmarks actually that show like the part is how does it actually like comparatively to the usual suspects for running large language models on GPUs.

01:44.960 --> 02:00.960
What are the difficulties that I'm still struggling with and that all of like that you will also struggle with if you try to use Vulcan for something like this and the conclusion about is it worth using is it worth putting in the time to use Vulcan here.

02:00.960 --> 02:28.960
So basically what is work on isn't that a gaming API. So yes, it's an API for graphics. You can actually like it's a successor to open GA and the idea there was to get rid of some of the inefficiencies to open GA by making it a lot more abstract and what they ended up with is basically like a generic interface to GPUs.

02:28.960 --> 02:49.960
And so you can actually use the same kind of API calls and the same kind of like shader or quote unquote kind of kind of code to run on all kinds of GPUs not just the Nvidia, the usual Nvidia graphics cards.

02:49.960 --> 03:00.960
And so my interest here is mostly about like I don't have a huge like $200,000 Nvidia service somewhere I don't have data centers.

03:00.960 --> 03:10.960
So I just have like some PC somewhere with an old graphics card how do I make that run a large language model so that I can run.

03:10.960 --> 03:16.960
So that I can actually use it for something that I don't want to share with the cloud.

03:16.960 --> 03:28.960
So we can actually do that. You don't actually have to use the graphics part of it. You can just from compute shaders as a replacement for kernel. It's basically the same thing.

03:28.960 --> 03:45.960
And that way you can actually use it for machine learning. And what I did was I edited to learn a CPU like it's over two years ago now and it has grown a lot since last video probably you should be familiar with because it talks already about it.

03:45.960 --> 03:57.960
So I'm not going to go to deep into this. But basically the idea was that whatever hardware you're playing around somewhere you should be able to run an element and I'm on it.

03:57.960 --> 04:08.960
Of course what kind of I'm you can run so depends on like how much memory you do you have and how like patient are you with waiting for your responses.

04:09.960 --> 04:32.960
So a lot of people is based on a like static graph structure. It's not a different from like other approaches we've seen it's but the cool thing and that has also grown like I think since last years that it basically abstracted away all of the back end stuff into something that you can execute on various patterns.

04:32.960 --> 04:46.960
So you saw the graph the compute graph that contains all of the different operations gets sent to a back end and it can even be split up and sent to multiple back ends. So there's a lot of like interesting stuff you can do here.

04:46.960 --> 04:56.960
So there's a lot of like back ends that currently exist like the most relevant ones of course like CPU, CUDA metal and working.

04:56.960 --> 05:04.960
The rock M1 for India is basically one top of the CUDA back end so it reduces most of the code from that.

05:04.960 --> 05:14.960
And then there's also an open seeer based one that's I think that in that mobile phones there's a can for I think who are way accelerators.

05:14.960 --> 05:33.960
There's what GPU that's I'm not sure how usable that is yet which is also interesting and some like blasts and the nn which is just trying to make the trying to offer large matrix modifications to libraries that are more optimized on the CPU and that example.

05:33.960 --> 05:43.960
So what actually happens since last year one of I think the major like the most important thing we've done is like is flash attention.

05:43.960 --> 05:54.960
So probably if you've ever looked into like attention and the way it's used in large language models you've also come across flash attention.

05:54.960 --> 06:09.960
It's like the paper was hugely influential there's multiple versions of it now like the usual way to add it I think in the title space projects is to actually use the like the code from from specific GitHub repos.

06:09.960 --> 06:23.960
So we have a custom shader there that was actually not I didn't write it that was Jeff was from a video there and last year I think it like the one version of it like that was Nvidia specific.

06:23.960 --> 06:34.960
Already existed that's the corporate of matrix 2 variant that's some cooperative matrices work in abstraction for tensor course so.

06:34.960 --> 06:49.960
Or like any kind of like matrix acceleration and so since last year we've like we've also worked on the like cooperative matrix 1 which is the conos version which is not specific to Nvidia.

06:49.960 --> 07:02.960
So that is the one that will for example run on modern in the hardware and of course there's also scalar version if the GPU doesn't have any kind of hardware acceleration for matrix modifications.

07:03.960 --> 07:26.960
We can still run flash attention it will give you it will give you actually a huge increase in performance with with context that has grown very large that's become incredibly important in modern lunch and it's not it's because the because context that these models can support is extremely big.

07:26.960 --> 07:42.960
And so with with this you fuse a lot of a lot of operations into one and so you do not need huge intermediate buffers and you get and you can run it in one like in one kernel core instead of a bunch of operations.

07:42.960 --> 08:01.960
So with something like 128k context you will see a huge difference from using this and so implementing this was very important and also like making it available to more hardware was also huge step.

08:01.960 --> 08:09.960
And it's one of the things that made that made us made a huge difference for performance in the Vulcan backend.

08:09.960 --> 08:19.960
There's still a lot to do there like I've just just over the last few weeks I've spent time like optimizing the AMD the version running on modern in the hardware.

08:19.960 --> 08:27.960
Like I got a ton of performance all of that I had some crazy reports of people getting like four times faster inference from that so that's.

08:27.960 --> 08:34.960
And the same thing there's probably still a lot of like optimization work left to do there that can be done.

08:34.960 --> 08:38.960
So yeah for anyone else wants to take a look at it.

08:38.960 --> 08:41.960
I would be happy not to have to do all of it.

08:41.960 --> 09:00.960
So yeah another thing that that's actually like one thing I wanted like I worked on over the last like maybe maybe half a year ago also is like using the before and eight accelerations that's hardware feature.

09:00.960 --> 09:07.960
Where you have like a dot product of like.

09:07.960 --> 09:26.960
Of all like packed in aids and one you packed them into one in 32 you multiply each one you add that to another integer and you get a result and all of that happens in a single clock cycle and that's something as available on some of the GPUs that are not.

09:26.960 --> 09:33.960
That do not have the hardware to use like something like tensor course to accelerate matrix modifications.

09:33.960 --> 09:45.960
And so this is actually something very interesting for for learn a CPU because we're mostly focused on quantized models and the quantization schemes that we're using.

09:45.960 --> 10:02.960
They make it they allow you allow you to do a lot of the operations within like in in eight like multiplications and additions so you can actually use this and also get a lot of like a big performance increase.

10:02.960 --> 10:12.960
The the hardware that this most effects is for example in video Pascal which was the last generation with our tensor course which does have this dot product support.

10:12.960 --> 10:19.960
And we also have like in the very good 20 is very interesting that was one of the like secret.

10:19.960 --> 10:29.960
You want to do GPU you want to do like GPU acceleration like large for large language miles at home what are cool accelerators you can import from China from data centers something.

10:29.960 --> 10:36.960
That was one of those was the M550 and that's very good 20 so that's one of the things that profits a lot from this.

10:36.960 --> 10:47.960
And the inter GPUs also this is also like the fastest you can offer the most like relevant acceleration feature on an economist.

10:47.960 --> 11:02.960
So from writing like I just had to write this this code like I had to add this to the code once and it runs on all of these GPUs and also helped in some other cases but because we're unusable for other reasons.

11:02.960 --> 11:10.960
So one more thing that's very interesting that primarily just both of us has been working on this operator fusion.

11:10.960 --> 11:18.960
So in large language miles you have to turn off like I mean this this example is exactly exaggerated I just made up some times.

11:18.960 --> 11:32.960
But basically for each operation in the models you often have the pattern that you have a big operation and then a few follow up like small transformations on the same on the same data on the result of the big operation.

11:32.960 --> 11:50.960
If you do that in the like in a regular way then you get a you get some kind of dispatch over it you have to load the memory you have to do the calculation you have to start and then you have to load it again and do another transformation on the same data and so if you put all of that into the big operation basically.

11:50.960 --> 11:59.960
You can save a lot of time by not storing the intermediate result and by dispatching your current or computers.

11:59.960 --> 12:12.960
So that's one optimization that's quite useful but it's also very specific like we don't have a generic way of doing this so you can just.

12:12.960 --> 12:19.960
At the new model architecture and it's working well it's differently you won't immediately do this because the.

12:19.960 --> 12:25.960
Because the operations don't fit what's already implemented in this sense so.

12:25.960 --> 12:37.960
The existing fusions won't apply to the new model and so someone has to go and look at the new model figure out where's where's potential for for fusion and then actually implemented.

12:37.960 --> 12:50.960
So I thought some cool ideas about how that could be done in a more dynamic way but that's one of the areas that we're still like that would be interesting to look at but someone has to find the time of course.

12:50.960 --> 13:00.960
Yeah and of course there's much more that has happened so we also got before 16 support which wasn't originally in support and working but got in for extensions.

13:00.960 --> 13:11.960
There was a lot of work on reducing CPU overhead so in the beginning we had like last year even still we had like a kind of try run.

13:11.960 --> 13:20.960
That well you had to go through the whole model and figure out how much memory we actually need in the temporary compute purpose allocate that.

13:20.960 --> 13:29.960
And then run and go through the whole graph again and actually run the actually run the the computer.

13:30.960 --> 13:37.960
And then we found a way to reduce that by actually.

13:37.960 --> 13:51.960
By doing all of these steps like on demand so you so it actually basically just wait something up all operations are done and then figures out like resize as purpose as something and then continues.

13:51.960 --> 14:10.960
So there was some crazy work on like fences like but like basically fences something you wait for so you wait for an operation on the GPU to finish and someone figured out that if you actually just wait until the whole like graph has been computed that takes out some of it like.

14:10.960 --> 14:26.960
That that for some the CPU has to wake up again after sleeping for so long that it actually takes quite a bit longer and has to be and so that was solved by adding your fence somewhere early on the graph and then busy idling at the end so the CPU is actually sleeping at that point.

14:26.960 --> 14:30.960
What else is some stable diffusion operators which isn't like relevant for large language.

14:30.960 --> 14:41.960
Models but there's also like it's very cool to be able to run like stereo fusion and Vulkan and huge like huge number of all that stuff that happened here.

14:41.960 --> 14:44.960
So I want to show some benchmarks here.

14:44.960 --> 14:54.960
So on Nvidia actually like basically what I've done is just run the same kind of like the the lama bench tool that's in the repo.

14:54.960 --> 15:18.960
And in this case on my 3090 and I've run it with a cooler back end in the in the armacy repeat and have done the same thing with Vulkan on the y-axis it's how fast the result was on the x-axis it's how how much context was in the kv cache that's actually exactly where flash attention becomes extremely important.

15:18.960 --> 15:36.960
You can see that basically the Vulkan back end is slower than the cooler back end but it's not by much and actually other context it stays approximately within that area and so the performance is actually competitive although slow.

15:36.960 --> 15:46.960
So you might ask like why what I use Vulkan if I can also just use cooler but there are some some cases where that would actually make things a lot easier.

15:46.960 --> 16:05.960
For example if you have a if you already have a game or something that you want to integrate AI into and so you could now like add at cooler into that but that would be a huge hassle or you just use Vulkan which you're already using for graphics and you can get pretty competitive performance.

16:05.960 --> 16:11.960
So this is prompt processing so prefer that's also where the tensor calls could used.

16:11.960 --> 16:32.960
And the same like important generation there's also some differences like the in GPT or S we're still like this optimization left we don't restore lagging behind cooler there but actually there was like here on the on the deep sea to which is that's actually the deep sea to the architecture that's actually the geon 4.7 flash model the reason one.

16:32.960 --> 16:40.960
And so for some reason we're actually faster currently in top generation on Vulkan that I could have which is quite interesting.

16:40.960 --> 16:48.960
So but more interesting for me personally is the AMD iX86 years which is the straight halo GPU.

16:48.960 --> 16:57.960
You can get that with 120 gigabytes of available VRAM which makes it very interesting for like a mixture of export models.

16:57.960 --> 17:17.960
So here you you can see that on the on the old llama it be models actually slightly slower and from processing but in both the huge GPT or S 120B and the new like here and 4.7 it's it outperforms the rock and back and in prompt processing.

17:17.960 --> 17:29.960
And in talking generation same thing even in the GPT or S like case it's actually a big difference there it's it's the Vulkan back and is currently quite a bit faster here.

17:29.960 --> 17:44.960
On the yeah only like here and 4.7 slightly behind after at longer context but faster lower context so that's work left to be done there there are also cases where there's still a lot of work to be done.

17:44.960 --> 18:01.960
So this is exactly the very doubt when do you think so they can see that the scale of light retention is actually the implementation is actually not optimized for the start where it doesn't run where so we so while we can can be faster for example in the GPT was as case.

18:01.960 --> 18:14.960
So context it drops much faster so it larger context you get much less performance here and talking generation similar so it drops to fast this optimization to be done someone wants to look into it.

18:14.960 --> 18:18.960
I'm happy to I otherwise I'm going to have to do that I guess.

18:18.960 --> 18:29.960
So actually also another example here's Intel and I wanted to show this because it liked it's an example for driver issues that I'm still running into.

18:29.960 --> 18:40.960
So you can see that the like the results I got don't make much sense like there's something that actually got faster at larger context which isn't great.

18:40.960 --> 18:52.960
And so basically this is an example of a driver issue like the Intel Linux driver still like the Vulkan driver's actually not optimized for this kind of thing and there's a lot to read.

18:52.960 --> 18:58.960
I'm having a lot of issues optimizing for it and so that leads to example it's like this.

18:58.960 --> 19:04.960
I've had issues with all drivers at this point like I think I found bugs in all of them people are to them.

19:04.960 --> 19:10.960
So yeah this is one of the things I'm dealing with you.

19:10.960 --> 19:19.960
The other thing that I'm still like how do I actually optimize a computer and one of the issues I have is for.

19:19.960 --> 19:25.960
Well Nvidia does provide a way like with inside graphics to get some inside there.

19:25.960 --> 19:34.960
Of course for India I don't have anything like that and for Intel I don't either so it's actually a lot of guesswork to optimize a shader here.

19:34.960 --> 19:45.960
You can apply the same techniques as for CUDA but AMD doesn't behave the same way as Nvidia and it's much different so it's actually.

19:45.960 --> 19:55.960
Yeah there's I had to do like a lot of guessing a lot of trial and error to figure out what is actually fast on what hardware.

19:55.960 --> 20:07.960
So yeah in conclusion for working is very interesting you can you can get a lot of performance out of it as you've seen actually you can beat some of you proprietary APIs if you put in another work.

20:07.960 --> 20:21.960
The development side is actually much harder than something like CUDA because you have to do a lot of the work on the whole side a lot more work a lot of boilerplate to actually get anywhere with working.

20:21.960 --> 20:30.960
The tooling is limited in comparison a lot like I'm hoping that can be improved in the future but it's like out there.

20:30.960 --> 20:40.960
The hardware compatibility is much much broader than any other like any of the other usual APIs so that's the bigger advantage the binary sizes something that's often forgotten.

20:40.960 --> 20:59.960
You actually get much much smaller binaries like if you if you download PyTorch for CUDA you get more to figure out by its device code if you in theory if you do the same thing with work and you have something very small because the code gets compiled to the device specific code on demand during the run.

21:00.960 --> 21:15.960
And the performance of work can actually be very good but there's always something you cannot do with work and so there's the potential slightly lower but as you've seen you can get actually pretty close.

21:16.960 --> 21:24.960
So yeah, so I hope I've got some interest in using work and helping out on the backend or maybe integrating it somewhere else.

21:24.960 --> 21:39.960
I hope that in future we can like use it more often and get to somewhere where we are not as limited to one single vendor or one single like way of writing.

21:39.960 --> 21:47.960
Or where we have to actually write completely new kernels to use a different than the different GPU and not use different hardware.

21:47.960 --> 21:49.960
So yeah, thank you.