WEBVTT 00:00.000 --> 00:09.240 I haven't seen the paper, I highly recommend you go read it, because it's basically proofs 00:09.240 --> 00:22.760 one and for all, that this is not really what you're still looking for, right? 00:22.760 --> 00:26.840 You know, like everybody was talking how much the Nvidia price is dropped, and you know, 00:26.840 --> 00:33.720 how much we're now really entering a different era, but I think James Allen basically 00:33.720 --> 00:36.160 summed it up the best. 00:36.160 --> 00:40.440 So James Allen, Executive Director of Linux Foundation, again, we didn't know in the open 00:40.440 --> 00:46.920 source community, basically made this point, this is a golden age for low-level open source 00:46.920 --> 00:55.920 AI engineering, and hearing it from him was especially useful because it's really reminded 00:55.920 --> 00:58.720 me of something that I experienced myself. 00:58.720 --> 01:05.400 Like how many of you guys recognize this email that got sent around in 1991, so like one 01:05.400 --> 01:09.240 particular mailing list, you know, anybody in the audience recognizes that email at all, 01:09.240 --> 01:14.200 and I have a few hands, I mean, I guess we're all farce's still showing up at Fuzznam. 01:14.200 --> 01:20.520 This is line of store votes announcing Linux, when nobody thought that a group of individuals 01:20.600 --> 01:26.320 let alone one person would basically compete with Microsoft, because like Microsoft and IBM at 01:26.320 --> 01:31.120 that point, you know, yeah, not all of you remember, was it both to own it all? 01:31.120 --> 01:35.400 Like that was in, that was the landscape of the computer industry at the time, and there 01:35.400 --> 01:39.840 was like some micro systems on the French, but you know, that was it, and that email changed everything. 01:39.840 --> 01:41.320 I was in this is everywhere. 01:41.320 --> 01:45.200 If you compare it to the email, while essentially it wasn't an email that more like a GitHub 01:45.200 --> 01:49.800 you know, you shoot that G-O-D-G-R-G-R got a positive, it's actually kind of amazingly 01:49.800 --> 01:50.800 similar, right? 01:50.800 --> 01:54.560 You know, several principles, you know, like, I'm going to do the thing mostly for fun, 01:54.560 --> 02:00.080 it's simple enough, this is what true engineering looks like, and in fact, I will try 02:00.080 --> 02:05.560 to kind of tie it all back to G-P-C, in fact, that's exactly what G-P-C proved. 02:05.560 --> 02:10.040 Because if you read that paper, you get this real clear understanding that the view 02:10.040 --> 02:15.280 for you, it come up with, you know, too many mobile, you know, rocket science level tricks 02:15.280 --> 02:20.440 on the machine learning site, they really did just some clever, you know, very sophisticated 02:20.440 --> 02:25.160 engineering, and that's what we're all here for, you know, to understand how computers can 02:25.160 --> 02:29.360 basically execute these types of systems as well as possible. 02:29.360 --> 02:33.040 So now that we're talking about it, and again, just as an introduction, what is it that 02:33.040 --> 02:34.040 we've been talking about? 02:34.040 --> 02:38.040 Because everybody says, yeah, you know, machine learning and deep learning, and like 02:38.040 --> 02:43.560 people get confused, I actually really love this little book of deep learning, it's a 02:43.560 --> 02:47.040 high professor who also happens to be like a really good engineer. 02:47.040 --> 02:52.080 So download it, read it, it's about 100 pages long, you will understand what AI is much 02:52.080 --> 02:53.080 better if you read it. 02:53.080 --> 02:57.080 But I love this definition, that the AI is an application, it's basically a business 02:57.080 --> 02:58.680 use case of its technology, right? 02:58.680 --> 03:03.640 So when you say that I'm doing AI, I don't really have to talk about the application. 03:03.640 --> 03:07.320 Engineering wise, we're talking about deep learning, which is an application of, you know, 03:07.320 --> 03:11.880 machine learning, focused on learning from representation, and the way we do it to put it 03:11.880 --> 03:16.440 in the engineering speed, we take this thing for the neural network. 03:16.440 --> 03:20.360 It's sort of even some kind of an architecture, and that's slightly out of scope for us 03:20.360 --> 03:21.360 engineers. 03:21.360 --> 03:25.320 That's where the machine learning people, you know, come up with the next best architecture, 03:25.320 --> 03:29.880 you know, it's renformers versus mom versus three versus that, but then we optimize the 03:29.880 --> 03:34.680 heck out of it by running that thing on the computer. 03:34.680 --> 03:38.960 And we used to run those things on computers, you know, through a variety of different networks, 03:39.920 --> 03:47.320 mostly done by people in academia, and hence not really suitable for, you know, a quick, sort 03:47.320 --> 03:53.880 of entroyant documentations that, again, I expect to see anything that runs on the computer. 03:53.880 --> 03:58.440 It's very flexible, it allows you know, machine learning people to prototype extremely quickly, 03:58.440 --> 04:02.200 you know, get to the results, you know, it's very valuable piece of software that they give 04:02.200 --> 04:07.560 in us, but it's not really something that can be optimized, that is not in my case. 04:07.560 --> 04:11.080 Although some people try it, so there's actually an interesting computer company in that 04:11.080 --> 04:16.400 called H, that is making transformers, you know, be H, like, you know, hence the name, right 04:16.400 --> 04:17.400 to the silicon. 04:17.400 --> 04:21.240 So they're basically producing cases that are only every capable of running one architecture 04:21.240 --> 04:23.840 in that is transformers. 04:23.840 --> 04:25.240 And that is the way to optimize it. 04:25.240 --> 04:28.560 Now, again, if you can't, if your model is not a transformer, you know, you all have to 04:28.560 --> 04:29.560 flag. 04:29.560 --> 04:35.560 But to me, this is an extreme case, because I and the guy who like to a general purpose computing, 04:35.640 --> 04:39.760 in terms of general purpose computing, all of these frameworks kind of went through these 04:39.760 --> 04:40.760 iterations. 04:40.760 --> 04:45.920 So they used to be free history, because again, machine learning, you know, started way back 04:45.920 --> 04:50.800 in the sixties, you know, a lot of the architecture is like a demo, you know, for example, 04:50.800 --> 04:54.760 I still with us in the same way that they got introduced in the sixties. 04:54.760 --> 05:00.960 Amazingly enough, the framework of the diverse MATLAB, because MATLAB is actually super 05:00.960 --> 05:03.160 old, you know, not a lot of people know that. 05:03.160 --> 05:06.760 And that was kind of a pie torture of the day, like, you know, your grandfather would go 05:06.760 --> 05:11.120 to work at IBM and use MATLAB to basically do AI. 05:11.120 --> 05:15.760 I mean, they would quote that way back then, but that's what happened. 05:15.760 --> 05:19.960 And nothing really, not too many artifacts survive, although if you Google, if you're kind 05:19.960 --> 05:25.200 of like Google, you know, PhD thesis, it's amazing how much, you know, what they developed 05:25.200 --> 05:28.720 back then, it's still applicable today, but you have to translate it through a lot of historical 05:28.720 --> 05:29.720 baggage. 05:29.720 --> 05:34.400 Then, you know, what is interesting, I think, would basically have this revival of AI 05:34.400 --> 05:42.680 or computers from actual computers starting at the bowler 2012, because 2012 was then, 05:42.680 --> 05:47.480 there was this guy, you know, who realized that now he can actually run these types of 05:47.480 --> 05:52.560 architectures on a computing unit called the GPU, because before it was, again, general 05:52.560 --> 05:57.560 from his computers, you know, yeah, fast but not fast enough, all of a sudden he got 05:57.560 --> 06:03.960 his hands on the bunch of MDGP's, and he's like, heck, this buddy of Maya, so, you know, 06:03.960 --> 06:08.560 sound you will probably recognize Alex Net, a very, you know, seminal paper. 06:08.560 --> 06:13.400 So two people wrote it, why was the machine learning engineer, and Alex, and Alex Net, 06:13.400 --> 06:18.200 the other, he was more on engineering engineer, but basically figured out all of these details 06:18.280 --> 06:22.720 of how to run it as fast as possible, or, you know, a pretty large number at the time 06:22.720 --> 06:23.720 of GPUs. 06:23.720 --> 06:29.720 And hands, you know, arguably, you basically have to put that as his main way of doing it, 06:29.720 --> 06:32.400 because, like, he didn't even have its framework at the time, right? 06:32.400 --> 06:36.240 If you go and look at the original source code, you know, he was just put it, like, he literally 06:36.240 --> 06:38.240 was programming into the kernels. 06:38.240 --> 06:43.160 It's all for actually, it's funny, because if you go to GitHub, there's a lot of 06:43.200 --> 06:48.240 implementations of Alex Net, intensify the flow, and in PyTorch, and elements 06:48.240 --> 06:52.720 complaining that it's slower than what Elia did back at the time. 06:52.720 --> 06:55.960 Like, it's really easy, Elia is like, how can we actually, because they take their 06:55.960 --> 06:59.320 original implementation, they want to run it side by side with the framework, and 06:59.320 --> 07:01.080 like, the framework is always slower. 07:01.080 --> 07:05.440 Like, what the hell, like, how can we get back to the original numbers? 07:05.440 --> 07:09.800 Now, that was the time when we go all realized that we should probably arrive 07:09.800 --> 07:15.320 on specialized hardware, and Google enters the race with basically designing 07:15.320 --> 07:19.720 not a GPU, but a TPU, right, you know, tons of processing units. 07:19.720 --> 07:26.200 Again, a PhD thesis behind it, now this time, Google's like, no way, we actually 07:26.200 --> 07:30.160 have to give people a framework so that they can use this technology, because nobody 07:30.160 --> 07:33.080 is smart enough to figure out how to write compute kernels for it. 07:33.080 --> 07:39.440 So they basically come up with them, so flow and the rest is history, but in parallel, 07:39.440 --> 07:44.440 a team at Google was also working on a next-generation architecture, 07:44.440 --> 07:48.080 transformers, and all of those three things, you know, kind of compliant to create a 07:48.080 --> 07:53.120 perfect storm, so transformers will go out of that effort. 07:53.120 --> 08:01.120 Finally, if you look at 2015, open AI, it went back to India, took transformers architecture, 08:01.120 --> 08:05.000 trained with on a whole bunch of data, by talking with this thing to do, and that 08:05.000 --> 08:09.160 was what I call ML PhD era, we were experimenting with networks, you know, there was 08:09.160 --> 08:15.400 not really a lot of sophisticated optimizations put into it, but that was fine, and that 08:15.400 --> 08:20.000 is what people are still associated with AI today. 08:20.000 --> 08:25.920 What I want you to pay attention to is this last slide on this slide, I call it an engineering 08:25.920 --> 08:30.480 era, and that is basically if you look at the architecture of the networks, they kind 08:30.480 --> 08:33.920 of fixed by now, right, it's kind of transformers plus class, you know, maybe it's 08:33.920 --> 08:40.160 Mumbai, maybe it's that, but we as engineers don't really have to worry about it too much. 08:40.160 --> 08:43.480 Make sure of experts is actually a great thing, because it allows you to partition your 08:43.480 --> 08:48.520 own networks effectively in smaller branches, and we should really do a lot of it. 08:48.520 --> 08:53.720 Now, what is really super interesting to me is that a whole bunch of next-generation frameworks 08:53.720 --> 08:58.480 came out of this effort of kind of like looking at it and saying, how can we run inference 08:58.480 --> 09:02.440 on this type of architectures and maybe something, you know, but I think a lot of I'm 09:02.440 --> 09:06.920 actually the focus on inference, as quickly as possible, and these are the things that we 09:06.920 --> 09:08.920 will be talking about through the day today. 09:08.920 --> 09:16.600 GGML, ZML, tiny girl, there's a project that we haven't mind, that all of these frameworks 09:16.600 --> 09:23.080 that are completely underpercated, and yeah, they are basically the basis of the deep sea type 09:23.080 --> 09:27.080 of an event, because that is where the engineering is happening, and if you're a software 09:27.080 --> 09:31.920 engineer like myself, if you're like, you know, tinkering with bits and bytes, this is where 09:32.000 --> 09:33.600 you need to spend your time on. 09:33.600 --> 09:35.800 Now, again, what are we going to run it on? 09:35.800 --> 09:39.920 I actually have a prediction to make, so catch me here in, you know, five years, and, you know, 09:39.920 --> 09:41.800 let's see that prediction depends out. 09:41.800 --> 09:45.600 I don't think we'll be running it on CPUs, saw, not all. 09:45.600 --> 09:48.960 I definitely don't think we'll be running it a lot on GPUs. 09:48.960 --> 09:54.680 I think a transduer-like architecture with the rest five main course, you know, type one approach 09:54.680 --> 09:58.360 would be an example of the kind of a thing for which these types of frameworks will be 09:58.360 --> 09:59.360 optimized. 10:00.360 --> 10:04.440 So, yeah, I give you, you know, at least in my view, the inference stack of the future. 10:04.440 --> 10:12.880 You basically have a piece of silicon that co-evolves with the, one of these frameworks. 10:12.880 --> 10:16.000 Again, could you channel tiny grant, GML, like, I don't care. 10:16.000 --> 10:20.400 There's a bunch of things that you kind of have to do on top of it, because inferencing 10:20.400 --> 10:22.680 in model is not just multiplying matrices. 10:22.680 --> 10:24.680 There's a lot that comes before that. 10:24.920 --> 10:29.480 And then, give your customers your login friends APIs, fine tuning APIs, enterprise APIs, 10:29.480 --> 10:31.720 management APIs, and you sell it as a product. 10:31.720 --> 10:35.640 So, if you're interested in developing anything like that, I think this is roughly the area 10:35.640 --> 10:38.120 of the scope that, you know, you should be paying attention to. 10:38.120 --> 10:40.200 And that is exactly why it is never on today. 10:40.200 --> 10:44.680 You're so different from all of the AI conferences, I've been, you know, in the past few years. 10:44.680 --> 10:49.480 And I hope the rest of the speakers will make the point I'm trying to make much more eloquently, 10:49.480 --> 10:51.240 but I just want to do to know all of this. 10:51.240 --> 10:52.240 Thank you so much. 10:54.680 --> 10:59.680 Thank you, Roman. 10:59.680 --> 11:01.680 Oh, switch out could. 11:01.680 --> 11:02.680 Okay? 11:02.680 --> 11:03.680 I'll show you.