WEBVTT

00:00.000 --> 00:09.240
I haven't seen the paper, I highly recommend you go read it, because it's basically proofs

00:09.240 --> 00:22.760
one and for all, that this is not really what you're still looking for, right?

00:22.760 --> 00:26.840
You know, like everybody was talking how much the Nvidia price is dropped, and you know,

00:26.840 --> 00:33.720
how much we're now really entering a different era, but I think James Allen basically

00:33.720 --> 00:36.160
summed it up the best.

00:36.160 --> 00:40.440
So James Allen, Executive Director of Linux Foundation, again, we didn't know in the open

00:40.440 --> 00:46.920
source community, basically made this point, this is a golden age for low-level open source

00:46.920 --> 00:55.920
AI engineering, and hearing it from him was especially useful because it's really reminded

00:55.920 --> 00:58.720
me of something that I experienced myself.

00:58.720 --> 01:05.400
Like how many of you guys recognize this email that got sent around in 1991, so like one

01:05.400 --> 01:09.240
particular mailing list, you know, anybody in the audience recognizes that email at all,

01:09.240 --> 01:14.200
and I have a few hands, I mean, I guess we're all farce's still showing up at Fuzznam.

01:14.200 --> 01:20.520
This is line of store votes announcing Linux, when nobody thought that a group of individuals

01:20.600 --> 01:26.320
let alone one person would basically compete with Microsoft, because like Microsoft and IBM at

01:26.320 --> 01:31.120
that point, you know, yeah, not all of you remember, was it both to own it all?

01:31.120 --> 01:35.400
Like that was in, that was the landscape of the computer industry at the time, and there

01:35.400 --> 01:39.840
was like some micro systems on the French, but you know, that was it, and that email changed everything.

01:39.840 --> 01:41.320
I was in this is everywhere.

01:41.320 --> 01:45.200
If you compare it to the email, while essentially it wasn't an email that more like a GitHub

01:45.200 --> 01:49.800
you know, you shoot that G-O-D-G-R-G-R got a positive, it's actually kind of amazingly

01:49.800 --> 01:50.800
similar, right?

01:50.800 --> 01:54.560
You know, several principles, you know, like, I'm going to do the thing mostly for fun,

01:54.560 --> 02:00.080
it's simple enough, this is what true engineering looks like, and in fact, I will try

02:00.080 --> 02:05.560
to kind of tie it all back to G-P-C, in fact, that's exactly what G-P-C proved.

02:05.560 --> 02:10.040
Because if you read that paper, you get this real clear understanding that the view

02:10.040 --> 02:15.280
for you, it come up with, you know, too many mobile, you know, rocket science level tricks

02:15.280 --> 02:20.440
on the machine learning site, they really did just some clever, you know, very sophisticated

02:20.440 --> 02:25.160
engineering, and that's what we're all here for, you know, to understand how computers can

02:25.160 --> 02:29.360
basically execute these types of systems as well as possible.

02:29.360 --> 02:33.040
So now that we're talking about it, and again, just as an introduction, what is it that

02:33.040 --> 02:34.040
we've been talking about?

02:34.040 --> 02:38.040
Because everybody says, yeah, you know, machine learning and deep learning, and like

02:38.040 --> 02:43.560
people get confused, I actually really love this little book of deep learning, it's a

02:43.560 --> 02:47.040
high professor who also happens to be like a really good engineer.

02:47.040 --> 02:52.080
So download it, read it, it's about 100 pages long, you will understand what AI is much

02:52.080 --> 02:53.080
better if you read it.

02:53.080 --> 02:57.080
But I love this definition, that the AI is an application, it's basically a business

02:57.080 --> 02:58.680
use case of its technology, right?

02:58.680 --> 03:03.640
So when you say that I'm doing AI, I don't really have to talk about the application.

03:03.640 --> 03:07.320
Engineering wise, we're talking about deep learning, which is an application of, you know,

03:07.320 --> 03:11.880
machine learning, focused on learning from representation, and the way we do it to put it

03:11.880 --> 03:16.440
in the engineering speed, we take this thing for the neural network.

03:16.440 --> 03:20.360
It's sort of even some kind of an architecture, and that's slightly out of scope for us

03:20.360 --> 03:21.360
engineers.

03:21.360 --> 03:25.320
That's where the machine learning people, you know, come up with the next best architecture,

03:25.320 --> 03:29.880
you know, it's renformers versus mom versus three versus that, but then we optimize the

03:29.880 --> 03:34.680
heck out of it by running that thing on the computer.

03:34.680 --> 03:38.960
And we used to run those things on computers, you know, through a variety of different networks,

03:39.920 --> 03:47.320
mostly done by people in academia, and hence not really suitable for, you know, a quick, sort

03:47.320 --> 03:53.880
of entroyant documentations that, again, I expect to see anything that runs on the computer.

03:53.880 --> 03:58.440
It's very flexible, it allows you know, machine learning people to prototype extremely quickly,

03:58.440 --> 04:02.200
you know, get to the results, you know, it's very valuable piece of software that they give

04:02.200 --> 04:07.560
in us, but it's not really something that can be optimized, that is not in my case.

04:07.560 --> 04:11.080
Although some people try it, so there's actually an interesting computer company in that

04:11.080 --> 04:16.400
called H, that is making transformers, you know, be H, like, you know, hence the name, right

04:16.400 --> 04:17.400
to the silicon.

04:17.400 --> 04:21.240
So they're basically producing cases that are only every capable of running one architecture

04:21.240 --> 04:23.840
in that is transformers.

04:23.840 --> 04:25.240
And that is the way to optimize it.

04:25.240 --> 04:28.560
Now, again, if you can't, if your model is not a transformer, you know, you all have to

04:28.560 --> 04:29.560
flag.

04:29.560 --> 04:35.560
But to me, this is an extreme case, because I and the guy who like to a general purpose computing,

04:35.640 --> 04:39.760
in terms of general purpose computing, all of these frameworks kind of went through these

04:39.760 --> 04:40.760
iterations.

04:40.760 --> 04:45.920
So they used to be free history, because again, machine learning, you know, started way back

04:45.920 --> 04:50.800
in the sixties, you know, a lot of the architecture is like a demo, you know, for example,

04:50.800 --> 04:54.760
I still with us in the same way that they got introduced in the sixties.

04:54.760 --> 05:00.960
Amazingly enough, the framework of the diverse MATLAB, because MATLAB is actually super

05:00.960 --> 05:03.160
old, you know, not a lot of people know that.

05:03.160 --> 05:06.760
And that was kind of a pie torture of the day, like, you know, your grandfather would go

05:06.760 --> 05:11.120
to work at IBM and use MATLAB to basically do AI.

05:11.120 --> 05:15.760
I mean, they would quote that way back then, but that's what happened.

05:15.760 --> 05:19.960
And nothing really, not too many artifacts survive, although if you Google, if you're kind

05:19.960 --> 05:25.200
of like Google, you know, PhD thesis, it's amazing how much, you know, what they developed

05:25.200 --> 05:28.720
back then, it's still applicable today, but you have to translate it through a lot of historical

05:28.720 --> 05:29.720
baggage.

05:29.720 --> 05:34.400
Then, you know, what is interesting, I think, would basically have this revival of AI

05:34.400 --> 05:42.680
or computers from actual computers starting at the bowler 2012, because 2012 was then,

05:42.680 --> 05:47.480
there was this guy, you know, who realized that now he can actually run these types of

05:47.480 --> 05:52.560
architectures on a computing unit called the GPU, because before it was, again, general

05:52.560 --> 05:57.560
from his computers, you know, yeah, fast but not fast enough, all of a sudden he got

05:57.560 --> 06:03.960
his hands on the bunch of MDGP's, and he's like, heck, this buddy of Maya, so, you know,

06:03.960 --> 06:08.560
sound you will probably recognize Alex Net, a very, you know, seminal paper.

06:08.560 --> 06:13.400
So two people wrote it, why was the machine learning engineer, and Alex, and Alex Net,

06:13.400 --> 06:18.200
the other, he was more on engineering engineer, but basically figured out all of these details

06:18.280 --> 06:22.720
of how to run it as fast as possible, or, you know, a pretty large number at the time

06:22.720 --> 06:23.720
of GPUs.

06:23.720 --> 06:29.720
And hands, you know, arguably, you basically have to put that as his main way of doing it,

06:29.720 --> 06:32.400
because, like, he didn't even have its framework at the time, right?

06:32.400 --> 06:36.240
If you go and look at the original source code, you know, he was just put it, like, he literally

06:36.240 --> 06:38.240
was programming into the kernels.

06:38.240 --> 06:43.160
It's all for actually, it's funny, because if you go to GitHub, there's a lot of

06:43.200 --> 06:48.240
implementations of Alex Net, intensify the flow, and in PyTorch, and elements

06:48.240 --> 06:52.720
complaining that it's slower than what Elia did back at the time.

06:52.720 --> 06:55.960
Like, it's really easy, Elia is like, how can we actually, because they take their

06:55.960 --> 06:59.320
original implementation, they want to run it side by side with the framework, and

06:59.320 --> 07:01.080
like, the framework is always slower.

07:01.080 --> 07:05.440
Like, what the hell, like, how can we get back to the original numbers?

07:05.440 --> 07:09.800
Now, that was the time when we go all realized that we should probably arrive

07:09.800 --> 07:15.320
on specialized hardware, and Google enters the race with basically designing

07:15.320 --> 07:19.720
not a GPU, but a TPU, right, you know, tons of processing units.

07:19.720 --> 07:26.200
Again, a PhD thesis behind it, now this time, Google's like, no way, we actually

07:26.200 --> 07:30.160
have to give people a framework so that they can use this technology, because nobody

07:30.160 --> 07:33.080
is smart enough to figure out how to write compute kernels for it.

07:33.080 --> 07:39.440
So they basically come up with them, so flow and the rest is history, but in parallel,

07:39.440 --> 07:44.440
a team at Google was also working on a next-generation architecture,

07:44.440 --> 07:48.080
transformers, and all of those three things, you know, kind of compliant to create a

07:48.080 --> 07:53.120
perfect storm, so transformers will go out of that effort.

07:53.120 --> 08:01.120
Finally, if you look at 2015, open AI, it went back to India, took transformers architecture,

08:01.120 --> 08:05.000
trained with on a whole bunch of data, by talking with this thing to do, and that

08:05.000 --> 08:09.160
was what I call ML PhD era, we were experimenting with networks, you know, there was

08:09.160 --> 08:15.400
not really a lot of sophisticated optimizations put into it, but that was fine, and that

08:15.400 --> 08:20.000
is what people are still associated with AI today.

08:20.000 --> 08:25.920
What I want you to pay attention to is this last slide on this slide, I call it an engineering

08:25.920 --> 08:30.480
era, and that is basically if you look at the architecture of the networks, they kind

08:30.480 --> 08:33.920
of fixed by now, right, it's kind of transformers plus class, you know, maybe it's

08:33.920 --> 08:40.160
Mumbai, maybe it's that, but we as engineers don't really have to worry about it too much.

08:40.160 --> 08:43.480
Make sure of experts is actually a great thing, because it allows you to partition your

08:43.480 --> 08:48.520
own networks effectively in smaller branches, and we should really do a lot of it.

08:48.520 --> 08:53.720
Now, what is really super interesting to me is that a whole bunch of next-generation frameworks

08:53.720 --> 08:58.480
came out of this effort of kind of like looking at it and saying, how can we run inference

08:58.480 --> 09:02.440
on this type of architectures and maybe something, you know, but I think a lot of I'm

09:02.440 --> 09:06.920
actually the focus on inference, as quickly as possible, and these are the things that we

09:06.920 --> 09:08.920
will be talking about through the day today.

09:08.920 --> 09:16.600
GGML, ZML, tiny girl, there's a project that we haven't mind, that all of these frameworks

09:16.600 --> 09:23.080
that are completely underpercated, and yeah, they are basically the basis of the deep sea type

09:23.080 --> 09:27.080
of an event, because that is where the engineering is happening, and if you're a software

09:27.080 --> 09:31.920
engineer like myself, if you're like, you know, tinkering with bits and bytes, this is where

09:32.000 --> 09:33.600
you need to spend your time on.

09:33.600 --> 09:35.800
Now, again, what are we going to run it on?

09:35.800 --> 09:39.920
I actually have a prediction to make, so catch me here in, you know, five years, and, you know,

09:39.920 --> 09:41.800
let's see that prediction depends out.

09:41.800 --> 09:45.600
I don't think we'll be running it on CPUs, saw, not all.

09:45.600 --> 09:48.960
I definitely don't think we'll be running it a lot on GPUs.

09:48.960 --> 09:54.680
I think a transduer-like architecture with the rest five main course, you know, type one approach

09:54.680 --> 09:58.360
would be an example of the kind of a thing for which these types of frameworks will be

09:58.360 --> 09:59.360
optimized.

10:00.360 --> 10:04.440
So, yeah, I give you, you know, at least in my view, the inference stack of the future.

10:04.440 --> 10:12.880
You basically have a piece of silicon that co-evolves with the, one of these frameworks.

10:12.880 --> 10:16.000
Again, could you channel tiny grant, GML, like, I don't care.

10:16.000 --> 10:20.400
There's a bunch of things that you kind of have to do on top of it, because inferencing

10:20.400 --> 10:22.680
in model is not just multiplying matrices.

10:22.680 --> 10:24.680
There's a lot that comes before that.

10:24.920 --> 10:29.480
And then, give your customers your login friends APIs, fine tuning APIs, enterprise APIs,

10:29.480 --> 10:31.720
management APIs, and you sell it as a product.

10:31.720 --> 10:35.640
So, if you're interested in developing anything like that, I think this is roughly the area

10:35.640 --> 10:38.120
of the scope that, you know, you should be paying attention to.

10:38.120 --> 10:40.200
And that is exactly why it is never on today.

10:40.200 --> 10:44.680
You're so different from all of the AI conferences, I've been, you know, in the past few years.

10:44.680 --> 10:49.480
And I hope the rest of the speakers will make the point I'm trying to make much more eloquently,

10:49.480 --> 10:51.240
but I just want to do to know all of this.

10:51.240 --> 10:52.240
Thank you so much.

10:54.680 --> 10:59.680
Thank you, Roman.

10:59.680 --> 11:01.680
Oh, switch out could.

11:01.680 --> 11:02.680
Okay?

11:02.680 --> 11:03.680
I'll show you.