WEBVTT

00:00.000 --> 00:10.000
Yeah, you just turn it on.

00:10.000 --> 00:13.000
There you go.

00:13.000 --> 00:16.000
All right, everyone.

00:16.000 --> 00:18.000
This is Martin Chang.

00:18.000 --> 00:21.000
He will be speaking about, oh, you asked your mic,

00:21.000 --> 00:28.000
building new GGL GML backends for novel accelerators,

00:28.000 --> 00:32.000
how challenge and opportunities you're ready to go?

00:32.000 --> 00:34.000
All right, take it away, Martin.

00:34.000 --> 00:36.000
All right, I'm sorry for the crazy setup,

00:36.000 --> 00:39.000
but hopefully everything is clear right now.

00:39.000 --> 00:43.000
So let's talk building about new GGL GML backends.

00:43.000 --> 00:46.000
A lot of the hardware companies are starting to come up with

00:46.000 --> 00:48.000
the new new and different hardware,

00:48.000 --> 00:53.000
that supposedly to be vastly vastly faster than GPUs.

00:53.000 --> 00:58.000
Yeah, but most of them have come with their own

00:58.000 --> 01:02.000
programming difficulties and mismatches.

01:02.000 --> 01:06.000
So for example, the hardware may not support a certain feature

01:06.000 --> 01:09.000
that is very common on GPUs or CPUs,

01:09.000 --> 01:12.000
and you have to just have to somehow deal with that.

01:12.000 --> 01:16.000
Yeah, this talk is about the how to integrate

01:16.000 --> 01:20.000
the differences into GGL GML, the challenges

01:20.000 --> 01:24.000
or the program mismatches that you expect that you have to see.

01:24.000 --> 01:30.000
And the opportunities or things that we still have to do.

01:30.000 --> 01:32.000
There's a lot to go through in 20 minutes,

01:32.000 --> 01:36.000
so I'll try my best, but no guarantees that I will be able to do everything.

01:36.000 --> 01:38.000
But first of all, some disclosure,

01:38.000 --> 01:41.000
I'm currently technically sponsored by TensorFlow,

01:41.000 --> 01:44.000
but this work is going to happen with all of your support.

01:44.000 --> 01:47.000
I'm very grateful for their support,

01:47.000 --> 01:50.000
and the engineering is very, very helpful.

01:50.000 --> 01:53.000
And just about two minutes ago,

01:53.000 --> 01:55.000
they just told me if anyone is interested in it

01:55.000 --> 01:57.000
if they're hardware, find them outside the door,

01:57.000 --> 02:01.000
and they will be able to help you gain access to their hardware.

02:01.000 --> 02:03.000
Very good to open source policy.

02:03.000 --> 02:05.000
First, I'm background on my self.

02:05.000 --> 02:06.000
Who am I?

02:06.000 --> 02:08.000
I do sell a lot of C++ and HPC.

02:08.000 --> 02:11.000
I'm a false developer, obviously.

02:11.000 --> 02:13.000
And this is my original fun.

02:14.000 --> 02:17.000
This is a lot of stuff I do besides doing AI,

02:17.000 --> 02:19.000
I maintain the web frameworks,

02:19.000 --> 02:21.000
I maintain the niche search engine for an

02:21.000 --> 02:23.000
niche in her app protocol,

02:23.000 --> 02:27.000
and I also develop some libraries for security.

02:27.000 --> 02:31.000
Beyond that, let's get back to the original topic,

02:31.000 --> 02:32.000
GGML.

02:32.000 --> 02:34.000
Hopefully everyone knows what it is.

02:34.000 --> 02:36.000
It's the back end of LaMASIVP,

02:36.000 --> 02:39.000
which is what a lot of people use right now.

02:39.000 --> 02:41.000
It's very efficient for instance,

02:41.000 --> 02:43.000
and especially for large-enrich models,

02:43.000 --> 02:45.000
it has very strong quantization support.

02:45.000 --> 02:47.000
It has a very good community and very flexible.

02:47.000 --> 02:50.000
And most importantly for us,

02:50.000 --> 02:52.000
it's written in C or C++,

02:52.000 --> 02:55.000
which makes developing very, very easy

02:55.000 --> 02:57.000
to integrate that into a new hardware,

02:57.000 --> 03:01.000
and then you input new capabilities of the hardware.

03:01.000 --> 03:04.000
Could we go through my journey?

03:04.000 --> 03:07.000
I started about 2022,

03:07.000 --> 03:09.000
where there's a new chip called the Arcus V8.

03:09.000 --> 03:12.000
There's an AI core process on there.

03:12.000 --> 03:15.000
I want to really want to locally exist

03:15.000 --> 03:19.000
and with L390 running the summer in Taiwan,

03:19.000 --> 03:21.000
which could be at 40 degrees Celsius,

03:21.000 --> 03:25.000
which is, I don't need another space fee for me.

03:25.000 --> 03:27.000
Either way, that's sort of work.

03:27.000 --> 03:31.000
I'm able to integrate the core process support

03:31.000 --> 03:33.000
into LaMASIVP,

03:33.000 --> 03:35.000
but due to the architectures,

03:35.000 --> 03:37.000
doesn't really work out.

03:37.000 --> 03:40.000
So I tried to side to pivot,

03:40.000 --> 03:42.000
and at the time,

03:42.000 --> 03:44.000
I started to sell their first dev kits.

03:44.000 --> 03:45.000
I got this thinking,

03:45.000 --> 03:47.000
what's the worst could happen,

03:47.000 --> 03:49.000
and that's how it started on this project.

03:49.000 --> 03:51.000
So the last talk,

03:51.000 --> 03:53.000
the people talk about P,

03:53.000 --> 03:55.000
P talking about TensorFlow,

03:55.000 --> 03:56.000
and in D,

03:56.000 --> 03:57.000
how the calorie hardware works.

03:57.000 --> 03:58.000
But in C++,

03:58.000 --> 03:59.000
other people,

03:59.000 --> 04:01.000
this is the very just version of it.

04:01.000 --> 04:02.000
It's a many core processor,

04:02.000 --> 04:04.000
so it has a lot of results

04:04.000 --> 04:06.000
organized into different grids.

04:06.000 --> 04:09.000
These risk fights are connected to different core processors,

04:09.000 --> 04:11.000
so you can do computation

04:11.000 --> 04:14.000
and TensorFlow operations really efficiently.

04:14.000 --> 04:16.000
They call these risk-like cores baby,

04:16.000 --> 04:18.000
because they have really, really small,

04:18.000 --> 04:19.000
and by small,

04:19.000 --> 04:20.000
I mean, like,

04:20.000 --> 04:24.000
undergrad has textbook level of small.

04:24.000 --> 04:27.000
Yeah, it's a grid of cores,

04:27.000 --> 04:29.000
it's a network on ship,

04:29.000 --> 04:31.000
so you can talk to cores,

04:31.000 --> 04:33.000
can talk to other cores.

04:33.000 --> 04:34.000
There's no global cache,

04:34.000 --> 04:35.000
we're going to find memory,

04:35.000 --> 04:38.000
so each core has access to its own memory,

04:38.000 --> 04:41.000
and you can ask a DRM to send something to it,

04:41.000 --> 04:43.000
or send it somewhere else.

04:43.000 --> 04:45.000
There's no cache.

04:45.000 --> 04:47.000
It's listed to help with management,

04:47.000 --> 04:50.000
which will come into you a little bit of a factor on.

04:50.000 --> 04:53.000
And yeah,

04:53.000 --> 04:57.000
and it also scales to a very large array

04:57.000 --> 04:59.000
by just taking more ships,

04:59.000 --> 05:01.000
gather and communicating with each other,

05:01.000 --> 05:03.000
using one of the peripherals.

05:03.000 --> 05:05.000
I stole a slide from there,

05:05.000 --> 05:07.000
caught ship's slide stack for the

05:07.000 --> 05:09.000
an activation processor.

05:09.000 --> 05:10.000
As you can see,

05:10.000 --> 05:12.000
it's a grid of CPUs,

05:12.000 --> 05:15.000
and different CPUs were different purposes.

05:15.000 --> 05:17.000
There's different core types.

05:17.000 --> 05:18.000
There's DRAM,

05:18.000 --> 05:19.000
there's Ethernet,

05:19.000 --> 05:20.000
and they call it N6,

05:20.000 --> 05:22.000
or just the compute.

05:22.000 --> 05:24.000
Compute just do,

05:24.000 --> 05:25.000
just do compute,

05:25.000 --> 05:29.000
does the AI method we actually care about.

05:29.000 --> 05:31.000
The DRM cores is interesting.

05:31.000 --> 05:32.000
The cores can,

05:32.000 --> 05:35.000
the compute core can either just request the MA,

05:35.000 --> 05:39.000
and ask with DRAM to send data from the DRM into,

05:39.000 --> 05:41.000
into it's local address,

05:41.000 --> 05:43.000
and do compute data,

05:43.000 --> 05:46.000
or DRM cores can actively push data from DRM to it,

05:46.000 --> 05:47.000
into other cores,

05:47.000 --> 05:48.000
so when they needed it,

05:48.000 --> 05:49.000
it's there.

05:49.000 --> 05:51.000
There's also the Ethernet core,

05:51.000 --> 05:53.000
which does all the communication,

05:53.000 --> 05:57.000
and that's how it scales beyond one ship,

05:57.000 --> 06:00.000
natively and don't need some crazy peripheral

06:00.000 --> 06:03.000
for that kind of work.

06:03.000 --> 06:05.000
Inside the compute core,

06:05.000 --> 06:09.000
there's really five cores that's working together.

06:09.000 --> 06:12.000
There's two respects that's connected to the knock router.

06:12.000 --> 06:17.000
These knock routers are able to talk to the other cores,

06:17.000 --> 06:20.000
and most importantly DRAM,

06:20.000 --> 06:22.000
so it can access data,

06:22.000 --> 06:25.000
and only these two core can programmatically.

06:25.000 --> 06:27.000
For the compute,

06:27.000 --> 06:29.000
there's three cores that's working together

06:29.000 --> 06:32.000
to perform computation.

06:32.000 --> 06:35.000
A compute side is really a tensor engine

06:35.000 --> 06:38.000
and a background working together.

06:38.000 --> 06:41.000
Unlike a traditional multi-core,

06:41.000 --> 06:42.000
multi-core, multi-process,

06:42.000 --> 06:44.000
architecture, each,

06:44.000 --> 06:48.000
this five cores there are working together

06:48.000 --> 06:50.000
to control the the peripherals

06:50.000 --> 06:55.000
instead of being the main system that's doing the computation.

06:55.000 --> 06:57.000
Because of this architecture,

06:57.000 --> 07:01.000
everything has to be explicitly management managed,

07:01.000 --> 07:03.000
including the data flow.

07:03.000 --> 07:06.000
So instead of just the compute core saying,

07:06.000 --> 07:09.000
hey, I want the data from this DRAM address,

07:09.000 --> 07:10.000
and be able to fetch it.

07:10.000 --> 07:14.000
What has to happen is one of the cores connected

07:14.000 --> 07:19.000
to the knock has to request memory from the DRAM.

07:19.000 --> 07:23.000
If the DRAM has to send the data to its local memory,

07:23.000 --> 07:26.000
these compute cores can then take that data,

07:26.000 --> 07:28.000
that data to the computation,

07:28.000 --> 07:30.000
and put it back into its local memory.

07:30.000 --> 07:34.000
And then the other DRAM,

07:34.000 --> 07:38.000
sorry, knock connected core can then take that data,

07:38.000 --> 07:40.000
and output that to the memory.

07:40.000 --> 07:43.000
That's how it's supposed to work.

07:43.000 --> 07:46.000
And the Ethernet cores are able to do to scale up

07:46.000 --> 07:48.000
to communicate with other chips,

07:48.000 --> 07:52.000
and be scaling out the architecture.

07:52.000 --> 07:55.000
This is the current SDK stack.

07:55.000 --> 07:57.000
There's on the bottom there's Metallium,

07:57.000 --> 08:01.000
which is a C++ OpenCL-like API,

08:01.000 --> 08:04.000
so you can directly program on the device,

08:04.000 --> 08:06.000
on how they build TTNN,

08:06.000 --> 08:11.000
which is their tensor and operator library.

08:11.000 --> 08:14.000
It's also a tensor library on like CUDNN,

08:14.000 --> 08:15.000
which they'll provide,

08:15.000 --> 08:17.000
because they need something called

08:17.000 --> 08:19.000
Tile, which we'll touch on later,

08:19.000 --> 08:20.000
but it's for hardware efficiency.

08:20.000 --> 08:21.000
On top,

08:21.000 --> 08:24.000
they have the MLR stack,

08:24.000 --> 08:26.000
and they're on the fork.

08:26.000 --> 08:28.000
All these software here are all open source,

08:28.000 --> 08:30.000
so it's very hackable.

08:30.000 --> 08:32.000
In my development progress,

08:32.000 --> 08:34.000
I also contributed a lot of features back

08:34.000 --> 08:36.000
into these software stack.

08:36.000 --> 08:39.000
And of course, this plug is about the LOMOS EVP

08:39.000 --> 08:42.000
and GGML support, so that's that.

08:42.000 --> 08:45.000
And that's the current community software.

08:45.000 --> 08:51.000
Yeah, let's take a look at what the SLR waiters is expecting.

08:51.000 --> 08:52.000
So unlike GPUs,

08:52.000 --> 08:55.000
where you can just directly program them to do a ray

08:55.000 --> 08:57.000
or something, something operation,

08:57.000 --> 09:02.000
these devices usually gives you some sort of higher level access.

09:02.000 --> 09:05.000
So, for example, on this code,

09:05.000 --> 09:07.000
like we just open one device,

09:07.000 --> 09:09.000
the created tensor,

09:09.000 --> 09:10.000
we Tile-like it.

09:10.000 --> 09:11.000
This is special operation,

09:11.000 --> 09:14.000
just to make it hardware more efficient.

09:14.000 --> 09:15.000
We can do level one cache,

09:15.000 --> 09:18.000
which does not exist on their hardware.

09:18.000 --> 09:20.000
Well, the console sensors here,

09:20.000 --> 09:21.000
it's really high bandwidth,

09:21.000 --> 09:22.000
because it's directly on chip,

09:22.000 --> 09:25.000
but it's small, it's only 100 megabytes compared

09:25.000 --> 09:28.000
to their 12 gigabytes of DRAM.

09:28.000 --> 09:31.000
Okay, that's just a couple of processor,

09:31.000 --> 09:34.000
and we can start building the GGML back end.

09:34.000 --> 09:36.000
So this is the general flow.

09:36.000 --> 09:37.000
You have to register your back end,

09:37.000 --> 09:39.000
scan or support it off,

09:39.000 --> 09:40.000
so it's how GGML,

09:40.000 --> 09:42.000
what you can support what you don't.

09:42.000 --> 09:43.000
You can accept,

09:43.000 --> 09:45.000
you have accept sensors from GGML,

09:45.000 --> 09:47.000
upload that to device,

09:47.000 --> 09:49.000
execute all the operations,

09:49.000 --> 09:51.000
and GGML can,

09:51.000 --> 09:52.000
sorry,

09:52.000 --> 09:54.000
after you do all the math,

09:54.000 --> 09:56.000
send a result back to GGML.

09:56.000 --> 09:58.000
But here's the problem

09:58.000 --> 10:01.000
that developers have to have to somehow fix.

10:01.000 --> 10:04.000
GGML expected real major tensors,

10:04.000 --> 10:06.000
but GGML has to use Tile tensors

10:06.000 --> 10:08.000
for hardware efficiency.

10:08.000 --> 10:12.000
But TileL doesn't support device memory mapping,

10:12.000 --> 10:13.000
while GGML,

10:13.000 --> 10:15.000
because GPU works,

10:15.000 --> 10:20.000
they expect the CPU can directly access the device memory.

10:20.000 --> 10:23.000
And GGML also does byte allocation,

10:23.000 --> 10:25.000
which doesn't really work on tensoring device,

10:25.000 --> 10:28.000
because allocations are tight.

10:28.000 --> 10:30.000
So what do I mean by that?

10:30.000 --> 10:35.000
That means when I create a 4 gigabyte buffer,

10:35.000 --> 10:36.000
for example,

10:36.000 --> 10:41.000
I have to tell the device that it is holding a 30-degree bit

10:41.000 --> 10:42.000
floating point,

10:42.000 --> 10:44.000
or a 16-degree floating point.

10:44.000 --> 10:46.000
These information are missing

10:46.000 --> 10:47.000
from GGML,

10:47.000 --> 10:49.000
and has to be somewhat there.

10:49.000 --> 10:50.000
Has it be dealt with?

10:50.000 --> 10:52.000
There's also different quantization types,

10:52.000 --> 10:54.000
which is different to GGML,

10:54.000 --> 10:56.000
and somehow.

10:56.000 --> 10:57.000
Yep.

10:57.000 --> 10:59.000
So first of all,

10:59.000 --> 11:01.000
we have to register device with,

11:01.000 --> 11:02.000
sorry,

11:02.000 --> 11:04.000
what's register document with GGML?

11:04.000 --> 11:06.000
So GGML has something called

11:06.000 --> 11:08.000
the GGML background registry,

11:08.000 --> 11:10.000
which is just a big list of

11:10.000 --> 11:11.000
different backgrounds.

11:11.000 --> 11:12.000
You can see,

11:12.000 --> 11:14.000
there's the good up-back end,

11:14.000 --> 11:15.000
metal,

11:15.000 --> 11:16.000
sickle,

11:16.000 --> 11:17.000
walk-in,

11:17.000 --> 11:18.000
etc.

11:18.000 --> 11:20.000
This is a very big list with a lot of

11:20.000 --> 11:21.000
it depth.

11:21.000 --> 11:23.000
So what do you have to do just in

11:23.000 --> 11:26.000
it add your own background here,

11:26.000 --> 11:28.000
and register it with GGML?

11:28.000 --> 11:29.000
Afterwards,

11:29.000 --> 11:31.000
it calls your device registration function.

11:31.000 --> 11:35.000
Here you have to return a back-end handle,

11:35.000 --> 11:38.000
a back-end registry.

11:39.000 --> 11:42.000
Each registry has a registry interface.

11:42.000 --> 11:45.000
This interface is a giant big retable

11:45.000 --> 11:47.000
that has some,

11:47.000 --> 11:50.000
some required and some optional functions

11:50.000 --> 11:52.000
that you have to implement yourself.

11:52.000 --> 11:55.000
And also a context that tells you

11:55.000 --> 11:56.000
back-end,

11:56.000 --> 11:59.000
what itself working on otherwise the back-end

11:59.000 --> 12:02.000
doesn't know what kind of data it has,

12:02.000 --> 12:04.000
or what device this it has.

12:04.000 --> 12:07.000
So it's looking at all the interface.

12:07.000 --> 12:10.000
So the interface, like I said, is a giant B table.

12:10.000 --> 12:13.000
The GGML uses this pattern

12:13.000 --> 12:15.000
internally very, very much.

12:15.000 --> 12:19.000
And some of these interfaces functions

12:19.000 --> 12:23.000
can return more B tables or more interfaces.

12:23.000 --> 12:24.000
So for example,

12:24.000 --> 12:25.000
this one,

12:25.000 --> 12:28.000
the get device will return a device interface.

12:28.000 --> 12:30.000
There's a whole lot of interface

12:30.000 --> 12:33.000
like this in an GGML for device,

12:33.000 --> 12:34.000
memory pools,

12:34.000 --> 12:35.000
buffer management,

12:35.000 --> 12:37.000
et cetera, et cetera, et cetera.

12:37.000 --> 12:38.000
It's a long list,

12:38.000 --> 12:40.000
and I don't have time to get into you.

12:40.000 --> 12:42.000
So you have to populate all interfaces,

12:42.000 --> 12:44.000
and eventually you have to figure out

12:44.000 --> 12:46.000
most of the interface and stop

12:46.000 --> 12:48.000
those start working.

12:48.000 --> 12:50.000
And after you get that working,

12:50.000 --> 12:51.000
you have the instrument,

12:51.000 --> 12:53.000
it tends to have poor management,

12:53.000 --> 12:55.000
but it gets general problems

12:55.000 --> 12:56.000
so you won't crash.

12:56.000 --> 12:59.000
And all the operators

12:59.000 --> 13:00.000
has to really data from

13:00.000 --> 13:02.000
the data to know the actual

13:02.000 --> 13:05.000
instrument set up using the map address.

13:05.000 --> 13:07.000
Right.

13:07.000 --> 13:09.000
Okay, you can allocate

13:09.000 --> 13:10.000
tensors.

13:10.000 --> 13:12.000
You have to get data in and out.

13:12.000 --> 13:13.000
Here's the problem.

13:13.000 --> 13:14.000
Not the old tensoring process

13:14.000 --> 13:17.000
or support old GGML types.

13:17.000 --> 13:20.000
Just take the quantized data.

13:20.000 --> 13:22.000
It's the format is different.

13:22.000 --> 13:24.000
And the tensoring process

13:24.000 --> 13:25.000
are there.

13:25.000 --> 13:27.000
The respect course are very small,

13:27.000 --> 13:28.000
very slow.

13:28.000 --> 13:30.000
They're designed to actually compute

13:30.000 --> 13:31.000
more controllers.

13:31.000 --> 13:33.000
So you cannot expect the CPU course

13:33.000 --> 13:35.000
there to do all the quantization

13:35.000 --> 13:37.000
like we do on the GPU.

13:37.000 --> 13:39.000
So instead, since we're ready

13:39.000 --> 13:40.000
to further the actual location,

13:40.000 --> 13:41.000
what we had to do is

13:41.000 --> 13:45.000
the quantized into F-330 to F-16.

13:45.000 --> 13:47.000
Upload that to a device,

13:47.000 --> 13:49.000
tileize it,

13:49.000 --> 13:51.000
and then cast manually

13:51.000 --> 13:53.000
to whatever type of tensoring

13:53.000 --> 13:56.000
hardware supports on

13:56.000 --> 13:58.000
the device and because

13:59.000 --> 14:00.000
it's the same thing for

14:00.000 --> 14:01.000
go for get tensors,

14:01.000 --> 14:03.000
which is the reverse operation.

14:03.000 --> 14:05.000
And then what you do is

14:05.000 --> 14:07.000
after you're able to do

14:07.000 --> 14:09.000
a tensor upload in download,

14:09.000 --> 14:11.000
everything you do is

14:11.000 --> 14:12.000
tile GGML,

14:12.000 --> 14:14.000
what operating you actually support.

14:14.000 --> 14:15.000
This is a quick,

14:15.000 --> 14:16.000
this is a toy version.

14:16.000 --> 14:18.000
Just tile GGML,

14:18.000 --> 14:20.000
I support a

14:20.000 --> 14:23.000
32 bit floating point tensors.

14:23.000 --> 14:26.000
The operator non is a special case

14:27.000 --> 14:29.000
that tells GGML,

14:29.000 --> 14:31.000
this tensors holds

14:31.000 --> 14:33.000
actual data instead of this

14:33.000 --> 14:34.000
is in operation.

14:34.000 --> 14:35.000
And in this call,

14:35.000 --> 14:36.000
we also tell GGML,

14:36.000 --> 14:38.000
hey, we support add operation

14:38.000 --> 14:41.000
if the dimensions one.

14:41.000 --> 14:44.000
This then GGML will do

14:44.000 --> 14:46.000
operations scheduling

14:46.000 --> 14:48.000
and figure out who runs what.

14:48.000 --> 14:50.000
And then your response

14:50.000 --> 14:53.000
which GGML will give you an array

14:53.000 --> 14:54.000
of data,

14:54.000 --> 14:56.000
of nodes to run and just run it

14:56.000 --> 14:59.000
in iteratively one by one.

14:59.000 --> 15:01.000
One problem you have to deal with

15:01.000 --> 15:02.000
is views.

15:02.000 --> 15:04.000
Views are operations that

15:04.000 --> 15:05.000
create sub,

15:05.000 --> 15:07.000
reads from the data tensor,

15:07.000 --> 15:09.000
but that doesn't really

15:09.000 --> 15:10.000
doesn't do really

15:10.000 --> 15:11.000
computation.

15:11.000 --> 15:12.000
So things like resize,

15:12.000 --> 15:14.000
views and transpose.

15:14.000 --> 15:16.000
Which views

15:16.000 --> 15:18.000
heavily depend on the fact

15:18.000 --> 15:20.000
that tenser data are stored in

15:20.000 --> 15:21.000
raw material out,

15:21.000 --> 15:23.000
which is not the case again.

15:23.000 --> 15:25.000
So initially, the solution was to

15:25.000 --> 15:27.000
just legal and evaluated.

15:27.000 --> 15:29.000
We see a view we executed.

15:29.000 --> 15:30.000
The problem with this approach is

15:30.000 --> 15:32.000
GGML would attempt to write data

15:32.000 --> 15:34.000
back into into our,

15:34.000 --> 15:35.000
into tenser buffers,

15:35.000 --> 15:36.000
which if you agree,

15:36.000 --> 15:37.000
legal and evaluated,

15:37.000 --> 15:38.000
so the realized you can

15:38.000 --> 15:40.000
want to take a back to original tensor.

15:40.000 --> 15:41.000
So the real solution is

15:41.000 --> 15:43.000
to have to do a lazy

15:43.000 --> 15:44.000
evaluation.

15:44.000 --> 15:46.000
GGML is writing

15:46.000 --> 15:48.000
you have to do the right

15:48.000 --> 15:50.000
correct thing to write into original

15:50.000 --> 15:51.000
buffers that of,

15:52.000 --> 15:54.000
run into a coffee, which doesn't work.

15:54.000 --> 15:56.000
So that's,

15:56.000 --> 15:58.000
that's a just of getting back

15:58.000 --> 16:00.000
and working with GGML.

16:00.000 --> 16:02.000
In the development process,

16:02.000 --> 16:04.000
we figured out that there's a lot of

16:04.000 --> 16:05.000
missing features that the

16:05.000 --> 16:07.000
GGML that is very helpful for

16:07.000 --> 16:08.000
A6,

16:08.000 --> 16:10.000
but it's not necessary as helpful for,

16:10.000 --> 16:13.000
for GPUs.

16:13.000 --> 16:14.000
First thing is,

16:14.000 --> 16:15.000
graphery writes.

16:15.000 --> 16:16.000
We,

16:16.000 --> 16:17.000
GGML,

16:17.000 --> 16:18.000
doesn't rehab it.

16:18.000 --> 16:19.000
So like I said before,

16:19.000 --> 16:20.000
the surveys just don't slide

16:20.000 --> 16:32.000
through the

16:38.000 --> 16:39.000
4,

16:39.000 --> 16:40.000
or below here.

16:40.000 --> 16:44.000
As far as let's go ahead,

16:44.000 --> 16:45.000
thanks for this information.

16:45.000 --> 16:46.000
That's good doing it,

16:46.000 --> 16:47.000
thank you,