WEBVTT 00:00.000 --> 00:10.000 Yeah, you just turn it on. 00:10.000 --> 00:13.000 There you go. 00:13.000 --> 00:16.000 All right, everyone. 00:16.000 --> 00:18.000 This is Martin Chang. 00:18.000 --> 00:21.000 He will be speaking about, oh, you asked your mic, 00:21.000 --> 00:28.000 building new GGL GML backends for novel accelerators, 00:28.000 --> 00:32.000 how challenge and opportunities you're ready to go? 00:32.000 --> 00:34.000 All right, take it away, Martin. 00:34.000 --> 00:36.000 All right, I'm sorry for the crazy setup, 00:36.000 --> 00:39.000 but hopefully everything is clear right now. 00:39.000 --> 00:43.000 So let's talk building about new GGL GML backends. 00:43.000 --> 00:46.000 A lot of the hardware companies are starting to come up with 00:46.000 --> 00:48.000 the new new and different hardware, 00:48.000 --> 00:53.000 that supposedly to be vastly vastly faster than GPUs. 00:53.000 --> 00:58.000 Yeah, but most of them have come with their own 00:58.000 --> 01:02.000 programming difficulties and mismatches. 01:02.000 --> 01:06.000 So for example, the hardware may not support a certain feature 01:06.000 --> 01:09.000 that is very common on GPUs or CPUs, 01:09.000 --> 01:12.000 and you have to just have to somehow deal with that. 01:12.000 --> 01:16.000 Yeah, this talk is about the how to integrate 01:16.000 --> 01:20.000 the differences into GGL GML, the challenges 01:20.000 --> 01:24.000 or the program mismatches that you expect that you have to see. 01:24.000 --> 01:30.000 And the opportunities or things that we still have to do. 01:30.000 --> 01:32.000 There's a lot to go through in 20 minutes, 01:32.000 --> 01:36.000 so I'll try my best, but no guarantees that I will be able to do everything. 01:36.000 --> 01:38.000 But first of all, some disclosure, 01:38.000 --> 01:41.000 I'm currently technically sponsored by TensorFlow, 01:41.000 --> 01:44.000 but this work is going to happen with all of your support. 01:44.000 --> 01:47.000 I'm very grateful for their support, 01:47.000 --> 01:50.000 and the engineering is very, very helpful. 01:50.000 --> 01:53.000 And just about two minutes ago, 01:53.000 --> 01:55.000 they just told me if anyone is interested in it 01:55.000 --> 01:57.000 if they're hardware, find them outside the door, 01:57.000 --> 02:01.000 and they will be able to help you gain access to their hardware. 02:01.000 --> 02:03.000 Very good to open source policy. 02:03.000 --> 02:05.000 First, I'm background on my self. 02:05.000 --> 02:06.000 Who am I? 02:06.000 --> 02:08.000 I do sell a lot of C++ and HPC. 02:08.000 --> 02:11.000 I'm a false developer, obviously. 02:11.000 --> 02:13.000 And this is my original fun. 02:14.000 --> 02:17.000 This is a lot of stuff I do besides doing AI, 02:17.000 --> 02:19.000 I maintain the web frameworks, 02:19.000 --> 02:21.000 I maintain the niche search engine for an 02:21.000 --> 02:23.000 niche in her app protocol, 02:23.000 --> 02:27.000 and I also develop some libraries for security. 02:27.000 --> 02:31.000 Beyond that, let's get back to the original topic, 02:31.000 --> 02:32.000 GGML. 02:32.000 --> 02:34.000 Hopefully everyone knows what it is. 02:34.000 --> 02:36.000 It's the back end of LaMASIVP, 02:36.000 --> 02:39.000 which is what a lot of people use right now. 02:39.000 --> 02:41.000 It's very efficient for instance, 02:41.000 --> 02:43.000 and especially for large-enrich models, 02:43.000 --> 02:45.000 it has very strong quantization support. 02:45.000 --> 02:47.000 It has a very good community and very flexible. 02:47.000 --> 02:50.000 And most importantly for us, 02:50.000 --> 02:52.000 it's written in C or C++, 02:52.000 --> 02:55.000 which makes developing very, very easy 02:55.000 --> 02:57.000 to integrate that into a new hardware, 02:57.000 --> 03:01.000 and then you input new capabilities of the hardware. 03:01.000 --> 03:04.000 Could we go through my journey? 03:04.000 --> 03:07.000 I started about 2022, 03:07.000 --> 03:09.000 where there's a new chip called the Arcus V8. 03:09.000 --> 03:12.000 There's an AI core process on there. 03:12.000 --> 03:15.000 I want to really want to locally exist 03:15.000 --> 03:19.000 and with L390 running the summer in Taiwan, 03:19.000 --> 03:21.000 which could be at 40 degrees Celsius, 03:21.000 --> 03:25.000 which is, I don't need another space fee for me. 03:25.000 --> 03:27.000 Either way, that's sort of work. 03:27.000 --> 03:31.000 I'm able to integrate the core process support 03:31.000 --> 03:33.000 into LaMASIVP, 03:33.000 --> 03:35.000 but due to the architectures, 03:35.000 --> 03:37.000 doesn't really work out. 03:37.000 --> 03:40.000 So I tried to side to pivot, 03:40.000 --> 03:42.000 and at the time, 03:42.000 --> 03:44.000 I started to sell their first dev kits. 03:44.000 --> 03:45.000 I got this thinking, 03:45.000 --> 03:47.000 what's the worst could happen, 03:47.000 --> 03:49.000 and that's how it started on this project. 03:49.000 --> 03:51.000 So the last talk, 03:51.000 --> 03:53.000 the people talk about P, 03:53.000 --> 03:55.000 P talking about TensorFlow, 03:55.000 --> 03:56.000 and in D, 03:56.000 --> 03:57.000 how the calorie hardware works. 03:57.000 --> 03:58.000 But in C++, 03:58.000 --> 03:59.000 other people, 03:59.000 --> 04:01.000 this is the very just version of it. 04:01.000 --> 04:02.000 It's a many core processor, 04:02.000 --> 04:04.000 so it has a lot of results 04:04.000 --> 04:06.000 organized into different grids. 04:06.000 --> 04:09.000 These risk fights are connected to different core processors, 04:09.000 --> 04:11.000 so you can do computation 04:11.000 --> 04:14.000 and TensorFlow operations really efficiently. 04:14.000 --> 04:16.000 They call these risk-like cores baby, 04:16.000 --> 04:18.000 because they have really, really small, 04:18.000 --> 04:19.000 and by small, 04:19.000 --> 04:20.000 I mean, like, 04:20.000 --> 04:24.000 undergrad has textbook level of small. 04:24.000 --> 04:27.000 Yeah, it's a grid of cores, 04:27.000 --> 04:29.000 it's a network on ship, 04:29.000 --> 04:31.000 so you can talk to cores, 04:31.000 --> 04:33.000 can talk to other cores. 04:33.000 --> 04:34.000 There's no global cache, 04:34.000 --> 04:35.000 we're going to find memory, 04:35.000 --> 04:38.000 so each core has access to its own memory, 04:38.000 --> 04:41.000 and you can ask a DRM to send something to it, 04:41.000 --> 04:43.000 or send it somewhere else. 04:43.000 --> 04:45.000 There's no cache. 04:45.000 --> 04:47.000 It's listed to help with management, 04:47.000 --> 04:50.000 which will come into you a little bit of a factor on. 04:50.000 --> 04:53.000 And yeah, 04:53.000 --> 04:57.000 and it also scales to a very large array 04:57.000 --> 04:59.000 by just taking more ships, 04:59.000 --> 05:01.000 gather and communicating with each other, 05:01.000 --> 05:03.000 using one of the peripherals. 05:03.000 --> 05:05.000 I stole a slide from there, 05:05.000 --> 05:07.000 caught ship's slide stack for the 05:07.000 --> 05:09.000 an activation processor. 05:09.000 --> 05:10.000 As you can see, 05:10.000 --> 05:12.000 it's a grid of CPUs, 05:12.000 --> 05:15.000 and different CPUs were different purposes. 05:15.000 --> 05:17.000 There's different core types. 05:17.000 --> 05:18.000 There's DRAM, 05:18.000 --> 05:19.000 there's Ethernet, 05:19.000 --> 05:20.000 and they call it N6, 05:20.000 --> 05:22.000 or just the compute. 05:22.000 --> 05:24.000 Compute just do, 05:24.000 --> 05:25.000 just do compute, 05:25.000 --> 05:29.000 does the AI method we actually care about. 05:29.000 --> 05:31.000 The DRM cores is interesting. 05:31.000 --> 05:32.000 The cores can, 05:32.000 --> 05:35.000 the compute core can either just request the MA, 05:35.000 --> 05:39.000 and ask with DRAM to send data from the DRM into, 05:39.000 --> 05:41.000 into it's local address, 05:41.000 --> 05:43.000 and do compute data, 05:43.000 --> 05:46.000 or DRM cores can actively push data from DRM to it, 05:46.000 --> 05:47.000 into other cores, 05:47.000 --> 05:48.000 so when they needed it, 05:48.000 --> 05:49.000 it's there. 05:49.000 --> 05:51.000 There's also the Ethernet core, 05:51.000 --> 05:53.000 which does all the communication, 05:53.000 --> 05:57.000 and that's how it scales beyond one ship, 05:57.000 --> 06:00.000 natively and don't need some crazy peripheral 06:00.000 --> 06:03.000 for that kind of work. 06:03.000 --> 06:05.000 Inside the compute core, 06:05.000 --> 06:09.000 there's really five cores that's working together. 06:09.000 --> 06:12.000 There's two respects that's connected to the knock router. 06:12.000 --> 06:17.000 These knock routers are able to talk to the other cores, 06:17.000 --> 06:20.000 and most importantly DRAM, 06:20.000 --> 06:22.000 so it can access data, 06:22.000 --> 06:25.000 and only these two core can programmatically. 06:25.000 --> 06:27.000 For the compute, 06:27.000 --> 06:29.000 there's three cores that's working together 06:29.000 --> 06:32.000 to perform computation. 06:32.000 --> 06:35.000 A compute side is really a tensor engine 06:35.000 --> 06:38.000 and a background working together. 06:38.000 --> 06:41.000 Unlike a traditional multi-core, 06:41.000 --> 06:42.000 multi-core, multi-process, 06:42.000 --> 06:44.000 architecture, each, 06:44.000 --> 06:48.000 this five cores there are working together 06:48.000 --> 06:50.000 to control the the peripherals 06:50.000 --> 06:55.000 instead of being the main system that's doing the computation. 06:55.000 --> 06:57.000 Because of this architecture, 06:57.000 --> 07:01.000 everything has to be explicitly management managed, 07:01.000 --> 07:03.000 including the data flow. 07:03.000 --> 07:06.000 So instead of just the compute core saying, 07:06.000 --> 07:09.000 hey, I want the data from this DRAM address, 07:09.000 --> 07:10.000 and be able to fetch it. 07:10.000 --> 07:14.000 What has to happen is one of the cores connected 07:14.000 --> 07:19.000 to the knock has to request memory from the DRAM. 07:19.000 --> 07:23.000 If the DRAM has to send the data to its local memory, 07:23.000 --> 07:26.000 these compute cores can then take that data, 07:26.000 --> 07:28.000 that data to the computation, 07:28.000 --> 07:30.000 and put it back into its local memory. 07:30.000 --> 07:34.000 And then the other DRAM, 07:34.000 --> 07:38.000 sorry, knock connected core can then take that data, 07:38.000 --> 07:40.000 and output that to the memory. 07:40.000 --> 07:43.000 That's how it's supposed to work. 07:43.000 --> 07:46.000 And the Ethernet cores are able to do to scale up 07:46.000 --> 07:48.000 to communicate with other chips, 07:48.000 --> 07:52.000 and be scaling out the architecture. 07:52.000 --> 07:55.000 This is the current SDK stack. 07:55.000 --> 07:57.000 There's on the bottom there's Metallium, 07:57.000 --> 08:01.000 which is a C++ OpenCL-like API, 08:01.000 --> 08:04.000 so you can directly program on the device, 08:04.000 --> 08:06.000 on how they build TTNN, 08:06.000 --> 08:11.000 which is their tensor and operator library. 08:11.000 --> 08:14.000 It's also a tensor library on like CUDNN, 08:14.000 --> 08:15.000 which they'll provide, 08:15.000 --> 08:17.000 because they need something called 08:17.000 --> 08:19.000 Tile, which we'll touch on later, 08:19.000 --> 08:20.000 but it's for hardware efficiency. 08:20.000 --> 08:21.000 On top, 08:21.000 --> 08:24.000 they have the MLR stack, 08:24.000 --> 08:26.000 and they're on the fork. 08:26.000 --> 08:28.000 All these software here are all open source, 08:28.000 --> 08:30.000 so it's very hackable. 08:30.000 --> 08:32.000 In my development progress, 08:32.000 --> 08:34.000 I also contributed a lot of features back 08:34.000 --> 08:36.000 into these software stack. 08:36.000 --> 08:39.000 And of course, this plug is about the LOMOS EVP 08:39.000 --> 08:42.000 and GGML support, so that's that. 08:42.000 --> 08:45.000 And that's the current community software. 08:45.000 --> 08:51.000 Yeah, let's take a look at what the SLR waiters is expecting. 08:51.000 --> 08:52.000 So unlike GPUs, 08:52.000 --> 08:55.000 where you can just directly program them to do a ray 08:55.000 --> 08:57.000 or something, something operation, 08:57.000 --> 09:02.000 these devices usually gives you some sort of higher level access. 09:02.000 --> 09:05.000 So, for example, on this code, 09:05.000 --> 09:07.000 like we just open one device, 09:07.000 --> 09:09.000 the created tensor, 09:09.000 --> 09:10.000 we Tile-like it. 09:10.000 --> 09:11.000 This is special operation, 09:11.000 --> 09:14.000 just to make it hardware more efficient. 09:14.000 --> 09:15.000 We can do level one cache, 09:15.000 --> 09:18.000 which does not exist on their hardware. 09:18.000 --> 09:20.000 Well, the console sensors here, 09:20.000 --> 09:21.000 it's really high bandwidth, 09:21.000 --> 09:22.000 because it's directly on chip, 09:22.000 --> 09:25.000 but it's small, it's only 100 megabytes compared 09:25.000 --> 09:28.000 to their 12 gigabytes of DRAM. 09:28.000 --> 09:31.000 Okay, that's just a couple of processor, 09:31.000 --> 09:34.000 and we can start building the GGML back end. 09:34.000 --> 09:36.000 So this is the general flow. 09:36.000 --> 09:37.000 You have to register your back end, 09:37.000 --> 09:39.000 scan or support it off, 09:39.000 --> 09:40.000 so it's how GGML, 09:40.000 --> 09:42.000 what you can support what you don't. 09:42.000 --> 09:43.000 You can accept, 09:43.000 --> 09:45.000 you have accept sensors from GGML, 09:45.000 --> 09:47.000 upload that to device, 09:47.000 --> 09:49.000 execute all the operations, 09:49.000 --> 09:51.000 and GGML can, 09:51.000 --> 09:52.000 sorry, 09:52.000 --> 09:54.000 after you do all the math, 09:54.000 --> 09:56.000 send a result back to GGML. 09:56.000 --> 09:58.000 But here's the problem 09:58.000 --> 10:01.000 that developers have to have to somehow fix. 10:01.000 --> 10:04.000 GGML expected real major tensors, 10:04.000 --> 10:06.000 but GGML has to use Tile tensors 10:06.000 --> 10:08.000 for hardware efficiency. 10:08.000 --> 10:12.000 But TileL doesn't support device memory mapping, 10:12.000 --> 10:13.000 while GGML, 10:13.000 --> 10:15.000 because GPU works, 10:15.000 --> 10:20.000 they expect the CPU can directly access the device memory. 10:20.000 --> 10:23.000 And GGML also does byte allocation, 10:23.000 --> 10:25.000 which doesn't really work on tensoring device, 10:25.000 --> 10:28.000 because allocations are tight. 10:28.000 --> 10:30.000 So what do I mean by that? 10:30.000 --> 10:35.000 That means when I create a 4 gigabyte buffer, 10:35.000 --> 10:36.000 for example, 10:36.000 --> 10:41.000 I have to tell the device that it is holding a 30-degree bit 10:41.000 --> 10:42.000 floating point, 10:42.000 --> 10:44.000 or a 16-degree floating point. 10:44.000 --> 10:46.000 These information are missing 10:46.000 --> 10:47.000 from GGML, 10:47.000 --> 10:49.000 and has to be somewhat there. 10:49.000 --> 10:50.000 Has it be dealt with? 10:50.000 --> 10:52.000 There's also different quantization types, 10:52.000 --> 10:54.000 which is different to GGML, 10:54.000 --> 10:56.000 and somehow. 10:56.000 --> 10:57.000 Yep. 10:57.000 --> 10:59.000 So first of all, 10:59.000 --> 11:01.000 we have to register device with, 11:01.000 --> 11:02.000 sorry, 11:02.000 --> 11:04.000 what's register document with GGML? 11:04.000 --> 11:06.000 So GGML has something called 11:06.000 --> 11:08.000 the GGML background registry, 11:08.000 --> 11:10.000 which is just a big list of 11:10.000 --> 11:11.000 different backgrounds. 11:11.000 --> 11:12.000 You can see, 11:12.000 --> 11:14.000 there's the good up-back end, 11:14.000 --> 11:15.000 metal, 11:15.000 --> 11:16.000 sickle, 11:16.000 --> 11:17.000 walk-in, 11:17.000 --> 11:18.000 etc. 11:18.000 --> 11:20.000 This is a very big list with a lot of 11:20.000 --> 11:21.000 it depth. 11:21.000 --> 11:23.000 So what do you have to do just in 11:23.000 --> 11:26.000 it add your own background here, 11:26.000 --> 11:28.000 and register it with GGML? 11:28.000 --> 11:29.000 Afterwards, 11:29.000 --> 11:31.000 it calls your device registration function. 11:31.000 --> 11:35.000 Here you have to return a back-end handle, 11:35.000 --> 11:38.000 a back-end registry. 11:39.000 --> 11:42.000 Each registry has a registry interface. 11:42.000 --> 11:45.000 This interface is a giant big retable 11:45.000 --> 11:47.000 that has some, 11:47.000 --> 11:50.000 some required and some optional functions 11:50.000 --> 11:52.000 that you have to implement yourself. 11:52.000 --> 11:55.000 And also a context that tells you 11:55.000 --> 11:56.000 back-end, 11:56.000 --> 11:59.000 what itself working on otherwise the back-end 11:59.000 --> 12:02.000 doesn't know what kind of data it has, 12:02.000 --> 12:04.000 or what device this it has. 12:04.000 --> 12:07.000 So it's looking at all the interface. 12:07.000 --> 12:10.000 So the interface, like I said, is a giant B table. 12:10.000 --> 12:13.000 The GGML uses this pattern 12:13.000 --> 12:15.000 internally very, very much. 12:15.000 --> 12:19.000 And some of these interfaces functions 12:19.000 --> 12:23.000 can return more B tables or more interfaces. 12:23.000 --> 12:24.000 So for example, 12:24.000 --> 12:25.000 this one, 12:25.000 --> 12:28.000 the get device will return a device interface. 12:28.000 --> 12:30.000 There's a whole lot of interface 12:30.000 --> 12:33.000 like this in an GGML for device, 12:33.000 --> 12:34.000 memory pools, 12:34.000 --> 12:35.000 buffer management, 12:35.000 --> 12:37.000 et cetera, et cetera, et cetera. 12:37.000 --> 12:38.000 It's a long list, 12:38.000 --> 12:40.000 and I don't have time to get into you. 12:40.000 --> 12:42.000 So you have to populate all interfaces, 12:42.000 --> 12:44.000 and eventually you have to figure out 12:44.000 --> 12:46.000 most of the interface and stop 12:46.000 --> 12:48.000 those start working. 12:48.000 --> 12:50.000 And after you get that working, 12:50.000 --> 12:51.000 you have the instrument, 12:51.000 --> 12:53.000 it tends to have poor management, 12:53.000 --> 12:55.000 but it gets general problems 12:55.000 --> 12:56.000 so you won't crash. 12:56.000 --> 12:59.000 And all the operators 12:59.000 --> 13:00.000 has to really data from 13:00.000 --> 13:02.000 the data to know the actual 13:02.000 --> 13:05.000 instrument set up using the map address. 13:05.000 --> 13:07.000 Right. 13:07.000 --> 13:09.000 Okay, you can allocate 13:09.000 --> 13:10.000 tensors. 13:10.000 --> 13:12.000 You have to get data in and out. 13:12.000 --> 13:13.000 Here's the problem. 13:13.000 --> 13:14.000 Not the old tensoring process 13:14.000 --> 13:17.000 or support old GGML types. 13:17.000 --> 13:20.000 Just take the quantized data. 13:20.000 --> 13:22.000 It's the format is different. 13:22.000 --> 13:24.000 And the tensoring process 13:24.000 --> 13:25.000 are there. 13:25.000 --> 13:27.000 The respect course are very small, 13:27.000 --> 13:28.000 very slow. 13:28.000 --> 13:30.000 They're designed to actually compute 13:30.000 --> 13:31.000 more controllers. 13:31.000 --> 13:33.000 So you cannot expect the CPU course 13:33.000 --> 13:35.000 there to do all the quantization 13:35.000 --> 13:37.000 like we do on the GPU. 13:37.000 --> 13:39.000 So instead, since we're ready 13:39.000 --> 13:40.000 to further the actual location, 13:40.000 --> 13:41.000 what we had to do is 13:41.000 --> 13:45.000 the quantized into F-330 to F-16. 13:45.000 --> 13:47.000 Upload that to a device, 13:47.000 --> 13:49.000 tileize it, 13:49.000 --> 13:51.000 and then cast manually 13:51.000 --> 13:53.000 to whatever type of tensoring 13:53.000 --> 13:56.000 hardware supports on 13:56.000 --> 13:58.000 the device and because 13:59.000 --> 14:00.000 it's the same thing for 14:00.000 --> 14:01.000 go for get tensors, 14:01.000 --> 14:03.000 which is the reverse operation. 14:03.000 --> 14:05.000 And then what you do is 14:05.000 --> 14:07.000 after you're able to do 14:07.000 --> 14:09.000 a tensor upload in download, 14:09.000 --> 14:11.000 everything you do is 14:11.000 --> 14:12.000 tile GGML, 14:12.000 --> 14:14.000 what operating you actually support. 14:14.000 --> 14:15.000 This is a quick, 14:15.000 --> 14:16.000 this is a toy version. 14:16.000 --> 14:18.000 Just tile GGML, 14:18.000 --> 14:20.000 I support a 14:20.000 --> 14:23.000 32 bit floating point tensors. 14:23.000 --> 14:26.000 The operator non is a special case 14:27.000 --> 14:29.000 that tells GGML, 14:29.000 --> 14:31.000 this tensors holds 14:31.000 --> 14:33.000 actual data instead of this 14:33.000 --> 14:34.000 is in operation. 14:34.000 --> 14:35.000 And in this call, 14:35.000 --> 14:36.000 we also tell GGML, 14:36.000 --> 14:38.000 hey, we support add operation 14:38.000 --> 14:41.000 if the dimensions one. 14:41.000 --> 14:44.000 This then GGML will do 14:44.000 --> 14:46.000 operations scheduling 14:46.000 --> 14:48.000 and figure out who runs what. 14:48.000 --> 14:50.000 And then your response 14:50.000 --> 14:53.000 which GGML will give you an array 14:53.000 --> 14:54.000 of data, 14:54.000 --> 14:56.000 of nodes to run and just run it 14:56.000 --> 14:59.000 in iteratively one by one. 14:59.000 --> 15:01.000 One problem you have to deal with 15:01.000 --> 15:02.000 is views. 15:02.000 --> 15:04.000 Views are operations that 15:04.000 --> 15:05.000 create sub, 15:05.000 --> 15:07.000 reads from the data tensor, 15:07.000 --> 15:09.000 but that doesn't really 15:09.000 --> 15:10.000 doesn't do really 15:10.000 --> 15:11.000 computation. 15:11.000 --> 15:12.000 So things like resize, 15:12.000 --> 15:14.000 views and transpose. 15:14.000 --> 15:16.000 Which views 15:16.000 --> 15:18.000 heavily depend on the fact 15:18.000 --> 15:20.000 that tenser data are stored in 15:20.000 --> 15:21.000 raw material out, 15:21.000 --> 15:23.000 which is not the case again. 15:23.000 --> 15:25.000 So initially, the solution was to 15:25.000 --> 15:27.000 just legal and evaluated. 15:27.000 --> 15:29.000 We see a view we executed. 15:29.000 --> 15:30.000 The problem with this approach is 15:30.000 --> 15:32.000 GGML would attempt to write data 15:32.000 --> 15:34.000 back into into our, 15:34.000 --> 15:35.000 into tenser buffers, 15:35.000 --> 15:36.000 which if you agree, 15:36.000 --> 15:37.000 legal and evaluated, 15:37.000 --> 15:38.000 so the realized you can 15:38.000 --> 15:40.000 want to take a back to original tensor. 15:40.000 --> 15:41.000 So the real solution is 15:41.000 --> 15:43.000 to have to do a lazy 15:43.000 --> 15:44.000 evaluation. 15:44.000 --> 15:46.000 GGML is writing 15:46.000 --> 15:48.000 you have to do the right 15:48.000 --> 15:50.000 correct thing to write into original 15:50.000 --> 15:51.000 buffers that of, 15:52.000 --> 15:54.000 run into a coffee, which doesn't work. 15:54.000 --> 15:56.000 So that's, 15:56.000 --> 15:58.000 that's a just of getting back 15:58.000 --> 16:00.000 and working with GGML. 16:00.000 --> 16:02.000 In the development process, 16:02.000 --> 16:04.000 we figured out that there's a lot of 16:04.000 --> 16:05.000 missing features that the 16:05.000 --> 16:07.000 GGML that is very helpful for 16:07.000 --> 16:08.000 A6, 16:08.000 --> 16:10.000 but it's not necessary as helpful for, 16:10.000 --> 16:13.000 for GPUs. 16:13.000 --> 16:14.000 First thing is, 16:14.000 --> 16:15.000 graphery writes. 16:15.000 --> 16:16.000 We, 16:16.000 --> 16:17.000 GGML, 16:17.000 --> 16:18.000 doesn't rehab it. 16:18.000 --> 16:19.000 So like I said before, 16:19.000 --> 16:20.000 the surveys just don't slide 16:20.000 --> 16:32.000 through the 16:38.000 --> 16:39.000 4, 16:39.000 --> 16:40.000 or below here. 16:40.000 --> 16:44.000 As far as let's go ahead, 16:44.000 --> 16:45.000 thanks for this information. 16:45.000 --> 16:46.000 That's good doing it, 16:46.000 --> 16:47.000 thank you,