WEBVTT 00:00.000 --> 00:13.400 Welcome to my talk. Thank you for attending it. I'm Vodislav. I'm a tech and render 00:13.400 --> 00:22.880 timlet and in my free time. I'm working as a launcher on zero AD. Zero AD is a free and open source 00:22.880 --> 00:28.440 cross-platform game. If UI games like Hedou from Paris Empire Earth or StarCraft, then you 00:28.440 --> 00:35.920 might find zero AD interesting as well. We use a cross-platform game, so it works only in 00:35.920 --> 00:42.600 the Mac OS and Windows. It works on different architectures. We use our own custom 00:42.600 --> 00:48.080 engine. It's called Pyogenesis. It's written in C++, so we have some technologies 00:48.080 --> 00:57.280 under the hood. About rendering, we have abstraction rendering interface. It helps us to 00:57.280 --> 01:02.720 have multiple backends. Open jail, Vulcan, and dummy. The latch-1, 4, test, and CPU performance 01:02.720 --> 01:09.920 checks. Also, we use MotonViki to be able to run Vulcan on Mac OS. It converts Vulcan API 01:09.920 --> 01:18.200 calls to a metal API calls. So it's how it looks like. We were trying to design our rendering 01:18.200 --> 01:24.840 interface as close as possible to Vulcan API, but we have still limitations because we support 01:24.840 --> 01:31.920 Open jail. For example, we still have a plot text-trained plot buffer, so no ringbuffers or 01:31.920 --> 01:37.440 a plot buffer or a new form buffer on top. Also, we have been using your sources directly 01:37.440 --> 01:43.320 in the device command context, so you might see the texture and set uniform. We are binding 01:43.320 --> 01:52.320 slot. So we have several limitations. So let's talk about Vulcan. We have added the Vulcan 01:52.320 --> 02:00.280 in 2020-23, and the first version of the game with Vulcan was released in 2025. It was 02:00.280 --> 02:07.160 released in 2007. By the way, we are currently preparing the next version. It released 02:07.160 --> 02:17.280 in 2008. We are hoping to release it soon. So that's how a timeline looks like. You might 02:17.280 --> 02:25.000 see that in the submit timeline, we have re-ordered our uploads, because uploads are not 02:25.000 --> 02:30.320 allowed with render passers approach, because we don't support the name rendering for all 02:30.320 --> 02:36.320 platforms, but we need to support Vulcan for all platforms. So we split them into two 02:36.320 --> 02:43.360 decay command buffers, prepange and mine. Why we have that, because we still have some legacy 02:43.360 --> 02:50.520 components. For example, user interface, where a component might load a plot and use texture 02:50.520 --> 02:57.320 right before the draw goal. So it might just get a texture path right before a draw goal 02:57.320 --> 03:02.480 inside render pass. So we have to upload them separately before the main decay command 03:02.480 --> 03:08.760 buffer. And because of that, we don't allow to upload the same texture or the same buffer, 03:08.760 --> 03:14.880 let's put times during the single render pass. So let's talk about obstacles we had. 03:14.880 --> 03:23.320 First one, Vulcan support detection. We've tried to do that with SDL. Nothing really hard 03:23.320 --> 03:30.960 here, just trying to load library and get instant support. But it was crashing for some 03:30.960 --> 03:38.520 drivers, because if you are trying to load Vulcan and then Vulcan jail, some drivers crashing. 03:38.520 --> 03:43.600 So we just disable that. Unfortunately, we don't have any suitable solution for us. 03:43.600 --> 03:50.200 So now users can switch in options or becans. Device detection. We have multiple physical 03:50.200 --> 03:58.080 devices. And we have to choose one. Because we have OpenGL, we had a problem that OpenGL 03:58.080 --> 04:03.440 sometimes uses integrated GPU instead of discrete one. So we thought that it would be 04:03.520 --> 04:12.440 great to sort all devices. So first we sort by type. GPU, discrete, GPU integrated, 04:12.440 --> 04:18.560 CPU and virtual. Then by device.com memory. And then by initial order. So the first 04:18.560 --> 04:25.320 error was using device.com memory, because some drivers report like Wapers have more memory 04:25.320 --> 04:36.280 than native cuts. The second problem was this one. So that's the initial order reported 04:36.280 --> 04:42.160 by driver. We were selecting the second one, they discrete. But it wasn't working, because 04:42.160 --> 04:49.160 it failed during installation. So we have to remove using type sorting as well. So we 04:49.240 --> 04:57.400 always using the 0, the first one from the list to report it by driver. So the problem 04:57.400 --> 05:09.880 that WCAF is equal to devices might be listed even if they're not going to work. So another problem, 05:09.880 --> 05:16.040 texture comparison, we see some platforms. Don't support all BC formats. And because of 05:16.120 --> 05:21.880 that texture comparison, BC might be false. So because of that, we were switching devices. 05:21.880 --> 05:28.280 That's not correct, because actually those platforms were supporting all needed for us BC formats. 05:28.280 --> 05:37.560 So instead of checking this property, we just checking all BC formats individually. Out of memory. 05:38.440 --> 05:43.240 In our game, we have multiple options. True control quality. We have shadow quality, 05:43.240 --> 05:48.040 it's a texture quality. We can enable water refraction, water reflection and so on. And it 05:48.040 --> 05:55.720 consumes memory. And we need to handle that situation. So when that might occur, usually when 05:55.720 --> 06:01.560 we allocate resources. So okay, memory creates something. It's easy to fall back on that stack, 06:01.560 --> 06:08.840 because we can return the error in the color. But there are more rare situation when in 06:08.840 --> 06:15.640 can fail in a choir and Q present and Q segment and wait for fences. In those situations, 06:15.640 --> 06:20.840 we unfortunately can handle. So currently in those situations we just crush. 06:21.880 --> 06:27.960 Because it's another way, it's a low level. But information about quality is a high level. 06:27.960 --> 06:34.280 So low level can't access the high level. So currently we just crush. No proper solution for that. 06:39.000 --> 06:45.800 We use in a low command mirror locator. So it's another helping for us. It's really 06:45.800 --> 06:53.160 simplified. Some could relate to memory locations. But at the same time, if we freeze memory, 06:53.160 --> 06:59.800 it doesn't go to GPU immediately, because all command mirror locator uses intermediate buffers. 06:59.800 --> 07:07.160 Bigger buffers to allocate smaller buffers from them. We're trying to use VKX memory budget, 07:07.240 --> 07:16.360 but it doesn't really help in those situations. The only solution we have is to use some ratio, 07:16.360 --> 07:23.960 like 80% of total available memory and another, like, should be on operation system side. 07:26.360 --> 07:32.680 GPU scanning artifacts. Skinning is a process of applying skeleton animation to a model, 07:32.680 --> 07:39.160 to a mesh. And how the frames should look like. It's a regular frame from our game. 07:39.720 --> 07:45.800 And that's how it looks like with the bug. And another one. So it might look like a driver bug, 07:45.800 --> 07:51.240 or like another synchronization problem. But actually, it was pretty simple. It was just 07:51.240 --> 07:58.600 incorrectly selected data. We didn't invalidate the flag when user was switching from 07:58.680 --> 08:05.400 CPU scanning to GPU scanning. But it was looking like a serious bug, which weren't able to 08:05.400 --> 08:11.800 produce. We are collecting GPU statistics to be able to optimize our game to know, like, 08:11.800 --> 08:19.720 corner cases. And so, for example, for OpenGL, we have the following reported names for the same 08:20.600 --> 08:30.120 game version, the same GPU, the same platform. We've welcome it's much better. 08:31.960 --> 08:41.320 The same game version, the same GPU, but all supported platforms. It's better. It's simpler to 08:41.320 --> 08:46.440 parts. It still has some problems, like it includes additional information and we need to 08:46.440 --> 08:56.440 remove trained markets. So forth, but it's much simpler. There are some helpers that might help 08:56.440 --> 09:01.960 you to distinguish different GPUs. For example, device ID, but it's not enough because it might 09:01.960 --> 09:08.280 be equal for different GPUs. Device UID is possible if it presents, because in some cases, it might 09:08.280 --> 09:13.640 be just zeroed. So it's not really helpful for that case. So the final solution, we just 09:13.720 --> 09:21.400 parse device name, but it's much simpler than for OpenGL. You know, debugging, that's the most 09:21.400 --> 09:29.640 interesting part for us. We have a lot of players, but not many of them have programming skills, 09:29.640 --> 09:35.560 or might build the game, debug the game. So we have to introduce some helpers, configuration options 09:35.560 --> 09:42.680 to be able to retrieve some useful information for us to debug the game. For example, we have 09:42.760 --> 09:48.440 Android. Helpers. So if we can teach someone how to make a capture, or maybe not still 09:48.440 --> 09:58.280 Android. We can enable debug labels. We can enable debug scope labels. So each resource in our 09:58.280 --> 10:05.640 engine is marked with a constant name. So we can distinguish them. Also, we can enable messages. 10:05.640 --> 10:12.120 If driver has something to report for us and we have debug context, it enables different features, 10:12.120 --> 10:18.200 including validation layers, if they are present in the platform, in the user platform. 10:19.320 --> 10:24.680 Also, currently, by default, we are working in Windows, so we are using discriminating indexing. 10:25.640 --> 10:31.720 But there is a quarter of a case. When you enable validation without GPU assistance, 10:32.840 --> 10:38.760 some validation layer might complain, because they don't really know when a resource will be accessed. 10:38.760 --> 10:44.440 So, for example, in this single descriptor set, you might have a frame buffer target and some 10:44.440 --> 10:52.760 like texture sample. And it will be evaluation for it. So in real case, it's not a problem. 10:53.320 --> 11:02.680 So we need to be able to disable that. Also, we would like to have in the future an option to 11:02.760 --> 11:07.640 choose GPU, not only backend, but also GPU for all cannot list. But can't even have. 11:07.640 --> 11:14.680 So we don't have that. So we use configuration option. And the last one helps us debug 11:14.680 --> 11:23.080 different situation problems and driver issues. We able to insert debug buyer. It's a barrier 11:23.080 --> 11:28.680 from all stages to all stages, from all excess mask to all excess mask. So it's really hard execution 11:28.680 --> 11:36.680 and memory barrier. Also, we have wait, wait for different stages to work present before and 11:36.680 --> 11:47.720 after. Again, to artifacts. In the beginning of 2025, I found visual artifacts in demand new 11:47.720 --> 11:53.960 in on Raspberry Pi 4, with metadata 24. So on the left, with bug on the right, without bug. 11:54.840 --> 12:01.960 After investigation, after I had to enable mentioned debug buyers and for stages and masks, 12:01.960 --> 12:08.360 didn't help. Try to locate device, wait, idle, didn't help. Don't help. Don't the thing is 12:09.880 --> 12:17.320 that was working is to split VKQ submit on two. We have the similar four relation between them. 12:17.960 --> 12:26.280 So the code was looking like this. And actually, it was a driver bug and thanks to Sam, 12:26.280 --> 12:33.720 he's colleagues. But my device has a talk today about Raspberry Pi. It was, in my opinion, 12:35.880 --> 12:45.160 the first fix I have seen for drivers and for vendor or driver author. So very much thanks to them. 12:48.040 --> 12:53.000 The main conclusion is that the one usual application uses the Vulcan API. 12:54.280 --> 13:01.240 The more likely a driver error will occur. So if you are doing like a simple quad rendering, 13:01.240 --> 13:08.920 then the chance that you will have an error, an artifact or something like that, is way, way 13:08.920 --> 13:15.720 low. But if you are doing something specific, for example, like we do, like I mentioned on the timeline, 13:16.040 --> 13:21.640 we split our device common context on two. So we have prepared to the K common buffer and 13:21.640 --> 13:28.200 main common buffer. And we have a synchronization between them. That already isn't so usual behavior 13:28.200 --> 13:35.240 for some platform, so at least. Because usual recommendation is to avoid multiple case 13:35.240 --> 13:43.240 of meets or multiple common buffer. So use only one, but not for all. And another thing that 13:43.240 --> 13:50.840 really helped in that situation that I was able to reproduce that back myself on my Raspberry Pi 4. 13:50.840 --> 13:58.920 And after a few evenings of debugging, I finally got that I would be a driver back and I made an 13:58.920 --> 14:07.800 issue for the method. And it was fixed really fast. I am really glad about that. And the last thing, 14:08.120 --> 14:14.440 GPU performance measurements, because we have players with different hardware from low to high, 14:15.000 --> 14:22.760 we need to be able to measure how expensive our frame is. What do we need to optimize? 14:24.760 --> 14:30.280 Usually we prefer using tools from vendors. So if we are debugging locally, we are trying to use 14:31.240 --> 14:40.120 tools that are provided by vendors if available. Not all patrons have enough tools. 14:41.320 --> 14:50.840 Else, we fall back to timestamp queries. It has limitations in terms of how it measures data, 14:50.840 --> 14:58.760 because when you insert a timestamp query, you pass when you want to capture. So for example, 14:58.760 --> 15:07.080 if you have to overlapping jobs, then you can really distinguish only single one. 15:09.240 --> 15:15.880 And actually using timestamps might be affected. For example, it will be affected by other processes, 15:15.880 --> 15:24.440 because they are using your GPU as well. They can be affected by temperature. It's not so 15:24.520 --> 15:32.200 relevant to discrete GPUs. It's mostly for mobile GPUs or like energy conserving. 15:33.880 --> 15:46.760 And sometimes you might have measurements that you get slower results than the previous one. 15:46.840 --> 15:54.360 So you are sure that you're using a better code. That's happening because a GPU driver 15:55.560 --> 16:05.320 sees the code or sees the real data. And it thinks that it might make sense to decrease the 16:05.320 --> 16:16.280 GPU frequency. We are still on 60 FPS, but we are using low less energy. So better code doesn't mean 16:16.360 --> 16:24.760 that it will be better performance. But in terms of energy consumption, it will be better anyway. 16:25.640 --> 16:29.240 So that was the last one. Thank you very much. 16:46.840 --> 16:58.840 I need tips about measurements for GPU performance, right? 17:04.120 --> 17:13.400 It's really not a simple question, because usually you need to take a look at each platform independently. 17:13.960 --> 17:21.720 For example, some wonders provide special functions, which can fix your GPU frequency. 17:23.000 --> 17:31.880 In that case, or for example, not GPU frequency, but from energy bands to performance mode, 17:31.880 --> 17:39.080 where it will be might be fixed. Not will, but might. Also some wonders too, 17:39.720 --> 17:46.840 help you to get more metrics. For example, those, which are not available in Vulcan and P. 17:47.720 --> 17:53.000 In the frame statistics somewhere else, or we are times some queries or something else. 17:53.960 --> 17:58.920 So each platform independently and using window tools. 17:59.880 --> 18:08.600 Looking back over the last three years now, was it best interest, which is the one kind 18:09.880 --> 18:14.840 and on putting on the effort in it, but if you think if you have to do it again, it's better to do it again. 18:16.680 --> 18:26.280 What's it worth to switch the welcome? I had the talking 2020-24, and yes, I said that for some platforms, 18:26.280 --> 18:33.640 we get up to 300% improved of performance. The best improved was for macOS, 18:34.840 --> 18:40.920 because we were using OpenGL there, and their OpenGL implementation is far from ideal. 18:42.680 --> 18:54.200 We get 10% to 300% improvement, and we get more stable performance, less fluctuations during multiple frames. 18:54.920 --> 19:02.680 So yes, it's worth it. Another reason why we switched to Vulcan, because I was really interested in Vulcan, 19:02.680 --> 19:09.320 so I have internal motivation to add it. If you don't have it or you don't have time, 19:09.320 --> 19:18.600 then it might be worth it to take a look at some other libraries that might be implemented in your 19:18.920 --> 19:25.960 application. We were limited, because we still support OpenGL 2.1, it has limitations, 19:25.960 --> 19:32.920 so more than a library is like Libangle or BGFX, not really useful for us yet. 19:34.760 --> 19:40.120 So the short answer, yes, it was worth it. Yes? 19:48.600 --> 20:11.000 You mean like choosing different GPUs, how does it affect performance? 20:19.560 --> 20:33.880 Do we notice any difference for similar GPUs, or for the same? 20:34.280 --> 20:51.960 I think no, because we don't have much GPUs available for us yet, so generally we have like developer machines, 20:52.680 --> 20:58.520 they are pretty limited, so we don't have many of them, maybe 10 to 20, not more, 20:59.160 --> 21:05.560 so we don't have much variety. From users we usually get pretty rough statistics, 21:06.280 --> 21:12.120 I mean from their reports, because they're reporting for us voluntarily, so we are not 21:12.120 --> 21:21.160 collection any data without their consent, so usually the most useful that we usually have, 21:21.160 --> 21:27.720 it's like FPS, or FnChime or something like that. 21:41.000 --> 21:48.280 Yes, in terms of supporting platforms, because for example, we still support OpenGL 2.1, 21:48.360 --> 21:55.880 and for, yeah, the question. Do you have, do you notice any difference between 21:55.880 --> 22:02.920 commercial implementation for engines and the open source one? So, a variety of support, like 22:03.560 --> 22:11.000 commercial usually trying to avoid platforms where we don't have many people, 22:11.880 --> 22:16.120 where we don't have money or something like that. In open source, it's vice versa, 22:17.080 --> 22:23.560 where we have many people, we have many support, so in that case we need to support 22:23.560 --> 22:37.720 much wider amount of hardware and much older hardware. It's, like, it's really interesting, it's 22:37.720 --> 22:48.760 also, like, two different areas to investigate, so it's not worse, not better, but different, 22:49.320 --> 22:52.920 so for me, it's both areas of those are interesting. 22:57.400 --> 22:59.960 Yep. So, what's the problem on the Raspberry Pi? 23:00.920 --> 23:07.320 I can show you a Ticket and Messer. What's the problem with Raspberry Pi? 23:08.120 --> 23:17.160 If not mistaken, it was when two different common buffers there, internal state, 23:17.720 --> 23:27.160 wasn't synchronized about some painted painting barriers, but I might show you more detailed. 23:27.560 --> 23:31.560 So, that's all? 23:31.560 --> 23:38.600 You mentioned you used the Bibles texture, so when you help to debug something, do you switch to 23:40.760 --> 23:46.440 known Bibles? As far as I know, with the renderer, it's a little bit difficult to see, 23:46.440 --> 23:49.400 to debug if you enable Bibles texture. 23:49.400 --> 24:01.400 Yes, it's possible we're trying to do it trying, are we trying to disable Bibles? 24:01.400 --> 24:08.360 Disgusting, if we have a bug. Yes, it's the first step, so if we have a bug, we're trying to 24:08.360 --> 24:13.800 disable, just keep it in the same first. If it's still producing, then we are working on that. 24:13.800 --> 24:21.160 If not, then we're trying to instigate the discreeting, but if I'm not mistaken, 24:21.160 --> 24:28.840 we had only one bug related to discreeting, so most of our bugs are producing on both cases. 24:30.360 --> 24:35.880 Because the client code, which is calling our backends, is absolutely the same. 24:36.600 --> 24:38.520 Just building your sources is different.