WEBVTT 00:00.000 --> 00:19.080 Good evening everyone, my name is Yosh, I work at Perkona as a quality engineer and recently 00:19.080 --> 00:23.360 I have been playing around with the GPUs and this session will be regarding what I learned, 00:23.360 --> 00:28.600 how you can partition using mega approach and what things you can do with that. 00:28.600 --> 00:34.520 So it will include setting things up and understanding how to monitor me as well. 00:34.520 --> 00:39.120 So the overview of this session will be like we will see what are the available sharing 00:39.120 --> 00:44.400 methods, then we will see how mix works, what are the concepts of mix that you should 00:44.400 --> 00:50.000 be aware of if you want to partition a GPU and we will also explore a workload. 00:50.000 --> 00:55.080 So for video generation what were the things that happened for MPS as well as mix partitioning 00:55.080 --> 00:56.080 method. 00:56.080 --> 01:01.560 So yeah then we will explore ways in which you can monitor the mixed instances and then 01:01.560 --> 01:05.240 we will conclude with the suggestions. 01:05.240 --> 01:11.400 So what are the ways in which you can share a GPU, there are broadly two types, temporal 01:11.400 --> 01:17.440 and spatial so you can share GPUs with time like a process which is with other processes 01:17.440 --> 01:22.080 and uses a full GPU so that is a time sharing then there is an MPS where you have 01:22.160 --> 01:26.040 an entire GPU which is shared by processes within it. 01:26.040 --> 01:31.400 So they all work at the same time and they share the GPU resources. 01:31.400 --> 01:36.280 It is a software based approach and then there is make the thing that we will be discussing 01:36.280 --> 01:37.280 today. 01:37.280 --> 01:40.200 So make is an actual isolation of your GPU. 01:40.200 --> 01:45.360 So if you have a large GPU you can partition it in a strict isolated manner and run 01:45.360 --> 01:46.720 processes on it. 01:46.720 --> 01:51.600 So it is an hardware based GPU isolation, what do we mean by hardware based? 01:51.600 --> 01:58.200 It is not actually hardware lines they are there but you enable it using the make commands 01:58.200 --> 01:59.200 and all. 01:59.200 --> 02:05.400 So that is what we mean by the hardware based but it is controlled using the CLI. 02:05.400 --> 02:11.840 So this is a broad comparison like how make is different compared to time slicing MPS. 02:11.840 --> 02:18.720 So GPUs split into isolated portions there is no context which overhead because in 02:18.720 --> 02:23.880 time slicing your entire process which is with other processes so memory and compute 02:23.880 --> 02:24.880 is used. 02:24.880 --> 02:31.400 So there is a lot of switch between the processes MPS is a shared pool and there also there 02:31.400 --> 02:37.880 is some context switching so that is one more thing but in make your process is isolated 02:37.880 --> 02:40.280 completely so it does not need any sharing. 02:40.280 --> 02:44.440 So there is no context switch within that particular process. 02:44.440 --> 02:49.080 Also yeah it is a fixed resource setup so you do not have to worry about time you do not 02:49.080 --> 02:54.080 have to worry about other resources hogging your particular resource. 02:54.080 --> 03:01.080 Also there is some limitation you can have 7 small size slices in case of time slicing 03:01.080 --> 03:04.040 that is like a single workload so you cannot do much with that. 03:04.040 --> 03:08.960 You can quickly switch things in MPS you have many processes that can use a single GPU but 03:08.960 --> 03:13.600 in make you have 7 small slices so if you have a workload that you want to distribute 03:13.600 --> 03:18.920 you can have maximum 7 and based on the size of the workload you can adjust that. 03:18.920 --> 03:24.200 So for strict isolation requirement make is best because it also guarantee quality of 03:24.200 --> 03:25.200 service. 03:25.200 --> 03:26.840 Now what do we mean by quality of service? 03:26.840 --> 03:32.920 You are bandwidth your SM utilisation they all are nearly guaranteed in make partition. 03:32.920 --> 03:39.240 This certain you know things which are not completely entered percent guaranteed so but 03:39.240 --> 03:43.240 most of the time it will work as per expectation. 03:43.240 --> 03:46.400 So yeah so how does make look like? 03:46.400 --> 03:52.080 So on the left hand side you can see a GPU it is partitioned into 7 GPU instances. 03:52.080 --> 03:58.480 So make partitions it into 7 instances so this is an example of the smallest slice of a 03:58.560 --> 04:05.600 Megan instance on the right hand side you can see there is a diagram for NVIDIA A100 GPU. 04:05.600 --> 04:09.720 It has 8 slices of memory and 7 slices of compute. 04:09.720 --> 04:16.080 So your GPU is essentially divided into 2 parts there is a GPU instance which is divided 04:16.080 --> 04:20.160 into 2 parts GPU slices and GPU engines. 04:20.160 --> 04:27.600 So GPU slices consist of those memory slices in the above diagram and compute slices. 04:27.640 --> 04:32.720 GPU engines are a separate thing which is allowed to the GPU instance based on the portion 04:32.720 --> 04:36.120 that you partition so that is another thing. 04:36.120 --> 04:42.200 Now each you can see there are 8 memory slice and 7 compute slice they are not using 04:42.200 --> 04:48.640 exactly 8 divisions or 7 divisions it is almost that we will see further how the slicing 04:48.640 --> 04:49.640 happens. 04:49.640 --> 04:57.360 Yeah so the slice hierarchy is like this you partition memory first so you cut memory and 04:57.400 --> 05:01.880 then you assign compute to it it is not the other way around you cannot assign compute 05:01.880 --> 05:04.760 you cannot slice compute first and then assign memory. 05:04.760 --> 05:13.280 So this is a 2 level hierarchy process so you have to strictly follow that steps. 05:13.280 --> 05:15.880 So what are the partitions that we can do? 05:15.880 --> 05:22.840 So on the left hand side you can see the smallest partitions so we sliced 5GB of memory 05:22.880 --> 05:29.480 which is the smallest available partition in the H1 and A1 and RGPU and then we 05:29.480 --> 05:36.760 alerted 1 compute to it so it is the smallest compute with 1G 5GB as an oven enclosure. 05:36.760 --> 05:42.840 There is also another one where you can share the memory pool with multiple compute instances. 05:42.840 --> 05:49.000 So in the middle one the figure H you can see that there is a huge chunk around 20GB 05:49.080 --> 05:56.120 of memory allocated to 4 compute instances so each compute instance X has a separate compute 05:56.120 --> 06:02.440 instance but they share the memory pool so that is one thing and in last you can have 06:02.440 --> 06:08.120 a large GPU instance where you can have large memory and large compute so it is an isolated 06:08.120 --> 06:13.960 instance but it is bigger compared to the smallest size which is figure G. 06:14.040 --> 06:19.720 So what can happen? I mean the smallest size you might have issues with your workload because 06:19.720 --> 06:24.200 it might require more resource so it is not good for that but for a smaller workload it works 06:24.200 --> 06:30.360 really well because you can have 7 of those instances. On the right hand side you have a big 06:30.360 --> 06:37.000 instance but it is not evenly distributed. What do we mean by not even a utilised sorry? 06:37.000 --> 06:41.800 It is not even a utilised. So if you have a workload which uses only 10GB of memory and two 06:42.600 --> 06:48.600 computes you are 10GB of memory and two computes are wasted on the side so there is a potential 06:48.600 --> 06:53.160 on use of the compute. On the middle one where you can see there are multiple compute sharing 06:53.160 --> 06:59.160 a big memory chunk you can have out of memory issues like all the computes are competing for 06:59.160 --> 07:07.640 memory and eventually some processes collapse because of no memory. So what are the overhead 07:07.640 --> 07:12.680 in make? I mean you are getting this full isolation so there must be something you are compromising. 07:12.680 --> 07:21.960 So you compromise on exact compute divisions. So as I said earlier it was around not exact 7 division. 07:21.960 --> 07:29.000 You can see in an example of H100 the smallest slice which is one compute slice has around 16 07:29.000 --> 07:35.560 asms but overall GPU has around 132 asms so it is not exactly divided by 7 you are missing 07:35.560 --> 07:43.400 around 2 asms. So that is one thing and you can also see on the middle one around 60 asms are 07:43.400 --> 07:50.360 available for 3G and 4G which is not exactly half of one 32 it is just half of 120 so you are 07:50.360 --> 07:56.920 living around 12 asms. So yeah there are a couple of things that you compromise if you are trying 07:56.920 --> 08:03.320 to use make so that is something you should you know keep into consideration. Then let us see how 08:03.400 --> 08:10.680 you can create a partition. I mean we have access to a GPU and we list those available 08:10.680 --> 08:16.360 partitions using this command on the left hand side and figure L. It shows you what are the 08:16.360 --> 08:21.880 divisions that are available to you. Using the profile IDs you can create make profiles. 08:23.080 --> 08:27.480 You can see on the steps that first of all you need to enable the make mode. 08:28.120 --> 08:32.760 GPUs are not enabled by default for make. So you have to enable it you don't have to install 08:32.760 --> 08:38.600 a new utilities for the new GPUs they are by default installed. So you just have to enable this mode. 08:38.600 --> 08:44.360 Once the mode is enabled you can create a GPU instance first for the particular GPU. So you can see 08:44.360 --> 08:52.120 from figure K that we have GPU 0 and we are using GPU 0 in the step 2 to create two compute 08:52.120 --> 08:59.960 instance for profile ID 9. So we got two 3G 40 GB profiles using step 2. So we created our 08:59.960 --> 09:06.600 GPU instance in this second step. Now we will create an assign compute instance to it. So here 09:06.600 --> 09:13.240 we are assigning the entire compute instance to it which is supported by those two GPU instance. 09:13.240 --> 09:18.440 So yeah the third step is that but if you want to be even more specific and divide that 09:19.400 --> 09:25.320 GPU instance to an assign multiple compute instance to it you can do that with the small step 09:25.400 --> 09:31.160 at the bottom. You add GPU ID and then you assign particular profiles that you want. 09:32.440 --> 09:39.160 You can also view the listed compute instance and GPU instance using the commands shown. 09:40.920 --> 09:47.160 Okay so after we create what actually it looks like using NVIDIA SML SMI L command you can see 09:47.160 --> 09:52.920 that there are multiple making instances available. You can see that UUID which we will use 09:53.000 --> 09:59.720 to run our process in the further steps. So this is how it looks. On the figure L you can also see 09:59.720 --> 10:04.680 that there are no free instances available. So we are done with the maximum profile that we possibly 10:04.680 --> 10:10.280 can. So yeah after creation this command can help you in identifying which are available and 10:10.280 --> 10:17.160 which are not available for partitioning. Okay so now let us look at the combinations. 10:17.720 --> 10:22.040 So there are multiple combinations apart from the overhead that we saw earlier. 10:22.120 --> 10:27.320 There are certain combinations of we partitioning that can destroy and waste compute and memory. 10:27.320 --> 10:31.880 So we look into that and we will also see which combinations use all the available profiles 10:31.880 --> 10:35.640 slices. So your GPU is at least used completely apart from the overhead. 10:38.120 --> 10:44.200 So yeah this is one example. So you can see this is an H100 GPU with around 80 GB RAM. 10:44.200 --> 10:49.560 So we divided into two portions three compute instance each and 40 GB each. 10:50.440 --> 10:55.080 So you can see there is one compute instance which is wasted which is not loaded to anyone. 10:55.080 --> 11:00.040 So that is one thing that can happen. If you are partitioning things in certain ways you are missing 11:00.040 --> 11:06.600 out on the compute capacity. So that is one thing. It is around one seventh of your GPUs like 11:06.600 --> 11:13.640 14 percent you are simply just wasting. You are not using it anywhere. So yeah that and yeah 11:13.640 --> 11:19.400 for the memory is similar thing. So if you have seven identical instances with small compute 11:20.360 --> 11:26.280 instances you just have 10 GB of memory wasted. So you are simply using seventh eighth of your 11:26.280 --> 11:33.800 GPU capacity. So yeah you can see from here this is taken from the documentation. 11:34.760 --> 11:41.560 Certain combinations were compute misses out. So and if you observe it properly it is around six. 11:41.560 --> 11:46.360 So whenever you are using the combination of six like three compute instance is three compute 11:47.080 --> 11:51.560 three to one compute instance, three one one one. You are missing out on one compute instance 11:51.560 --> 11:57.720 and based on that you are simply just wasting things. Okay so which combinations don't 11:57.720 --> 12:04.600 leave any slices. These are the combinations which I have listed. There are you can also see 12:04.600 --> 12:10.200 in this table that six is missing. So it is not working out for that particular combination. 12:10.600 --> 12:21.080 Okay so yeah this is the same slides I will let it go. Okay so how can we execute things on 12:21.080 --> 12:25.880 our making instance. We created a make instance now we want to run our things on it. So you can see 12:25.880 --> 12:32.680 using the NVIDIA SMI L command you can view multiple available instances. You use those UIDs 12:32.680 --> 12:37.480 after creating your make instance and then simply export it as a variable. So there is this 12:37.800 --> 12:43.000 visible device is variable where you assign your particular make device UID and then run it. 12:43.640 --> 12:47.960 The highlighted one is for the GPU it is not for the make device. So yeah. 12:51.080 --> 12:57.960 Okay so B200 also has similar things like the one we saw earlier for H100. 12:59.720 --> 13:06.680 We are showing this because we will be showing performance of the video generation model on B200 13:06.760 --> 13:11.560 and H200. So these are the partition methods which are available to us for B200. 13:13.480 --> 13:19.720 Okay so for the good part like video generation. We have around two models that we will be testing 13:19.720 --> 13:25.160 things. 14.1 which is a very large video generation model. It uses prompt image and audio 13:25.160 --> 13:31.480 to generate a video. There is another one which is text to video generator which is a smaller model 13:31.560 --> 13:36.760 of 5 billion parameter. Well it is not small but for video generation contact it is like a smaller one. 13:38.360 --> 13:45.160 So yeah we have also kept an observed the peak and constant V RAM utilization for this. So they 13:45.160 --> 13:54.040 are mentioned over here. Okay so what did we observed? Like the baseline compute is for H100 GPU. 13:54.040 --> 14:00.760 At the end there is an observation where we simply run make instances on B200 and show it along with 14:01.640 --> 14:08.440 this. So as you can see in certain workloads it takes some time and you are only able to run one of them. 14:09.880 --> 14:15.320 While in other workloads which is on B200 you are able to run multiple models in parallel. 14:16.120 --> 14:23.080 So I will show you even detailed one the next slide. So yeah so how many images were we able to 14:23.080 --> 14:31.480 generate? Like using the MPS method isolation was not guaranteed but it it worked. It worked around 14:31.480 --> 14:36.360 two instances and three staggered manner. So we run one process then we wait then for one minute 14:36.360 --> 14:43.640 then another then again we run the third one. So it worked in that manner. For the make part 14:44.200 --> 14:51.400 we had guaranteed isolation for H100 and B200. So there was no need of you know doing things but 14:51.480 --> 14:58.520 in the case of larger GPU B200 we might assume that we will have a lot more slices available but 14:58.520 --> 15:04.440 that's not the case. The slices combination that we have are quite more than we require in terms 15:04.440 --> 15:11.160 of memory. So in case of 5 billion parameter model for B200 make we are wasting around 10 15:11.160 --> 15:17.400 GB of memory for generating one model in parallel. So that's certain combination where things don't 15:17.480 --> 15:24.760 work out but that is one more thing. You cannot allow other work to be run on this unused GPU. 15:24.760 --> 15:30.200 But for MPS part you can do it. So that is cons to that but isolation is completely guaranteed 15:30.200 --> 15:37.800 in the make process. You won't have any issues or any failures. So possible failures. So I tried 15:37.800 --> 15:44.520 running a variable workload where a model uses more memory and then stops. So it uses for like 15:44.520 --> 15:50.760 5 seconds. It's a small time but still it fails. So you can see that for a variable workload and 15:50.760 --> 15:56.440 if the variance is very high even for a shorter time this make partitioning is not good idea. 15:56.440 --> 16:03.800 So yeah and also for the single GI instance and multiple GI instance O M occurs in the similar fashion. 16:05.640 --> 16:11.160 Okay so now we have make setup and we understood what it is. How can we monitor the instances? 16:11.800 --> 16:17.800 So there are two ways. First is the NVIDIA SMI if you want to click quick CLI based setup. 16:17.800 --> 16:22.920 You can monitor it using watch and see but there is another one. The DCGM exporter. 16:22.920 --> 16:28.200 This exporter collects matrix just like any other exporter like node exporter and then you can combine 16:28.200 --> 16:35.800 it with Prometheus and Grafana. So yeah this is the monitoring setup. What else can we do? 16:35.800 --> 16:40.840 Like we can slice and create make but we can also combine things together. 16:40.840 --> 16:46.600 Like you can have a make instance which can run MPS demand because MPS is essentially a service 16:46.600 --> 16:51.720 and you simply enable it within a make instance and it will work flawlessly. But there is one more thing. 16:51.720 --> 16:58.680 You cannot run make after enabling MPS. You need to disable MPS and then you can create make. 16:58.680 --> 17:05.640 So that's one thing. Yeah so conclusion for high variance workloads provision profiles with high 17:05.640 --> 17:13.560 buffer. So yeah this will result in memory, news memory and a lot of things but still it will 17:13.560 --> 17:19.640 provide isolation and we will get into it. Then for stable workloads make works well for high 17:19.640 --> 17:26.280 variance use MPS whenever possible don't rely too much on make because you will simply waste 17:26.280 --> 17:30.840 some resources which might be used by other processes. And for the large video generation 17:30.920 --> 17:37.320 models H100 is not enough. You need B200 at least for generating at least two to three videos 17:37.320 --> 17:42.120 in parallel. So yeah that's that's all. Thank you very much. 17:49.000 --> 17:50.200 Yeah you have one question. 18:00.840 --> 18:05.160 I think it's because of the architecture. Yeah can you repeat the question again. 18:13.240 --> 18:20.120 Yeah so the question is that why do we have certain partitions and when can we you know expect certain 18:20.120 --> 18:27.160 resources to be utilized properly in the partitions? Well I don't have right answer but it is because 18:27.240 --> 18:32.840 of the architecture. So I believe if NVIDIA improves on the GPU architecture we might get even better 18:32.840 --> 18:40.280 profiles which are more dynamic. So yeah hopefully that happens. Yeah we have another question. 18:45.080 --> 18:52.280 Are the profiles same for all the period teams like H100, H100, H100, H100 do they all have 18:52.280 --> 19:07.000 exactly same. Yeah combinations available for all GPUs same or do different GPUs have different 19:07.000 --> 19:11.720 available profiles. Well the answer is there are certain combinations which are similar for certain 19:11.720 --> 19:15.800 architectures but for different architectures they have different combinations available. 19:16.360 --> 19:30.280 So yeah also one more question. Yeah I haven't explored AMD or Intel I have worked with 19:30.360 --> 19:34.920 NVIDIA hopefully. Okay so we have two more questions. Yeah. 19:47.400 --> 19:49.400 Can you repeat speak loudly a bit. 20:00.280 --> 20:10.200 Yeah so I tried generating videos on the different make partitions compared to the 20:10.200 --> 20:17.960 MPs. The make ones had bit delays. Now the reason was it was not using complete SMs in these 20:17.960 --> 20:23.320 partitions because certain SMs were left over and these video generation models are SM intensive 20:23.320 --> 20:29.240 apart from the RAM requirement. So that caused around two to three minutes of delays in generating 20:29.560 --> 20:36.520 them. So I generated around one minute not one minute 40 seconds of videos and it was taking 20:36.520 --> 20:41.560 delays. Okay we have another question. 20:42.520 --> 21:02.040 Okay so the question was like do AMD have things? Yeah they do. So one person from the audience 21:02.040 --> 21:06.040 explained what are the ways in which they can be partition. So thanks thank you for that. 21:07.000 --> 21:08.520 Okay we have one more question. 21:18.200 --> 21:22.680 Yeah so the question is like can we have multiple GPUs where you can provision workload using 21:22.680 --> 21:30.600 make partitions? Yes it can only be done if you have 4x H100s available. So in this case it's 21:30.600 --> 21:35.640 in single instance but you can orchestrate your workload by different make partitions and enabling 21:35.640 --> 21:45.720 those partitions and turning them. No there are many available out there like 4x H100 21:45.720 --> 21:51.320 data also so on for B200 also there are certain combinations. It depends on the provider. 21:51.320 --> 21:57.400 So I use Worlda. So they are also providing combinations which are in smaller portions. 21:57.400 --> 21:59.800 So I use one combination. 22:05.640 --> 22:14.280 Thank you.