WEBVTT 00:00.000 --> 00:09.840 Welcome everybody, we say in the cloud native world, but with a different project and 00:09.840 --> 00:13.160 same, always fun, so round of a pass. 00:13.160 --> 00:20.800 Hello everyone, I'm Kabatis Pistadinos, and today I'm going to show you a fork done by 00:20.800 --> 00:24.440 me and my colleagues in fourth grade in Greece. 00:25.400 --> 00:29.400 It's called optimizing loan for hyperformans hardware. 00:29.400 --> 00:34.600 Okay, so first of all, I think most of you know about Kubernetes, but have an introduction 00:34.600 --> 00:35.600 slide. 00:35.600 --> 00:41.440 Most of the production now comes from containers, so you make a container in your ship, 00:41.440 --> 00:47.980 you get the flow of your things, and most of the job is on the cloud, so you have 00:47.980 --> 00:53.640 Kubernetes for this, orchestrating your containers, it's kind of like a cloud operating 00:53.640 --> 00:54.640 system. 00:54.640 --> 01:00.840 It does the resource management, the working and manage the same failures done, so you just 01:00.840 --> 01:06.680 describe what you want, there's an abstraction based on primitives and conveysions. 01:06.680 --> 01:14.520 You describe what do you want from the hardware perspective and you get it, and you get 01:14.520 --> 01:19.360 the manage platform for deployment and development. 01:19.360 --> 01:27.440 So you can have Kubernetes on local, on prem, or on cloud, and because of this, there are 01:27.440 --> 01:34.080 a lot of extension and third party tools implemented. 01:34.080 --> 01:42.680 In this production we'll talk about storage, so the basic component for Kubernetes is the 01:42.680 --> 01:51.360 volume object, they have standard API, there are storage classes, you have persistent 01:51.360 --> 01:56.840 volume claims and additional volumes, you do request using persistent volume claim, and 01:56.840 --> 02:03.680 the persistent volume is the actual storage, you can implement the storage class, and 02:03.840 --> 02:10.120 it's how you will get the actual storage. 02:10.120 --> 02:19.120 Because there are a lot of storage interfaces, and there are a lot of implementations done 02:19.120 --> 02:27.520 in the last years, so there are containers stored at the interface, it's an API in 02:27.520 --> 02:34.480 the interface, in order for you to introduce your custom storage offering, and you can do 02:34.480 --> 02:41.200 rather move your storage, you can implement your own lifecycle for the volume, and you can 02:41.200 --> 02:50.200 do synapses and cloning, and there are a lot of implementations, and one of those is 02:50.200 --> 02:51.200 long-hurt. 02:51.200 --> 02:57.040 Long-hurt is an open source, a project part of the CNCF, it's in to baiting project, it 02:57.040 --> 03:02.880 remains a complete software defined stored at CNC, in comparison with other storage 03:02.880 --> 03:12.040 solutions, it's complete, it means that it's not relied on any other third-party tools 03:12.040 --> 03:17.560 and providers, and it also have advanced features like synapses and backups, there are 03:17.560 --> 03:27.880 a few technologies, you can see here in the figure, the overview of the architecture, you 03:27.880 --> 03:35.800 have the CSI plugin with stocks, with communicates, with a Kubernetes, UI for the user, you 03:35.800 --> 03:42.280 can describe from the UI what volume you need, and what node do you want it to, and the 03:42.280 --> 03:51.120 manager gives to you, and the actual data path, the actual data IO path is done in the 03:51.120 --> 03:58.000 engine, where you have the controller, when you need a volume, you issue a request, and the controller 03:58.000 --> 04:06.960 runs along with the node you want the volume in, and there are replicas running in all the 04:06.960 --> 04:14.440 same node or another node, the communicates through TCP protocol, and the replica is where 04:14.440 --> 04:24.040 the actual IO is happening, and the user SSDs are whatever actual storage device you want, 04:24.040 --> 04:32.880 and so anything is the core of the implementation, is where the actual work happens, that's 04:32.880 --> 04:42.520 where the main focus of our work is, and its volume is implemented by only one controller, 04:42.520 --> 04:49.360 so its volume has one controller, and you can attach as many replicas as you want, okay, 04:49.360 --> 04:57.080 so let's see how the performance of the system, nowadays most of the people use cloud, 04:57.080 --> 05:12.040 like AWS or Azure, to run Kubernetes, and in these, the performance there is okay, 05:12.040 --> 05:19.040 because there are limitations in AWS, around 40K IO, you need to pay a lot more in order 05:19.040 --> 05:24.720 to not have that limitation, but when you have on PEM where there is high speed networks, 05:24.720 --> 05:31.520 then VMA and you have a better computer, you are limited to about a 50K to a defyK read right IOPS, 05:31.520 --> 05:39.200 where your actual device supports 400K IOPS, and so in order to find the bottlenecks and what 05:39.200 --> 05:43.600 we can do about them, we are so late at the end in, we split it into three basic parts, 05:43.600 --> 05:50.600 the front and the controller and the replica, and we found the three major bottlenecks, from 05:50.680 --> 05:57.320 the front and the perspective, the use there is quasi-planetation, it is an interface for the 05:57.320 --> 06:05.960 kernel to expose a block device in the user, and the connect the IOPath and redirected to your application, 06:05.960 --> 06:15.840 but it is actually really slow, the controller has to forward all the, all the requests 06:15.920 --> 06:23.200 in the replica, and the user, the wrong communication protocol, that serialized all the operations, 06:23.200 --> 06:28.800 and that's a big bottleneck too, and at the bottom part of the ending, we actually replica, 06:28.800 --> 06:36.560 where they actually read right happens, the user's parts file implementation, where rights are 06:38.160 --> 06:44.800 are very, very limited performance, especially when using multiple snapshots, okay, 06:44.800 --> 06:52.160 so what we can do about it, we explore our alternatives, and we found that in the front and part, 06:52.160 --> 06:58.720 we can use UBLK, it's a new interface, based on IOEuring, which is available in the latest Linux 06:58.720 --> 07:08.320 kernel, but it's in the new Ubuntu 2, in the controller part, we implemented, we found what's 07:08.320 --> 07:13.840 the problem, and we can implement the logic of the communication, in the replica part, we implement 07:14.000 --> 07:22.480 a different storage mechanism, we call it direct blocks, it's a custom direct disk storage layer, 07:24.400 --> 07:31.280 okay, so let's see part of part of what are the improvements, so for UBLK, 07:32.560 --> 07:38.480 UBLK consists of two components, major components, UBLK driver, which is in the kernel, 07:38.800 --> 07:44.800 you can issue a request there in order to get a block device, to get to expose a block device to 07:44.800 --> 07:51.920 in the user space, and there is the UBLK, SRV, which is in the user space, and deployments, 07:51.920 --> 08:00.000 they are IO, it redirects the data wherever you want, you can write a custom driver in order to 08:00.000 --> 08:10.640 get it to done what you want, you can either implement to be in the loop, you can connect to 08:10.640 --> 08:14.880 the loop device, so you can implement it to send the data to the long-horned housing, 08:16.880 --> 08:26.080 which is very modular, you know, just a field, some functions send the UBLK, and it can be used 08:26.080 --> 08:34.800 for IO, and that means task like adding more devices, and why is it that fast, it's based on the 08:34.800 --> 08:39.920 IO during, which is a Linux kernel system called interface, it supports a synchronous operations, 08:40.640 --> 08:47.280 it's based on the logic of the two circular buffers, the one is for the request and one is for replies, 08:47.600 --> 08:57.360 the applications in the kernel, it can communicate through them without actual system calls, 08:57.360 --> 09:08.640 and that means you can do button and the last memory copies, and that's why it's really fast, 09:08.640 --> 09:15.040 you can see you will see the numbers in a minute, okay for the second part of the engine, 09:15.040 --> 09:19.760 the controller replica communication, in the top part of the figure you can see the original 09:19.760 --> 09:28.080 implementation, for every request, a go-long thread, a binary quest, user request channel, 09:28.080 --> 09:32.720 and the problem is that here the loop thread is just a single thread, a single thread, 09:34.000 --> 09:41.520 running and serving every request, so they use a message map to store all the requests, 09:41.520 --> 09:49.520 in order to identify which one is the one coming in the response, and that operation is synchronous, 09:50.560 --> 09:58.800 meaning you have a drop of performance because of the single loop thread, and then you 09:58.800 --> 10:04.000 forward the message to write thread and the right thread issued the request in the replica 10:04.080 --> 10:12.720 and waiting for the reply. In our approach, we saw that the single operations are not good, 10:12.720 --> 10:23.440 so we used an ID channel, and the message array, when a request is built, you get an ID from a 10:23.440 --> 10:30.560 go-long channel, and you use it as an index in this message array, that way go-long channels 10:30.640 --> 10:42.240 assure you that you don't have concurrent, you can have concurrent readrides in the array using the 10:42.240 --> 10:48.160 index, and that way every request issuing to the right thread, immediately not waiting for the loop 10:48.160 --> 10:56.560 thread, it's iteration of the loop thread. Okay, in the last part of the, of the ending, we have 10:56.560 --> 11:04.320 the direct block store, or DPS for short, it can be used in a file or directly in the device, 11:04.320 --> 11:11.280 it supports multiple volumes, now it's very volume, it has an accessing API, you can use it from CLI 11:11.280 --> 11:20.640 or call the API in your code, and the main point of it, it divides the store, the storage in four regions, 11:20.640 --> 11:26.720 it's a super block, the volume is an absolute metadata, the extend metadata, and they use it data. 11:29.280 --> 11:37.600 You can, it's really light-wet in fast, and the extend maps are getting memory, 11:37.600 --> 11:44.960 as you can see, it's around 40 megabytes per one terabyte volume, so it's operations not that 11:45.040 --> 11:50.800 actual in the disk, it's in the memory, it's really fast, an extensive use of meat, but meat, 11:50.800 --> 11:56.720 beat maps, it's done here, so it's actually in the fast, it's written in GoLank, it's open source, 12:00.000 --> 12:08.560 okay, so in the left part is the original implementation, the user has cast a block device, 12:08.560 --> 12:17.680 they get it demon, in the server, here is the old communication implementation, and right here, 12:17.680 --> 12:24.240 sparsified storage and file system goes through the SSD, in our implementation, we used a Ubell 12:24.240 --> 12:30.720 K-block device, communicating with a UBD SRV demon, communicating with a front end, 12:30.720 --> 12:38.160 here the controller has a different approach, and here the device storage is implemented, 12:38.720 --> 12:49.920 so using the whole long-core nanzing, you can see you have 50k read and 25k right, 12:51.760 --> 13:07.440 and okay, in the first layer here, we didn't know what operation for this part of the system, 13:07.520 --> 13:15.360 so in this experiment, only the front end is used, after the front end, long-core can 13:16.560 --> 13:23.760 not long-core now actually, i-scasity G-t can hold up to 60k IObs, but Ubell K can go up to 13:24.960 --> 13:31.840 500k IObs, as you can see the front end, the Ubell K front end has way more capabilities compared 13:31.920 --> 13:38.720 to why scasity G-t, okay, so now because i think it's kind of complicated, 13:39.760 --> 13:48.800 for this metric, we placed the front end part, and we still used the all the communication protocol, 13:50.240 --> 13:56.080 but we also didn't do the actual IO, we stopped the data path IO here, and the 13:56.960 --> 14:08.960 no warp is done, so here we see we levered some of the IObs from the Ubell K, but we're still at 100k IObs, 14:09.920 --> 14:16.400 replacing the old communication method with our new communication, we go a lot to 100k IObs, 14:16.400 --> 14:29.040 and now using the Linux parts file implementation and the very custom implementation, 14:29.040 --> 14:38.800 you drop back to 128k IObs read and 38k IObs write, and DBS, 14:39.440 --> 14:50.800 introducing DBS in this implementation, does the system to 150k IObs, so as you can see we can't 14:50.800 --> 15:01.680 triple the performance of the system using three major changes, okay, so in conclusion, 15:02.560 --> 15:08.960 Ubell K has the biggest impact in accelerating IO, it's a new technology using state-of-the-art 15:08.960 --> 15:22.960 solution for doing IO, so the question is, is NVMFNS-serie, when you are not using over the network 15:22.960 --> 15:30.720 operations, it's actually not that the Ubell K has the capabilities to perform it in some situations, 15:31.680 --> 15:37.520 the second you can do further performance improvements, especially in the controller as you can see 15:37.520 --> 15:44.400 the performance dropped massively inside the controller, we are working on optimization in adding more 15:44.400 --> 15:52.560 features, and in the last part DBS is actually not the novel, but provide a helpful utility for other 15:52.560 --> 16:04.160 projects too, okay, so some pull requests have been submitted, DBS and other projects can be found 16:04.160 --> 16:13.440 at GitHub of Carvice's force, then you can also find the code of the modified code of long 16:13.600 --> 16:27.040 on there, I want to acknowledge the back of our founding this job, thank you, many questions? 16:27.520 --> 16:29.520 Yes? 16:33.520 --> 16:40.480 Sorry, I don't know a lot of very well, but you said a poorly instillified one controller and then a number 16:40.480 --> 16:47.120 of replicas. Yes? To support high availability of it, so failure or something? 16:48.080 --> 16:52.720 What, have you have the ability failure? Yes, yes, yes, all that is done by the manager, 16:53.680 --> 17:01.280 Longhorn is, I don't know how to speak, okay, can you say the question for me the 17:01.280 --> 17:11.440 bit? No, you're like, well, I'll explain what I mean again, so you have the simple controller 17:11.440 --> 17:18.880 multiple replicas of your data that allows for the distributed high availability of volume 17:18.880 --> 17:25.040 is that right? Yes? Can you repeat the question? Yeah, I'm looking forward to you, you have to 17:25.040 --> 17:30.640 repeat for the microphone. I get, but I don't remember the whole question. Do you see the same 17:30.640 --> 17:35.600 performance improvements when you're using distributed multiple replicas? 17:35.600 --> 17:42.880 Multiple replicas. Okay, so the question is if using multiple replicas has the same effect, 17:43.840 --> 17:50.480 our change has the same effect using multiple replicas. Okay, we haven't used multiple replicas, 17:50.480 --> 17:58.880 the only bottleneck using multiple multiple replicas is the network. Using a bigger, better connection 17:58.880 --> 18:05.040 in the network, it's not the bottleneck anymore, so yeah, you're going to have the same effects. 18:06.000 --> 18:12.000 Thank you. But I haven't done any actual numbers for that. 18:18.240 --> 18:24.960 Any other questions? Yes? I'll see the forms improvements of the concept we need to be 18:25.040 --> 18:29.760 to take the understanding of the phone point. What was the question to take the question? 18:31.760 --> 18:39.600 Ah, you asked me about compared to the version two, how it's compared to the 18:39.600 --> 18:45.920 output of the one data and the phone point when you achieve it? Okay, the question is if this is 18:45.920 --> 18:51.680 the version one or the version two of long one, right? Yeah, this is the version one, the version two of 18:52.320 --> 19:00.240 long one will use a speedk and VMOS. In GitHub, I can see the issues and the one two improve 19:00.240 --> 19:07.520 the version one two, so that's why we'll focus on the version one two. version two with a speedk and 19:07.520 --> 19:13.200 VMOS will solve the problem of the front end, but most of the other parts are not updated. 19:13.200 --> 19:18.800 I mean the communication between the controller and the replica is not part of the improvement 19:18.800 --> 19:24.000 in version two and also they will still use the Linux parts file in the end. 19:26.160 --> 19:33.280 So there's also need to improve the version one, but version two is really good. 19:37.920 --> 19:38.800 Yes? 19:38.800 --> 19:53.600 I think that's configurable. Sorry, you asked me if NVMe or F is over TCP? 19:54.160 --> 20:00.320 Maya start, no? Maya start to open EBS. Maya start using something. 20:02.640 --> 20:13.280 Okay, I think when you have on-prem, you have when you have the controller on the same 20:13.280 --> 20:17.680 note, you don't need to use TCP, you can just go and extract it to do that. 20:23.600 --> 20:36.960 I'm not sure, I'm not familiar with Maya start with the implementation.