WEBVTT

00:00.000 --> 00:09.840
Welcome everybody, we say in the cloud native world, but with a different project and

00:09.840 --> 00:13.160
same, always fun, so round of a pass.

00:13.160 --> 00:20.800
Hello everyone, I'm Kabatis Pistadinos, and today I'm going to show you a fork done by

00:20.800 --> 00:24.440
me and my colleagues in fourth grade in Greece.

00:25.400 --> 00:29.400
It's called optimizing loan for hyperformans hardware.

00:29.400 --> 00:34.600
Okay, so first of all, I think most of you know about Kubernetes, but have an introduction

00:34.600 --> 00:35.600
slide.

00:35.600 --> 00:41.440
Most of the production now comes from containers, so you make a container in your ship,

00:41.440 --> 00:47.980
you get the flow of your things, and most of the job is on the cloud, so you have

00:47.980 --> 00:53.640
Kubernetes for this, orchestrating your containers, it's kind of like a cloud operating

00:53.640 --> 00:54.640
system.

00:54.640 --> 01:00.840
It does the resource management, the working and manage the same failures done, so you just

01:00.840 --> 01:06.680
describe what you want, there's an abstraction based on primitives and conveysions.

01:06.680 --> 01:14.520
You describe what do you want from the hardware perspective and you get it, and you get

01:14.520 --> 01:19.360
the manage platform for deployment and development.

01:19.360 --> 01:27.440
So you can have Kubernetes on local, on prem, or on cloud, and because of this, there are

01:27.440 --> 01:34.080
a lot of extension and third party tools implemented.

01:34.080 --> 01:42.680
In this production we'll talk about storage, so the basic component for Kubernetes is the

01:42.680 --> 01:51.360
volume object, they have standard API, there are storage classes, you have persistent

01:51.360 --> 01:56.840
volume claims and additional volumes, you do request using persistent volume claim, and

01:56.840 --> 02:03.680
the persistent volume is the actual storage, you can implement the storage class, and

02:03.840 --> 02:10.120
it's how you will get the actual storage.

02:10.120 --> 02:19.120
Because there are a lot of storage interfaces, and there are a lot of implementations done

02:19.120 --> 02:27.520
in the last years, so there are containers stored at the interface, it's an API in

02:27.520 --> 02:34.480
the interface, in order for you to introduce your custom storage offering, and you can do

02:34.480 --> 02:41.200
rather move your storage, you can implement your own lifecycle for the volume, and you can

02:41.200 --> 02:50.200
do synapses and cloning, and there are a lot of implementations, and one of those is

02:50.200 --> 02:51.200
long-hurt.

02:51.200 --> 02:57.040
Long-hurt is an open source, a project part of the CNCF, it's in to baiting project, it

02:57.040 --> 03:02.880
remains a complete software defined stored at CNC, in comparison with other storage

03:02.880 --> 03:12.040
solutions, it's complete, it means that it's not relied on any other third-party tools

03:12.040 --> 03:17.560
and providers, and it also have advanced features like synapses and backups, there are

03:17.560 --> 03:27.880
a few technologies, you can see here in the figure, the overview of the architecture, you

03:27.880 --> 03:35.800
have the CSI plugin with stocks, with communicates, with a Kubernetes, UI for the user, you

03:35.800 --> 03:42.280
can describe from the UI what volume you need, and what node do you want it to, and the

03:42.280 --> 03:51.120
manager gives to you, and the actual data path, the actual data IO path is done in the

03:51.120 --> 03:58.000
engine, where you have the controller, when you need a volume, you issue a request, and the controller

03:58.000 --> 04:06.960
runs along with the node you want the volume in, and there are replicas running in all the

04:06.960 --> 04:14.440
same node or another node, the communicates through TCP protocol, and the replica is where

04:14.440 --> 04:24.040
the actual IO is happening, and the user SSDs are whatever actual storage device you want,

04:24.040 --> 04:32.880
and so anything is the core of the implementation, is where the actual work happens, that's

04:32.880 --> 04:42.520
where the main focus of our work is, and its volume is implemented by only one controller,

04:42.520 --> 04:49.360
so its volume has one controller, and you can attach as many replicas as you want, okay,

04:49.360 --> 04:57.080
so let's see how the performance of the system, nowadays most of the people use cloud,

04:57.080 --> 05:12.040
like AWS or Azure, to run Kubernetes, and in these, the performance there is okay,

05:12.040 --> 05:19.040
because there are limitations in AWS, around 40K IO, you need to pay a lot more in order

05:19.040 --> 05:24.720
to not have that limitation, but when you have on PEM where there is high speed networks,

05:24.720 --> 05:31.520
then VMA and you have a better computer, you are limited to about a 50K to a defyK read right IOPS,

05:31.520 --> 05:39.200
where your actual device supports 400K IOPS, and so in order to find the bottlenecks and what

05:39.200 --> 05:43.600
we can do about them, we are so late at the end in, we split it into three basic parts,

05:43.600 --> 05:50.600
the front and the controller and the replica, and we found the three major bottlenecks, from

05:50.680 --> 05:57.320
the front and the perspective, the use there is quasi-planetation, it is an interface for the

05:57.320 --> 06:05.960
kernel to expose a block device in the user, and the connect the IOPath and redirected to your application,

06:05.960 --> 06:15.840
but it is actually really slow, the controller has to forward all the, all the requests

06:15.920 --> 06:23.200
in the replica, and the user, the wrong communication protocol, that serialized all the operations,

06:23.200 --> 06:28.800
and that's a big bottleneck too, and at the bottom part of the ending, we actually replica,

06:28.800 --> 06:36.560
where they actually read right happens, the user's parts file implementation, where rights are

06:38.160 --> 06:44.800
are very, very limited performance, especially when using multiple snapshots, okay,

06:44.800 --> 06:52.160
so what we can do about it, we explore our alternatives, and we found that in the front and part,

06:52.160 --> 06:58.720
we can use UBLK, it's a new interface, based on IOEuring, which is available in the latest Linux

06:58.720 --> 07:08.320
kernel, but it's in the new Ubuntu 2, in the controller part, we implemented, we found what's

07:08.320 --> 07:13.840
the problem, and we can implement the logic of the communication, in the replica part, we implement

07:14.000 --> 07:22.480
a different storage mechanism, we call it direct blocks, it's a custom direct disk storage layer,

07:24.400 --> 07:31.280
okay, so let's see part of part of what are the improvements, so for UBLK,

07:32.560 --> 07:38.480
UBLK consists of two components, major components, UBLK driver, which is in the kernel,

07:38.800 --> 07:44.800
you can issue a request there in order to get a block device, to get to expose a block device to

07:44.800 --> 07:51.920
in the user space, and there is the UBLK, SRV, which is in the user space, and deployments,

07:51.920 --> 08:00.000
they are IO, it redirects the data wherever you want, you can write a custom driver in order to

08:00.000 --> 08:10.640
get it to done what you want, you can either implement to be in the loop, you can connect to

08:10.640 --> 08:14.880
the loop device, so you can implement it to send the data to the long-horned housing,

08:16.880 --> 08:26.080
which is very modular, you know, just a field, some functions send the UBLK, and it can be used

08:26.080 --> 08:34.800
for IO, and that means task like adding more devices, and why is it that fast, it's based on the

08:34.800 --> 08:39.920
IO during, which is a Linux kernel system called interface, it supports a synchronous operations,

08:40.640 --> 08:47.280
it's based on the logic of the two circular buffers, the one is for the request and one is for replies,

08:47.600 --> 08:57.360
the applications in the kernel, it can communicate through them without actual system calls,

08:57.360 --> 09:08.640
and that means you can do button and the last memory copies, and that's why it's really fast,

09:08.640 --> 09:15.040
you can see you will see the numbers in a minute, okay for the second part of the engine,

09:15.040 --> 09:19.760
the controller replica communication, in the top part of the figure you can see the original

09:19.760 --> 09:28.080
implementation, for every request, a go-long thread, a binary quest, user request channel,

09:28.080 --> 09:32.720
and the problem is that here the loop thread is just a single thread, a single thread,

09:34.000 --> 09:41.520
running and serving every request, so they use a message map to store all the requests,

09:41.520 --> 09:49.520
in order to identify which one is the one coming in the response, and that operation is synchronous,

09:50.560 --> 09:58.800
meaning you have a drop of performance because of the single loop thread, and then you

09:58.800 --> 10:04.000
forward the message to write thread and the right thread issued the request in the replica

10:04.080 --> 10:12.720
and waiting for the reply. In our approach, we saw that the single operations are not good,

10:12.720 --> 10:23.440
so we used an ID channel, and the message array, when a request is built, you get an ID from a

10:23.440 --> 10:30.560
go-long channel, and you use it as an index in this message array, that way go-long channels

10:30.640 --> 10:42.240
assure you that you don't have concurrent, you can have concurrent readrides in the array using the

10:42.240 --> 10:48.160
index, and that way every request issuing to the right thread, immediately not waiting for the loop

10:48.160 --> 10:56.560
thread, it's iteration of the loop thread. Okay, in the last part of the, of the ending, we have

10:56.560 --> 11:04.320
the direct block store, or DPS for short, it can be used in a file or directly in the device,

11:04.320 --> 11:11.280
it supports multiple volumes, now it's very volume, it has an accessing API, you can use it from CLI

11:11.280 --> 11:20.640
or call the API in your code, and the main point of it, it divides the store, the storage in four regions,

11:20.640 --> 11:26.720
it's a super block, the volume is an absolute metadata, the extend metadata, and they use it data.

11:29.280 --> 11:37.600
You can, it's really light-wet in fast, and the extend maps are getting memory,

11:37.600 --> 11:44.960
as you can see, it's around 40 megabytes per one terabyte volume, so it's operations not that

11:45.040 --> 11:50.800
actual in the disk, it's in the memory, it's really fast, an extensive use of meat, but meat,

11:50.800 --> 11:56.720
beat maps, it's done here, so it's actually in the fast, it's written in GoLank, it's open source,

12:00.000 --> 12:08.560
okay, so in the left part is the original implementation, the user has cast a block device,

12:08.560 --> 12:17.680
they get it demon, in the server, here is the old communication implementation, and right here,

12:17.680 --> 12:24.240
sparsified storage and file system goes through the SSD, in our implementation, we used a Ubell

12:24.240 --> 12:30.720
K-block device, communicating with a UBD SRV demon, communicating with a front end,

12:30.720 --> 12:38.160
here the controller has a different approach, and here the device storage is implemented,

12:38.720 --> 12:49.920
so using the whole long-core nanzing, you can see you have 50k read and 25k right,

12:51.760 --> 13:07.440
and okay, in the first layer here, we didn't know what operation for this part of the system,

13:07.520 --> 13:15.360
so in this experiment, only the front end is used, after the front end, long-core can

13:16.560 --> 13:23.760
not long-core now actually, i-scasity G-t can hold up to 60k IObs, but Ubell K can go up to

13:24.960 --> 13:31.840
500k IObs, as you can see the front end, the Ubell K front end has way more capabilities compared

13:31.920 --> 13:38.720
to why scasity G-t, okay, so now because i think it's kind of complicated,

13:39.760 --> 13:48.800
for this metric, we placed the front end part, and we still used the all the communication protocol,

13:50.240 --> 13:56.080
but we also didn't do the actual IO, we stopped the data path IO here, and the

13:56.960 --> 14:08.960
no warp is done, so here we see we levered some of the IObs from the Ubell K, but we're still at 100k IObs,

14:09.920 --> 14:16.400
replacing the old communication method with our new communication, we go a lot to 100k IObs,

14:16.400 --> 14:29.040
and now using the Linux parts file implementation and the very custom implementation,

14:29.040 --> 14:38.800
you drop back to 128k IObs read and 38k IObs write, and DBS,

14:39.440 --> 14:50.800
introducing DBS in this implementation, does the system to 150k IObs, so as you can see we can't

14:50.800 --> 15:01.680
triple the performance of the system using three major changes, okay, so in conclusion,

15:02.560 --> 15:08.960
Ubell K has the biggest impact in accelerating IO, it's a new technology using state-of-the-art

15:08.960 --> 15:22.960
solution for doing IO, so the question is, is NVMFNS-serie, when you are not using over the network

15:22.960 --> 15:30.720
operations, it's actually not that the Ubell K has the capabilities to perform it in some situations,

15:31.680 --> 15:37.520
the second you can do further performance improvements, especially in the controller as you can see

15:37.520 --> 15:44.400
the performance dropped massively inside the controller, we are working on optimization in adding more

15:44.400 --> 15:52.560
features, and in the last part DBS is actually not the novel, but provide a helpful utility for other

15:52.560 --> 16:04.160
projects too, okay, so some pull requests have been submitted, DBS and other projects can be found

16:04.160 --> 16:13.440
at GitHub of Carvice's force, then you can also find the code of the modified code of long

16:13.600 --> 16:27.040
on there, I want to acknowledge the back of our founding this job, thank you, many questions?

16:27.520 --> 16:29.520
Yes?

16:33.520 --> 16:40.480
Sorry, I don't know a lot of very well, but you said a poorly instillified one controller and then a number

16:40.480 --> 16:47.120
of replicas. Yes? To support high availability of it, so failure or something?

16:48.080 --> 16:52.720
What, have you have the ability failure? Yes, yes, yes, all that is done by the manager,

16:53.680 --> 17:01.280
Longhorn is, I don't know how to speak, okay, can you say the question for me the

17:01.280 --> 17:11.440
bit? No, you're like, well, I'll explain what I mean again, so you have the simple controller

17:11.440 --> 17:18.880
multiple replicas of your data that allows for the distributed high availability of volume

17:18.880 --> 17:25.040
is that right? Yes? Can you repeat the question? Yeah, I'm looking forward to you, you have to

17:25.040 --> 17:30.640
repeat for the microphone. I get, but I don't remember the whole question. Do you see the same

17:30.640 --> 17:35.600
performance improvements when you're using distributed multiple replicas?

17:35.600 --> 17:42.880
Multiple replicas. Okay, so the question is if using multiple replicas has the same effect,

17:43.840 --> 17:50.480
our change has the same effect using multiple replicas. Okay, we haven't used multiple replicas,

17:50.480 --> 17:58.880
the only bottleneck using multiple multiple replicas is the network. Using a bigger, better connection

17:58.880 --> 18:05.040
in the network, it's not the bottleneck anymore, so yeah, you're going to have the same effects.

18:06.000 --> 18:12.000
Thank you. But I haven't done any actual numbers for that.

18:18.240 --> 18:24.960
Any other questions? Yes? I'll see the forms improvements of the concept we need to be

18:25.040 --> 18:29.760
to take the understanding of the phone point. What was the question to take the question?

18:31.760 --> 18:39.600
Ah, you asked me about compared to the version two, how it's compared to the

18:39.600 --> 18:45.920
output of the one data and the phone point when you achieve it? Okay, the question is if this is

18:45.920 --> 18:51.680
the version one or the version two of long one, right? Yeah, this is the version one, the version two of

18:52.320 --> 19:00.240
long one will use a speedk and VMOS. In GitHub, I can see the issues and the one two improve

19:00.240 --> 19:07.520
the version one two, so that's why we'll focus on the version one two. version two with a speedk and

19:07.520 --> 19:13.200
VMOS will solve the problem of the front end, but most of the other parts are not updated.

19:13.200 --> 19:18.800
I mean the communication between the controller and the replica is not part of the improvement

19:18.800 --> 19:24.000
in version two and also they will still use the Linux parts file in the end.

19:26.160 --> 19:33.280
So there's also need to improve the version one, but version two is really good.

19:37.920 --> 19:38.800
Yes?

19:38.800 --> 19:53.600
I think that's configurable. Sorry, you asked me if NVMe or F is over TCP?

19:54.160 --> 20:00.320
Maya start, no? Maya start to open EBS. Maya start using something.

20:02.640 --> 20:13.280
Okay, I think when you have on-prem, you have when you have the controller on the same

20:13.280 --> 20:17.680
note, you don't need to use TCP, you can just go and extract it to do that.

20:23.600 --> 20:36.960
I'm not sure, I'm not familiar with Maya start with the implementation.