WEBVTT 00:00.000 --> 00:29.960 You guys can help me, everyone. 00:30.400 --> 00:32.240 Should I be louder? 00:32.240 --> 00:33.920 All good. 00:33.920 --> 00:42.520 Hello everyone, I'm Haris and this is Jintu and we are Colonel Mint Ares for RNBD Anti-RTRS, 00:42.520 --> 00:49.280 gun modules that work in the storage and RDMA domain together and we work for INS Cloud 00:49.280 --> 00:54.840 Germany, we work from the Berlin office and today we will be presenting one of the Colonel 00:54.840 --> 01:00.360 modules that I just mentioned that RTRS is a reliable high speed transport library 01:00.360 --> 01:04.360 over RDMA. 01:04.360 --> 01:09.760 So everyone would have heard about RDMA, that maybe the buzzword for nowadays but it's 01:09.760 --> 01:15.240 a very important technology that is being used, it's like one of the biggest standards 01:15.240 --> 01:22.160 being used in the high performance domain, AI domain, all the machine learning things that 01:22.160 --> 01:29.160 are being developed right now, mostly based on the RDMA technology and for good reason because 01:29.160 --> 01:35.560 it is low latency, high throughput, it uses very less CPU and you can basically pump a large 01:35.560 --> 01:40.760 amount of data through the network, zero copy, you can do zero copy, it doesn't have to 01:40.760 --> 01:46.680 copy data, it can just map the buffers that you want to send through this 3DMA. 01:46.760 --> 01:53.680 That's all well and good, it's fast, it's nice but it is a little difficult to work with 01:53.680 --> 01:57.760 because there's a lot of things that you need to do for starters, for example you have to start 01:57.760 --> 02:02.200 working with protection domains, when you want to establish the sessions, you might have 02:02.200 --> 02:08.640 to create Q pairs, you have to create completion Qs, shared Qs if you want that kind of 02:08.640 --> 02:13.600 completion technology and that's just the set of part right when you want to send data 02:13.600 --> 02:21.240 when you send higher, you will have to do your own DMA, you have to manage memory regions 02:21.240 --> 02:27.400 and that is the good part and then you have events that are happening in the network for 02:27.400 --> 02:31.200 example, a disconnect event or an error event, you have to manage all those events through 02:31.200 --> 02:36.840 this connection manager, so it's all a little difficult to handle and you obviously have 02:36.840 --> 02:43.560 to react to those network events and do your own managing reconnection error parts and 02:43.600 --> 02:50.240 the code is also sensitive, you see it's designed in a way to work with a G list but that 02:50.240 --> 02:54.160 shouldn't stop anyone because you can basically just map a buffer whatever you want to 02:54.160 --> 03:02.040 send your header or your IO in an SG list and send it across and so using RTRs you see 03:02.040 --> 03:08.280 that all the complex things that you want to avoid while using RTR may are just hidden 03:08.280 --> 03:14.200 below and every RTR is basically does everything for you, it does the connection management, 03:14.200 --> 03:22.280 it does the completion gear management, it manages memory regions for you and all the 03:22.280 --> 03:31.400 DMA mappings and RDMA events that I was speaking about right so it's a RTRs is a 03:31.480 --> 03:38.040 basically very simple client server architecture and you have the module on the client side where 03:38.040 --> 03:42.360 you have to do an open just like a socket, you have module on the server side also which also 03:42.360 --> 03:48.520 you have to do an open to start listening for your connection and basically that's it if you 03:48.520 --> 03:53.640 do an open on the server side and then an open on client side giving the IP address or whatever 03:53.640 --> 03:57.960 the GID for example an RDMA, you'll have a connection in between and then you can start sending 03:58.200 --> 04:07.720 files, RTRs allows you to have multiple paths for multi-pathing and you can basically design 04:07.720 --> 04:13.960 your architecture in a way that different paths goes through different network links so you have 04:13.960 --> 04:23.720 redundancy in network also and they the paths internally has connections that are perceived 04:24.120 --> 04:29.000 you so you can have an IRIS you pinning for efficient performance and data transfer. 04:31.560 --> 04:39.000 So as always as always saying the main highest of the logical relationship is the session 04:39.000 --> 04:46.040 which you can establish giving a unique name and internally it has its own unique in a UUID 04:46.040 --> 04:50.600 and in the session you can have multiple paths which I just mentioned through different 04:50.760 --> 04:57.080 network ports and network paths if you want and each path will then have a Q pair which is linked 04:57.080 --> 05:02.440 to perceived you so every CPU for every CPU on the client side that you create a connection 05:02.440 --> 05:12.760 so it basically utilizes all your resources and also improves your completion better 05:12.760 --> 05:22.120 so generally this is how the headers laid out you have the RTRs message but you don't have to 05:22.120 --> 05:27.960 worry about that what you have to send is an optional header if you want for example you have 05:27.960 --> 05:31.800 modules on both sides that you want to communicate what you're sending with it's an IO message 05:31.800 --> 05:36.760 there's a management message so you can have your own header and you can have your own data 05:36.760 --> 05:42.360 you send header through a vector and you send the data through an SG list which you can create 05:42.360 --> 05:52.040 out of a buffer or anything and this is the general handshake protocol you have the 05:53.320 --> 05:56.920 not very important because it's it's all being handled internally it's a connection to 05:56.920 --> 06:05.400 question response and then in voice exchange in between thing yes so one more important thing is 06:05.400 --> 06:11.320 RTRs internally has its own heartbeat mechanism which lets you make sure that the connection is on 06:11.400 --> 06:15.800 all the time even when it's you're not using it and when it's supposed your IO fails what it's 06:15.800 --> 06:19.560 going to do is if you have multiple parts internally it's going to fail over to the other part so 06:19.560 --> 06:24.680 you won't even know that something has happened is it's going to whatever in flight's IO you had on 06:24.680 --> 06:29.640 the other part it's going to stop sees that there's a path as prepared it goes to fails over to 06:29.640 --> 06:34.520 the other part and it also initiates a reconnect mechanism for the failed path if you had an 06:34.520 --> 06:40.040 network disruption it comes back your path would be basically established again successfully 06:42.760 --> 06:49.160 yeah in the second part I will take over for the second part so 06:49.160 --> 06:56.920 RTRs support different path path policy so we have round robbing so just circle around 06:57.000 --> 07:06.760 from the process available and we also have mini flight which is RTRs track feels the active 07:06.760 --> 07:13.880 request and we'll pick that pass for next IO and the server is related to it so it 07:13.880 --> 07:20.920 uses internally this heartbeat mechanism to tracking the minimal latency for the 07:20.920 --> 07:33.160 fact is passed so when there's so happy to detect there's a failure on the 07:33.160 --> 07:40.760 pass it will automatically pick the healthy pass so it'll feel well as filled out directly to 07:41.480 --> 07:47.000 the healthy pass and so customer won't notice there's a failure 07:47.720 --> 07:55.880 and when the pass is recovered it will automatically reconnect and so it gets to 07:58.600 --> 08:12.120 so multiplas both multiplas works again yeah we have an important 08:12.120 --> 08:20.360 configuration knob which is a tradeoff between performance and security caught always invalidate 08:21.160 --> 08:30.440 so it means for every IO the server style video invalidate the buffer and so 08:30.760 --> 08:43.640 and it's indicated by a message caught RTS message or RK response and it defaults as 08:43.640 --> 08:52.920 on so it's more safer and this has its own performance default but you finish your 08:53.880 --> 08:59.720 transfer you just call this RTS client close to close that session 09:03.720 --> 09:13.000 here from the first side is mostly similar just different structure so there's RDMA 09:13.000 --> 09:21.400 event callback and there's a link event callback so so in RDMA feel there's different 09:21.400 --> 09:30.840 events generated so just you need to handle them RTS does handle it transparently inside 09:33.640 --> 09:44.280 so there's some when you so you first need to define the structure and then call RTS 09:44.280 --> 09:54.520 server open call and you will get back this RTS context for for handle the data transfer 09:55.880 --> 10:04.920 yeah link event is simply handle the event generated from the underlying low-level 10:04.920 --> 10:15.160 data event and you'll finish the call you call RTS server close to close the session 10:21.640 --> 10:30.200 yeah we get to this be different into the IO pass so you're basically you need to define the message 10:30.200 --> 10:39.960 and defines the confirmation callback this is message our confirmation callback and we use 10:39.960 --> 10:48.840 send the request it is because we reserve the resource on the server side memory resource to hold 10:48.840 --> 10:58.680 this IO so you need to get a permit so don't overload buffer and you define 11:00.840 --> 11:09.800 the type of connection there's different type called IO connection or admin connection so the 11:09.800 --> 11:23.400 main is mostly for the amendment commands and always general IO it also have different mode you can 11:23.400 --> 11:35.080 wait for the running mode and false all-down servers in our data center and you can also see 11:35.080 --> 11:44.120 the performance graph is skills pretty well the number of jobs increase or number of device increase 11:44.120 --> 11:55.480 is mostly line nearly but when there's multi-numa architecture there's slope not so deep 11:58.920 --> 12:09.240 yeah so basically the KU's case for RTS is managing give a general 12:10.200 --> 12:23.480 internal library to do RTS may do RTS may and you can also use it for AI machine learning 12:23.480 --> 12:32.920 scenario to transfer data and because mostly generic so we currently use for 12:33.640 --> 12:43.320 for R&BD there's a set of module R&BD and it can be reused for other modules too 12:45.880 --> 12:55.640 we just call out others to get familiar with the code in can just run and test also reused the 12:55.640 --> 13:13.080 code as needed I think almost done and if there's any question you can ask here's our context 13:13.080 --> 13:15.080 in the email and thank you 13:30.280 --> 13:38.760 hi thanks for the talk I have a quick question is RTRS available to user space through 13:38.760 --> 13:42.760 you know you verbs interface or is it on the kernel level feature 13:44.760 --> 13:50.760 we are it's internal right now we do plan to have a interface that basically either using 13:50.760 --> 13:54.760 either you know something or maybe a user space blocked device through you block 13:54.760 --> 14:01.000 to start using RTRS and kernel but right now it's not available to use a space and before 14:01.000 --> 14:06.840 for break do you rely on an infinite button or something else it doesn't matter you can use 14:06.920 --> 14:10.840 infinite button you can use rocky it doesn't matter whether you're using 14:10.840 --> 14:18.840 nalenox or a broad gone it just needs to be available to use through the I the words 14:18.840 --> 14:28.840 okay thank you any other questions anyone okay 14:28.840 --> 14:40.600 I may be I miss something does this library designed to export block device at all I can 14:41.400 --> 14:49.640 share read or write any block from this device yes so this was designed in 14:52.600 --> 14:58.360 paired with another kernel module called RNBD and you would see that so basically when you use 14:58.440 --> 15:05.000 RNBD it internally uses RTRS and through that you can export block devices through RDMA it's 15:05.000 --> 15:12.920 very similar to NVM NVM or the differences that if you so for app form and test it shows that it 15:12.920 --> 15:19.320 performs better than NVM or web basically so but if that other module is also available in the 15:19.320 --> 15:26.360 kernel if you want to export block devices across RDMA network you can try that it's RNBD RTRS 15:26.760 --> 15:28.360 okay thank you 15:34.200 --> 15:36.200 okay thank you very much round of applause