WEBVTT 00:00.000 --> 00:09.000 Thank you, everyone, for coming to this. 00:09.000 --> 00:12.000 The objective of this talk is to give a few pointers and 00:12.000 --> 00:16.000 Heidi on how you can maximize performance when using 00:16.000 --> 00:20.000 GRPC and specifically the go implementation of GRPC. 00:20.000 --> 00:25.000 A lot of people are already familiar with GRPC, but I will just make a few 00:25.000 --> 00:28.000 basic introduction to reintroduce and re-explain what it is. 00:28.000 --> 00:31.000 GRPC is an RPC protocol made by Google, 00:31.000 --> 00:34.000 VATTEX is a team of first approach. 00:34.000 --> 00:38.000 So you first need to define your service specification in a 00:38.000 --> 00:39.000 Protome of fine. 00:39.000 --> 00:42.000 You will list all of the end points that exist on your service. 00:42.000 --> 00:47.000 All the message type that they take in input and that they output. 00:47.000 --> 00:51.000 From this specification, you can then generate 00:51.000 --> 00:57.000 Client and server code in any language that you want. 00:57.000 --> 01:00.000 It means that with GRPC it is easy to have 01:00.000 --> 01:03.000 Cross language support to have one client written in one language 01:03.000 --> 01:07.000 talking to server and another language without having to do 01:07.000 --> 01:08.000 much work. 01:08.000 --> 01:13.000 GRPC is based on top of HTTP2 and it uses binary encoding for 01:13.000 --> 01:15.000 the payloads. 01:15.000 --> 01:20.000 It supports either unary RPC or streams. 01:20.000 --> 01:25.000 So let's have a look at how it would work to do a unary 01:25.000 --> 01:27.000 request. 01:27.000 --> 01:32.000 I define in a Protophile my service that has a single endpoint 01:32.000 --> 01:34.000 Create user. 01:34.000 --> 01:37.000 It takes a specific message type in input. 01:37.000 --> 01:41.000 Create user request and return the specific message type in 01:41.000 --> 01:42.000 Output. 01:42.000 --> 01:46.000 From this Protome of specification, I will be able to 01:46.000 --> 01:49.000 Involve the Protome of the Protome of the Protome of the 01:49.000 --> 01:52.000 Compiler to generate a associated go code. 01:52.000 --> 01:56.000 And so it will generate a go script with all of the fields that 01:56.000 --> 01:59.000 have defined in the Protome of the definition. 01:59.000 --> 02:03.000 With this generated code, it is easy to implement 02:03.000 --> 02:05.000 The server itself. 02:05.000 --> 02:08.000 I just need to create a type that has a method. 02:08.000 --> 02:13.000 Create user that has a signature where it takes my message 02:13.000 --> 02:17.000 type in input and that returns the message type in Output. 02:17.000 --> 02:20.000 The implementation here is very dumb. 02:20.000 --> 02:23.000 It's just echoing back all of the field of the request in 02:23.000 --> 02:25.000 the response. 02:25.000 --> 02:28.000 And we have this implementation. 02:28.000 --> 02:32.000 I can have a quick benchmark to see how efficient that is. 02:32.000 --> 02:36.000 To do that, I use the standard go benchmark tooling. 02:36.000 --> 02:39.000 And I create a benchmark function. 02:39.000 --> 02:41.000 First, I will set up the server. 02:41.000 --> 02:43.000 I will bind to a local response. 02:43.000 --> 02:45.000 I will create a new JavaScript server. 02:45.000 --> 02:48.000 Attach my implementation that I should previously 02:48.000 --> 02:51.000 To it and ask it to set request. 02:51.000 --> 02:54.000 On the second hand, I will create the client. 02:54.000 --> 02:58.000 I attach to a specific local response and create a client 02:58.000 --> 03:00.000 object from it. 03:00.000 --> 03:05.000 With this client object, I can then benchmark the 03:05.000 --> 03:10.000 act of calling client.create user on the client object. 03:10.000 --> 03:14.000 And while this looks like just regular function call, 03:14.000 --> 03:17.000 it is actually doing a remote procedure call and doing all 03:17.000 --> 03:19.000 of the following steps. 03:19.000 --> 03:23.000 It will first marshal the request object to transform it to a 03:23.000 --> 03:26.000 byte array that is the wire representation of it. 03:26.000 --> 03:28.000 Send that to the network connection. 03:28.000 --> 03:30.000 The server will receive it. 03:30.000 --> 03:33.000 You will get back again the request object. 03:33.000 --> 03:36.000 Execute our handler implementation. 03:36.000 --> 03:40.000 Which was the basic figure that I showed earlier. 03:40.000 --> 03:43.000 Then it will get a response object that it could then marshal 03:43.000 --> 03:45.000 to a wire representation. 03:45.000 --> 03:46.000 Send that for the network. 03:46.000 --> 03:48.000 The client will receive it. 03:48.000 --> 03:51.000 And finally, we will unmark it to get a response object that 03:51.000 --> 03:55.000 can be returned from the function. 03:55.000 --> 03:59.000 And we can run the benchmark to have a rough ID of what is 03:59.000 --> 04:02.000 the performance of basic implementation. 04:02.000 --> 04:07.000 And it takes roughly 36 microsecond to do all of these steps. 04:07.000 --> 04:12.000 And so the overhead of a grpc-pc-pc-pc seems to be around 04:12.000 --> 04:17.000 36 microsecond because I get nothing interesting in my handler. 04:17.000 --> 04:20.000 And it's actually not that slow, that's good. 04:20.000 --> 04:25.000 But it's actually possible to reduce this time. 04:25.000 --> 04:29.000 And if we look back at all of the steps that were taken in 04:29.000 --> 04:32.000 or benchmark, a lot of them are about marshalling it 04:32.000 --> 04:38.000 and marshalling object to their wire representation. 04:38.000 --> 04:42.000 By default, what the grpc-library does is that it uses 04:42.000 --> 04:43.000 a reflection. 04:43.000 --> 04:47.000 It iterates over all of the public field of the goods 04:47.000 --> 04:50.000 tracked and it will marshall them one by one. 04:50.000 --> 04:54.000 And while this works, a reflection isn't the fastest thing 04:55.000 --> 04:58.000 So it is possible to change the way 04:58.000 --> 05:01.000 change the codec that is used to marshalling a marshall object 05:01.000 --> 05:04.000 to use a different representation that may be faster. 05:04.000 --> 05:09.000 To do that, grpc-go as a library exposes a codec interface 05:09.000 --> 05:13.000 that has two method, marshall and a marshall. 05:13.000 --> 05:17.000 And it is possible to pass a user-defined codec implementation 05:17.000 --> 05:20.000 to do this transformation. 05:21.000 --> 05:25.000 And it's not that hard and it's possible to make it faster. 05:25.000 --> 05:28.000 And to do that, I will introduce the vtprotabuff plugin 05:28.000 --> 05:33.000 that we'll help at doing the far further marshalling and marshalling. 05:33.000 --> 05:36.000 This is an open source project and it's a protocol 05:36.000 --> 05:37.000 with a compiler plugin. 05:37.000 --> 05:41.000 So it takes your protocol definition 05:41.000 --> 05:44.000 and it will generate additional code. 05:44.000 --> 05:48.000 Using it is fairly straightforward when on your compiling setup 05:49.000 --> 05:51.000 where you invoke the protocol buffer 05:51.000 --> 05:54.000 specifically the go plugin and the grpc plugin 05:54.000 --> 05:58.000 we just add one more to use the vtprotabuff codec. 05:58.000 --> 06:00.000 The vtprotabuff plugin. 06:00.000 --> 06:03.000 And when we generate the go code from our protocol, 06:03.000 --> 06:06.000 it will generate an additional file called my service 06:06.000 --> 06:09.000 and the score vtprotabuff.pb.go. 06:09.000 --> 06:12.000 This file will contain additional method 06:12.000 --> 06:16.000 on top of our ghost track that represent our protabuff object 06:17.000 --> 06:20.000 and this method can be used to implement a custom codec 06:20.000 --> 06:22.000 that was faster. 06:22.000 --> 06:25.000 So we want to implement a codec that has a 06:25.000 --> 06:28.000 method and a marshalling method and we will create 06:28.000 --> 06:31.000 such type and we will delegate the actual marshalling 06:31.000 --> 06:34.000 and a marshalling to a specific function 06:34.000 --> 06:38.000 to a specific method of protabuff go types, 06:38.000 --> 06:41.000 specifically marshalling vt and a marshalling vt. 06:41.000 --> 06:45.000 These two methods are generated by the vtprotabuff plugin. 06:46.000 --> 06:49.000 And they contain a hand-rolled implementation of the actual 06:49.000 --> 06:52.000 marshalling and marshalling, but do not rely on reflection. 06:52.000 --> 06:55.000 So it means that we don't have to use reflection 06:55.000 --> 06:59.000 and that this implementation is way more friendly 06:59.000 --> 07:02.000 to the in-liner and to the compiler. 07:02.000 --> 07:07.000 Once we've defined this type that implement the codec interface, 07:07.000 --> 07:10.000 we can tell the grpc library to actually use it. 07:10.000 --> 07:13.000 This can be done at any time here with anchored in the 07:13.000 --> 07:17.000 register codec or it can also be done when you create a grpc 07:17.000 --> 07:20.000 server object or a client object. 07:20.000 --> 07:24.000 And with this change we can run again or benchmark. 07:24.000 --> 07:28.000 And while it used to take 33 microsecond to actually do the whole 07:28.000 --> 07:33.000 herpc sequence it now takes around 33 microsecond. 07:33.000 --> 07:38.000 To ensure that we have a real difference and it's not just 07:38.000 --> 07:43.000 noises in the measurement, we can run the benchmark several times 07:43.000 --> 07:48.000 with a dash content option and save the output in a file. 07:48.000 --> 07:53.000 And then we can ask the bench start to compare the result that we got 07:53.000 --> 07:58.000 and bench start bench start will do statistical analysis on 07:58.000 --> 08:01.000 those results to confirm or deny that there is indeed a difference 08:01.000 --> 08:03.000 in our two implementation. 08:03.000 --> 08:07.000 And in this specifically it confirms that there is a statistically 08:07.000 --> 08:12.000 significant difference and that with the vt prototype of 08:12.000 --> 08:19.000 codec we use 12% less CPU time when actually performing 08:19.000 --> 08:22.000 or benchmark. 08:22.000 --> 08:28.000 So with very little code here we just had to enable the use of one plugin 08:28.000 --> 08:30.000 and make a simple codec implementation. 08:30.000 --> 08:35.000 We managed to reduce a bit overhead of the grpc ecosystem. 08:35.000 --> 08:39.000 And there was no need to change or hinder implementation. 08:39.000 --> 08:42.000 So this was for unary request. 08:42.000 --> 08:47.000 Now let's have a look at grpc streams and specifically grpc streams 08:47.000 --> 08:50.000 will send a larger amount of data. 08:50.000 --> 08:55.000 In my service definition I have two rpcs, 08:55.000 --> 08:59.000 put where we will send the stream from the client to the server 08:59.000 --> 09:04.000 and get where we send the stream from the server to the client. 09:05.000 --> 09:09.000 In grpc stream is just a sequence of prototype messages. 09:09.000 --> 09:13.000 And specifically here what we consider to be a message in our stream 09:13.000 --> 09:19.000 is just a chunk that has a single field which is a bite slice. 09:19.000 --> 09:27.000 From this definition we can have we can write a basic implementation 09:27.000 --> 09:31.000 of put on put we just iterate of the stream and 09:31.000 --> 09:34.000 stream that we see on it up until the stream gets closed. 09:34.000 --> 09:42.000 And on get we generate a random bite slice and we send that to the stream. 09:42.000 --> 09:51.000 So we will work with chunks of 16 kilobytes and for total stream size of 16 megabytes. 09:51.000 --> 09:56.000 And so we can have a look at how fast that basic implementation of grpc streams 09:56.000 --> 10:01.000 would be. This time I will not choose the go benchmark set up. 10:01.000 --> 10:05.000 I will have separate the client and the server and look at 10:05.000 --> 10:12.000 system metrics coming from that explorer as well as go runtime metrics. 10:12.000 --> 10:18.000 So when we run our grpc server implementation and we stress it with 10:19.000 --> 10:25.000 a gats week. So sorry our test set up has two virtual machines. 10:25.000 --> 10:30.000 One for the client one for server each of them have two virtual CPUs and they share 10:30.000 --> 10:33.000 a five gigabit per second network link. 10:33.000 --> 10:38.000 So when stressing the grpc server with gats we can see that we manage to saturate 10:38.000 --> 10:42.000 this five gigabit per second network link and that to do that we can 10:42.000 --> 10:46.000 see one from point three CPU core. 10:46.000 --> 10:50.000 So that's good or basic knife grpc implementation can saturate 10:50.000 --> 10:56.000 kind of high-performance kind of high speed network link but it costs some CPU. 10:56.000 --> 11:01.000 And if we compare with another project for example kd which is a popular 11:01.000 --> 11:05.000 ktp 2.5 server with an in go we can see that kd is able to saturate 11:05.000 --> 11:09.000 the same network link but for all these 0.2 CPU core. 11:09.000 --> 11:13.000 So there is a lot of this dependency between what our grpc implementation 11:13.000 --> 11:16.000 does and what kd which is optimized does. 11:16.000 --> 11:20.000 And so even though grpc is a bit of more involved protocol 11:20.000 --> 11:25.000 by an HTTP 2 it would still be nice if we are able to reduce the CPU 11:25.000 --> 11:28.000 that we consume to do so. 11:28.000 --> 11:33.000 And the problem is even worse when we try the put stream when we put 11:33.000 --> 11:37.000 data to the server so when the server receives data it needs to consume 11:37.000 --> 11:43.000 1.60 CPU core to saturate this network link. 11:43.000 --> 11:48.000 So the question is where does where is the CPU consume what is our 11:48.000 --> 11:52.000 CPU doing on this workload to consume that much power. 11:52.000 --> 11:56.000 And the good thing about go is that answering this question is fairly straightforward. 11:56.000 --> 12:02.000 You can just run the CPU profine on your rolling jrpc server to get an answer. 12:02.000 --> 12:06.000 And if you do that you get this kind of frame graph. 12:06.000 --> 12:11.000 There is one big tower in the middle which is actually doing the c-school 12:11.000 --> 12:12.000 to read data. 12:12.000 --> 12:15.000 So while it can be optimized I will not have a look at that today. 12:15.000 --> 12:17.000 This is kernel CPU time. 12:17.000 --> 12:22.000 But I will never have a look at other towers and specifically this one. 12:22.000 --> 12:29.000 They all are related to garbage collection or memory allocation. 12:29.000 --> 12:34.000 And indeed if we look at go runtime metrics you can see that 12:34.000 --> 12:41.000 or jrpc server while under load allocates at a rate of 11 gigabit per second of 12:41.000 --> 12:42.000 that time. 12:42.000 --> 12:46.000 When we are serving 5 gigabit per second of throughput we are actually 12:46.000 --> 12:51.000 internally allocated 11 gigabits of second per second of that time. 12:51.000 --> 12:56.000 And due to this higher location rate we need to run gc very often. 12:56.000 --> 13:01.000 During our stress test the gc was running between 4 and 5 times per second. 13:01.000 --> 13:06.000 This is also metrics coming from this growing time. 13:06.000 --> 13:11.000 And so while the go gc is kind or efficient and fast if you are in it several 13:11.000 --> 13:17.000 hundred times per second when it starts to add up and to be a significant amount of CPU 13:17.000 --> 13:18.000 type. 13:18.000 --> 13:22.000 So now the question is where does this memory allocation come from and can we 13:22.000 --> 13:23.000 reduce it? 13:23.000 --> 13:28.000 Once again in go it is very straightforward to have an answer on where does it come from. 13:28.000 --> 13:35.000 We can just run a memory profile on our running jrpc server and see where it comes from. 13:35.000 --> 13:42.000 And specifically it seems to come from a single function inside the jrpc library called jrpc. 13:42.000 --> 13:46.000 The jrpc that we see has two parts. 13:46.000 --> 13:52.000 Receive and decompress that reads data from the network connection and put it in a 13:52.000 --> 13:53.000 byte buffer. 13:53.000 --> 13:59.000 So it reads one part of a message is located on the heap and puts the wire representation 13:59.000 --> 14:00.000 on that place. 14:00.000 --> 14:02.000 And then on the second part it unmashing. 14:02.000 --> 14:09.000 So it tries to turn this wire representation to a good track. 14:09.000 --> 14:15.000 So specifically when for the receive and decompress part by default every time a 14:16.000 --> 14:19.000 prototype of message is received by the jrpc library it 14:19.000 --> 14:25.000 evaluates a byte size and it writes that's wire format to it. 14:25.000 --> 14:30.000 And so for every message that is received by the jrpc library we need to 14:30.000 --> 14:32.000 heap allocate. 14:32.000 --> 14:37.000 It would be nice if there was a way to tell the jrpc library to actually not heap 14:37.000 --> 14:42.000 allocate for every new message that is being received but actually try to reuse data 14:42.000 --> 14:44.000 internally to reuse memory. 14:44.000 --> 14:46.000 And it's actually something that is possible. 14:46.000 --> 14:50.000 You can configure the jrpc go library to do so. 14:50.000 --> 14:57.000 If you are using jrpc go with a version that is lower than 1.66 there is an 14:57.000 --> 15:02.000 experimental option that can be used when creating a server or a client. 15:02.000 --> 15:07.000 If you use experimental dots receive buffer pull you will tell the jrpc library 15:07.000 --> 15:11.000 to try to use a memory buffer pull internally when receiving new messages. 15:11.000 --> 15:20.000 If you are using a recent enough version of jrpc go which I would strongly advise then 15:20.000 --> 15:27.000 you actually don't have to do anything and it is now the new behavior by default. 15:27.000 --> 15:31.000 With this memory pull we don't have to heap allocate for every new message that 15:31.000 --> 15:33.000 is being received by the jrpc library. 15:33.000 --> 15:37.000 This memory can be reused across different messages. 15:37.000 --> 15:44.000 And so if we were to run again or benchmark setup you can see that our allocation rate is lower. 15:44.000 --> 15:51.000 We only allocate at around 5 gigabit per second compared to the 11 gigabit per second that we had previously. 15:51.000 --> 15:57.000 And as a result we consume less CPU time when serving the same amount of traffic. 15:57.000 --> 16:00.000 So this was one half of the answer. 16:00.000 --> 16:06.000 The other half we would need to look at how to reduce memory allocation when we unmasher. 16:07.000 --> 16:14.000 In our basic implementation what we did was calling stream dot receive in a loop 16:14.000 --> 16:17.000 up until the jrpc stream gets closed. 16:17.000 --> 16:23.000 And what stream dot receive does is that it will return new heap allocated chunk 16:23.000 --> 16:30.000 that has a single data field that is a byte slice that will contain what we send through the network. 16:31.000 --> 16:36.000 Every time we call stream dot receive we heap allocate a chunk. 16:36.000 --> 16:41.000 And we also need to heap allocate when we unmasher. 16:41.000 --> 16:43.000 We need to heap allocate the data. 16:43.000 --> 16:46.000 The slice that contains the data relevant payload. 16:46.000 --> 16:51.000 We need to heap allocate it so that we can link it to the data field. 16:51.000 --> 16:57.000 And it is possible to make a difference implementation of puts that do not have this issue. 16:58.000 --> 17:03.000 To do that we will stop using the jrpc api stream dot receive. 17:03.000 --> 17:07.000 And we will rather use the api stream dot receive message. 17:07.000 --> 17:15.000 The big difference is that stream dot receive message takes a chunk as argument and will unmasher to this thing. 17:15.000 --> 17:23.000 If we are able to pass a chunk that has a data field that is a byte slice of length zero 17:23.000 --> 17:31.000 but with free capacity then we can unmasher to this free capacity without having to do heap allocation. 17:31.000 --> 17:40.000 And to do that we will pull our chunk messages by using some VT protocol helpers. 17:40.000 --> 17:44.000 When we get a chunk from the VT protocol pool there are two options. 17:44.000 --> 17:48.000 I have of the pool was empty and we heap allocated new chunk. 17:48.000 --> 17:53.000 The data field is the nil slice, length zero and capacity equals zero. 17:53.000 --> 18:01.000 So when we unmasher we need to create a new slice and attach it to the data field which is exactly what was happening before. 18:01.000 --> 18:09.000 Or if we get lucky when we get a chunk from the pool it is possible that the data field is a slice of land zero 18:09.000 --> 18:14.000 but with some extra free capacity that was previously allocated. 18:14.000 --> 18:20.000 And if it is the case when we unmasher to this slice we can just write it directly to the free capacity 18:20.000 --> 18:25.000 without having to ask the runtime to do a new heap allocation. 18:25.000 --> 18:33.000 And so with this thing we don't have to make a new heap allocation for every message that we receive on the stream. 18:34.000 --> 18:44.000 And so by running again or benchmark we can see that with this change we have reduced dramatically the allocation of our GAPC server. 18:44.000 --> 18:49.000 We now only allocate around 20 to 30 megabit per second of data. 18:49.000 --> 18:53.000 One we use to allocate several gigabit per second. 18:53.000 --> 18:57.000 Due to this change the gc rate is way more appropriate. 18:57.000 --> 19:00.000 It only runs around one per second. 19:00.000 --> 19:08.000 And so with that the CPU consumed when saturating the network is actually around 0.7 CPU core. 19:08.000 --> 19:21.000 It's used to be at 1.6 so with this simple change we more than reduced and divided by 2 the amount of CPU that is spent on the GAPC in terms. 19:21.000 --> 19:24.000 We can now have a look at the get workload. 19:24.000 --> 19:27.000 That's the first from roughly the same issue. 19:27.000 --> 19:34.000 On the get workload so when there is a stream from the server to the client it allocates a lot of the memory. 19:34.000 --> 19:39.000 It allocates around 5 gigabit per second when we serve 5 gigabit per second. 19:39.000 --> 19:44.000 And as we run this year lot it consumes a lot of CPU and it's the same issue as previously. 19:44.000 --> 19:48.000 But this time the memory allocation comes from a different place. 19:48.000 --> 20:00.000 All of the memory allocation comes when calling the marshalling to transform or go struct to their way of presentation and to send them free network. 20:00.000 --> 20:08.000 The reason for that can be found in the API that GAPC library uses for the codec. 20:08.000 --> 20:14.000 Specifically, marshalling takes a good struct and returns a byte slice. 20:14.000 --> 20:22.000 As we may have shown in the talk previously returning a byte slice means that you need to high-pallocate it. 20:22.000 --> 20:30.000 There is no way for the color and the color of this API to cooperate to prevent memory allocation. 20:30.000 --> 20:40.000 So every time we need to marshalling object through this interface we need to keep allocate as much memory as it's why your representation is. 20:40.000 --> 20:46.000 Thankfully GAPC recently introduced a codec V2 interface. 20:46.000 --> 20:49.000 It's kind of the same thing as a codec interface. 20:49.000 --> 20:59.000 There is still a marshalling method but this time rather than using a standard byte slice it uses a specific GAPC internal type. 20:59.000 --> 21:02.000 called Mem.BufferSlices. 21:02.000 --> 21:16.000 And what a Mem.BufferSlices is that it's basically a reference count at byte slice but the GAPC library can reuse and pull internally so that it does not have to e-pallocate it new stuff every time. 21:16.000 --> 21:24.000 And so we can make a codec that implements this codec V2 interface. 21:24.000 --> 21:29.000 Once again we have the help of the V2 prototype helpers. 21:29.000 --> 21:38.000 To implement marshalling we will first ask what would be the size of the wire representation of this object. 21:38.000 --> 21:47.000 We will then get a piece of memory from an internal GAPC memory buffer pool that can hold this wire representation. 21:47.000 --> 21:58.000 Then we can just marshalling to the slice and return it as wrapped in a mem.BufferSlices type. 21:59.000 --> 22:11.000 And this means that the GAPC library can now send this piece of slice of memory that contains the representation of a ghost truck. 22:11.000 --> 22:22.000 And when it is done it can just decrement a reference count and if it switches zero put it back into the pool that can be so that it can be reused later. 22:22.000 --> 22:30.000 So with this change we don't have to make a heap allocation every time we call marshalling. 22:30.000 --> 22:39.000 And so it means that we can run or benchmark again and we can see that once again our allocation rate has been dramatically reduced. 22:39.000 --> 22:54.000 We only allocate a few dozen megabyte per second when serving this gigabit per second load does the GC rate is way lower and we consume less CPU. 22:54.000 --> 23:06.000 So as a small summary we've seen that it's possible to make a few changes to how you use the GAPC library to reduce CPU consumption. 23:06.000 --> 23:21.000 For you now I request using the VT product of the codec helps you to remove the default implementation that is based on reflection and by doing so you can save around 10% of CPU time. 23:21.000 --> 23:42.000 For streams on egress streams of stream moving out of the GAPC implementation you can reduce memory allocation and this CPU consumption by a significant factor by just using a recent GAPC version and having a codec VT implementation that do not hit allocate for every call. 23:42.000 --> 23:53.000 Ingress stream you can also reduce memory allocation and CPU usage by using either a recent GAPC version or enabling an experimental option all the ones. 23:53.000 --> 24:00.000 And if you want to go further you can change the handler implementation to pull the message that you receive from the stream. 24:00.000 --> 24:11.000 Apart from this very last point all of the everything that I explained previously do not require you to change the actual GAPC implementation of your handler. 24:11.000 --> 24:23.000 You just need to change a few lines of code on your combining setup on when you would how you generate GAPC from protocol and have one good codec implementation. 24:23.000 --> 24:31.000 So with very little amount of work and code you can get a significant amount of CPU reduction. 24:31.000 --> 24:36.000 Thank you for listening we have a few minutes I think for question guys. 24:36.000 --> 24:45.000 I have space for one very short question.