WEBVTT 00:00.000 --> 00:09.400 I'm going to do the introduction to this might now. 00:09.400 --> 00:15.840 We are still in the deep diving go track and you heard we talk about garbage collection 00:15.840 --> 00:19.840 a lot which reminds you remember if you have garbage you want to collect the garbage 00:19.840 --> 00:21.840 bin is there. 00:21.840 --> 00:26.080 We are a garbage collected language I don't want to see any garbage at the end of the day 00:26.080 --> 00:28.760 only if you do reference counting. 00:28.760 --> 00:34.200 But we are going to talk about going easy on memory and Samar here is an amazing person 00:34.200 --> 00:40.240 to talk about this well with a very cute graphic here explain us how we can treat our 00:40.240 --> 00:44.720 memory of our servers and our computers way better run for plus. 00:44.720 --> 00:53.200 Thanks guys thank you I'm really happy being here. 00:53.200 --> 00:59.480 So my name is Subagip I am a software injury and a full-time dead of three monsters 00:59.480 --> 01:04.600 basically I contribute to open source as much as I can but these days it's a bit hard. 01:04.600 --> 01:10.720 I'm mostly interested in the observability, distributed systems and databases I have spent 01:10.720 --> 01:17.080 like all my life coding basically I'm a systems guy before go I was just writing profilers 01:17.080 --> 01:22.280 in Python as a sea extension but we as a company decided to implement a continuous 01:22.360 --> 01:28.000 profiling ingestion pipeline and go length we turned out to be an excellent idea. 01:28.000 --> 01:34.640 So my main motivation for this is for this talk is that there is tons of material around 01:34.640 --> 01:38.640 like how garbage collection works like what types of garbage collection collectors are 01:38.640 --> 01:44.080 there but there's less emphasis on writing garbage collection and weird code which means 01:44.080 --> 01:50.240 like I feel it like when you're writing code you will know like what kind of tricks you 01:50.320 --> 01:55.360 have in your bag like what should you do to write more efficient code in terms of garbage 01:55.360 --> 02:03.080 collection and even though the topic is very language-agnostic like you mostly see the same 02:03.080 --> 02:06.080 things in Java as well. 02:06.080 --> 02:11.800 I aim to be practical as possible and let me first start with some real work data so this 02:11.880 --> 02:20.880 is one of my favorite conferences and Pekka he is the CEO too so it is a horizontally scalable 02:20.880 --> 02:26.560 SQLite database and mentions about like avoiding dynamic memory management is the only 02:26.560 --> 02:31.560 thing that matters for low latest it's kind of a bolt statement but I think really it's 02:31.560 --> 02:36.640 after some point it becomes true because you optimize all the low hanging fruits in your 02:36.720 --> 02:42.240 code and maybe like do lots of optimizations in your algorithm and data structures but 02:42.240 --> 02:47.520 then what your left is just memory management and you need to do much better in that area 02:47.520 --> 02:55.680 as well and yeah this talk is from brands like previous talks if you I'm not sure if you 02:55.680 --> 03:04.200 can see everything but view works is horizontally scalable database and it is like if you 03:04.280 --> 03:10.440 can some of these numbers you will see that half of the time is actually spent on garbage 03:10.440 --> 03:16.120 collection and this is from a production system and one other example is from this talk and 03:16.120 --> 03:22.680 receive video and this is I think this is a perfect put and if you see the first function it 03:22.680 --> 03:29.400 basically takes like 40% of the time and it's just runtime gc bar it's basically the marking 03:29.480 --> 03:36.840 process marking gorytine of the garbage collection and if you want some more this is from our 03:36.840 --> 03:41.720 ingestion continues profiling ingestion pipeline here is the like half of the time is actually 03:41.720 --> 03:48.360 the handler code the ingestion code and you can see that 25% of the time is actually spent on 03:48.360 --> 03:54.440 garbage collection as well and this is a recent data dog blog again this is a metric database 03:54.440 --> 04:00.600 from from day by the way the the code that I showed you from Uber was M3db which was their 04:00.600 --> 04:05.880 metric database as well and this is from this is a recent blog post from the data dog they switched 04:05.880 --> 04:14.680 to us because 30% of resources on garbage collection and like there is this gc friendly libraries 04:14.680 --> 04:21.640 there are lots of them like go or 0 log go reflect all of them are just advertised there's 04:22.600 --> 04:27.960 themselves as being zero allocation and that kind of stuff and the people just download 04:27.960 --> 04:35.160 it which means that there's there's a really neat for this and yeah I don't language I'm not sure 04:35.160 --> 04:41.960 if you have ever heard the term like that spiral that that's I that was a term I heard a lot 04:41.960 --> 04:48.840 in java there are that was you guys sitting behind me not myself but the that spiral means 04:48.840 --> 04:54.600 that your application is basically doing nothing useful just trying to keep up with the pace 04:54.600 --> 05:00.440 of the allocations so I'm going to have a very overview of very I'm going very fast here 05:01.480 --> 05:07.880 to just mention a bit about like how memory works in go lang basically we have two types 05:07.880 --> 05:14.600 stack and heap and the both are on memory of course and the stack is basically pre allocated memory 05:14.680 --> 05:20.200 dynamically grows like it basically starts with one megabyte by default for a go routine and then 05:20.200 --> 05:25.960 it can dynamically grow it is fashion because it is when you eloquate it's just incrementing 05:25.960 --> 05:32.440 and decrementing a pointers so it's very fast because it's known the access patterns are very 05:32.440 --> 05:38.680 known and the faster access because you most of the time when you enter a function you access 05:38.760 --> 05:46.280 variables most of the time the same stack will use that means that your by nature your code 05:46.280 --> 05:52.920 your stack memory will be cash friendly because it will be cashed in L1 L2 whatever and it's 05:52.920 --> 06:00.360 managed by the compiler in go which means whenever you compile the binary the stack memory will be 06:00.360 --> 06:07.160 known and the heap is dynamic allocated and it's managed by garbage collection the only thing I would 06:07.240 --> 06:13.960 like to say is that whenever a variable escapes to heap you will you can see it as a pressure 06:13.960 --> 06:19.880 to garbage collection and one more thing I would like really would like to mention is that like 06:19.880 --> 06:27.560 it is the stack is cash friendly so it's mostly in L1 so you're mostly accessing the same 06:27.560 --> 06:34.600 variables again and again which puts them closer to the CPU that's how CPUs work and if you look 06:34.600 --> 06:39.800 at the numbers you can see that accessing RAM is much much worse than just being in the cash 06:40.520 --> 06:46.360 so and these numbers are usually hidden you cannot see them in your CPU profiles whatsoever 06:46.360 --> 06:52.440 so what I'm trying to say is that do your best to be in stack if you've wild writing code if 06:52.440 --> 06:56.760 there's a possibility you can put that variable in the stack please put it in the stack 06:57.400 --> 07:02.360 and to move before moving forward into the tricks I also would like to mention a little bit 07:02.360 --> 07:10.360 about like understanding garbage collection so how garbage collection works before that I just want to 07:10.360 --> 07:17.160 say that it is a very complex software and it is probably do inherent wide range requirements 07:17.160 --> 07:24.280 of all the applications it need to support the allocation rate of an application can be very high 07:24.280 --> 07:31.240 the volume can be very high there might be few goretings like there might be millions of goretings 07:31.240 --> 07:36.360 running in your application there's fragmentation issues it needs to consider and there's also 07:36.360 --> 07:43.720 the pacing like it needs to basically say no to say no to allocations when the time comes 07:43.720 --> 07:47.880 and it needs to do it in a minimum later otherwise people just switch to other languages like 07:47.880 --> 07:55.800 rust so yeah and how garbage collection works it basically works in three phases this is again 07:55.800 --> 08:03.400 like two thousand feet overview of this so I don't want to shame the people work on this one 08:03.400 --> 08:10.600 because it is like an extremely complex software so in the first phase this is called the initial 08:10.600 --> 08:16.680 mark phase there is this stop the world so this is the stop the world well let's stop the world 08:16.680 --> 08:22.840 means is that it basically stops your all your application no nothing runs at this period this 08:22.840 --> 08:31.320 is done because it needs to get a consistent snapshot of the variables that point to the heap 08:31.320 --> 08:36.520 so it basically stops the application nothing is running it then works over the stack of your 08:36.520 --> 08:42.840 goretings try to identify which which ones are pointers there's lots of optimizations on 08:42.840 --> 08:48.680 in doing so but just I'm just saying in this initial phase it is stopping the world so effectively 08:48.680 --> 08:55.640 it it means a lot for the for the latency and in the second phase it can then concurrently because 08:55.640 --> 09:02.680 in the initial phase it eject it basically enabled some right barriers so what right barriers mean 09:02.680 --> 09:09.880 is that in this concurrent mark phase when the right barriers are in place what you can do is you can 09:09.880 --> 09:16.280 basically get updates or rights happen to those pointers and then you can understand you can 09:16.360 --> 09:22.680 basically detect them you can think a white right barrier is a callback to your pointer change 09:24.200 --> 09:31.400 so this this all happens concurrently and after like marking all the reachable objects in the heap 09:31.400 --> 09:36.120 unreachable ones are just sweep which is the third phase of the garbage collection 09:36.120 --> 09:42.520 what I would like to just highlight is that it is extremely not extremely but very unproductive it 09:42.600 --> 09:47.640 might be very unpredictable like this because of the stop-to-world you basically stop the 09:47.640 --> 09:53.000 application and the garbage collection is also very dependent to your application because it is 09:53.000 --> 10:00.360 linear with the number of pointers you have in the heap so and it acts very differently on pressure 10:00.360 --> 10:06.680 whenever you are limit in the pressure it acts differently it basically kicks in just more 10:06.840 --> 10:12.600 and more and one thing that's overlooked is that CPU cash flushes that I just talk about before 10:12.600 --> 10:17.960 garbage collection it is like this everything is green and everything works perfectly but after 10:17.960 --> 10:23.400 garbage collection it is just empty because if you think about it garbage collection is kind of a 10:23.400 --> 10:31.320 software because it basically traverses all heap it is kind of very cash unfriendly if you think about it 10:31.400 --> 10:36.600 so to wrap up the garbage collection can high impact we see that real work data makes it up 10:36.600 --> 10:44.200 40-50% usually goes unnoticed and it can be unpredictable so let's keep garbage collection happy 10:46.360 --> 10:54.040 okay I shamelessly copy and paste this from Brian's talk hello brand Brian and this is from 10:54.040 --> 11:01.480 Histalk that like reuse reduce recycle this is an MRI and my motor phrase that is basically 11:01.480 --> 11:06.920 just to eliminate waste and I think it's a very nice analogy for this one so let's start with 11:06.920 --> 11:14.760 reduce basically so by the way reduce is not all about right reducing allocations or pointers it is 11:15.960 --> 11:23.720 it is reducing size like like reducing any size of your memory always have compounding benefits 11:24.360 --> 11:31.640 like I was a piting guy and I know that they have done for example from 2002 to 2007 to 11:32.120 --> 11:38.680 2003 that 13 they have shrinks the base object base everything is an object in piting by the way and 11:38.680 --> 11:46.760 basically they shrink the size by a 95% and they see like 60% on runtime without doing any optimization 11:46.760 --> 11:53.720 that's how you benefit from cash locality and that kind of stuff so mechanical sympathy is very 11:53.720 --> 12:00.200 important yeah I have written a simple blog post about it yeah so the first thing I would like 12:00.200 --> 12:06.760 to mention is like the stack versus heap so whenever you return something some reference type or 12:06.760 --> 12:14.280 pointer it returns it basically escapes to heap so you should try to be if it is possible you should 12:14.280 --> 12:21.880 try to try to design your APIs such that you accept it instead of returning it like the 12:21.880 --> 12:27.880 reader interface is a good example and I saw this example from Jacob Stalkier he mentions that 12:27.880 --> 12:34.680 whenever you return it it means that the allocation is already done it escapes to heap but if you 12:34.680 --> 12:39.800 accept it there is a chance that you can allocate this on stack and whenever you call the function 12:40.600 --> 12:48.280 it might still be in the stack there is there is no rule on that so returning escapes to heap but 12:48.280 --> 12:56.520 calling does not and yeah closure variables are tricky just if you if you use closure variables 12:56.520 --> 13:03.720 just be cautious about them they might escape to heap. Interface in generating escapes to heap as 13:03.720 --> 13:10.200 well because compiler doesn't know the type that's the size there's slower on hotpets please 13:10.840 --> 13:17.880 prefer using concrete types so the the most important thing I think is this line here in 13:17.880 --> 13:24.600 the Stalk is that avoiding pointers so whenever you write I was mentioning some mindset before 13:24.600 --> 13:31.320 the Stalk I think it is this whenever you write the pointer you use a pointer just try to be 13:31.320 --> 13:38.760 mindful about it because it basically means more garbage collection pressure it is linear with the 13:38.760 --> 13:44.840 pointers and if you don't use pointers it can skip entire regions without pointers I'm talking 13:44.840 --> 13:51.080 about the garbage collection it is more cash friendly to to be to not to use pointer by the way 13:51.720 --> 13:56.840 and compiler generates extra checks like like if you panic for example you need to have some checks 13:56.840 --> 14:03.160 in the compiled output and sometimes these pointers can get unnoticed like I didn't know before 14:03.160 --> 14:08.760 this talk the time that time for example has a pointer inside so if you for example have a slice of 14:08.840 --> 14:14.360 time objects they basically contain a pointer which means garbage collection pressure 14:16.120 --> 14:23.480 yeah string time that time all contain pointers in self so careful maps with reference types 14:23.480 --> 14:31.720 slice values string keys and many more so one one one technique that there is being used is that 14:31.720 --> 14:37.800 if you have like this kind of struct like basically two integers region and tenon tidy instead 14:37.800 --> 14:44.760 of just formatting them to a string and using this string as a key please just use the struct as a key 14:44.760 --> 14:54.360 because you will be avoiding pointers for free and this this one is also interesting for me as 14:54.360 --> 14:59.560 well I don't know if you remember that's just a Swiss map talk from Brian that just a previous 15:00.360 --> 15:07.400 thought he was measuring the buckets like so if you allocate like a struct value or key 15:08.120 --> 15:17.960 more than 128 bytes this means that the map implementation needs to allocate more memory and use 15:17.960 --> 15:25.480 pointer instead of inlining the value inside the bucket but if you if you are smaller than 15:26.120 --> 15:33.080 this special value you will be inline integral so there is no extra allocation you can 15:33.080 --> 15:37.800 test it yourself just benchmark the code with the allocations you can see that it is different so 15:37.800 --> 15:49.320 I will just head up for this one yeah one one more thing is that as guys coming from sea 15:49.400 --> 15:57.000 world like me there is this myth of copying is expensive but if you think about it like in terms 15:57.000 --> 16:06.120 of CPU it is it is a myth because basically like copying a 64 bytes is just the same as copying 16:06.120 --> 16:15.160 a pointer because basically CPU and RAM just operates in cash line rate yeah I mean 16:15.320 --> 16:22.520 prefer to use non pointer versions of data structures like linked list you would be amazed 16:22.520 --> 16:28.840 like how many data structures can be just implemented without using pointers this one might not 16:28.840 --> 16:34.440 be avoiding pointers but I just wanted to mention because this bit me in production there is a big 16:34.440 --> 16:40.360 peeper of payload for we we used in our ingestion pipeline and we were cashing a small 16:40.360 --> 16:47.080 struct inside it which was not a pointer if you do if you do this this thing I mean the big 16:47.080 --> 16:54.440 struct will not get the allocated it is I mean it is obvious but just keep in mind while doing 16:54.440 --> 17:01.400 these kind of things because we get in in the in the head and yeah it took us lots of time we 17:01.400 --> 17:09.160 our application was out of memory like crazy so basically avoid holding references inside large 17:09.160 --> 17:16.840 objects and the remember zero kitchen libraries use them they're awesome there is lots of them 17:16.840 --> 17:22.440 and if you wonder how they work the main trick is just pre-allocating memory and use integer 17:22.440 --> 17:28.840 indexes to reference objects so it's just basically another way of avoiding pointers that's all 17:29.000 --> 17:38.680 and reusing okay reuse I think it should start with singtapool definitely because it's the basic 17:38.680 --> 17:46.440 tool singtapool is again coming from a serial singtapool is not a fearless like you cannot just 17:46.440 --> 17:54.120 put something and wait for it to be present whenever you request it it is different because 17:54.200 --> 18:02.040 the values when when you put them in the singtapool they can be garbage collected and so the main 18:02.040 --> 18:10.120 trick here is that if you would like to use the singtapool you should have allocations that you would 18:10.120 --> 18:18.760 like to use reuse between two garbage collections cycles basically I say too because whenever you put 18:18.840 --> 18:26.040 something into the singtapool it stays one more time it's persist to the garbage collection 18:26.040 --> 18:34.600 cycle one more time and one more thing is that it is defined and singt this one yeah it is very 18:34.600 --> 18:41.000 simple thing but I mean if you think about it sometimes you don't see it so it's not in 18:41.000 --> 18:46.600 memory that pull or something like that it's singtapool because it is very very optimized for 18:46.600 --> 18:52.600 concurrent access there is no looks inside the implementation of the singtapool there is a very good 18:52.600 --> 18:57.800 blockpost about it from the pictorial matrix I just read and basically what it does is under the 18:57.800 --> 19:06.840 hood in the runtime it has a go routine caches basically uses very small caches but it's somehow 19:06.840 --> 19:16.120 managed to do it without using any mutex or like yeah it is very useful but a bit misunderstood so 19:16.280 --> 19:23.480 you need to understand very well one of the tricky things that be careful with returning non-pointers 19:23.480 --> 19:31.720 why because so if you look at the example in the in the red here if you return a slice header 19:33.320 --> 19:41.480 and return this one like that what happens is that when you put it put the buffer as a slice 19:42.280 --> 19:49.640 the as the put accepts an interface there is a conversion so the the buffer as the buffer slice 19:49.640 --> 19:55.000 other escapes to heat which means that you have another allocation unnecessary allocation so 19:57.080 --> 20:04.040 so the the reason you basically use singtapool is that reusing not allocating but you are 20:04.040 --> 20:09.320 allocating more so this is exactly the opposite and that is why you optimize for allocations but 20:09.320 --> 20:15.800 introduce one for for the next cycle and yeah this might be hard to have a support on production 20:15.800 --> 20:21.400 and that's why we have this static check and please just use static checks in your code as well 20:22.440 --> 20:27.960 and sorry like if you read the static check when passing a value there is not a pointer to a function 20:28.680 --> 20:33.800 the value need to be placed on the heat which means additional allocation and this is the reason 20:34.280 --> 20:41.560 and I'm not going to too much detail on this one this is very simple I guess like if you have any 20:41.560 --> 20:48.120 chance of like you allocate basically a slice of numbers and if you have any chance of reusing it please 20:48.120 --> 20:55.160 reuse you can just point to the start of the slice by this trick like a pan trick and please again if 20:55.640 --> 21:03.640 you just allocate a big or small it doesn't matter a map or slice please a pre-allocate because 21:04.280 --> 21:12.200 if you let the dynamic if you let your slice or map to dynamically grow it it might lead to 21:12.200 --> 21:19.640 fragmentation as well and yeah strings that build their bytes but buffer like all the stuff 21:20.280 --> 21:27.880 no they're optimized for in not making no intermediate allocations as well oops okay recycle 21:27.880 --> 21:32.920 so for me recycle is all about like tuning the garbage collection tuning the parameters 21:33.480 --> 21:40.680 and if you look at it like I really think that Golank has Golank team has done great job at just 21:40.680 --> 21:45.800 abstracting all these garbage collection details into two basic configuration parameters 21:46.440 --> 21:54.200 one of them is the Golk gc which means that okay like if you set it to 100 person for 21:54.200 --> 22:01.160 example what it means that I have this the current heap it is x bytes whenever it becomes like it 22:01.160 --> 22:08.280 doubles it becomes one person more sorry 100 person more just trigger the gc that's how it that 22:08.280 --> 22:14.280 way what it does and the go map limit is basically saying that okay it is the it is the hard limit 22:14.280 --> 22:21.240 for the go run time do not do not get this threshold but it is also a nice thing it is also 22:21.240 --> 22:26.440 a soft limit for the operating system because you would like to avoid out of memory errors because 22:26.440 --> 22:32.360 whenever out of memory occurs it might end up very weirdly this gives you a chance to set a 22:32.360 --> 22:39.320 proper value without out of memory and the go run time can act be like agree and by the way 22:39.320 --> 22:46.360 when garbage collection near this limits it may be near this go map limit it acts more aggressively 22:46.360 --> 22:53.480 try to reduce there is a nice tool that show you can just select go gc memory and play with it 22:54.920 --> 23:01.560 okay two solar reef I have a few minutes left so this is overview for reason I think go is 23:01.560 --> 23:07.480 unparalleled when it comes to observability it is like in the runtime you have lots and lots of tools 23:07.640 --> 23:13.480 I'm not going very detail on them there's tons of material I just would like to mention like 23:13.480 --> 23:20.120 profiling memory you can basically use heap life objects size to debug memory leaks you can use 23:20.120 --> 23:26.600 allocation count on per function and line level by the way line level is also very interesting as 23:26.600 --> 23:33.400 well you can basically observe allocation frequency and maybe you think that pull and this is again 23:34.280 --> 23:41.960 like an older I shame this I was really really lazy to just do it myself I used Ryan Stalk again 23:41.960 --> 23:47.960 but this is line profiling I'm still amazed by it you can do line profiling and it shows you 23:47.960 --> 23:55.400 per line allocation information and the escape analysis again try to be I said try to be on the 23:55.400 --> 24:01.880 stack not heap so this is your tool to do it if you run this tool it basically compiles and 24:01.880 --> 24:07.480 basically says you I will be on the stack or not so you can use this tool to decide if you're 24:07.480 --> 24:13.960 on the stack sometimes it's not easy to spot and one tool that I'd like to mention because I 24:13.960 --> 24:20.840 feel it's kind of underrated execution trace is one of them I feel that it is it is the most 24:20.840 --> 24:27.720 cinematic visualization what I mean by that is that it offers the most realistic view on what happens 24:27.720 --> 24:34.600 inside your application because it shows the time and it shows the important events happening 24:34.600 --> 24:41.480 in your application it can be look it look acquire it can be garbage collection whatever like 24:41.480 --> 24:46.920 you will see a timeline of events happening and you can also see the garbage collection 24:46.920 --> 24:54.120 pressure and phases and the latency like stop the world every every phase that I just mentioned 24:54.120 --> 24:59.560 you can see and visualize it and yeah you can debug look condition issues as well and it is 24:59.560 --> 25:05.800 safe on production kind of safe why I say that because we with one that 21 overhead drops the 25:05.800 --> 25:12.600 one to two percent of time thanks to kudos kudos to Felix Gaysander for who works on these 25:12.600 --> 25:20.040 kind of topics basically they optimize the stack online encode to drop it from 10 percent so it is 25:20.040 --> 25:27.320 kind of safe to use on production so yeah this is how it looks if you have haven't seen already 25:28.520 --> 25:35.320 and for me there's one more M1 environment where I will that you said like this but it is for me 25:35.320 --> 25:40.200 like kind of a CLI way of doing execution tracer I just would like to mention it 25:41.400 --> 25:48.840 okay just I would like to just wrap up reduce preferred stack over here if you can avoid pointers 25:49.080 --> 25:54.520 make it a habit I mean there's there's no way you can avoid pointers I'm just saying be in the 25:54.520 --> 26:01.000 mindset of that they put pressure on garbage collection use sparingly I mean use interface and 26:01.000 --> 26:09.000 generic sparingly sing pull is your friend but understand it very well reuse pre-allocate maps 26:09.000 --> 26:17.800 slices whenever possible and usable observability tools I think anytime spent on using observability 26:17.880 --> 26:25.800 tools is just a well time spent I mean profile benchmark execution tracer and one bonus one I see 26:27.000 --> 26:35.480 the morning talk from Martin that about memory just I was very I didn't decide since one hour ago 26:35.480 --> 26:43.480 I should put this or not but memory just like before this one arena was proposed which is to just 26:43.560 --> 26:50.120 a way of deciding you're like implementing your own stack in some way like basically you you say that 26:50.440 --> 26:58.120 I don't want to involve garbage collection anyway in this namespace and regions are awesome and 26:59.720 --> 27:06.680 they're better than arena because arena just found to be less ergonomic in the in the previous 27:06.680 --> 27:13.960 implementation but I don't know it I think it will be good so it and it was minimal garbage 27:13.960 --> 27:17.000 collection impact as well with that thank you very much