WEBVTT 00:00.000 --> 00:13.280 My presentation is about probably the most pointless thing the software can do is about moving 00:13.280 --> 00:18.240 bytes around memory copies. 00:18.240 --> 00:22.840 But in general, let's take a look how to write fast code. 00:22.840 --> 00:26.360 You can do high level optimizations and flow level optimizations. 00:26.360 --> 00:33.000 All L-optimizations are about choosing the right algorithm using the right data structures, 00:33.000 --> 00:34.800 the right interfaces. 00:34.800 --> 00:42.200 And flow level optimization is about removing unnecessary copying, using the right instruction 00:42.200 --> 00:47.880 set, making sure the data layout and memory is right. 00:47.880 --> 00:51.720 But actually sometimes it's not easy to figure out what is high level and what is low 00:51.720 --> 00:53.760 level optimization. 00:53.760 --> 00:59.120 For example, let's imagine you profile your simple stuff code this is tough and you 00:59.120 --> 01:00.120 see this. 01:00.120 --> 01:07.760 And this is terrible dynamic cost and again dynamic cost and string serialization and 01:07.760 --> 01:21.040 SDDLKL and whatever string streams, how do I change it? 01:21.040 --> 01:31.440 I don't know, I speak loud enough. 01:31.440 --> 01:38.440 Is it okay? 01:38.440 --> 01:47.600 So you see this, typically when you profile some simple aspect code that not you have written, 01:47.600 --> 01:51.280 but someone else have written. 01:51.280 --> 01:54.960 And this is my impression when I feel this. 01:54.960 --> 01:59.320 So the first step in the optimization is removing trash. 01:59.320 --> 02:05.760 Removing all of this dynamic cost SDDLKL copies, string streams and so on, but this is just 02:05.760 --> 02:08.000 basics. 02:08.000 --> 02:15.880 Then when you optimize all of this trash, and it's unclear if it is high level or low level 02:15.880 --> 02:18.360 optimization, it is just removing trash. 02:18.360 --> 02:24.200 When you do this, the profile might look like here. 02:24.200 --> 02:27.600 And here at the top we have memory copy. 02:27.600 --> 02:33.000 And the second function is copy user generic string and do you know what it is? 02:33.000 --> 02:36.040 It's also my memory copy. 02:36.040 --> 02:40.880 But it happens in the Linux kernel. 02:40.880 --> 02:44.680 And then we have some real stuff like decompression and string serialization. 02:44.680 --> 02:50.600 But it is typical to find memory copy at the top when you profile your code. 02:50.600 --> 02:57.320 So reasonable question, should we optimize memory copy? 02:57.320 --> 02:59.400 Not necessarily. 02:59.400 --> 03:06.920 Sometimes you can remove memory copy from the profiling just by copying less data, which 03:06.920 --> 03:10.800 is not always possible and not always needed. 03:10.800 --> 03:16.120 So I think you do less copies and your code will be faster. 03:16.120 --> 03:22.120 But sometimes you need to do memory copies just because some algorithms expect the data 03:22.120 --> 03:25.120 to be laid out in a certain way. 03:25.120 --> 03:30.360 Like you arrange the data in a certain way with memory copies and then you do some fast 03:30.360 --> 03:35.040 algorithm that will process data continuously in a single batch. 03:35.040 --> 03:41.920 So sometimes you do instead like your copy data just to optimize your code. 03:41.920 --> 03:47.920 And the second argument against of trying to optimize memory copy is that this function 03:47.920 --> 03:52.320 is so popular it is so often at the top of the profiles. 03:52.320 --> 04:00.480 So maybe thousands of people already tried to optimize memory copy and maybe it will be 04:00.480 --> 04:10.880 pointless and maybe you think too much of yourself if you think that you can optimize it. 04:10.880 --> 04:14.560 But let's dig into more details. 04:14.560 --> 04:21.320 So memory copy is the world most popular function one all programmer's love. 04:21.320 --> 04:23.600 Actually I disagree. 04:23.600 --> 04:29.400 I don't love this function every time I see this function in the profile I want to get 04:29.400 --> 04:30.400 read of it. 04:30.400 --> 04:37.240 How can I style out this function but okay. 04:37.240 --> 04:43.080 Let's start to look at interesting facts about memory copies. 04:43.080 --> 04:51.560 The first fact is that sometimes this function is not called it's not being unlocked. 04:51.560 --> 04:54.120 Instead it is just a built-in. 04:54.120 --> 05:00.240 Then a compiler sees a memory copy with a constant size like in this example 8 bytes. 05:00.240 --> 05:04.840 It will generate this assembly code without any function call. 05:04.840 --> 05:12.480 You can disable it if you want using the parameter built-in memory copy. 05:12.480 --> 05:17.560 You can enable it explicitly but it is enabled by default with optimizations. 05:17.560 --> 05:22.960 And also you can invoke built-in compiler built-in function explicitly using underscore 05:22.960 --> 05:25.960 build-in memory copy. 05:25.960 --> 05:30.560 But if the size is unknown at compile time it will not be effective it will still 05:30.560 --> 05:33.560 do a function call. 05:33.560 --> 05:41.680 Okay the next fact it works even for like unusual and even sizes. 05:41.680 --> 05:48.920 If you do 16 bytes copy you see the compiler will use symptoms traction at least SSE to 05:48.920 --> 05:53.320 copy 16 bytes at once. 05:53.320 --> 05:58.520 The compiler has to use un-line extraction instructions because it does not know if the 05:58.520 --> 06:00.360 date is aligned. 06:00.360 --> 06:04.880 If you copy 15 bytes the compiler will do something interesting. 06:04.880 --> 06:08.840 It will do two copies of 8 bytes that overlap. 06:08.840 --> 06:12.600 One of these bytes will be like copy twice but it does not matter. 06:12.600 --> 06:20.160 It is fairly optimal and the compiler is probably smart enough to choose this specific 06:20.160 --> 06:21.160 code. 06:21.160 --> 06:28.280 You can always check it on Godbolt for example to see the assembly. 06:28.280 --> 06:39.960 Okay the next fact if you try to write your own memory copy like this function f that 06:39.960 --> 06:45.320 will invoke a loop that copy is data byte by byte. 06:45.320 --> 06:53.040 The compiler can recognize this pattern and replace it just to a call of memory copy. 06:53.040 --> 06:59.320 So you try to write a memory copy and compile a memory copy like this and insert it 06:59.320 --> 07:07.320 to memory copy as you want it and your code is pointless in this case. 07:07.320 --> 07:11.640 Look how disabled this behavior using again f built in. 07:11.640 --> 07:16.120 I'm copy or f3 loop distribute patterns. 07:16.120 --> 07:22.840 The question for the audience what does it mean I'm copy and present PLT not and present 07:22.840 --> 07:36.760 at PLT exactly this exactly what does it mean look up the procedure look up what yeah so 07:36.760 --> 07:44.120 procedure look up table for what for a call yeah but why why the compiler has to use 07:44.120 --> 07:52.760 the procedure look up table yeah relocations for third libraries yeah exactly we 07:52.760 --> 08:01.560 which is not exactly optimal but we will go to this later so well interesting fact what 08:01.560 --> 08:06.840 will happen if you write your own memory copy like this you will get segmentation falls 08:06.840 --> 08:18.240 from stack or flow because this memory copy function will invoke again memory copy okay the 08:18.320 --> 08:27.600 next fact glipsy which is used by default if you compile your program one almost any 08:27.600 --> 08:37.040 opponent on Linux specifically for glipsy it actually can invoke one of this one to three 08:37.040 --> 08:46.080 four five six nine nine different implementations of memory copy and it does it with a 08:46.320 --> 08:55.280 very interesting trick so it uses the shared library logic mechanism to substitute 08:56.880 --> 09:05.760 the most optimal supposedly the most optimal implementation at the time of 09:07.680 --> 09:14.400 relocations so at loading time the program has to do these relationships for shared library and 09:14.400 --> 09:23.440 it can substitute the function using this i funk mechanism so glipsy is already really smart 09:24.240 --> 09:35.840 it can use like name copy avx512512 unaligned avx512 unaligned eRMS and avx512 09:35.920 --> 09:44.240 no way v0 upper and it will be very interesting to figure out why does it have all three of these functions 09:47.920 --> 09:55.200 let's take a look so which instructions to use for memory copy actually 09:56.480 --> 10:01.600 the implementation for memory copy already exists inside the CPU 10:01.920 --> 10:16.880 so Intel CPUs since 1978 already has have this instruction rep move sb which does actually 10:16.880 --> 10:24.160 memory copy actually it includes something like move a byte increment a counter until some 10:24.160 --> 10:30.880 condition happens but it is memory copy and it exists since 1978 which means that this 10:30.880 --> 10:45.680 memory copy is older than me but it is not so simple because in 2012 the CPUs get a new flag 10:45.680 --> 10:56.400 name it eRMS which means enhanced rep move sb so the previous implementation was really slow 10:57.520 --> 11:07.120 and they enhanced it that but it was not enhanced enough and in about 2017 some CPUs get a flag 11:07.120 --> 11:17.280 name it f as fs are m like this which I will tell you about just in a few minutes 11:19.920 --> 11:27.920 okay how to optimize memory copy I would say that every self respecting the water already tried 11:27.920 --> 11:35.920 to optimize memory copy at least once in a life by the way how many self respecting the 11:35.920 --> 11:47.440 well preserving the room quite of you actually I like it yeah I would like to know how 11:47.440 --> 11:57.520 did you succeed okay after this presentation I would like to ask you okay so let's try let's try 11:57.520 --> 12:08.240 to optimize memory copy but the first question you should have be prepared you should have 12:08.240 --> 12:13.680 reproducible performance test and ideally on some products on applications on some products 12:13.680 --> 12:19.760 work load because otherwise you will over optimize memory copy for a particular scenario like 12:19.760 --> 12:27.520 copying one megabyte of memory or copying from four kilobytes to eight kilobytes of memory 12:27.520 --> 12:32.480 with random distribution and random distribution is not representative constant distribution is not 12:32.480 --> 12:38.080 represented the only representative distribution is the distribution from your production 12:38.080 --> 12:45.440 and it depends on many many factors like we are exactly the data will be copied how frequently 12:45.440 --> 12:55.920 it will be accessed how at what time after copying it will be accessed and so on and often you 12:55.920 --> 13:02.240 don't need to optimize memory copy because you sometimes we need something else other than memory copy 13:02.240 --> 13:11.840 some different variations of this function because memory copy does exactly what 13:12.160 --> 13:18.320 it represents copy specific amount of bytes from specific location to specific other specific 13:18.320 --> 13:25.520 collections but what if you have some assumptions in the code like you can copy this chunk of memory 13:25.520 --> 13:33.680 to another chunk of memory and then immediately copy another chunk after straight after that and then 13:33.680 --> 13:40.400 you can use a function that will overwrite some memory past all the request buffer because the 13:40.400 --> 13:47.840 next location will also overwrite this memory and about the standard name copy function you cannot 13:47.840 --> 13:53.280 do this assumption inside your code maybe you can take a few assumptions and use a different 13:53.280 --> 14:00.480 function it's like a file a follow-up call up the direction okay our goal is to optimize the 14:01.520 --> 14:08.720 standard function now the question which language should we use probably we should imagine 14:09.360 --> 14:20.320 ourselves like really hardcore engineer like gray beard engineer and write it in assembly 14:22.800 --> 14:27.760 but if you write it in assembly like you create a file name copy dot test 14:29.680 --> 14:37.200 the upside you basically control everything every single instruction but this function will not 14:37.520 --> 14:42.240 in line and sometimes specifically for smaller sizes it's really important to 14:43.120 --> 14:49.200 make sure that some invocations will get in line it and in time optimization also 14:51.040 --> 14:55.120 will not work maybe if you will write in the elevator I'm assembly maybe you will get 14:55.120 --> 15:02.560 something like working quick time optimization but it is out of scope also you can write in the 15:02.640 --> 15:08.640 in C maybe with a certain in line assembly can actually you can write name copy in C++ 15:10.400 --> 15:17.200 just use external C and then you use your favorite language you know maybe you can even write 15:18.880 --> 15:21.760 I'm not sure about it should you write name copy in Rust 15:21.760 --> 15:37.200 let's take a look what is the essence of memory copy typically it has some instruction 15:37.200 --> 15:45.360 to instructions to process tales or heads of your data like you have to copy a chunk of memory of 15:46.320 --> 15:56.400 130 bytes and it means typically that you can do a fast loop for 128 bytes and also you 15:56.400 --> 16:01.600 have a tail of two bytes or maybe you have a head of certain bytes if you want to use a line 16:01.600 --> 16:09.360 instructions so you do something like loop peeling to align the main loop and you have the main loop 16:09.360 --> 16:19.600 and you have to optimize both of them okay so what about what should we focus on 16:21.120 --> 16:26.480 oftentimes my copy is involved in logged on small sizes like for SD disk string 16:27.600 --> 16:34.320 when you copy SD disk string which should not do but it's inevitable in many cases 16:34.720 --> 16:44.480 and you store something like I don't know URL or names or whatever company names in a 16:44.480 --> 16:53.040 system it will be maybe 50 bytes maybe 100 bytes something like that and at this size it depends 16:53.040 --> 17:01.040 on the cost of actually calling the copy the cost of function call and if the function is 17:01.040 --> 17:06.080 located in a shared library you call you pay for this procedural laptop table which is not 17:07.760 --> 17:16.720 a big cost but it's still noticeable for these small sizes and if a function is in a different 17:16.720 --> 17:22.240 translation unit either using time optimization or it is not going to be in line it 17:24.480 --> 17:29.920 and even according to convention makes sense because you have to say for register you have to 17:30.240 --> 17:41.200 register and so on what matters for large sizes the instructions said should you use 17:42.080 --> 17:48.320 SSE like in the previous presentation that was demonstrated should you use AVX should you use AVX512 17:48.640 --> 17:59.280 it's a big question should you use this rep of SD loop and rolling matters the order of 18:00.720 --> 18:08.400 copying data like should you copy forward or backwards there is an interesting story when 18:09.760 --> 18:16.400 so tried this and I'm copy was implemented like copying byte by byte in for in a forward 18:16.720 --> 18:26.560 direction and some people abused memory copy to copy intersection ranges which is not allowed by 18:26.560 --> 18:31.920 the standard it is undefined behavior but some people did it and when the implementation of 18:31.920 --> 18:39.360 memory copy was changed to copy data backwards they were really disappointed about bugs in their program 18:40.160 --> 18:49.280 and even tried to convince lip see authors to change it to not optimize memory copy fortunately 18:49.840 --> 18:54.400 they did not convince lip see authors are really hard to go guys so 18:56.560 --> 19:03.280 difficult to convince them okay and so there are a lot of variability like should we use a 19:03.280 --> 19:10.080 line of instructions or not a line should we use non-temporary stores like there are specific 19:10.080 --> 19:17.440 instructions that tell you that you write data somewhere and this data should not be in the 19:17.440 --> 19:22.880 CPU cache it should go directly into memory because this data will not be accessed soon 19:27.120 --> 19:31.920 and everything depends on the particular CPU model and particular data distribution so the 19:31.920 --> 19:41.680 testing will be especially hard the first example this single instruction the rep and e move 19:41.680 --> 19:49.680 sb one instruction in all CPUs it is implemented with microcode and it is really slow and you should 19:49.680 --> 19:57.840 not use it in more modern CPUs it works fast but only for large sizes it has a large startup overhead 19:58.320 --> 20:06.560 but a new CPUs with this f as m flag it supposedly should work well for all cases but still 20:06.560 --> 20:14.160 it not the fastest option and still you have to figure out on which CPU do you run just to decide 20:14.160 --> 20:23.760 if you have to invoke this instruction or you should do something else another example if you use 20:23.760 --> 20:32.240 a the action instructions for memory copies for most CPUs it they will be faster but they will 20:32.240 --> 20:41.120 be faster if you use this v0 upper instruction after finishing with your ethics code because otherwise 20:42.240 --> 20:48.640 all other codes like non-events might be SSE we'll suddenly slow down you do your fast memory 20:48.640 --> 20:56.000 code and all other codes slows down but in newer CPUs you don't have to use this v0 upper instruction 20:56.000 --> 21:02.160 because they optimized the CPUs and now it works okay and this is why there are so many different 21:02.160 --> 21:16.000 implementations a VX5 12 is also not easy in many not modern CPUs it was not faster than 21:16.000 --> 21:25.200 a VX for multiple reasons today interesting fact that on mdcpus a VX5 12 works often 21:25.200 --> 21:31.360 better than on Intel CPUs where it was originated but still we get latency spikes 21:31.360 --> 21:40.880 one way due to the usage of this instruction and this is my reaction to all of this it is so complex 21:41.840 --> 21:48.080 so many so many difficult details so what to do what should we do 21:49.360 --> 21:56.240 another example non-temporary stores should we use non-temporary stores to bypass cache 21:56.240 --> 22:04.000 if you run your benchmark and your test your web copy most likely you will get an impression that 22:04.000 --> 22:10.880 yes non-temporary stores improves everything until you put this implementation into production 22:10.880 --> 22:18.080 when it will actually slow down the code why because non-temporary stores are intended for the case 22:18.080 --> 22:25.040 when data is not accessed after it was copied at least for a long time in production code typically 22:25.040 --> 22:33.920 the data is copied to be used almost immediately so it is almost completely pointless in production 22:36.080 --> 22:43.680 okay so better not to optimize memory code not even try you will optimize it for your machine 22:43.680 --> 22:51.120 you will write a blog post about your amazing page mark and your amazing implementation 22:51.680 --> 22:59.680 and your software will slow down on someone else machine but I will try to optimize memory 22:59.680 --> 23:10.800 copy anyway okay so for benchmark we will use different buffer sizes different sizes 23:10.800 --> 23:17.040 memory copy called different random distributions with actually different probability 23:17.040 --> 23:24.720 distributions different number of threads and different directions different relative positions 23:24.720 --> 23:32.320 between buffers and we have a lot of existing deliberations nine variants from jlipsi 23:34.160 --> 23:42.960 memory copy from cost map all tonlipsi rep and I move SB simple loop with different options 23:43.920 --> 23:47.600 once the range and implementation of the input I found on a chinese website 23:51.520 --> 23:58.560 yeah two different chinese variants one and a traditional chinese 23:58.560 --> 24:09.040 simplified chinese my own implementation and also imitation from string zilla library 24:09.680 --> 24:21.680 and I have to test it and for testing I have all machines from iWS all recent generations 24:21.680 --> 24:28.240 which includes at least four different Intel CPUs and at least three different AMD CPUs 24:28.960 --> 24:41.200 to do this benchmark I have a 42 thousand measurements for pin total to compare 24:42.240 --> 24:49.280 and I basically test all variants for realism data sizes plus multi-plied by the number of machines 24:49.520 --> 24:58.560 and when I tested all of this the result was that none of these implementations are actually the best 24:59.840 --> 25:06.080 there are good implementations but under certain conditions one implementation wins 25:06.960 --> 25:11.840 under certain conditions and other wins for example memory copy from cost map point of 25:11.840 --> 25:21.760 nipsi is really good but not for small sizes string zilla is really great and zlipsi is also 25:22.960 --> 25:30.400 not much slower and sometimes faster but zlipsi cannot be in line and I really want to 25:30.400 --> 25:40.400 in line my memory copy implementation for small sizes so I still have a hope that I can make my own 25:40.480 --> 25:47.440 the best memory copy how to do it if none of these implementation are the best 25:47.440 --> 25:55.120 the idea is to make a generalize the memory copy so I will generate this code in c 25:56.400 --> 26:03.200 which is similar to c++ because I use something similar to templates but 26:03.360 --> 26:13.360 yeah with markers I just include the same file many many times with different values of this 26:13.360 --> 26:22.160 markers vector size the enabling of v0 upper instruction and something else 26:22.960 --> 26:29.600 and how many times to unroll the loop 26:36.000 --> 26:42.240 and this code in c generates a code in in line assembly 26:44.560 --> 26:50.960 so I have Microsoft for register name for instruction name if it is a line at 26:53.040 --> 27:06.080 the size and so on and it looks something like this yeah we have function and in line assembly 27:08.320 --> 27:14.160 and actually in line assembly can have markers and this is really handy for me can you write this code in 27:14.160 --> 27:24.400 Rust yes yes I wanted to say there is at least nothing how C++ and C in this case 27:25.040 --> 27:32.240 is better than Rust but maybe there is nothing okay it's just not just normal code yeah 27:33.200 --> 27:43.200 and different options work better and different conditions so I thought maybe I can implement 27:43.920 --> 27:51.600 self tuning memory copy because even if I do dynamic display for CPU model I still cannot decide 27:51.600 --> 27:58.480 and there will be too many too many things to decide and it inevitably will be outdated with the 27:58.480 --> 28:06.560 next CPU model so what about one interesting trick to make this memory copy like self driven 28:07.680 --> 28:16.400 so for large sizes for small sizes we will have small like constant in line code but for large sizes 28:16.400 --> 28:26.240 we will invoke different variants and calculate statistics we will really quickly calculate which 28:26.640 --> 28:34.240 variant took which number of instructions and then we will kind of urge to the best 28:34.240 --> 28:41.440 statistically the best option and everything was this at runtime so let's do it for at least 28:41.440 --> 28:50.720 30,000 30 kilobytes if the buffer smaller than 40 kilobytes we will just use regular implementation 28:51.680 --> 29:00.960 and why 30 kilobytes because I calculated memory copy in L1 cache should be at least like 100 29:00.960 --> 29:08.880 gigabytes per second actually should be more today because this was long time ago and if we 29:09.840 --> 29:20.720 and we have like a budget for our statistic progression just like a few 10 nanoseconds maximum 29:23.360 --> 29:35.440 how to do it maximum 10 nanoseconds so this looks like as follows in my memory copy 29:35.680 --> 29:46.320 I calculate I have a counter a thread probably it's a thread local maybe it's just counter on a 29:46.320 --> 29:55.680 yeah thread local variable like count and depending on a threshold I invoke either exploration or 29:55.680 --> 30:04.800 exploitation mode with exploitation mode I invoke this selected variant and in the exploration mode 30:04.880 --> 30:10.400 there is a function explorer that calculates hash function and depending on this hash function it will 30:10.400 --> 30:18.240 invoke a random variant and it basically calculates the time where it imprecises me but still 30:19.440 --> 30:30.880 it's not probably not the right way to calculate the time but okay and guess if it helped it was at the 30:31.200 --> 30:40.000 first option it must be definitely the best option and I must definitely invent the fastest 30:40.000 --> 30:50.800 memory copy so it worked successfully and it was the best at the benchmark then I put it 30:50.960 --> 31:03.600 in production and start testing real queries and it was slow actually it was quite okay comparable 31:03.600 --> 31:12.160 to other memory copy implementation but no speed up no expected speed up so self tuning magical memory 31:12.160 --> 31:19.280 copy that I spent so much time implementing it was almost completely useless 31:21.040 --> 31:40.320 and this is my impression why why was it useless so one hypothesis is that if you have a 31:40.320 --> 31:50.800 amount to try to work a lot and say you have a CPU with let's say 64 cores 128 threads 31:52.560 --> 32:02.400 and this is something like md epic with eight memory channels let it be dr5 32:03.280 --> 32:11.920 and dr5 has whatever let's say 50 gigabytes per second per channel right and if you multiply by eight 32:14.960 --> 32:20.720 we will get 400 is it 400 yeah gigabytes per second per one socket 32:20.800 --> 32:34.160 but if you take 128 threads and here hyperthreading is okay logical threads can do 32:34.880 --> 32:42.320 can issue this instructions with memory you will get saturated by memory and in this amount to 32:42.320 --> 32:47.520 try to work lots it is not really important to switch from one memory copy implementation to 32:47.520 --> 32:54.560 another memory copy implementation for large sizes but for small sizes it is really important 32:54.560 --> 33:08.320 and in line with memory copy makes sense so what all of these users no not quite I still replace 33:08.320 --> 33:14.640 at one memory copy implementation with another that I have written not with in line assembly but with 33:16.160 --> 33:24.320 actually C++ yeah so it is a function with x-term C it is entirely in the header file so it is 33:24.320 --> 33:38.960 in line it and it uses CMD instructions in 36 and today it uses actually just SSI2 so nothing fancy 33:40.640 --> 33:50.160 but again it is not final we found out that if I use masket load stores from avx512 33:51.120 --> 33:57.040 it makes sense it gives benefit and maybe I should continue my work but the problem is 33:58.400 --> 34:07.440 all of our production in clickhouse cloud runs on arm and I spend so much time on this and I have to 34:07.440 --> 34:18.160 spend even more to optimize memory copy on arm so conclusions top to mice your code 34:18.720 --> 34:23.440 don't be afraid try crazy ideas and never give up 34:35.360 --> 34:40.320 thank you for talking now it's time for questions rise your hand and I will pass my phone 34:41.120 --> 34:47.920 I see I see I see I'm coming 34:54.640 --> 34:59.680 hi thanks that was fantastic I have one important question about 35:00.720 --> 35:06.000 okay I have two questions one is so at the beginning you showed a birth bench for a lot of 35:06.960 --> 35:13.040 so at the beginning you showed a profile made with birth of mmcopy being I think 10% or more 35:13.040 --> 35:22.000 of the runtime yeah so at the beginning you showed a profile with birth of mmcopy being more than 10% 35:22.000 --> 35:30.240 of your runtime yeah does that include in-line mmcopies no if I in-line it will it does not show 35:30.320 --> 35:37.520 in-line so at least in a birth at least by default it will show up inside different functions 35:38.400 --> 35:45.040 okay and my second question was will you able to record the real world distribution of the mmcopy 35:45.040 --> 35:54.800 sizes yeah that's a really good question and I thought about it and even I tried to do it 35:55.680 --> 36:03.840 record the addresses and sizes into a file but I did not get to incorporating it 36:03.840 --> 36:07.520 benchmark although it's a really good idea that I wanted to try 36:24.800 --> 36:54.400 so one thing that I have seen in my application is that the compiler is clever 36:54.480 --> 37:01.760 as long as you know the size but if you don't have the size and know it as a programmer that you 37:01.760 --> 37:09.840 have varying sizes and small sizes then it's of better to do the loop on ruling yourself so for 37:09.840 --> 37:19.040 example I have some code that which commonly just takes between six and 50 bytes and so I made an 37:19.040 --> 37:25.040 outer loop and outer check for the size and if it's the smaller size then I manually unlooper 37:25.040 --> 37:31.280 and roll it which is much faster than using the the one function that you said that cannot be 37:31.280 --> 37:40.880 in-line the mmcopy AVX which is normally used it's really really hard to hear the question 37:40.960 --> 37:52.560 but I almost I almost got the question so what to do is the size is like small but not not 37:52.560 --> 38:02.000 known at compile time or yeah interesting that in this case mmcopy looks similar to 38:02.000 --> 38:09.760 in one sense it is similar to the dev device so you have a few like jumps depends on 38:09.760 --> 38:20.880 one's the remainder of the size into into a loop that has a few white instructions like 38:20.880 --> 38:29.760 simple instruction center few small instructions and this pattern is typical in many 38:29.760 --> 38:35.040 implementation including my implementation including cost-mopolitan dipsy all of them have 38:36.880 --> 38:38.880 some very similar to the dev device 38:45.840 --> 38:50.640 but I'm afraid maybe I did not hear the question well 38:51.280 --> 39:03.200 yeah another question sorry absolute I cannot hear maybe you can like this can you hear me yeah 39:03.200 --> 39:10.080 okay did you look at any interpersonal optimization on with a compiler to see if you could avoid 39:10.080 --> 39:17.920 some of the main copy do have some inter procedural optimization right isn't the question 39:18.800 --> 39:26.160 now the question was if you use a bit the compiler with the inter position the IPO 39:26.160 --> 39:31.520 the inter procedural optimization so really difficult to check you would just 39:31.520 --> 39:39.120 sorry I'm not sure how does it sound in the do you hear the question I can answer about 39:39.120 --> 39:45.520 the inter procedural optimization so we use the final tip in a slank in a realist build 39:46.400 --> 39:52.800 it does not help with memory copy but it helps with many other functions 40:05.600 --> 40:13.360 maybe can you hear me maybe just to follow up the question just asked the if you 40:13.360 --> 40:20.320 happened that we use all two probably inter procedural optimization has been carried out by the compiler 40:21.200 --> 40:30.880 if you use the class so if you I know if you tested like your code that I like the original 40:31.760 --> 40:40.000 copy as low down if you enable the optimizations like inter procedural optimization 40:40.000 --> 40:48.400 part of the optimization pipeline we will do one of three yeah actually we use all three in slank 40:48.400 --> 40:53.840 and we try to enable some extra stuff we also use some stuff like we 40:54.800 --> 41:02.240 disabled the API compatibility because we have all the source code bundled and we enable 41:02.240 --> 41:09.200 just a bit of stuff that the new stuff that appears in slank I'm not sure if we include the 41:09.200 --> 41:20.480 that particular optimization that you mentioned maybe we can ask the room yeah the experts 41:20.560 --> 41:25.920 see plus plus expert about specific new inter procedural optimization that are not enabled by default 41:25.920 --> 41:30.480 in all three modes 41:31.440 --> 41:39.440 another one sorry just the last one a fun fact are have you tried the main copy attribute for 41:39.440 --> 41:46.320 you can see me I'm cop from what library in gc there's a compiler attribute that is main copy 41:46.320 --> 41:50.720 and then it understands that's a main copy so you can do even more optimization things again 41:51.440 --> 41:56.480 so it's just a fun fact if you want to do more compiler stuff on top just add the attribute 41:56.480 --> 42:01.760 main copy and then see if it performs even better it can do things like using main copies 42:02.320 --> 42:09.600 and stuff like that yeah it's really interesting although we started using gcf you years ago 42:10.720 --> 42:14.640 so now you'll still ask what slank should have compatible optimization 42:16.240 --> 42:23.200 also you might be if you are kind of gcc please don't don't be worried 42:24.080 --> 42:31.440 it's good actually I think maybe it's in plan to they normally like the attributes are 42:31.440 --> 42:44.240 possible but thanks I'm sure we are out of time so let's move further thank you for talking