WEBVTT

00:00.000 --> 00:13.280
My presentation is about probably the most pointless thing the software can do is about moving

00:13.280 --> 00:18.240
bytes around memory copies.

00:18.240 --> 00:22.840
But in general, let's take a look how to write fast code.

00:22.840 --> 00:26.360
You can do high level optimizations and flow level optimizations.

00:26.360 --> 00:33.000
All L-optimizations are about choosing the right algorithm using the right data structures,

00:33.000 --> 00:34.800
the right interfaces.

00:34.800 --> 00:42.200
And flow level optimization is about removing unnecessary copying, using the right instruction

00:42.200 --> 00:47.880
set, making sure the data layout and memory is right.

00:47.880 --> 00:51.720
But actually sometimes it's not easy to figure out what is high level and what is low

00:51.720 --> 00:53.760
level optimization.

00:53.760 --> 00:59.120
For example, let's imagine you profile your simple stuff code this is tough and you

00:59.120 --> 01:00.120
see this.

01:00.120 --> 01:07.760
And this is terrible dynamic cost and again dynamic cost and string serialization and

01:07.760 --> 01:21.040
SDDLKL and whatever string streams, how do I change it?

01:21.040 --> 01:31.440
I don't know, I speak loud enough.

01:31.440 --> 01:38.440
Is it okay?

01:38.440 --> 01:47.600
So you see this, typically when you profile some simple aspect code that not you have written,

01:47.600 --> 01:51.280
but someone else have written.

01:51.280 --> 01:54.960
And this is my impression when I feel this.

01:54.960 --> 01:59.320
So the first step in the optimization is removing trash.

01:59.320 --> 02:05.760
Removing all of this dynamic cost SDDLKL copies, string streams and so on, but this is just

02:05.760 --> 02:08.000
basics.

02:08.000 --> 02:15.880
Then when you optimize all of this trash, and it's unclear if it is high level or low level

02:15.880 --> 02:18.360
optimization, it is just removing trash.

02:18.360 --> 02:24.200
When you do this, the profile might look like here.

02:24.200 --> 02:27.600
And here at the top we have memory copy.

02:27.600 --> 02:33.000
And the second function is copy user generic string and do you know what it is?

02:33.000 --> 02:36.040
It's also my memory copy.

02:36.040 --> 02:40.880
But it happens in the Linux kernel.

02:40.880 --> 02:44.680
And then we have some real stuff like decompression and string serialization.

02:44.680 --> 02:50.600
But it is typical to find memory copy at the top when you profile your code.

02:50.600 --> 02:57.320
So reasonable question, should we optimize memory copy?

02:57.320 --> 02:59.400
Not necessarily.

02:59.400 --> 03:06.920
Sometimes you can remove memory copy from the profiling just by copying less data, which

03:06.920 --> 03:10.800
is not always possible and not always needed.

03:10.800 --> 03:16.120
So I think you do less copies and your code will be faster.

03:16.120 --> 03:22.120
But sometimes you need to do memory copies just because some algorithms expect the data

03:22.120 --> 03:25.120
to be laid out in a certain way.

03:25.120 --> 03:30.360
Like you arrange the data in a certain way with memory copies and then you do some fast

03:30.360 --> 03:35.040
algorithm that will process data continuously in a single batch.

03:35.040 --> 03:41.920
So sometimes you do instead like your copy data just to optimize your code.

03:41.920 --> 03:47.920
And the second argument against of trying to optimize memory copy is that this function

03:47.920 --> 03:52.320
is so popular it is so often at the top of the profiles.

03:52.320 --> 04:00.480
So maybe thousands of people already tried to optimize memory copy and maybe it will be

04:00.480 --> 04:10.880
pointless and maybe you think too much of yourself if you think that you can optimize it.

04:10.880 --> 04:14.560
But let's dig into more details.

04:14.560 --> 04:21.320
So memory copy is the world most popular function one all programmer's love.

04:21.320 --> 04:23.600
Actually I disagree.

04:23.600 --> 04:29.400
I don't love this function every time I see this function in the profile I want to get

04:29.400 --> 04:30.400
read of it.

04:30.400 --> 04:37.240
How can I style out this function but okay.

04:37.240 --> 04:43.080
Let's start to look at interesting facts about memory copies.

04:43.080 --> 04:51.560
The first fact is that sometimes this function is not called it's not being unlocked.

04:51.560 --> 04:54.120
Instead it is just a built-in.

04:54.120 --> 05:00.240
Then a compiler sees a memory copy with a constant size like in this example 8 bytes.

05:00.240 --> 05:04.840
It will generate this assembly code without any function call.

05:04.840 --> 05:12.480
You can disable it if you want using the parameter built-in memory copy.

05:12.480 --> 05:17.560
You can enable it explicitly but it is enabled by default with optimizations.

05:17.560 --> 05:22.960
And also you can invoke built-in compiler built-in function explicitly using underscore

05:22.960 --> 05:25.960
build-in memory copy.

05:25.960 --> 05:30.560
But if the size is unknown at compile time it will not be effective it will still

05:30.560 --> 05:33.560
do a function call.

05:33.560 --> 05:41.680
Okay the next fact it works even for like unusual and even sizes.

05:41.680 --> 05:48.920
If you do 16 bytes copy you see the compiler will use symptoms traction at least SSE to

05:48.920 --> 05:53.320
copy 16 bytes at once.

05:53.320 --> 05:58.520
The compiler has to use un-line extraction instructions because it does not know if the

05:58.520 --> 06:00.360
date is aligned.

06:00.360 --> 06:04.880
If you copy 15 bytes the compiler will do something interesting.

06:04.880 --> 06:08.840
It will do two copies of 8 bytes that overlap.

06:08.840 --> 06:12.600
One of these bytes will be like copy twice but it does not matter.

06:12.600 --> 06:20.160
It is fairly optimal and the compiler is probably smart enough to choose this specific

06:20.160 --> 06:21.160
code.

06:21.160 --> 06:28.280
You can always check it on Godbolt for example to see the assembly.

06:28.280 --> 06:39.960
Okay the next fact if you try to write your own memory copy like this function f that

06:39.960 --> 06:45.320
will invoke a loop that copy is data byte by byte.

06:45.320 --> 06:53.040
The compiler can recognize this pattern and replace it just to a call of memory copy.

06:53.040 --> 06:59.320
So you try to write a memory copy and compile a memory copy like this and insert it

06:59.320 --> 07:07.320
to memory copy as you want it and your code is pointless in this case.

07:07.320 --> 07:11.640
Look how disabled this behavior using again f built in.

07:11.640 --> 07:16.120
I'm copy or f3 loop distribute patterns.

07:16.120 --> 07:22.840
The question for the audience what does it mean I'm copy and present PLT not and present

07:22.840 --> 07:36.760
at PLT exactly this exactly what does it mean look up the procedure look up what yeah so

07:36.760 --> 07:44.120
procedure look up table for what for a call yeah but why why the compiler has to use

07:44.120 --> 07:52.760
the procedure look up table yeah relocations for third libraries yeah exactly we

07:52.760 --> 08:01.560
which is not exactly optimal but we will go to this later so well interesting fact what

08:01.560 --> 08:06.840
will happen if you write your own memory copy like this you will get segmentation falls

08:06.840 --> 08:18.240
from stack or flow because this memory copy function will invoke again memory copy okay the

08:18.320 --> 08:27.600
next fact glipsy which is used by default if you compile your program one almost any

08:27.600 --> 08:37.040
opponent on Linux specifically for glipsy it actually can invoke one of this one to three

08:37.040 --> 08:46.080
four five six nine nine different implementations of memory copy and it does it with a

08:46.320 --> 08:55.280
very interesting trick so it uses the shared library logic mechanism to substitute

08:56.880 --> 09:05.760
the most optimal supposedly the most optimal implementation at the time of

09:07.680 --> 09:14.400
relocations so at loading time the program has to do these relationships for shared library and

09:14.400 --> 09:23.440
it can substitute the function using this i funk mechanism so glipsy is already really smart

09:24.240 --> 09:35.840
it can use like name copy avx512512 unaligned avx512 unaligned eRMS and avx512

09:35.920 --> 09:44.240
no way v0 upper and it will be very interesting to figure out why does it have all three of these functions

09:47.920 --> 09:55.200
let's take a look so which instructions to use for memory copy actually

09:56.480 --> 10:01.600
the implementation for memory copy already exists inside the CPU

10:01.920 --> 10:16.880
so Intel CPUs since 1978 already has have this instruction rep move sb which does actually

10:16.880 --> 10:24.160
memory copy actually it includes something like move a byte increment a counter until some

10:24.160 --> 10:30.880
condition happens but it is memory copy and it exists since 1978 which means that this

10:30.880 --> 10:45.680
memory copy is older than me but it is not so simple because in 2012 the CPUs get a new flag

10:45.680 --> 10:56.400
name it eRMS which means enhanced rep move sb so the previous implementation was really slow

10:57.520 --> 11:07.120
and they enhanced it that but it was not enhanced enough and in about 2017 some CPUs get a flag

11:07.120 --> 11:17.280
name it f as fs are m like this which I will tell you about just in a few minutes

11:19.920 --> 11:27.920
okay how to optimize memory copy I would say that every self respecting the water already tried

11:27.920 --> 11:35.920
to optimize memory copy at least once in a life by the way how many self respecting the

11:35.920 --> 11:47.440
well preserving the room quite of you actually I like it yeah I would like to know how

11:47.440 --> 11:57.520
did you succeed okay after this presentation I would like to ask you okay so let's try let's try

11:57.520 --> 12:08.240
to optimize memory copy but the first question you should have be prepared you should have

12:08.240 --> 12:13.680
reproducible performance test and ideally on some products on applications on some products

12:13.680 --> 12:19.760
work load because otherwise you will over optimize memory copy for a particular scenario like

12:19.760 --> 12:27.520
copying one megabyte of memory or copying from four kilobytes to eight kilobytes of memory

12:27.520 --> 12:32.480
with random distribution and random distribution is not representative constant distribution is not

12:32.480 --> 12:38.080
represented the only representative distribution is the distribution from your production

12:38.080 --> 12:45.440
and it depends on many many factors like we are exactly the data will be copied how frequently

12:45.440 --> 12:55.920
it will be accessed how at what time after copying it will be accessed and so on and often you

12:55.920 --> 13:02.240
don't need to optimize memory copy because you sometimes we need something else other than memory copy

13:02.240 --> 13:11.840
some different variations of this function because memory copy does exactly what

13:12.160 --> 13:18.320
it represents copy specific amount of bytes from specific location to specific other specific

13:18.320 --> 13:25.520
collections but what if you have some assumptions in the code like you can copy this chunk of memory

13:25.520 --> 13:33.680
to another chunk of memory and then immediately copy another chunk after straight after that and then

13:33.680 --> 13:40.400
you can use a function that will overwrite some memory past all the request buffer because the

13:40.400 --> 13:47.840
next location will also overwrite this memory and about the standard name copy function you cannot

13:47.840 --> 13:53.280
do this assumption inside your code maybe you can take a few assumptions and use a different

13:53.280 --> 14:00.480
function it's like a file a follow-up call up the direction okay our goal is to optimize the

14:01.520 --> 14:08.720
standard function now the question which language should we use probably we should imagine

14:09.360 --> 14:20.320
ourselves like really hardcore engineer like gray beard engineer and write it in assembly

14:22.800 --> 14:27.760
but if you write it in assembly like you create a file name copy dot test

14:29.680 --> 14:37.200
the upside you basically control everything every single instruction but this function will not

14:37.520 --> 14:42.240
in line and sometimes specifically for smaller sizes it's really important to

14:43.120 --> 14:49.200
make sure that some invocations will get in line it and in time optimization also

14:51.040 --> 14:55.120
will not work maybe if you will write in the elevator I'm assembly maybe you will get

14:55.120 --> 15:02.560
something like working quick time optimization but it is out of scope also you can write in the

15:02.640 --> 15:08.640
in C maybe with a certain in line assembly can actually you can write name copy in C++

15:10.400 --> 15:17.200
just use external C and then you use your favorite language you know maybe you can even write

15:18.880 --> 15:21.760
I'm not sure about it should you write name copy in Rust

15:21.760 --> 15:37.200
let's take a look what is the essence of memory copy typically it has some instruction

15:37.200 --> 15:45.360
to instructions to process tales or heads of your data like you have to copy a chunk of memory of

15:46.320 --> 15:56.400
130 bytes and it means typically that you can do a fast loop for 128 bytes and also you

15:56.400 --> 16:01.600
have a tail of two bytes or maybe you have a head of certain bytes if you want to use a line

16:01.600 --> 16:09.360
instructions so you do something like loop peeling to align the main loop and you have the main loop

16:09.360 --> 16:19.600
and you have to optimize both of them okay so what about what should we focus on

16:21.120 --> 16:26.480
oftentimes my copy is involved in logged on small sizes like for SD disk string

16:27.600 --> 16:34.320
when you copy SD disk string which should not do but it's inevitable in many cases

16:34.720 --> 16:44.480
and you store something like I don't know URL or names or whatever company names in a

16:44.480 --> 16:53.040
system it will be maybe 50 bytes maybe 100 bytes something like that and at this size it depends

16:53.040 --> 17:01.040
on the cost of actually calling the copy the cost of function call and if the function is

17:01.040 --> 17:06.080
located in a shared library you call you pay for this procedural laptop table which is not

17:07.760 --> 17:16.720
a big cost but it's still noticeable for these small sizes and if a function is in a different

17:16.720 --> 17:22.240
translation unit either using time optimization or it is not going to be in line it

17:24.480 --> 17:29.920
and even according to convention makes sense because you have to say for register you have to

17:30.240 --> 17:41.200
register and so on what matters for large sizes the instructions said should you use

17:42.080 --> 17:48.320
SSE like in the previous presentation that was demonstrated should you use AVX should you use AVX512

17:48.640 --> 17:59.280
it's a big question should you use this rep of SD loop and rolling matters the order of

18:00.720 --> 18:08.400
copying data like should you copy forward or backwards there is an interesting story when

18:09.760 --> 18:16.400
so tried this and I'm copy was implemented like copying byte by byte in for in a forward

18:16.720 --> 18:26.560
direction and some people abused memory copy to copy intersection ranges which is not allowed by

18:26.560 --> 18:31.920
the standard it is undefined behavior but some people did it and when the implementation of

18:31.920 --> 18:39.360
memory copy was changed to copy data backwards they were really disappointed about bugs in their program

18:40.160 --> 18:49.280
and even tried to convince lip see authors to change it to not optimize memory copy fortunately

18:49.840 --> 18:54.400
they did not convince lip see authors are really hard to go guys so

18:56.560 --> 19:03.280
difficult to convince them okay and so there are a lot of variability like should we use a

19:03.280 --> 19:10.080
line of instructions or not a line should we use non-temporary stores like there are specific

19:10.080 --> 19:17.440
instructions that tell you that you write data somewhere and this data should not be in the

19:17.440 --> 19:22.880
CPU cache it should go directly into memory because this data will not be accessed soon

19:27.120 --> 19:31.920
and everything depends on the particular CPU model and particular data distribution so the

19:31.920 --> 19:41.680
testing will be especially hard the first example this single instruction the rep and e move

19:41.680 --> 19:49.680
sb one instruction in all CPUs it is implemented with microcode and it is really slow and you should

19:49.680 --> 19:57.840
not use it in more modern CPUs it works fast but only for large sizes it has a large startup overhead

19:58.320 --> 20:06.560
but a new CPUs with this f as m flag it supposedly should work well for all cases but still

20:06.560 --> 20:14.160
it not the fastest option and still you have to figure out on which CPU do you run just to decide

20:14.160 --> 20:23.760
if you have to invoke this instruction or you should do something else another example if you use

20:23.760 --> 20:32.240
a the action instructions for memory copies for most CPUs it they will be faster but they will

20:32.240 --> 20:41.120
be faster if you use this v0 upper instruction after finishing with your ethics code because otherwise

20:42.240 --> 20:48.640
all other codes like non-events might be SSE we'll suddenly slow down you do your fast memory

20:48.640 --> 20:56.000
code and all other codes slows down but in newer CPUs you don't have to use this v0 upper instruction

20:56.000 --> 21:02.160
because they optimized the CPUs and now it works okay and this is why there are so many different

21:02.160 --> 21:16.000
implementations a VX5 12 is also not easy in many not modern CPUs it was not faster than

21:16.000 --> 21:25.200
a VX for multiple reasons today interesting fact that on mdcpus a VX5 12 works often

21:25.200 --> 21:31.360
better than on Intel CPUs where it was originated but still we get latency spikes

21:31.360 --> 21:40.880
one way due to the usage of this instruction and this is my reaction to all of this it is so complex

21:41.840 --> 21:48.080
so many so many difficult details so what to do what should we do

21:49.360 --> 21:56.240
another example non-temporary stores should we use non-temporary stores to bypass cache

21:56.240 --> 22:04.000
if you run your benchmark and your test your web copy most likely you will get an impression that

22:04.000 --> 22:10.880
yes non-temporary stores improves everything until you put this implementation into production

22:10.880 --> 22:18.080
when it will actually slow down the code why because non-temporary stores are intended for the case

22:18.080 --> 22:25.040
when data is not accessed after it was copied at least for a long time in production code typically

22:25.040 --> 22:33.920
the data is copied to be used almost immediately so it is almost completely pointless in production

22:36.080 --> 22:43.680
okay so better not to optimize memory code not even try you will optimize it for your machine

22:43.680 --> 22:51.120
you will write a blog post about your amazing page mark and your amazing implementation

22:51.680 --> 22:59.680
and your software will slow down on someone else machine but I will try to optimize memory

22:59.680 --> 23:10.800
copy anyway okay so for benchmark we will use different buffer sizes different sizes

23:10.800 --> 23:17.040
memory copy called different random distributions with actually different probability

23:17.040 --> 23:24.720
distributions different number of threads and different directions different relative positions

23:24.720 --> 23:32.320
between buffers and we have a lot of existing deliberations nine variants from jlipsi

23:34.160 --> 23:42.960
memory copy from cost map all tonlipsi rep and I move SB simple loop with different options

23:43.920 --> 23:47.600
once the range and implementation of the input I found on a chinese website

23:51.520 --> 23:58.560
yeah two different chinese variants one and a traditional chinese

23:58.560 --> 24:09.040
simplified chinese my own implementation and also imitation from string zilla library

24:09.680 --> 24:21.680
and I have to test it and for testing I have all machines from iWS all recent generations

24:21.680 --> 24:28.240
which includes at least four different Intel CPUs and at least three different AMD CPUs

24:28.960 --> 24:41.200
to do this benchmark I have a 42 thousand measurements for pin total to compare

24:42.240 --> 24:49.280
and I basically test all variants for realism data sizes plus multi-plied by the number of machines

24:49.520 --> 24:58.560
and when I tested all of this the result was that none of these implementations are actually the best

24:59.840 --> 25:06.080
there are good implementations but under certain conditions one implementation wins

25:06.960 --> 25:11.840
under certain conditions and other wins for example memory copy from cost map point of

25:11.840 --> 25:21.760
nipsi is really good but not for small sizes string zilla is really great and zlipsi is also

25:22.960 --> 25:30.400
not much slower and sometimes faster but zlipsi cannot be in line and I really want to

25:30.400 --> 25:40.400
in line my memory copy implementation for small sizes so I still have a hope that I can make my own

25:40.480 --> 25:47.440
the best memory copy how to do it if none of these implementation are the best

25:47.440 --> 25:55.120
the idea is to make a generalize the memory copy so I will generate this code in c

25:56.400 --> 26:03.200
which is similar to c++ because I use something similar to templates but

26:03.360 --> 26:13.360
yeah with markers I just include the same file many many times with different values of this

26:13.360 --> 26:22.160
markers vector size the enabling of v0 upper instruction and something else

26:22.960 --> 26:29.600
and how many times to unroll the loop

26:36.000 --> 26:42.240
and this code in c generates a code in in line assembly

26:44.560 --> 26:50.960
so I have Microsoft for register name for instruction name if it is a line at

26:53.040 --> 27:06.080
the size and so on and it looks something like this yeah we have function and in line assembly

27:08.320 --> 27:14.160
and actually in line assembly can have markers and this is really handy for me can you write this code in

27:14.160 --> 27:24.400
Rust yes yes I wanted to say there is at least nothing how C++ and C in this case

27:25.040 --> 27:32.240
is better than Rust but maybe there is nothing okay it's just not just normal code yeah

27:33.200 --> 27:43.200
and different options work better and different conditions so I thought maybe I can implement

27:43.920 --> 27:51.600
self tuning memory copy because even if I do dynamic display for CPU model I still cannot decide

27:51.600 --> 27:58.480
and there will be too many too many things to decide and it inevitably will be outdated with the

27:58.480 --> 28:06.560
next CPU model so what about one interesting trick to make this memory copy like self driven

28:07.680 --> 28:16.400
so for large sizes for small sizes we will have small like constant in line code but for large sizes

28:16.400 --> 28:26.240
we will invoke different variants and calculate statistics we will really quickly calculate which

28:26.640 --> 28:34.240
variant took which number of instructions and then we will kind of urge to the best

28:34.240 --> 28:41.440
statistically the best option and everything was this at runtime so let's do it for at least

28:41.440 --> 28:50.720
30,000 30 kilobytes if the buffer smaller than 40 kilobytes we will just use regular implementation

28:51.680 --> 29:00.960
and why 30 kilobytes because I calculated memory copy in L1 cache should be at least like 100

29:00.960 --> 29:08.880
gigabytes per second actually should be more today because this was long time ago and if we

29:09.840 --> 29:20.720
and we have like a budget for our statistic progression just like a few 10 nanoseconds maximum

29:23.360 --> 29:35.440
how to do it maximum 10 nanoseconds so this looks like as follows in my memory copy

29:35.680 --> 29:46.320
I calculate I have a counter a thread probably it's a thread local maybe it's just counter on a

29:46.320 --> 29:55.680
yeah thread local variable like count and depending on a threshold I invoke either exploration or

29:55.680 --> 30:04.800
exploitation mode with exploitation mode I invoke this selected variant and in the exploration mode

30:04.880 --> 30:10.400
there is a function explorer that calculates hash function and depending on this hash function it will

30:10.400 --> 30:18.240
invoke a random variant and it basically calculates the time where it imprecises me but still

30:19.440 --> 30:30.880
it's not probably not the right way to calculate the time but okay and guess if it helped it was at the

30:31.200 --> 30:40.000
first option it must be definitely the best option and I must definitely invent the fastest

30:40.000 --> 30:50.800
memory copy so it worked successfully and it was the best at the benchmark then I put it

30:50.960 --> 31:03.600
in production and start testing real queries and it was slow actually it was quite okay comparable

31:03.600 --> 31:12.160
to other memory copy implementation but no speed up no expected speed up so self tuning magical memory

31:12.160 --> 31:19.280
copy that I spent so much time implementing it was almost completely useless

31:21.040 --> 31:40.320
and this is my impression why why was it useless so one hypothesis is that if you have a

31:40.320 --> 31:50.800
amount to try to work a lot and say you have a CPU with let's say 64 cores 128 threads

31:52.560 --> 32:02.400
and this is something like md epic with eight memory channels let it be dr5

32:03.280 --> 32:11.920
and dr5 has whatever let's say 50 gigabytes per second per channel right and if you multiply by eight

32:14.960 --> 32:20.720
we will get 400 is it 400 yeah gigabytes per second per one socket

32:20.800 --> 32:34.160
but if you take 128 threads and here hyperthreading is okay logical threads can do

32:34.880 --> 32:42.320
can issue this instructions with memory you will get saturated by memory and in this amount to

32:42.320 --> 32:47.520
try to work lots it is not really important to switch from one memory copy implementation to

32:47.520 --> 32:54.560
another memory copy implementation for large sizes but for small sizes it is really important

32:54.560 --> 33:08.320
and in line with memory copy makes sense so what all of these users no not quite I still replace

33:08.320 --> 33:14.640
at one memory copy implementation with another that I have written not with in line assembly but with

33:16.160 --> 33:24.320
actually C++ yeah so it is a function with x-term C it is entirely in the header file so it is

33:24.320 --> 33:38.960
in line it and it uses CMD instructions in 36 and today it uses actually just SSI2 so nothing fancy

33:40.640 --> 33:50.160
but again it is not final we found out that if I use masket load stores from avx512

33:51.120 --> 33:57.040
it makes sense it gives benefit and maybe I should continue my work but the problem is

33:58.400 --> 34:07.440
all of our production in clickhouse cloud runs on arm and I spend so much time on this and I have to

34:07.440 --> 34:18.160
spend even more to optimize memory copy on arm so conclusions top to mice your code

34:18.720 --> 34:23.440
don't be afraid try crazy ideas and never give up

34:35.360 --> 34:40.320
thank you for talking now it's time for questions rise your hand and I will pass my phone

34:41.120 --> 34:47.920
I see I see I see I'm coming

34:54.640 --> 34:59.680
hi thanks that was fantastic I have one important question about

35:00.720 --> 35:06.000
okay I have two questions one is so at the beginning you showed a birth bench for a lot of

35:06.960 --> 35:13.040
so at the beginning you showed a profile made with birth of mmcopy being I think 10% or more

35:13.040 --> 35:22.000
of the runtime yeah so at the beginning you showed a profile with birth of mmcopy being more than 10%

35:22.000 --> 35:30.240
of your runtime yeah does that include in-line mmcopies no if I in-line it will it does not show

35:30.320 --> 35:37.520
in-line so at least in a birth at least by default it will show up inside different functions

35:38.400 --> 35:45.040
okay and my second question was will you able to record the real world distribution of the mmcopy

35:45.040 --> 35:54.800
sizes yeah that's a really good question and I thought about it and even I tried to do it

35:55.680 --> 36:03.840
record the addresses and sizes into a file but I did not get to incorporating it

36:03.840 --> 36:07.520
benchmark although it's a really good idea that I wanted to try

36:24.800 --> 36:54.400
so one thing that I have seen in my application is that the compiler is clever

36:54.480 --> 37:01.760
as long as you know the size but if you don't have the size and know it as a programmer that you

37:01.760 --> 37:09.840
have varying sizes and small sizes then it's of better to do the loop on ruling yourself so for

37:09.840 --> 37:19.040
example I have some code that which commonly just takes between six and 50 bytes and so I made an

37:19.040 --> 37:25.040
outer loop and outer check for the size and if it's the smaller size then I manually unlooper

37:25.040 --> 37:31.280
and roll it which is much faster than using the the one function that you said that cannot be

37:31.280 --> 37:40.880
in-line the mmcopy AVX which is normally used it's really really hard to hear the question

37:40.960 --> 37:52.560
but I almost I almost got the question so what to do is the size is like small but not not

37:52.560 --> 38:02.000
known at compile time or yeah interesting that in this case mmcopy looks similar to

38:02.000 --> 38:09.760
in one sense it is similar to the dev device so you have a few like jumps depends on

38:09.760 --> 38:20.880
one's the remainder of the size into into a loop that has a few white instructions like

38:20.880 --> 38:29.760
simple instruction center few small instructions and this pattern is typical in many

38:29.760 --> 38:35.040
implementation including my implementation including cost-mopolitan dipsy all of them have

38:36.880 --> 38:38.880
some very similar to the dev device

38:45.840 --> 38:50.640
but I'm afraid maybe I did not hear the question well

38:51.280 --> 39:03.200
yeah another question sorry absolute I cannot hear maybe you can like this can you hear me yeah

39:03.200 --> 39:10.080
okay did you look at any interpersonal optimization on with a compiler to see if you could avoid

39:10.080 --> 39:17.920
some of the main copy do have some inter procedural optimization right isn't the question

39:18.800 --> 39:26.160
now the question was if you use a bit the compiler with the inter position the IPO

39:26.160 --> 39:31.520
the inter procedural optimization so really difficult to check you would just

39:31.520 --> 39:39.120
sorry I'm not sure how does it sound in the do you hear the question I can answer about

39:39.120 --> 39:45.520
the inter procedural optimization so we use the final tip in a slank in a realist build

39:46.400 --> 39:52.800
it does not help with memory copy but it helps with many other functions

40:05.600 --> 40:13.360
maybe can you hear me maybe just to follow up the question just asked the if you

40:13.360 --> 40:20.320
happened that we use all two probably inter procedural optimization has been carried out by the compiler

40:21.200 --> 40:30.880
if you use the class so if you I know if you tested like your code that I like the original

40:31.760 --> 40:40.000
copy as low down if you enable the optimizations like inter procedural optimization

40:40.000 --> 40:48.400
part of the optimization pipeline we will do one of three yeah actually we use all three in slank

40:48.400 --> 40:53.840
and we try to enable some extra stuff we also use some stuff like we

40:54.800 --> 41:02.240
disabled the API compatibility because we have all the source code bundled and we enable

41:02.240 --> 41:09.200
just a bit of stuff that the new stuff that appears in slank I'm not sure if we include the

41:09.200 --> 41:20.480
that particular optimization that you mentioned maybe we can ask the room yeah the experts

41:20.560 --> 41:25.920
see plus plus expert about specific new inter procedural optimization that are not enabled by default

41:25.920 --> 41:30.480
in all three modes

41:31.440 --> 41:39.440
another one sorry just the last one a fun fact are have you tried the main copy attribute for

41:39.440 --> 41:46.320
you can see me I'm cop from what library in gc there's a compiler attribute that is main copy

41:46.320 --> 41:50.720
and then it understands that's a main copy so you can do even more optimization things again

41:51.440 --> 41:56.480
so it's just a fun fact if you want to do more compiler stuff on top just add the attribute

41:56.480 --> 42:01.760
main copy and then see if it performs even better it can do things like using main copies

42:02.320 --> 42:09.600
and stuff like that yeah it's really interesting although we started using gcf you years ago

42:10.720 --> 42:14.640
so now you'll still ask what slank should have compatible optimization

42:16.240 --> 42:23.200
also you might be if you are kind of gcc please don't don't be worried

42:24.080 --> 42:31.440
it's good actually I think maybe it's in plan to they normally like the attributes are

42:31.440 --> 42:44.240
possible but thanks I'm sure we are out of time so let's move further thank you for talking