WEBVTT 00:00.000 --> 00:11.240 My next speaker is Brian, Brian has been speaking the good of room before, even online if 00:11.240 --> 00:16.800 I remember correctly, is some amazing complex subjects of how GoWorks internally and has been 00:16.800 --> 00:18.760 doing an amazing job. 00:18.760 --> 00:24.360 So I invited him back this year, well, actually, he asked, to talk about Swiss maps, which 00:24.360 --> 00:29.520 I previously showed to wrongs his map, he's going to talk about the actual Swiss map, 00:29.520 --> 00:31.360 so round of applause for Swiss maps. 00:31.360 --> 01:00.160 This is a Swiss map, all right, no, well, okay. 01:00.160 --> 01:11.680 So Swiss map is a new map implementation in Go124, and so we understand why we're here. 01:11.680 --> 01:18.920 Hello, my name is Brian Borum, I work at Grafana Labs, where pretty much all of our backend 01:18.920 --> 01:28.160 codes are written in Go, I've been working in Go for 10 years or so, and at work I work on 01:28.160 --> 01:38.160 these things, open source projects for storing metrics, logs, traces. 01:38.160 --> 01:39.400 Why am I talking about this? 01:39.400 --> 01:45.000 Well, I watched this talk, map Cooler Cundus, is Matt in the room? 01:45.000 --> 01:46.640 No, okay. 01:46.640 --> 01:50.880 Is anyone who worked on this in the room? 01:50.880 --> 01:55.680 That's good, I can just tell you anything, okay. 01:55.720 --> 02:04.320 So yeah, I watched this talk, which is about the C++ implementation of Swiss maps, and 02:04.320 --> 02:11.760 I find it very engaging and funny, and so this talk is, this Matt's talk is an hour 02:11.760 --> 02:16.840 long, and in fact, this is part two, so he took two hours, and I have like 25 minutes, 02:16.840 --> 02:22.040 so we're going to go. 02:22.040 --> 02:25.240 So there's a map, right? 02:25.240 --> 02:36.400 There's a map in Go, a map is a key value store in memory, and the one thing that we need 02:36.400 --> 02:42.760 to get out of a map is that everything is pretty much constant time, so putting things in, 02:42.760 --> 02:48.720 looking things up, and deleting things constant time. 02:48.720 --> 02:52.560 So how do we do that? 02:52.560 --> 03:00.440 So this is kind of an abstract, how a map works, a hash table, how does the hash table work? 03:00.440 --> 03:09.600 Well, we take the key, I can't see the red dot, oh well, we take the key, which is over 03:09.600 --> 03:15.840 on the left, we put it through a hash function, hash function generates a 64 bit number, 03:15.840 --> 03:22.400 and that number changes for every different key, hopefully, 64 bits might be smaller than 03:22.400 --> 03:28.640 your key, so they might sometimes map onto the same hash, but hopefully you get a wide diversity 03:28.640 --> 03:39.720 of hash, and then you take that number and modulate it with some buckets, and if two things 03:39.720 --> 03:44.440 map onto the same bucket, then you make a chain with pointers. 03:44.440 --> 03:50.760 So this is called open hashing, this is kind of classical technique that you might have learned 03:50.760 --> 03:53.960 in university, something like that. 03:53.960 --> 04:00.040 There is another way of doing it, this one's called closed hashing, where instead of making 04:00.040 --> 04:06.840 a chain you just put it in the next slot, that's not entirely true, I'm going to lie 04:06.840 --> 04:13.440 from time to time in order to go faster and simplify what really happens. 04:13.440 --> 04:21.480 So this one is called closed hashing, so I put those words in big text, focus on the 04:21.480 --> 04:37.800 first two letters, so this is why it's a Swiss map, because CH is the country code for 04:38.800 --> 04:48.000 it, it didn't hurt that the people mainly working on it worked in Google Zoo, but apparently 04:48.000 --> 04:54.960 this is the story, it's called a Swiss map, because it uses closed hashing, alright, let's 04:54.960 --> 05:04.240 get into, so the previous version, the app to go to 1.23, just a contrast, how does that work? 05:05.160 --> 05:13.040 I'm going to have a spirited attempt to get a dot, no, how about if I move this somewhere 05:13.040 --> 05:14.040 over here? 05:14.040 --> 05:20.480 Yeah, right, so this is the data structure, right, in memory, it starts with a little bit 05:20.480 --> 05:28.160 of sort of header information, and there is one slice of buckets, just like we saw before 05:28.160 --> 05:35.880 we take the hash value, it's always a power of two, so modulus is cheaper, because we're 05:35.880 --> 05:41.600 very, very focused on performance in this whole thing, and then was a bucket, look like 05:41.600 --> 05:48.840 a bucket, in this case in the go map has eight up to eight things in it, and they are laid 05:48.840 --> 05:55.360 out all the keys, then all the values, a little bit of metadata for the bucket, and then 05:55.360 --> 06:02.920 if you manage to fill the bucket, another one will be created with an overflow pointer, 06:02.920 --> 06:08.040 so there's this chaining thing going on, but eight at a time, and if you get a lot of 06:08.040 --> 06:14.140 overflows, then the whole table is resized, doubled in size, so that's how the previous 06:14.140 --> 06:27.260 go map worked, this is the new one, it's pretty similar, there are a couple of kind 06:27.260 --> 06:35.020 of main important changes, that the overflow pointer has disappeared, so true to the name 06:35.020 --> 06:43.020 we're using closed hashing, we do not make chains of buckets, but the other big change, 06:43.020 --> 06:48.660 I mean the keys go key value, key value, that's a little bit of a change, but the big 06:48.660 --> 06:57.420 big change is instead of just one table of buckets, there are multiple, and we look a 06:57.420 --> 07:01.860 little bit more about how that works, but basically what that allows is it allows the table 07:01.860 --> 07:10.020 to grow more gently, or it can sort of grow each table can grow independently, so we 07:10.020 --> 07:18.420 don't have to double the size of the whole thing all at once, right, we look at a little 07:18.420 --> 07:30.540 bit more detail, so first of all let's just talk about what is that metadata, okay, so this 07:30.540 --> 07:36.620 in example I've got four entries, four tables, I've got three tables, but two of the entries 07:36.620 --> 07:47.340 point to the same table, that's how it allows it to grow gently, you can have one table 07:47.340 --> 07:54.580 two, four, eight, whatever, you can have five tables, and the directory doubles in size as 07:54.580 --> 08:00.100 these things grow, and we take a certain number of bits off the top of the hash in order 08:00.100 --> 08:03.580 to look up the directory, so in this case we've got four entries in the directory, we take 08:03.580 --> 08:12.660 two bits, it's going to get more complicated, strapping, we take two bits off the top 08:12.660 --> 08:21.300 of the hash to look up the directory, we take the next bits down to bit 57 to look into 08:21.300 --> 08:26.940 the bucket, to pick a bucket by modulus the size of the table, and then what do we do with 08:26.940 --> 08:34.500 the other seven bits? Oh, we'll come back to that, I thought you'd be wanting to see 08:34.500 --> 08:53.420 some benchmarks, so I put a benchmark in next, it's faster, so the x-axis is a number 08:53.420 --> 09:08.500 of elements, so this is a benchmark from the go runtime, so we make a map and add a certain 09:08.500 --> 09:16.380 number, like a million or whatever, elements to it, and then we look up all of the elements, 09:16.380 --> 09:25.540 so every time we look it up in this benchmark it's a hit, and then we run this as a go benchmark, 09:25.540 --> 09:34.820 so we get nanoseconds pair up, and so just to point out the numbers, it's basically 09:34.820 --> 09:43.180 the same for small maps, and the big picture, why does it slow down, because the map 09:43.180 --> 09:51.340 no longer fits in cache, right, your CPU has an L1 cache, which is tiny, a few tens 09:51.340 --> 09:57.100 of k, something like that, as a level 2 cache, which is maybe a megabyte, it has a level 09:57.100 --> 10:04.540 3 cache, which might be 32 megabytes, once you get out of the fast L1 cache, it starts 10:04.540 --> 10:13.140 to slow down, and so basically with everything in cache they're going at about the same speed, 10:13.140 --> 10:18.940 and as the map gets bigger, this kind of a bit in the middle, where it's like two or 10:18.940 --> 10:26.340 three times faster, and then it gets in kind of asymptotically reaches a few percent faster, 10:26.340 --> 10:36.300 on this benchmark, benchmarks are basically lies, right, so I run it with a real program, 10:36.380 --> 10:44.060 which is Prometheus that I work on, and unfortunately it's like, it's about the same, 10:44.060 --> 11:01.540 it's, very little difference observed between go 1.23 and go 1.24 RC2, and the reasons for that 11:01.620 --> 11:09.460 are probably complicated, and I have had some discussions with Michael Pratt, who did a lot 11:09.460 --> 11:16.180 of the work on the goal version of this, so hopefully we can get this up, but yeah, that's 11:16.180 --> 11:21.540 a little disappointing, so I thought I'd stick that in the middle, so we get the sadness over 11:21.620 --> 11:35.140 with, okay, so where's Michael Clicker, right, back to the metadata, so what we're going to do 11:35.140 --> 11:42.340 with those seven bits, we're going to store them in the metadata, so once again we take the key, 11:42.340 --> 11:49.140 we hash it, just the ones on the way in, or we need to rehash it if we did this thing of doubling 11:49.220 --> 11:57.540 the table, we need to rehash all the elements, but generally speaking just once, we call the hash 11:57.540 --> 12:05.940 function, we get 64 bits, and we take the bottom seven bits and stick that in one bite of the 12:05.940 --> 12:13.460 metadata word, why seven, well we have one more bit which tells us if there's anything there are 12:13.460 --> 12:21.620 tall, so the one bit says that it's empty or deleted, and the other if it's not emptier deleted, 12:21.620 --> 12:30.100 the other seven bits are from the hash, one bite per element, so I've got three elements in this 12:30.100 --> 12:36.980 bucket, so we get three hash values, and then the other five bytes here are going to say that it's empty, 12:37.060 --> 12:50.180 that's the bit pattern with a one at the top means empty, let's look an example of how this 12:50.900 --> 13:00.420 works in the code, so I've made up some numbers here, once again the eight zero means empty, 13:01.380 --> 13:09.300 we're going to look for something that has the seven bits three, can anyone see it, 13:10.740 --> 13:20.500 yes it's easier if you're human, but if you're a computer, so the the obvious way to do this, 13:20.500 --> 13:28.420 and in fact the go one dot twenty three implementation loops, so a little four loop it says is it 13:28.500 --> 13:36.260 the first one is it's second one is it's third one, loops are kind of slow, so cash misses and 13:36.260 --> 13:47.380 branches are the two things that are slow right at the lowest level in your CPU, so we do something 13:47.460 --> 13:56.660 extremely funky, I say we I didn't write this, we replicate that byte eight times, 13:58.500 --> 14:08.260 we exclusive or, seriously, so the one that we're looking for is the one where we got zero, 14:08.660 --> 14:18.660 so we do a funky thing to basically turn the zero into FF and then we mask that down to just one bit, 14:21.060 --> 14:28.740 so in just a few instructions with no branches we have found the element we're looking for 14:29.700 --> 14:39.300 and I thought you might think I was lying, so I used to go bolt, so this is the actual 14:40.340 --> 14:45.700 machine code for this function, the go code is on the left it kind of looks like noise, 14:48.580 --> 14:50.980 the machine code is easier to understand, 14:51.380 --> 15:06.660 and that is, it's on the right here, and yeah, so we take the thing, we multiply it by that, 15:06.660 --> 15:13.060 I mean that's the decimal version of that one zero one zero thing, we do multiply, we do x or 15:13.060 --> 15:26.420 add not and done, no branches, so that's pretty cool, but there's more, if you read the description 15:26.420 --> 15:36.900 of this talk, I promised you Cindy, what is Cindy, well this is not Cindy, so the sort of typical 15:36.900 --> 15:40.420 traditional way that computer works, you have a stream of instructions, you have a stream of 15:41.380 --> 15:47.620 and we compute some sort of results, the green bit is the is the logical processing unit in 15:47.620 --> 15:54.820 the middle of your your computer, so this is Cindy, we still have one stream of instructions, 15:54.820 --> 16:01.380 but we have multiple sets of data coming in at once and producing multiple sets of results, 16:02.340 --> 16:09.140 so this was this kind of idea was invented really for graphics, or it can also be used in 16:10.020 --> 16:16.500 sound processing, but once it's in the CPU, people use it for all kinds of things, and 16:17.940 --> 16:25.300 we're going to use it to look up on map, how do you do that, how do you write Cindy in go, 16:25.300 --> 16:36.420 well you can't, I mean the only the way this works is inside the compiler, so this is in the go 16:36.420 --> 16:46.660 compiler source code, internal ssa.gen and trinzics.go, it matches on the name of that function, 16:47.300 --> 16:59.940 which we had earlier, if you set this is an environment variable go AMD64, so that takes a value v1 v2 v3 v4, 16:59.940 --> 17:06.740 and that tells the go compiler how capable your processor is, the default is v1 which is like 17:06.740 --> 17:17.300 a, I don't know, Pentium from 1980 or something, it's nobody use, every processor pretty much has 17:17.300 --> 17:27.300 got these nice Cindy instructions, but you do need to set the variable to v2 to convince the compiler 17:27.300 --> 17:38.500 that it should be using them, and yeah, I couldn't use Godbolt because you can't set the environment, 17:38.500 --> 17:44.180 but you can't persuade Godbolt to compile that exact function, so I can't show you it that way, 17:44.980 --> 17:51.300 but I'll show you, I'll show you in this visualization how the Cindy instructions work, 17:51.380 --> 17:57.860 I mean the basic point of this is that things like that multiply that we saw earlier, 17:57.860 --> 18:04.260 multiplies pretty slow instruction in the CPU, it might take 14 nanoseconds, 18:05.700 --> 18:11.140 whereas mostly instructions we're going to look at here take one nanosecond, plus or minus, 18:12.100 --> 18:20.820 I lie a lot, okay, so we saw this before, right, there's some, these are the seven bit hashes, 18:21.300 --> 18:31.380 three of them, and 80 remains empty, we, this instruction is called a broadcast, so we go from the 18:32.500 --> 18:37.220 one byte to eight bytes all the same in a single instruction in a single clock cycle. 18:38.180 --> 18:45.940 We don't have to do XOR and an end and a knot, because there's a single instruction 18:47.220 --> 18:52.500 that matches each byte individually, so we find the one that we're looking for, 18:53.780 --> 19:03.540 and then we squish that down to a single byte where one bit tells us which one we found it at, 19:04.500 --> 19:09.940 and it's a little confusing because the bytes and the bits are the opposite ways round, so 19:10.820 --> 19:20.900 blame until. So, yeah, this is, this is, the bias complicated as my talk gets, 19:22.100 --> 19:28.260 so what are we looking at here, we've used Cindy instructions where each instruction 19:28.420 --> 19:36.500 executes more or less in one clock cycle. We have from a bucket of eight up to eight elements, 19:36.500 --> 19:44.100 we've found the one, in my example, that matches, because only seven bits, there's, there's a one 19:44.100 --> 19:51.940 in 128 chance that you get a false positive, so it's a little bit better than 99% of the time, 19:51.940 --> 19:59.300 you find the exact right one instantly, but if you've got a, what we need to, we need to check 19:59.300 --> 20:03.940 the whole key anyway, once you find a match, we need to check the whole key just to, 20:05.300 --> 20:12.100 because the hashes are small anyway. So, false positives are a little bit of a drag, 20:12.900 --> 20:20.740 but less than one percent at a time, and so this, this, it blows my mind, this is really why I 20:20.820 --> 20:28.100 wanted to do this talk, because this code is so clever and so tight, and I believe this is 20:28.100 --> 20:36.020 down to Jeff Deenan's surgery, again, who are the sort of superbeings of Google. 20:38.260 --> 20:44.660 All right, what else we got? More benchmarks. This is memory size. 20:45.620 --> 21:00.500 The 124 map is generally a bit smaller, and sometimes it's a lot smaller, so let me just explain 21:00.500 --> 21:06.820 how I drew this chart, how I measured this. I actually, when you run go benchmarks, it gives you a 21:06.820 --> 21:12.580 memory number, but that includes memory that was allocated and free. This is just the allocated 21:13.540 --> 21:17.380 memory. The stuff that's left live, what your program actually needs to run. 21:19.860 --> 21:25.300 And what I do is I call make, and I don't give it the right size. I don't give it any size at all. 21:25.300 --> 21:31.220 So the map has to grow, and what we see with those humps is all of those overflow buckets, 21:32.180 --> 21:40.740 and then the doubling. Those two things give you this massive jump. So the go 124 map can be a 21:40.820 --> 21:48.980 lot smaller, just depending on how unlucky you get. It tends to always be a bit smaller, 21:50.100 --> 21:56.820 because it is packed tighter, what we call the load factor. How many elements will it allow 21:56.820 --> 22:01.380 in a map before doubling the size? That went up slightly. 22:01.860 --> 22:10.820 Yeah, I just wanted to, that's the overflow buckets. That's why those big humps are the overflow 22:10.820 --> 22:24.820 buckets. Okay, shut up. This is the worst case for memory. So this is a map of an empty struct, 22:25.780 --> 22:32.260 right, a structment nothing in it, and people use this as a set type, just we just know if it's 22:32.260 --> 22:41.780 in the map or not. The 124 map ends up allocating eight bytes for each empty struct, 22:43.140 --> 22:54.500 which is a bit of a mistake. Sorry. So yeah, this is the, this is the showing the go 124 22:54.580 --> 22:58.740 the Swiss map in the worst possible light from all the benchmarks and all the 22:58.740 --> 23:04.020 investigation that I did, this is the worst thing it does. Probably this will get fixed. 23:05.060 --> 23:12.100 Oh, by the way, release. Oh, yeah, let me talk about that in a second. So this is, this is the issue 23:12.100 --> 23:20.660 to track 71368. If you want to know when they fix the empty struct bug or inefficiency. 23:24.900 --> 23:34.180 The release of 124 is delayed, because I found this bug. It's not a bug in go. This is a bug 23:34.180 --> 23:40.820 in the library called modern go reflect two, which I had not heard of, but it is used in JSON 23:40.820 --> 23:50.660 iterator, which I use all the time. JSON iterator was using unsafe techniques to look at the 23:50.740 --> 24:01.700 internals of map. The internals have changed a lot. It's broken. So there's going to be an RC3 24:03.300 --> 24:16.340 go 124 RC3 in like on Tuesday roughly. So that pushes back the release by like another week. 24:16.660 --> 24:25.140 What they did in, so this is the tracking bug in where the problem exists, but this library 24:25.140 --> 24:37.540 seems to be abandoned. So you know, just stop using it. So in go there's a layer, they basically added 24:37.540 --> 24:41.940 an extra layer of indirection, because that's the way you fix every problem in computing. 24:42.020 --> 24:51.140 So basically when you use this or any library that is going behind the back of map to look at 24:51.140 --> 25:00.580 the internals, it actually gets a compatibility layer, which then now works. But adding the 25:00.580 --> 25:07.060 compatibility layer slows you down. So it only slows down the people using the the back door 25:07.060 --> 25:14.900 access, the unsafe access. But that's kind of worth knowing about. All right, we're basically 25:14.900 --> 25:20.100 done. I wanted to put up all the people that really worked on it, just to be clear, I did not work on 25:20.100 --> 25:31.060 this. So very, very clever people. And yeah, I could take questions. I have some links here 25:31.060 --> 25:37.060 if, uh, and some credits. 25:43.460 --> 25:48.500 Are there any pressing questions? I see a very raised hand right over there. Please speak as 25:48.500 --> 25:55.620 close as you can to the microphone. Thank you for the talk. If we go so button to the 25:55.780 --> 26:07.460 instructions, it means it's already architectural dependency for the processor and yeah, yeah. 26:07.460 --> 26:19.300 So the code I showed was was Intel AMD code. And so they, they have not written, yeah, they've 26:19.300 --> 26:27.380 not written this code for ARM or any other architecture yet. But it can be written. And then the 26:27.380 --> 26:34.580 other version with the multiply, that does work on ARM. So it's already faster on ARM, but we don't 26:34.580 --> 26:42.100 have the SIMD implementation on ARM yet. Thank you. If you have more questions, please be 26:42.100 --> 26:49.940 going to be in the hallway. We have to switch fingers now, last forever plus.