WEBVTT 00:00.000 --> 00:11.520 Okay, so welcome to my talk, called UltraFast, Lua Jason Parsing. 00:11.520 --> 00:17.240 It's called UltraFast, because at the time when I was submitting the proposal, I still 00:17.240 --> 00:22.400 wasn't sure how fast it's going to be, so we will see. 00:22.400 --> 00:28.480 So yeah, a bit about me, why I'm interested about Lua and Jason. 00:28.560 --> 00:36.320 I'm mass software engineer at BMG. It's a video game, because in later we have a close source 00:36.960 --> 00:44.400 engine, unfortunately, but the Lua code, which is most of the logic, is open source with an open source 00:44.400 --> 00:52.000 license, you have to buy it though. And my focus there is on the Linux port, security, so we are 00:52.240 --> 00:59.600 unboxing some user code and also a bit of like hacking on LuaJet, so if you want to talk about 00:59.600 --> 01:09.200 any of this, I would be happy to. So what is LuaJet if you didn't hear? It's a compiler, just in time 01:09.200 --> 01:16.640 compiler for Lua programming language, and Lua is really commonly used in game development. 01:17.600 --> 01:24.240 LuaJet comes also with a hyper from an interpreter written in C, but it's compatible with 01:24.240 --> 01:33.680 the Lua API, and why do we use it for the game, because these logic virtual machines are 01:33.680 --> 01:40.960 very small, and we run lots of them in threats, so forever vehicle, we run a single logic VM, 01:41.600 --> 01:50.320 which then they communicate with each other, and LuaJet is also fast, we have quite high update 01:50.320 --> 01:59.520 rate, so some systems in the game run up to two kilohertz, and this is the problem. 01:59.840 --> 02:12.080 Okay, this might be bad. Okay, this is the problem. So Jason, you know, it's a notation 02:12.080 --> 02:18.640 how to or a format, how to exchange data, is a strength format, and Lua is a program language, 02:18.640 --> 02:27.040 so this program language has basically only one complex data type, which is both 02:27.280 --> 02:37.760 an associative array, and also sequential array, so it has this like keys and values, 02:39.520 --> 02:47.040 and yeah, this is not, we are not interested on the speed, on the pure speed of the Jason 02:47.040 --> 02:55.200 parsing, but also in the creation of the Lua object, because in the end, both of these parts 02:55.760 --> 03:04.320 need to be fast enough, for the use case, because we need to work with this object, not just 03:04.320 --> 03:14.640 parse the representation into an intermediate structure. So of course, there are so many Jason 03:14.640 --> 03:23.040 libraries available in C++, or even in Lua, there is a lot of libraries available, or you can 03:23.120 --> 03:33.760 combine this, because Lua has a CAPI, so or some people made this publicly available, or you can 03:33.760 --> 03:42.160 also do it on your own, you can take any C++ library, and then use some sort of Lua bindings 03:42.160 --> 03:52.000 with C or C++. They exist, of course, but let's go by one by one. So with this pure Lua libraries, 03:52.640 --> 03:59.120 which are available on the internet, you can integrate it into your cultural easily, 04:00.080 --> 04:09.680 but at least the ones that I tested are just not performant enough, because Lua is still a dynamic 04:09.680 --> 04:16.960 scripting language, or these implementations are not focused frequently on performance. 04:17.280 --> 04:27.520 So let's go for the C++ libraries, and then here I think the state of the art is Cindy Jason, 04:27.520 --> 04:36.800 which is very fast based on their own benchmark. They say they can parse Jason at gigabytes per second. 04:38.080 --> 04:45.840 Of course, for this, you need the extra bindings, which might make the whole thing slower. 04:47.280 --> 04:54.560 And the biggest problem for our use case, of course, you can use Cindy Jason for Jason parsing, 04:54.560 --> 05:02.640 but our Jason format is actually not strict Jason, and we have some extra features in there, 05:03.680 --> 05:13.200 so which I will talk about later. And then you have like the middle ground, these bindings 05:13.280 --> 05:21.280 they suffer from the same problems as the pure C++ libraries. So it means like for example, 05:21.280 --> 05:27.840 if you extend Lua's Cindy Jason, you still have the problem that it's like a strict parser. 05:28.880 --> 05:34.480 And some of them, for example, also the Lua's Cindy Jason, which is public available, 05:34.480 --> 05:40.800 they use the Lua's C API, and this might be a surprise, but using the Lua's C API is 05:41.600 --> 05:46.320 not so perform, and it actually hurts the parsing a lot, as we will see. 05:48.000 --> 05:52.000 So it's time to reinvent the wheel and write a Jason parser again. 05:53.040 --> 06:00.400 And this is our format, it's called Jbim, but in the end it's just Jason with comments, 06:00.400 --> 06:08.480 or as Jason, there are some, you might see this file, it kind of looks weird, we can check all the 06:08.560 --> 06:14.240 differences. So you have comments, which you don't do anything, but you have to parse them, 06:15.440 --> 06:21.280 you can have equal signs instead of columns just for like whatever sake. 06:22.160 --> 06:28.320 Cmas are fully optional, they are treated as white space, and also you don't need to cove your 06:28.400 --> 06:38.720 strengths when they are the keys of the dictionaries. Just basically it. Yes, so you can find 06:38.720 --> 06:45.680 this format, it was used in multiple other kind of source first, but there is not one single 06:45.680 --> 06:54.160 exact specification for this. So we wrote our own parser, and first we wrote it in Lua, 06:54.240 --> 07:05.440 because that is easy. And it's already used in the game for the past 10 years, for example, 07:05.440 --> 07:10.320 so it's well, but it's small, you can write the parser in 300 lines of code. 07:12.160 --> 07:20.000 And it also parsed Jason because as Jason or the Jbim 07:20.160 --> 07:30.720 is only a superset of Jason. But of course, it has some caveats, it is not a validating parser, 07:30.720 --> 07:37.520 because for example, it fully ignores Cmas, so it cannot tell you whether Jason files 07:37.520 --> 07:43.120 a false specification or not, so it also parses some files which are not strict Jason, 07:43.200 --> 07:50.320 and it was written to be Jet Friendly as we used logic. So let's just, yes, or there is going to be 07:50.320 --> 08:02.400 some quite of source code. So let's see the basics. This is the basic function, and 08:04.240 --> 08:12.400 yeah, we start like doing some tricks to be faster. So logic is or Lua is a garbage collected 08:12.480 --> 08:19.600 language, so for the duration of the parsing, we disable garbage collection, because it's 08:19.600 --> 08:28.720 much faster when we are working with lots of new objects, so that is trig number one. 08:30.320 --> 08:39.760 And then skipping white spaces is a big part of the Jason parsing, because there can be quite 08:40.080 --> 08:47.120 a bit of useless white space in Jason, of course, not if your Jason is modified, but in this case, 08:47.120 --> 08:55.120 we have like semantic meaning in all these comments, so we want to keep them. And Jason parsing 08:55.120 --> 09:03.280 is just going through all the characters, basically, and checking whether they are a space or 09:03.520 --> 09:11.120 whether they are a comma. As you can see, we have the ASCII values for this, because Lua 09:11.120 --> 09:18.560 just doesn't have a character data type, so you would have to use a one-character string, which is 09:18.560 --> 09:27.600 much slower, so to go faster use numbers. And the important takeaway, white space skipping can 09:27.680 --> 09:39.520 take a lot of time. We will see later that some optimizations are related to white space skipping, 09:40.480 --> 09:49.440 and the core of this algorithm of the Jason parsing is this peak table, which only loops 09:49.440 --> 09:58.160 at the next character, and basically what you need to parse, if you have written a parser before 10:00.000 --> 10:08.000 Jason is really easy, you don't need to look far ahead, you just get to see the one single 10:08.000 --> 10:17.040 next character, and this is the full full record, like full implementation of the parsing, 10:17.040 --> 10:24.240 like if you see the next character, it can only be null or it's invalid Jason, the same for 10:24.240 --> 10:29.840 some Boolean values, I especially because you can have infinity, which is the number, 10:30.880 --> 10:39.120 and then on these strings and for our use case comments. So our peak table, it just loops again 10:39.120 --> 10:46.560 at the ASCII values, and based on that, for example, if it encounters the value of t, then it just 10:46.640 --> 10:57.520 checks if the next three values are RE, so true, and if not, then it fails. And the only like 10:57.520 --> 11:08.560 more complex functions are the one that parses your objects or arrays. So you see that first 11:08.640 --> 11:16.560 we like allocate a new table, then just read the key, which is a string in Jason, 11:16.560 --> 11:23.600 then skip Y space, read the value, which can be anything, so here is the recursion, and 11:25.200 --> 11:34.400 does it, that's basically it, but again there are some optimizations done. For example, this 11:35.360 --> 11:45.600 table new function, you specify the number of values, then what they should have pre-allocated. 11:45.600 --> 11:52.160 So you can, and logic tables are special in this way, that they have the sequential array part, 11:52.160 --> 11:58.880 and then the hash part, which is the ASCII array. And we specify, 11:58.960 --> 12:09.360 yeah, we specify the three values, because it was a value selected by benchmarking, of course, 12:09.360 --> 12:18.480 but it's important to focus on this if you are optimizing, and another of the tricks 12:19.600 --> 12:25.760 is actually like we have skipped Y space function, I showed it to you, but it's proved to be faster 12:25.760 --> 12:35.920 to copy this, skip Y space implementation into the code, because logic or logic in general 12:35.920 --> 12:46.000 has over it for a function calls. So sorry, so inlining gives you some extra performance boost. 12:47.600 --> 12:54.160 Now let's focus on the benchmarking. I used the latest commit of logic, because it's 12:54.480 --> 12:59.600 continuously developed project, this is my CPU, my system doesn't matter, too much. 13:00.640 --> 13:10.480 I made this benchmarking protocol on the go, yeah, this is my kind of like first time benchmarking 13:10.480 --> 13:16.720 project, so of course like you have probably seen the talk before, how to measure performance 13:16.800 --> 13:24.240 reliably, so followed it. But in the end, I got some of these things right. I run passes 13:24.240 --> 13:32.160 through some adjacent data sets, and I run either, I get some number of these passes, or some 13:32.160 --> 13:43.520 timeout runs out, and I do some things for this to not for the runs not to be dependent to much 13:43.680 --> 13:50.480 on each other, and I collect the garbage before I measure, and I also flush the jit cache, 13:50.480 --> 13:58.400 which means like the jit has a clean state, so in theory, it shouldn't be influenced by the fact 13:58.400 --> 14:05.040 that I already like parsed this JSON file one thousand times before, because otherwise we would 14:05.040 --> 14:11.200 just have ever think in the jit cache. I take the mean time of the pass, and I repeat everything, 14:11.200 --> 14:18.720 so I have like some inter, these like inter runs, I repeat everything five times, take the mean of 14:18.720 --> 14:26.400 the means, and calculate throughput out of it. I also was looking at the minimum and maximum, 14:26.400 --> 14:32.880 but I didn't include it in these numbers, to many numbers, and to further reduce the variance 14:32.880 --> 14:43.920 of on my machine, yes, I set the priority to the maximum, and it also helped a bit to 14:43.920 --> 14:53.760 always force the same CPU core, like the same CPU affinity using tasks. So let's talk a bit about 14:53.760 --> 14:59.920 the data sets I used. The first one is the jit beam data sets, so this is the file that we actually 14:59.920 --> 15:10.640 wanted to parse. These are the S JSON files, which are mostly numbers, some positions, and short strings, 15:11.680 --> 15:23.040 and they represent the structure of the physical structure of these vehicles. So, they tell 15:23.120 --> 15:31.840 which of these nodes, which have IDs are connected to each other, and the strength of these 15:31.840 --> 15:41.200 connections for the purpose of the simulation. And I did the benchmark, and our parser is the fastest, 15:41.200 --> 15:48.480 because the other parsers do not parse as JSON, because there are commands, so this does not tell 15:48.560 --> 15:58.640 as much. Therefore, I made the JSON data set, which contains some testing files, which 15:58.640 --> 16:07.280 simply JSON is all using. So, 20 files with different sizes, different context, different structures, 16:08.800 --> 16:17.280 and I run it again in total 15 megabytes, and our parser is not the fastest anymore. 16:18.000 --> 16:28.400 You see with jit, it still reaches like reasonable performance, but loss in JSON is faster. We don't 16:28.400 --> 16:41.280 get the gigabytes per second. They were advertising, but in the end, it's okay, but we were not happy 16:41.280 --> 16:47.520 with it. Also, when you look at the number with jit off, which is a realistic situation, 16:47.520 --> 16:53.120 there are some platforms where jit is not available, then the performance is really bad. 16:54.720 --> 17:03.600 And yeah, this is just an extra one file data set. I wanted to see the difference between these 17:03.600 --> 17:09.600 all like 20 files and one file that I already like understand what's inside, so these are just like 17:10.240 --> 17:18.240 Twitter statuses, it's really commonly used, and the numbers are quite similar, or at least the 17:20.960 --> 17:31.120 positions of these parsers. So, we have this loop parser, which you have bits of 17:31.120 --> 17:38.000 very pure loop parsers, but the SIMD JSON is very, and also the performance of the loop parser 17:38.000 --> 17:46.800 relies on jit, and this is the point of the talk, if we can go closer to the source code and 17:47.680 --> 17:56.080 get a faster implementation. Just a bit about logic source code, it's not, it's not large, it's 17:56.160 --> 18:06.400 around 70, 70 c files, and some handwriting assembly, but in the end, if we just want to 18:06.400 --> 18:12.960 hack on it a bit, we are interested in two files, and the one which does serialization of some 18:12.960 --> 18:24.560 internal logic format and string buffers. So, string buffers are, let's say, a module in 18:24.560 --> 18:32.960 logic, which allows you to have mutable strings, because normal logic strings are not mutable, 18:32.960 --> 18:43.520 and are also interned, and this is the serialization function that logic already has. So, you 18:43.520 --> 18:51.520 encode a table, you see this is the same as encoding a table, for example, to JSON, but it 18:51.600 --> 18:59.120 comes to a different format, and you can also take this format and decode it simple, but we can 18:59.120 --> 19:07.840 steal the source code of this, and basically implement it like to generate JSON and deco JSON instead 19:07.840 --> 19:14.960 of this. So, it's almost the same, you see the same primitive skipping white space, and again 19:14.960 --> 19:23.440 like peeking on the next value, is the same, here we go, just like character, like character. 19:25.120 --> 19:34.960 So, here are some simple things, I will start skipping a bit, because, okay, and here is really similar 19:34.960 --> 19:43.120 to the implementation. Again, you need to pre-licate some memory, with the type of new function, 19:43.200 --> 19:51.680 it does the same thing, except this number is in the C function is one more, because in the 19:51.680 --> 20:01.120 CAPI index from zero, and in the normal one from one, and then it writes the lower values. 20:01.920 --> 20:08.720 So, we did it like this knife implementation, and it's much better when the JSON is soft, 20:08.720 --> 20:15.920 like, and the performance doesn't matter on the jet, but it's, we didn't get, we didn't get an 20:15.920 --> 20:24.320 speed up, or it's even worse. So, what optimizations can we do? Branch predictor hints are 20:24.320 --> 20:33.040 over a part of, of logic, and we can basically say, like give a hint to the compiler that this 20:34.000 --> 20:42.240 shouldn't happen, so you should always like, yeah, you should expect that the likely part 20:42.240 --> 20:49.200 is happening. So, for example, here, every time we encounter an error flow, like an invalid 20:49.200 --> 20:55.840 invalid JSON, or the end of the file, we don't need to care about it, like, if it's slower fast. 20:55.920 --> 21:05.920 So, it's, it's unlikely to happen for us. Then, the initial number parsing, I, I use the 21:05.920 --> 21:15.440 logic function, which deals with all sorts of formats, but I did some profiling, and it show 21:15.440 --> 21:23.600 that it's not, it's not super-formant. So, I simplified the code, I really wrote like the 21:23.600 --> 21:31.680 knife implementation, where it just like parsed the digits until you encounter the 21:31.680 --> 21:40.800 digit dot, and then you put the fraction value, it was, it was much faster. For some of these 21:40.960 --> 21:48.320 scientific numbers, I still use the original function, because I'm lazy, but these numbers 21:48.320 --> 21:56.480 are not common, in our use case. Then, in lining, we saw it in the Lua API, but here, Lua 21:56.480 --> 22:04.240 Jit already contains the macros, so we can use them, and I want to really like, hard with the 22:04.240 --> 22:12.080 inlining, because I can, because I also wanted to have the functions real module, but in the 22:12.080 --> 22:18.560 end, in the compiled code, I want everything to be in light, so most of the functions in the parsers 22:18.560 --> 22:27.280 are in light. And then, yeah, I don't forget to use a profiler, and I saw a lot of the time 22:27.280 --> 22:35.680 is spent like while resizing the tables, the Lua tables, and this happens before, because when I'm 22:35.680 --> 22:44.240 parsing, I tell the initial size of the table, and then I parse the children, but there can be, 22:44.240 --> 22:51.200 for example, like, 5,000 children, and I only allocate this space for three elements, so this means 22:51.280 --> 22:58.400 I will have to reload from three, and probably double the value a lot of times, and it's 22:58.400 --> 23:12.400 going to copy a lot of values. So, there is this branch buffer, which is just a way, which is just 23:12.400 --> 23:18.320 a dynamic stack for these like temporary values. It's important that yeah, it is there for all 23:18.400 --> 23:27.280 the recursion levels, because we parse up to 100 levels of recursion depth, and you still, of course, 23:27.280 --> 23:34.320 need to copy the values from the buffer, but you don't really look anymore, so it helps. 23:35.360 --> 23:41.920 And then, yeah, this is like the cherry on the top for white-specific other implementations, 23:41.920 --> 23:52.560 such as representation, do some intrinsic, so there are two functions where it's proof to be useful, 23:52.560 --> 24:01.440 might be more, but again, I'm lazy. So, yeah, for white-specific, when you are parsing commands, 24:01.440 --> 24:06.240 you just need to find either the end of the line, or for multi-line commands, you need to find the 24:06.320 --> 24:14.560 next star, asterisk, and for string, and you either need to find the matching quote, or there 24:14.560 --> 24:23.200 can be escape characters, and then you go to the slow path. So, yeah, I just show the SSC 24:23.200 --> 24:31.520 for version, which basically, like you set up the search set of the values I want to find, 24:31.600 --> 24:38.720 and the improvement is like I scan through 16 values at once, so this primitive gives me the 24:38.720 --> 24:44.480 index of the first match, or if I don't find a match, then I go to find the next 16 characters, 24:44.480 --> 24:51.600 but it operates on 16 bytes at once, so it provides some speed-up, the same benchmark, 24:52.560 --> 25:03.600 and our parser is faster. Now, we got like some, yeah, 33% improvement, but like, 25:03.600 --> 25:10.400 which one of these optimizations was helpful, you might ask, or not, so I did some 25:10.400 --> 25:18.240 ablations study, I disabled, yeah, the final version is the one that is published, I disabled 25:18.640 --> 25:26.320 these optimizations, but one by one, and basically, you can see that probably all of them are in 25:26.320 --> 25:35.280 some way helpful, and if you go to the SSC for inference, you even might get some small boost, 25:35.280 --> 25:41.520 but the default version uses SSC too, because they are available on every x64 CPU. 25:41.520 --> 25:52.800 And on the JSON parsing data set, this is important, we beat loss in the JSON, so out of the 25:52.800 --> 25:59.360 known parser, that I knew at the time of making this, and still this should be the fastest 25:59.440 --> 26:14.800 widget parser, and on 15 JSON is the same. So, we gain some speed-up, 50% speed-up, one jit is on, 26:14.800 --> 26:21.440 but like one jit is off, which might be a real important for some platforms, we get more than 10 times 26:21.520 --> 26:30.400 speed-up, so for these files, like parsing them, can take, yeah, instead of three seconds, 26:31.120 --> 26:38.560 it can take 0.3 seconds, and on the JSON data set, we are faster, and much faster than the 26:38.560 --> 26:45.760 other implementations. So, some tips at the end, try to combine these tricks, but you have to 26:45.840 --> 26:54.320 measure them, use the profiler, minimize coping, and yeah, like focus on your files, if you are 26:54.320 --> 27:01.680 writing a parser yourself, if your files are not using scientific numbers or escape characters, 27:01.680 --> 27:09.520 then you can make the implementation slow, and you will save some time. So, there are also limitations, 27:09.520 --> 27:19.120 yeah, like the only thing I like which we care about is the JSON parsing of the file for the 27:19.120 --> 27:26.240 first time, but like it's very hard to measure the cold parsing time. So, in the end, we just take 27:26.240 --> 27:33.920 this like proxy, a measurement, of course, it's a big limitation, and yeah, I had problems, like with some 27:33.920 --> 27:38.640 unstable terraformant numbers, in the end, like loads of these tricks helped, but there are still 27:38.720 --> 27:44.320 problems. If you run the benchmark yourself, you will probably get a bit different values, 27:45.520 --> 27:54.400 even if you had the same CPU and system as me. So, I think that is, are we finished? This 27:54.400 --> 28:00.240 slides would say no, but I say yes, because we still have JSON encoding, but it's really simple. 28:01.040 --> 28:08.080 Our encoder happens to also be fast, but I never put any thoughts into the optimization. So, 28:08.160 --> 28:14.640 thank you for bearing with me. Yeah, and if you have questions, then. 28:18.960 --> 28:21.200 Now it's time for a question, if you have one. 28:28.320 --> 28:35.600 Just a small question. So, for the assembly part, did you rely on compiler intrinsic, or did you use 28:35.600 --> 28:44.080 like lupant authorization stuff? Oh, yes, I relied entirely on intrinsic, yeah, for example, 28:44.080 --> 28:49.760 the SSE4 version I'm using the intrinsic here, no else like compiler options, or anything. 28:53.120 --> 28:59.280 I'm not sure if it's relevant, but did you consider, did you had cases where you had 28:59.360 --> 29:06.080 a chunk parsing that you had to parse it part, and then maybe wait on some data, and then 29:06.080 --> 29:15.440 had to parse again. So, has three processing? Yeah, okay. In this case, in this case, no, 29:15.440 --> 29:21.280 like you already need to have all the contents in the buffer, it would be an interesting optimization 29:21.360 --> 29:29.280 in the cases where you are, yeah, it is something to look into, but no, this works only when everything 29:29.280 --> 29:30.560 is already in memory. 29:43.520 --> 29:50.640 Have you considered totally teaching the JSON part, and just writing your data in lupant file, 29:50.960 --> 29:58.720 the RIT? Of course, there might be optimal format, more optimal format. 29:58.720 --> 30:04.240 The problem is, like, this JSON files are there since the beginning, the code basis, 30:04.240 --> 30:09.200 like, more than 10 years old. So, in the end, like, the format is not going away. 30:09.200 --> 30:14.400 We might introduce something like jbm version 2, which might be more performant data format, 30:15.360 --> 30:22.400 but, like, from, like, the relive, like, as if our respective, like, it would be hard to transition 30:22.400 --> 30:25.680 the format. So, we're going to break the JSON. 30:31.520 --> 30:40.080 Are there questions? So, it's look like we are done. Thank you again. Okay, thank you.