WEBVTT 00:00.000 --> 00:12.120 Okay, so we have our next talk, Elexa, is going to tell us about Rustin Clickhouse, so please 00:12.120 --> 00:21.440 give him a warm welcome. 00:21.440 --> 00:27.560 My presentation today is not about how to rewrite everything in Rust. 00:27.560 --> 00:35.120 It's more about how to not waste years in rewriting, but do something more sane. 00:35.120 --> 00:40.520 It's about our approach in Rust in Clickhouse. 00:40.520 --> 00:42.000 So what is Clickhouse? 00:42.000 --> 00:50.600 It's an open source project, it's pretty big, almost 2,000 contributors, 43,000 stars. 00:50.600 --> 00:57.160 It's a C++ code base, mostly C++, I would say 99%, maybe 95%. 00:58.160 --> 01:04.960 One and a half million lines of code, it exists since 2009, and today it's the most popular 01:04.960 --> 01:07.960 open source on our e-cultate base. 01:07.960 --> 01:17.560 And I can say C++ is not the nicest language, but in 2009 neither was Rust. 01:17.560 --> 01:25.480 So we started with C++, and it was quite a good choice, back then, and it's still 01:25.480 --> 01:31.600 quite popular in databases, data-based management systems, in graphics, applications, 01:31.600 --> 01:42.600 video, like games, computer-rated design, operating systems, drivers, scientific data analysis. 01:42.600 --> 01:50.960 So it still has its place, but the question is, if we start today, should we write Clickhouse 01:50.960 --> 01:53.040 in Rust, not in C++? 01:54.000 --> 01:58.480 Let's take a look. 01:58.480 --> 02:07.440 And the first question is C++, a plain, yes, it is. 02:07.440 --> 02:16.160 In big projects, people try to source this plain, a little, by adding even more plain. 02:16.160 --> 02:20.960 I think many, many ways to test the code base. 02:20.960 --> 02:28.500 So they have segmentation folds, they have data races, and they have to use all types of 02:28.500 --> 02:34.520 sanitizers, other sanitizer, 13-initizer memory, and undefined, however, sanitizers. 02:34.520 --> 02:41.960 They have to use Fizing, and we have to run all types of tests, including randomized 02:42.040 --> 02:47.720 tests, stress tests, performance tests, functional integration tests, compatibility tests, 02:47.720 --> 02:54.080 logic tests, jobs and tests for the correctness of distributed applications. 02:54.080 --> 03:01.640 So on, we run about 10 to 30 million tests in vacations each day. 03:01.640 --> 03:08.640 So it's quite a big pain, but, and you know what, we still have set folds in production. 03:09.640 --> 03:13.640 So there'd be different ways of Rust. 03:13.640 --> 03:19.640 There are quite a lot of reasons for using Rust today. 03:19.640 --> 03:26.640 The obvious one is memory and thread safety, but it is the most obvious. 03:26.640 --> 03:34.640 Another reason is that many modern libraries, many modern projects, are only exist in Rust. 03:34.640 --> 03:42.640 For example, libraries for data lakes, like for iceberg, for long, long time. 03:42.640 --> 03:45.640 The one the library was in Java. 03:45.640 --> 03:54.640 The same was for Delta Lake, but then, for Delta Lake, data bricks implemented Rust library. 03:54.640 --> 03:59.640 But a C++ library, a good C++ library, it still does not exist. 03:59.640 --> 04:07.640 If we don't use either Java or Rust, how can we use Delta Lake? 04:07.640 --> 04:11.640 And another reason is there is a lot of hype around Rust. 04:11.640 --> 04:17.640 I see this full audience, and I would say, yes, this is true. 04:17.640 --> 04:25.640 If this talk was about some boring C++ stuff, probably I will have just a few people in the audience. 04:25.640 --> 04:29.640 Maybe not here, but there are a few arguments about rewriting in Rust. 04:29.640 --> 04:34.640 In the main argument, if we do full rewriting, it will take years. 04:34.640 --> 04:40.640 Make your database, one and a half million lines of code, how to do full rewriting. 04:40.640 --> 04:45.640 We kind of just stop and allocate a year for doing this. 04:45.640 --> 04:48.640 And it will take not a year to take many years. 04:48.640 --> 04:58.640 Another reason is that using Rust is simple, but using C++ is slightly less. 04:58.640 --> 05:04.640 So, but if you use both C++ and Rust, you will get pain from both of the languages. 05:04.640 --> 05:12.640 So, maybe it's not a good choice to use two languages at the same time. 05:13.640 --> 05:18.640 And there is a lot of accumulated knowledge about C++, so why should we throw away this knowledge? 05:18.640 --> 05:19.640 Just to rewrite. 05:19.640 --> 05:23.640 There is too much drama about Rust. 05:23.640 --> 05:33.640 What if we write a database and someone will throw us just because we use unsafe in one file, 05:33.640 --> 05:38.640 and we use unwrap in another file, and people will hate us. 05:38.640 --> 05:41.640 But I want something boring. 05:41.640 --> 05:45.640 I want something sane. 05:45.640 --> 05:46.640 Okay. 05:46.640 --> 05:50.640 By the way, performance and efficiency is not deciding factor. 05:50.640 --> 05:54.640 You can write equally performant code, both in C++ and Rust. 05:54.640 --> 05:59.640 Sometimes it will be easier than Rust, because you can do quicker iterations. 05:59.640 --> 06:01.640 You can optimize quickly sometimes. 06:01.640 --> 06:07.640 You can do it faster in C++ just by avoiding complications with borrowers. 06:07.640 --> 06:10.640 Okay. 06:10.640 --> 06:16.640 So, the approach is not to do full rewrite, but to do iterative development. 06:16.640 --> 06:25.640 To find some library, some small feature that we don't really care about. 06:25.640 --> 06:32.640 And just to test, can we use a library in Rust integrated in C++, 06:32.640 --> 06:37.640 and use it as like a gateway for our Rust development. 06:37.640 --> 06:41.640 If it will succeed, we will add more libraries and more on more libraries, 06:41.640 --> 06:44.640 and maybe we will attract more engineers. 06:44.640 --> 06:49.640 Maybe we will get a relatively rewrite our code. 06:49.640 --> 06:55.640 We will see make build system and see make has some way for integration. 06:55.640 --> 07:02.640 In two thousand twenty-two, it was a library in the name of the corrosion. 07:02.640 --> 07:04.640 So, we selected the library. 07:04.640 --> 07:13.640 Then we found one student just because we did not want our full-time employees to lose sanity. 07:13.640 --> 07:18.640 If we asked them to drop C++ and write in Rust, 07:18.640 --> 07:23.640 and we found one function that we don't really need. 07:23.640 --> 07:27.640 There was a Rust library for Blake 3. 07:27.640 --> 07:35.640 Then we decided why don't add this function to SQL, and try what will happen. 07:35.640 --> 07:37.640 And actually, it succeeded. 07:37.640 --> 07:44.640 We integrated this library and even written an article how Blake 3 in Rust is faster. 07:44.640 --> 07:48.640 By the way, then we replaced it with an implementation from LWM. 07:48.640 --> 07:52.640 That is in C++, so it did not really matter, but. 07:52.640 --> 08:03.640 The point is, this was the third thing we integrated with Rust code. 08:03.640 --> 08:08.640 Here is a pull request from this guy. 08:08.640 --> 08:11.640 And what was the second? 08:11.640 --> 08:14.640 I hope that Rust is good for terminal applications. 08:14.640 --> 08:19.640 Sometimes I think that the only thing that people do in Rust is writing, 08:19.640 --> 08:23.640 and three writes in terminal applications. 08:23.640 --> 08:26.640 So, I decided we have a terminal application. 08:26.640 --> 08:27.640 Clickhouse client. 08:27.640 --> 08:31.640 Why don't improve it with Rust? 08:31.640 --> 08:37.640 There is a nice library for history search. 08:37.640 --> 08:42.640 And we decided to try. 08:42.640 --> 08:46.640 And it was also made by an external developer, 08:46.640 --> 08:57.640 works for Clickhouse, and he writes in C++. 08:57.640 --> 09:04.640 There were some problems like we integrated it, and our terminal application crashed. 09:04.640 --> 09:12.640 Because the library has a panic, and we had to patch this library just to avoid this trash. 09:13.640 --> 09:19.640 Potentially, the usability improved, so it was worth it. 09:19.640 --> 09:22.640 Actually, it was not so easy. 09:22.640 --> 09:27.640 So, you can see the history, we added the library, 09:27.640 --> 09:34.640 then we improved, and we might have built it, then we reverted the library due to a crash, 09:34.640 --> 09:39.640 then we changed the build system, and so on. 09:39.640 --> 09:41.640 But anyway, it worked. 09:41.640 --> 09:47.640 The next step was to entirely new language into Clickhouse. 09:47.640 --> 09:51.640 So, Clickhouse is a SQL database, you write queries in SQL. 09:51.640 --> 09:56.640 But there is an alternative database language, name it, 09:56.640 --> 10:00.640 PRQL, pipeline, and relational query language. 10:00.640 --> 10:04.640 You can see how it looks to the left, we have C++. 10:04.640 --> 10:08.640 Sorry, to the left, we have SQL with the clickhouse dialect 10:08.640 --> 10:12.640 to the right, we have PRQL. 10:12.640 --> 10:15.640 To the honest, I like SQL more. 10:15.640 --> 10:22.640 But anyway, PRQL, it was very fashionable, very happy. 10:22.640 --> 10:26.640 And I decided why don't add it just in case. 10:26.640 --> 10:31.640 Maybe people will prefer to write queries in PRQL. 10:31.640 --> 10:36.640 So, again, we found one student that did not mind, 10:36.640 --> 10:39.640 taking it as a coursework. 10:39.640 --> 10:43.640 And the point is, this is not a small library. 10:43.640 --> 10:48.640 It's a full-like transpider from PRQL to SQL. 10:48.640 --> 10:53.640 And maybe we will find many like details in the build system 10:53.640 --> 11:01.640 that we will have to just integrate integrated library. 11:01.640 --> 11:06.640 And what happened, actually, it wasn't integrated. 11:06.640 --> 11:11.640 No one used PRQL, but anyway. 11:11.640 --> 11:14.640 The next step was to make something practical. 11:14.640 --> 11:18.640 We already integrated three libraries. 11:18.640 --> 11:21.640 And we did not really need them. 11:21.640 --> 11:25.640 Now, something that we actually needed. 11:25.640 --> 11:31.640 A library to support Delta format for data lakes. 11:31.640 --> 11:33.640 What is data lake? 11:33.640 --> 11:39.640 It's why to represent a database in a different integrated form. 11:39.640 --> 11:45.640 Like data format is one thing, query language is another thing. 11:45.640 --> 11:49.640 And you can use different, sorry, query engine is another thing. 11:49.640 --> 11:53.640 And you can use different query engines like clickhouse, data fusion, 11:54.640 --> 11:57.640 data bricks on the same data format. 11:57.640 --> 12:01.640 And there are a few data lake formats, 12:01.640 --> 12:06.640 iceberg and Delta lake. 12:06.640 --> 12:13.640 And I say there were no implementations in C++ on the Java. 12:13.640 --> 12:16.640 And when the first library in Rath the period, 12:16.640 --> 12:18.640 it was published by data bricks. 12:18.640 --> 12:22.640 We decided let's try to integrate it. 12:22.640 --> 12:26.640 And we integrated this library. 12:26.640 --> 12:30.640 And now we have support for Delta lake. 12:30.640 --> 12:31.640 OK. 12:31.640 --> 12:38.640 But while all the integration we face just a bit of problems, 12:38.640 --> 12:41.640 let me go through these problems. 12:41.640 --> 12:43.640 All this wrong is Rath. 12:43.640 --> 12:45.640 It's a so nice programming language. 12:45.640 --> 12:48.640 Everyone loves it. 12:48.640 --> 12:52.640 Rath might be a perfect language. 12:52.640 --> 12:55.640 But the problem is when you integrate, 12:55.640 --> 13:02.640 clickhouse integrate Rath and C++ together. 13:02.640 --> 13:07.640 And the first problem is how to get reproducible builds. 13:07.640 --> 13:12.640 Things like making sure that all dependencies are fixed 13:12.640 --> 13:15.640 and all dependencies are in the source code. 13:15.640 --> 13:22.640 Everything is been applied to supply and chain attacks. 13:22.640 --> 13:26.640 How to avoid things like when the build system 13:26.640 --> 13:30.640 download something from the internet from and Rath the sources. 13:30.640 --> 13:34.640 And it is not easy to solve in C++, 13:34.640 --> 13:38.640 but we solve it at a long time ago. 13:38.640 --> 13:42.640 And in Rath the typically not a problem. 13:42.640 --> 13:50.640 But how to ensure that when you integrate it in Cmic, 13:50.640 --> 13:52.640 it does not download crates. 13:52.640 --> 13:54.640 It enters all the crates. 13:54.640 --> 13:56.640 It was not trivial. 14:02.640 --> 14:06.640 Another problem is when you combine two languages, 14:06.640 --> 14:08.640 you have to write wrappers. 14:08.640 --> 14:13.640 To call the code from Rath in Rath from C++. 14:13.640 --> 14:15.640 You have to figure out the interface, 14:15.640 --> 14:18.640 you have to figure out who allocates memory, 14:18.640 --> 14:20.640 who delegates memory. 14:20.640 --> 14:23.640 And when we try the first time, 14:23.640 --> 14:30.640 immediately our test system found that we did it wrong. 14:30.640 --> 14:32.640 There were crashes and so on. 14:32.640 --> 14:36.640 It was really ever wrong and not really safe. 14:38.640 --> 14:41.640 Fortunately, we already had a phyzen camp 14:41.640 --> 14:44.640 to continue some integration system, so it saved us. 14:48.640 --> 14:56.640 Another problem is how errors are handled in C++ and in Rath. 14:56.640 --> 15:00.640 And in C++ we use exceptions. 15:00.640 --> 15:02.640 Actually, I like exceptions. 15:02.640 --> 15:06.640 Maybe you prepare some, I don't write on tomatoes for me. 15:06.640 --> 15:08.640 I use exceptions. 15:12.640 --> 15:16.640 In Rath people typically don't use exceptions. 15:16.640 --> 15:21.640 You can get something close to exceptions and Rath, 15:21.640 --> 15:23.640 but it will be not easy. 15:26.640 --> 15:33.640 And sometimes instead of other handling people just use panic. 15:33.640 --> 15:39.640 And it is okay for applications that do something like batch processing 15:39.640 --> 15:45.640 for applications that invoke it like once, do the stuff, 15:45.640 --> 15:48.640 and went away. 15:48.640 --> 15:52.640 For several applications, it is quite controversial. 15:53.640 --> 15:58.640 You don't want some third party library to just terminate your server. 15:58.640 --> 16:01.640 So you have to fix all these libraries. 16:05.640 --> 16:10.640 And I would like to say that yes panic is memory safe, 16:10.640 --> 16:15.640 but it is in the same way memory safe as like a borders 16:15.640 --> 16:19.640 as to determine it in C++ or even null pointer reference. 16:19.640 --> 16:23.640 It is memory safe in the same way as panic. 16:23.640 --> 16:28.640 But typically panics are used to indicate some bugs, 16:28.640 --> 16:31.640 fail at the portions. 16:31.640 --> 16:36.640 And when you go seriously with fasting, 16:36.640 --> 16:39.640 you almost certainly will find some corner cases 16:39.640 --> 16:46.640 and will find some uncovered some bugs in Rath libraries. 16:46.640 --> 16:51.640 And in this way, the fact that we write code in C++ 16:51.640 --> 16:56.640 and we have to pay for all these testing, 16:56.640 --> 17:02.640 it helps us with Rath as well. 17:02.640 --> 17:06.640 One example is in PRQL, 17:06.640 --> 17:11.640 so immediately we found that if you write a query something like X, 17:11.640 --> 17:14.640 or Y, it will crash. 17:14.640 --> 17:16.640 And we have to fix it. 17:16.640 --> 17:21.640 Not a big problem, but okay. 17:21.640 --> 17:25.640 Another thing is sanitizers. 17:25.640 --> 17:28.640 Maybe you want to say that you don't need sanitizers 17:28.640 --> 17:31.640 since in Rath because it is so safe, 17:31.640 --> 17:34.640 why do you need other sanitizers in Rath? 17:34.640 --> 17:36.640 Why do you need memory? 17:37.640 --> 17:41.640 But we built all our code with sanitizers, 17:41.640 --> 17:48.640 and we want all our builds to continue to be tested with sanitizers. 17:48.640 --> 17:51.640 So all the code must be sanitized. 17:51.640 --> 17:57.640 For memory, sanitizers, it's important that every code that writes 17:57.640 --> 18:02.640 or reads into memory is sanitized. 18:02.640 --> 18:09.640 And initially we had to switch to the nightly toolchain 18:09.640 --> 18:12.640 for us just to get memory sanitizer. 18:12.640 --> 18:17.640 For some reason, we still have some problems with it. 18:17.640 --> 18:22.640 Some Rath libraries are disabled with memory sanitizer, 18:22.640 --> 18:26.640 just because they don't provide some symbols that are required 18:26.640 --> 18:29.640 to compile them this way. 18:29.640 --> 18:37.640 But today it is mostly not a problem. 18:37.640 --> 18:43.640 What about cross compilation? 18:43.640 --> 18:47.640 And again, I can say that cross compilation in Rath 18:47.640 --> 18:50.640 is much better than in C++. 18:50.640 --> 18:54.640 The only problem that again, 18:54.640 --> 18:58.640 we paid a huge amount of effort to make it working with C++. 18:59.640 --> 19:05.640 We had to provide custom toolchames, 19:05.640 --> 19:12.640 custom headers from Lipsy for every system. 19:12.640 --> 19:17.640 But now we have to solve this problem again. 19:17.640 --> 19:24.640 And it was again not easy. 19:25.640 --> 19:29.640 What about dependencies? 19:29.640 --> 19:35.640 For example, we prefer to link everything statically. 19:35.640 --> 19:38.640 We start to link open as a cell. 19:38.640 --> 19:42.640 And we start to link the kernel Rath from Rath. 19:42.640 --> 19:46.640 And it depends on another library name it request. 19:46.640 --> 19:50.640 And request also depends on open as a cell. 19:50.640 --> 19:54.640 And for some reason, now we have two different versions 19:54.640 --> 19:56.640 of open as a cell in the binary. 19:56.640 --> 19:59.640 And actually not just in the binary. 19:59.640 --> 20:05.640 One was statically linked and another was dynamic linked at runtime. 20:05.640 --> 20:10.640 And it broke our hermetic builds. 20:10.640 --> 20:14.640 We found a configuration option just to switch request 20:14.640 --> 20:17.640 to use Rath TLS. 20:18.640 --> 20:22.640 But the problem was that Rath TLS was not tips compliant. 20:22.640 --> 20:24.640 And we had to switch it back to open as a cell. 20:24.640 --> 20:27.640 And I'm sure that it used the same open as a cell. 20:27.640 --> 20:30.640 That is tips compliant. 20:30.640 --> 20:31.640 Okay, problem solved. 20:31.640 --> 20:34.640 What is next? 20:34.640 --> 20:37.640 Composeability of the code. 20:37.640 --> 20:43.640 The question is how what are the conventions that we use for the libraries? 20:43.640 --> 20:45.640 How each library should allocate memory? 20:45.640 --> 20:48.640 How it should spawn threads? 20:48.640 --> 20:52.640 How should maintain connection pools or cashes? 20:52.640 --> 20:57.640 If it does, H.D.s per request, how it managed to retract. 20:57.640 --> 21:00.640 And if you make it one way in our C++ code, 21:00.640 --> 21:04.640 how to ensure that other libraries do it in the same way? 21:04.640 --> 21:07.640 And the answer we cannot ensure it. 21:07.640 --> 21:10.640 Either we just patch this libraries. 21:11.640 --> 21:17.640 Or we get away with different ways of managing these things. 21:21.640 --> 21:23.640 Small surprises. 21:23.640 --> 21:31.640 Like when we edit PRQL, we found that some symbols in the binary. 21:31.640 --> 21:35.640 Now take 50 kilobytes just for the name. 21:35.640 --> 21:39.640 And this is one of the names of these symbols. 21:39.640 --> 21:44.640 I see homesky repeated like 20 times. 21:44.640 --> 21:46.640 The legs are broken. 21:46.640 --> 21:50.640 It's some kind of monomorphization like C++ templates. 21:50.640 --> 21:52.640 I'm not expecting that. 21:52.640 --> 22:00.640 So I would say, no, this is not better than C++ templates. 22:00.640 --> 22:03.640 What about dependencies? 22:03.640 --> 22:06.640 Like software composition analysis. 22:06.640 --> 22:11.640 If we list our C++ libraries that we depend on. 22:11.640 --> 22:15.640 It will be like 20 to my B.30 libraries. 22:15.640 --> 22:26.640 If we list the REST libraries, there will be 156. 22:26.640 --> 22:31.640 Direct dependencies and 6072 interact dependencies. 22:31.640 --> 22:34.640 I would say it's not that bad. 22:34.640 --> 22:39.640 It's not that bad as it is in, say, Node.js. 22:39.640 --> 22:45.640 If we use Node.js, maybe NPM, maybe you will have thousands of dependencies. 22:45.640 --> 22:47.640 Now, 6072. 22:47.640 --> 22:56.640 But it is not as like as boring as in C++. 22:56.640 --> 22:58.640 Where you cannot just add the library. 22:58.640 --> 23:01.640 You have to integrate it into build system. 23:01.640 --> 23:08.640 For this reason, you can have too many libraries. 23:08.640 --> 23:11.640 OK, actually, all problems have been solved. 23:11.640 --> 23:14.640 And now we have just a bit of REST code in C++. 23:14.640 --> 23:17.640 We did not correct it in REST yet. 23:17.640 --> 23:19.640 Maybe there is a chance. 23:19.640 --> 23:25.640 It depends on the enthusiasm of our engineers. 23:25.640 --> 23:29.640 I don't see a lot of enthusiasm by the way. 23:29.640 --> 23:32.640 But there is still a chance. 23:32.640 --> 23:34.640 So, what are the takeaways? 23:34.640 --> 23:37.640 REST is actually a great language. 23:37.640 --> 23:43.640 And you can write in C++ and REST in the same project. 23:43.640 --> 23:49.640 And if you like REST, welcome to be a clickhouse contributor. 23:49.640 --> 23:51.640 Thank you. 23:59.640 --> 24:01.640 Thank you.