WEBVTT 00:00.000 --> 00:11.040 Hi everyone, I'm Jake Hillian. I was hoping to present this with Johannesburg, but he hasn't 00:11.040 --> 00:15.920 quite made it yet due to Deutsche Bahn delays mainly, but he should be here hopefully by 00:15.920 --> 00:20.960 the end to say hello. We're here to talk today about a side project we're working on 00:20.960 --> 00:26.720 involving concurrency testing using custom Linux schedulers. I work at Meta and I work on 00:26.720 --> 00:32.320 schedulers, mainly custom Linux schedulers, so this is pretty related to what I do. Johannes 00:32.320 --> 00:37.280 is an open JDK developer who recently had to spend a lot of time debugging a race condition. 00:37.280 --> 00:41.840 So we hope we could put those two things together and we've got a bit of a proof of concept today 00:41.840 --> 00:48.320 that we can show you and explain how it works to attempt to make these a little bit more likely 00:48.320 --> 00:56.640 to occur, which should make them easier to debug. So Hizenbooks, I imagine lots of us are familiar, 00:56.960 --> 01:01.680 you've got the same input, you hope for the same output, but instead you're going to crash. 01:01.680 --> 01:07.520 This is not great and it's especially not great when it happens 1 in 10,000 or 1 in 100,000, 01:07.520 --> 01:14.080 the 1 in the million invocations. As an application owner, debugging that from reports is very tricky. 01:14.080 --> 01:20.160 We'll go for a simple example now, very simple because these things get complex in reality, 01:20.160 --> 01:24.160 but imagine we've got some data being produced from a producer for it and we're consuming it in a 01:24.160 --> 01:29.680 consumer thread. In our case, and in our example later on, there's an explicit expiry date on 01:29.680 --> 01:34.720 that data, which isn't what really happens in production, more likely you've got some reference 01:34.720 --> 01:39.520 to a pointer that you might clear in some other thread, all of these expiry reasons that that data 01:39.520 --> 01:46.800 might no longer be valid at some point in the future. It doesn't crash, the vast majority of the time. 01:47.600 --> 01:54.640 The reason for this is that schedulers are pretty good, but when that interaction happens, 01:54.640 --> 01:58.480 when your machine gets a bit busy, when some processes get in the way that you weren't expecting, 01:58.480 --> 02:04.480 when the network gets slow, all of these things can just add extra delay. So a large reason for 02:04.480 --> 02:09.440 these conditions is scheduling. We see this, for example, RR, the debugger has a chaos mode that 02:09.440 --> 02:16.000 supports to make these a lot more likely, too, but that has its own issues. What is scheduling 02:16.000 --> 02:20.560 then? What do we actually do? In this case, we're talking about CPU scheduling. It's one of the 02:20.560 --> 02:25.760 more common types. The problem we have, we've got many processes, likely in the order of thousands, 02:25.760 --> 02:32.080 and some number of CPUs, likely in the order of tens nowadays, and we need to somehow make sure 02:32.080 --> 02:37.360 all those processes work successfully on that CPU to share the system. The simplest way we might 02:37.360 --> 02:42.880 do this is just a schedule process A, and whenever it stops, we'll schedule process B. 02:42.880 --> 02:47.680 Unfortunately, that really does not work. There are classes of non-preemptive schedulers, 02:47.680 --> 02:50.960 sometimes it makes sense, but the vast majority of the time we're going to need to schedule 02:50.960 --> 02:54.560 B a bit sooner, or the issues we were talking about before, they'll happen more often, 02:54.560 --> 02:59.920 network timeouts, all that sort of thing. So instead, we go over time, and we'll stop scheduling 02:59.920 --> 03:04.640 A for a bit. B might not be ready the next time, we might schedule A again. We'll schedule B, 03:04.640 --> 03:08.080 we'll schedule A, we'll flip back and forth, and we're doing this on the scale of many 03:08.080 --> 03:13.680 thousands of processes, likely we've several ready at any point in time. On an actual system, 03:13.680 --> 03:17.120 it might look something like this. We're not going to be able to look at any of the detail on 03:17.120 --> 03:22.720 this chart, but on the left, in the y-axis, we have which CPU we're looking at, as we go across, 03:22.720 --> 03:27.280 we're looking at what's happening on that CPU, whether a process is scheduled, the different colors, 03:27.280 --> 03:32.080 the different processes, nor gets quite complex, but these are the charts super interesting. This 03:32.080 --> 03:38.080 is a 612 Linux system, just running EVDF, that default scheduler, we can see processes are 03:38.080 --> 03:41.040 darting around all over the place, they're coming in, they're running for a short time, 03:41.040 --> 03:45.120 some of them are long running, they move about a bit, there's all sorts of complexity, 03:45.120 --> 03:50.640 and this is even on a pretty quiet system. When you start looking at big systems, hundreds of CPUs, 03:50.640 --> 03:57.040 all the interactions just get way more complicated. So when we look at our race conditions, 03:57.040 --> 04:02.000 you're replicating this on your lovely deaf machine, you've got a 32-core processor, 04:02.000 --> 04:05.200 it's nice and quiet, you don't want anything getting in the way of your testing, 04:05.200 --> 04:11.280 the bug never happens, ever, it's a nightmare, you know it's happening, people are reporting it, 04:11.280 --> 04:15.440 you're running the standard scheduler, and the bug never happens, you can try running stresses 04:15.440 --> 04:20.000 in the background, and that might make it a little bit more likely, but the bug still never happens, 04:21.040 --> 04:25.600 working on custom scheduler's at matter, I got to work with some scheduler's that are not too good, 04:26.160 --> 04:29.680 which is great, it turns out when you write a schedule of yourself and you have loads of 04:29.680 --> 04:35.520 configuration options, there are many ways to configure that scheduler badly, and I found one 04:35.520 --> 04:39.520 another while ago, I was working on a service, I was writing a scheduler, I got the configuration 04:39.520 --> 04:45.520 terribly wrong and the service failed, it really didn't work, but there were free parts to the service, 04:46.160 --> 04:50.560 two of them hit massive timeout errors, but they came back to life, one of them crashed, 04:51.360 --> 04:57.440 so 250 hosts died all at once because of my scheduler, it turns out this was a race condition, 04:58.080 --> 05:01.920 someone was storing the value of a shared pointer instead of copying the shared pointer and see 05:01.920 --> 05:06.000 plus plus for efficiency reasons and they got it wrong, so if there was a long enough delay in 05:06.000 --> 05:11.120 scheduling the service with crash, this does happen in the real application, but it's so rare that 05:11.120 --> 05:15.920 nobody would ever look at it, or even if you did, you would really struggle to replicate it, 05:16.720 --> 05:22.560 so what if we wrote a scheduler that was deliberately bad, what if it was deliberately erratic and 05:22.560 --> 05:28.080 got us into these states more often where these errors are likely to happen, that's what we 05:28.080 --> 05:33.200 got a demo of today, but how would you write an erratic scheduler? Well, there's some options here, 05:34.000 --> 05:40.080 you could write it in the Linux kernel, you might have a hard time doing that in general, 05:40.080 --> 05:44.320 the scheduler is very sensitive, if you get it slightly wrong, your system will hit the soft lock 05:44.320 --> 05:49.840 up the texture and immediately reboot, which is a bit of a pain, it's hard to do, if you get it wrong 05:49.920 --> 05:55.200 in memory on safe ways, your system will crash even more quickly, and if you get it right, 05:55.200 --> 06:00.480 you're still waiting, well, maybe in the tens of seconds for a K exec every time you want to change 06:00.480 --> 06:06.160 your kernel, this is a bit awkward. Nowadays, we can do it in user space, you're 06:06.160 --> 06:09.760 kind of supposed to talk about Java, he's got a project that I'm not going to give enough credit 06:09.760 --> 06:13.120 in this presentation because I don't know enough about it, where you can write these schedulers 06:13.120 --> 06:17.360 in Java, I'm more familiar with the Rust ones, you can also do it in C, if that's what you're 06:17.440 --> 06:25.280 like to, and it's all because of BPS, which is the B, scared x, which is the, I think we're supposed 06:25.280 --> 06:32.480 to call it a sex deposit, which is an interesting choice of logo we got there, and then this is 06:32.480 --> 06:40.240 the additional logo on the right. This is a photo of running Greg, supposedly shouting at hard 06:40.240 --> 06:46.640 drives, but there's a quote about putting JavaScript into the Linux kernel here, many similarities 06:46.720 --> 06:50.880 between the EBPS, the way it runs in the kernel, and a virtual machine for JavaScript you might 06:50.880 --> 06:56.240 have in your browser. Here's a photo of him looking slightly more normal, I think he'd prefer that one. 06:57.440 --> 07:02.480 EBPS, we're not going to go into it, it's not super important how it works, but that there are 07:02.480 --> 07:07.120 a few details that we need to cover just for the understanding. When you develop an EBPS program, 07:07.120 --> 07:10.960 you're going to write your source code in some language, there's a few options we have, 07:11.440 --> 07:18.160 it sees the standard one, rust works reasonably well, there's some sort of academic languages 07:18.160 --> 07:22.640 you can choose as well, and then there's the Java Transfiler that your analysis got, which is quite 07:22.640 --> 07:28.320 exciting. You can pull that into BPS byte code, it's like assembly, but it's its own language 07:28.320 --> 07:35.760 that works on all the Linux systems effectively. We make a CIS call to BPS to ask it to load our 07:35.760 --> 07:39.920 program into the kernel, you need a lot of privilege for this, it's a root only operation again, 07:39.920 --> 07:44.960 now I think, it kind of got user for a while, but now it's all root. It goes through the Verifier, 07:44.960 --> 07:51.760 the Verifier is a magic black box that's the post to make sure your program is safe in certain ways, 07:51.760 --> 07:56.240 so you can't remember it in a bad way that will cause your system to crash, 07:56.960 --> 08:00.720 you can't have unbounded loops that don't terminate because we're running this in the schedule 08:00.720 --> 08:06.720 a hop-off, if you have non-terminating code there, you're in trouble, stuff like that, 08:06.720 --> 08:13.840 it's a bit of a beast to work with, but once you're verified, you get jick compiled, loaded 08:13.840 --> 08:19.760 into the kernel, you can look at sockets, network interfaces scheduling now, you're loaded as an x86 08:19.760 --> 08:26.800 program, there's no arm, whichever system you're on, there's no further runtime basically attached, 08:26.880 --> 08:30.800 and then you communicate with that, mostly using Cisco's at the minute, we're getting some new 08:30.800 --> 08:35.280 stuff called Arena's which are more like map memory, and you can communicate back to user space, 08:35.280 --> 08:40.720 so we can write an application across user space and kernel space, which is pretty cool, 08:40.720 --> 08:45.120 so the general way we write our production schedulers is we write some rust that talks to the BPS, 08:45.120 --> 08:49.040 and then the BPS runs in the kernel and makes quick scheduling decisions. 08:51.120 --> 08:55.200 So that's BPS, how do we use that for scheduling? And recently, 08:55.200 --> 09:01.120 you mentioned that it's kernel 612, we're now 613, so it's a pretty recent SCADAXT, SCADAXT, 09:01.120 --> 09:06.080 it's the extension framework for jumping in as a scheduler from BPS space. 09:07.520 --> 09:12.800 This is a tasian, the creator, there are a few key features, I mentioned some of the troubles 09:12.800 --> 09:17.280 we're working in the kernel before, and the idea is that SCADAXt makes them better, 09:17.280 --> 09:21.520 there's a magnum perfect, but it certainly makes them better, so ease of experimentation, 09:21.520 --> 09:27.840 we have a repo with, in the order of 10 schedulers now, maybe a few more, the Linux kernel has 09:28.400 --> 09:33.600 two-ish schedulers, even the old one has to be ripped out to make way for the new one, so we don't, 09:33.600 --> 09:38.960 we don't have a lot of optionality in the kernel, but you can run many different SCADAXt schedulers on 09:38.960 --> 09:44.240 your machine switching between them just by running a program and pressing Ctrl C, it's super easy. 09:44.960 --> 09:51.040 Customisation 2, we can talk to user space, you can do basically anything you want in these schedulers, 09:51.120 --> 09:55.840 sure, some of it has to avoid the hot path, and you've got to communicate with user space a little 09:55.840 --> 10:00.480 bit, turns out that's not as bad as we might think, but you can make loads of choices, you can use 10:00.480 --> 10:05.360 information from Nvidia RESTMI that the Linux kernel is never going to do, and stuff like that, 10:05.360 --> 10:10.480 and finally rapid scheduler deployments, deploying a new kernel at scale is tricky, and we have to 10:10.480 --> 10:16.320 get it to millions of machines, and it takes in the order of weeks to get that kernel out, deploying 10:16.400 --> 10:22.000 a new scheduler can take a day, it's really easy, and running it and stopping it is also easy, 10:22.000 --> 10:26.240 you don't have to reboot, so if we find out weeks later that our scheduler is kind of bad, 10:26.240 --> 10:30.080 we can just stop it, and we go back to the default, and everyone's safe, we don't have to 10:30.080 --> 10:32.720 err, we don't have to worry about how much we've broken all the systems. 10:35.520 --> 10:40.640 In the SCADAXt scheduler, again, maybe don't worry too much about the details, but we have a few 10:40.640 --> 10:45.280 bits that we have to worry about, on each CPU, we have a local phypho queue, it's just first in 10:45.280 --> 10:50.720 a first out, and that's effectively read from by the kernel. If you've put stuff in that queue, 10:50.720 --> 10:56.560 the kernel side of SCX will make sure it gets run, on that CPU in that order, quite convenient. 10:57.680 --> 11:03.120 In SCX, we generally have global queue as well, when we write our own schedulers, 11:03.120 --> 11:07.520 in this picture we got one, you can have a dozen, you can have as many as you like, and within 11:07.520 --> 11:11.120 those queue, you can meet different things, so on some schedulers, we might have a different 11:11.200 --> 11:16.160 queue per LLC, some schedulers, we have a different queue per how much we want to prioritise 11:16.160 --> 11:22.240 the workflow, and various different things like this. The scheduler's job that we write in SCX 11:22.240 --> 11:27.520 is to move things from global queues into local queues, and accept new processes, make decisions 11:27.520 --> 11:33.280 based on them, and let them run in the order we like. To view a super simple scheduler, 11:33.280 --> 11:38.880 in the Java side of this framework, first step, well, first step is the license everything 11:38.880 --> 11:44.720 is GPL, that's an absolute requirement with BPS, which is pretty cool. License it's that, 11:44.720 --> 11:50.640 we're going to share this constant of a shared DSQ ID, nice and easy. We'll create a shared DSQ, 11:50.640 --> 11:56.800 which we need to be able to handle tasks in a more uniform way, handling it per CPU would end up 11:56.800 --> 12:01.840 with separate scheduling issues, so we'll create that DSQ, and now we've got our queue, and that's it. 12:01.840 --> 12:08.320 Next one, NQ. This happens when you receive a task that is now runnable, and you've got 12:08.320 --> 12:13.440 a task ideally, you want to put it on a CPU, but if you can't put it on a CPU, we're going to 12:13.440 --> 12:18.720 end Q it. In this case, we're using another K-Funk SCX BPS dispatch, we're taking our task, 12:19.280 --> 12:23.440 we're putting it in our shared DSQ, we're saying next time it runs, it can have up to five 12:23.440 --> 12:28.080 milliseconds, and then we're just passing through these flags. It's pretty simple too, so far. 12:30.000 --> 12:34.080 And the final one is dispatch. This is what's called runnable CPU goes idle, 12:34.080 --> 12:37.680 you have your CPU, it's finished doing whatever it was doing. It doesn't know what to run next, 12:37.680 --> 12:42.720 because it's little queue is empty, so we just run SCX BPS consume from the shared queue, 12:42.720 --> 12:46.400 which takes the task from the shared queue, and just runs it on that, that's CPU for us. 12:47.360 --> 12:51.600 That's it, that's a whole scheduler. It's not a very good scheduler. We're using 12:51.600 --> 12:56.480 five-fold queues everywhere, there's no priority for any processes, everything is completely equivalent, 12:57.200 --> 13:01.200 which it turns out doesn't work very well, and there's also only one global queue, so if you're 13:01.200 --> 13:06.240 in any sort of complicated CPU, that will really struggle. If you've got two sockets on certain 13:06.240 --> 13:12.080 Intel machines, this will kill the machine, because their cross-socket communication is so slow, 13:12.080 --> 13:17.120 that if you try and run the scheduler, you hit the soft lock up detector, before the skedets 13:17.120 --> 13:21.360 think and get kicked out. It's normally very safe, normally skedets, if you don't schedule stuff, 13:21.360 --> 13:25.920 it just gets kicked out, and you go back to normal, but the Intel machines are so slow, 13:25.920 --> 13:29.520 you can't actually get kicked out, because that bit of kernel code can't run in time. 13:30.320 --> 13:34.640 So that's quite interesting, but in the general case, you're pretty safe, this will run, 13:35.280 --> 13:41.360 and then you can extend it as you like. Producing erratic scheduling orders. That's what 13:41.360 --> 13:47.280 this was all about. How can we make our race condition fun more likely? We have a, let's see, 13:48.320 --> 13:53.360 let's go first of all, this is the example. We have an example here written, I believe in Java 13:53.360 --> 14:00.000 again. It's a super simple thing to crash, we just consume things from a queue that are only 14:00.000 --> 14:04.400 valid for a certain amount of time. It's missing a little bit of the code to it, and I won't find 14:04.640 --> 14:08.800 it, because it gets, you always need a bit of plumbing to make these things work, but effectively 14:08.800 --> 14:13.200 we get a task come from this producer thread. We've set a time on it, that's just the limit, 14:13.200 --> 14:17.440 and if we try and read it beyond that, we're going to crash, and then we just keep reading it. 14:17.440 --> 14:22.320 On a quiet system, this is fine. This will run for days at a time, and it will never crash. 14:22.880 --> 14:27.040 Even on a busy system, we haven't yet seen a crash, but it can theoretically happen. 14:28.240 --> 14:32.640 We had to get quite simple with these examples to make them fit, basically. 14:33.600 --> 14:38.400 I've got a video to show you from your harness that I'm going to have to talk over, I believe. 14:40.960 --> 14:47.440 There we go. Okay, so we started with our schedule, we've got schedule over SH, which just 14:47.440 --> 14:53.040 launches our schedule, with the correct arguments, samplesruncue.sh. Here's our sample script, 14:53.040 --> 14:57.040 the Java Rage Show G. And we're also, we're getting some extra of the velocity out of it. 14:57.440 --> 15:02.000 Every time we make a scheduling decision, we're printing it here. And the way we've set this up 15:02.080 --> 15:07.760 is it's going to sleep things for just way longer than it needs to. We'll take runnable tasks 15:07.760 --> 15:13.040 that would get a CPU immediately on a normal scheduler and not schedule them. For whatever amount 15:13.040 --> 15:17.280 of time this is saying. So this is going between half a second and a second and a half of not 15:17.280 --> 15:22.480 scheduling our tasks that could be scheduled. Then when it runs it, we run it for 80 milliseconds, 15:22.480 --> 15:29.600 something like that, and it normally finishes. And we saw it crashed, which is pretty good. 15:30.560 --> 15:35.040 This program, again, hours at a time, we've left it running on these machines, it doesn't crash. 15:35.040 --> 15:39.680 It just never hits these edge cases, but those edge cases are there, and they can be hit, 15:39.680 --> 15:45.360 and if you scaled this efficiently, if that Fred with the random delay was actually a network 15:45.360 --> 15:51.280 request, and then your systems would slow that was curing on those, this would crash. So we've got, 15:51.280 --> 15:55.840 we've got a lot of times where this could happen. If someone reported it, you wouldn't be able to debug 15:55.920 --> 16:03.520 it comfortably on your machine. And that's it, we've got a crash. I think we've got a few minutes 16:03.520 --> 16:09.920 where I can briefly scroll through the code. There isn't too much of it, surprisingly, again. 16:10.320 --> 16:20.320 Let's go. So we go back to... 16:24.880 --> 16:28.160 I'll take it back. Before I do that, have we gone any questions? 16:28.240 --> 16:42.560 So the question was, are we running that in the CIO just locally at the minute? At this point, 16:42.560 --> 16:48.080 this is very local. It's quite constrained, this scheduler, it only works on small machines 16:48.080 --> 16:54.320 at the minute, we use a lot of those phytocus. It's very new, it's very early. What we were excited 16:54.320 --> 16:59.760 about seeing was whether we could make it crash, and we can. So the next steps with this would be to 16:59.760 --> 17:04.880 production, I said a bit more, get it able to run on a big service. That example, I mentioned earlier, 17:04.880 --> 17:08.640 if I tried to route that machine, it's one of the new AMD chips, so it's got loads of LLCs, 17:08.640 --> 17:12.000 it's all a bit complicated. If we try and run this schedule on there, it just doesn't work. 17:12.880 --> 17:17.520 It gets kicked out. The machine survives, but it doesn't work. So we need a more complex hierarchy 17:17.520 --> 17:22.640 in the scheduler, and then to interject the randomness. We also need some seeding and bits like that 17:22.640 --> 17:27.440 to try and get it more consistent, and probably a bit of searching to find the right conditions 17:27.440 --> 17:31.520 to make it crash. So this is still very early. We'll be happy about contributions too. 17:35.520 --> 17:40.560 Have we tried it on an arm machine? No, but it does work. So schedule it in general, we have 17:40.560 --> 17:46.160 tried an arm machine. It works fine. There's nothing super worrying about it, which is great, 17:46.160 --> 17:49.920 because we've got a lot of arm to cover at the minute. 17:52.640 --> 17:57.600 You do from the schedule. In general, for example, to the process memory, to see what help, 17:57.600 --> 18:02.400 it has what testing purpose, it has these kind of bits, now I know that I have to 18:03.920 --> 18:12.240 schedule it to get a CPUs like another process. So the question was about looking at the 18:12.240 --> 18:16.640 memory of the process, there's a more information we can use from them to make a scheduling decision. 18:16.640 --> 18:21.600 That's super interesting. We haven't looked at that yet. We can do the filtering at the minute, 18:21.600 --> 18:27.520 it's based on parent pids effectively. So we were a little bit, the way we're running the 18:27.520 --> 18:32.480 schedule is we schedule the whole machine, but we only care about messing up this specific process, 18:32.480 --> 18:37.440 because otherwise we'll stop finding race conditions in the shell, and then we'll be in trouble. 18:38.480 --> 18:44.080 So the reason we do it like that is just easier. We use p-pid at the minute, you can filter on 18:44.080 --> 18:49.680 in other schedules, we filter on things like calm, process name, thread groups, or these things. 18:49.680 --> 18:54.000 So that, but I've never seen the option of actually looking at process memory, that's super 18:54.000 --> 18:59.760 interesting. We are doing some stuff where the application can tell the scheduler what it wants 18:59.760 --> 19:05.920 in a more fine-grained way. Currently we just use niceness, which is a bit weak, it's not very 19:05.920 --> 19:11.200 rich. So we're doing more communication from the process to the scheduler in our production 19:11.200 --> 19:16.080 schedule, and I think that would be possible here too. We're running a BPF, it's four 19:16.240 --> 19:18.240 privileges, you can kind of do what you want, which is cool. 19:25.920 --> 19:31.840 It's an excellent question. Their question was about reproducing the crashes, and how we can 19:31.840 --> 19:36.880 make that happen. The answer at the minute is, no, we don't have that. It's a lot of the goal of this 19:36.880 --> 19:41.680 project, but when we were running through it, we started looking at how we could build this scheduler 19:41.680 --> 19:50.000 to get rid of a huge amount of the non-determinance in the process. The way we saw we're scheduling 19:50.000 --> 19:54.240 it, we were going to slow it down too much if we tried to get rid of too much of the non-determinance, 19:54.240 --> 19:58.080 because we want to be in the position long term where we can run this on production applications 19:58.080 --> 20:03.840 with out slowing them down to the point where they stop serving traffic. And that meant we 20:03.840 --> 20:10.400 made some compromises. The main one is we schedule things on course pretty quickly now, when the 20:10.400 --> 20:15.440 original plan was to go one-fred process, but it just isn't scalable at that point. So there's 20:15.440 --> 20:19.120 definitely some work to be done to getting the seeded and making it more reproducible. 20:22.960 --> 20:27.840 I was wondering about this scheduling point where you can decide to make this end. Sounds like 20:27.840 --> 20:32.160 only when the process was the kernel, is the point that you can really do this schedule, 20:32.160 --> 20:36.560 but if you have some divide, let's say, you increment a variable, you should be economy, 20:36.560 --> 20:41.280 but you have the increment do with a fashion head, but just so how do you do to make things 20:41.280 --> 20:47.280 or leaving everything? Yes. So the question was about effectively summarized when we can preempt things. 20:48.160 --> 20:54.320 Anytime, it's good. We have full control over it basically. So we can cut the slices down, 20:54.320 --> 21:00.560 which helps, but we also have a K-fun called a CXBFF kick CPU that kicks a CPU quickly, 21:01.520 --> 21:04.960 which is pretty cool. How we'd integrate that as a different question. We haven't done it yet. 21:04.960 --> 21:08.960 We're purely working with slices in a minute. So that does, sorry. 21:10.320 --> 21:14.640 That does get the primitive scheduler to kick in and stop the process, and we will get these 21:14.640 --> 21:18.880 into leaving eventually, but you're having a more, if you could look at the memory and see a bit 21:18.880 --> 21:24.320 that's flipped, and then kick it, we have that option, which would be very exciting in the future. 21:24.320 --> 21:28.960 I'm hoping that we've opened sked eggs for the world of testing, and that everyone will have 21:28.960 --> 21:33.920 great ideas now, because I'm a scheduler developer. Your harnesses of open JDK developer that 21:34.000 --> 21:39.600 likes schedulers, so it would be really cool for other people to see that schedulers are 21:39.600 --> 21:45.440 available to them, and not completely impossible to write now, and use that in testing more widely. 21:45.520 --> 21:56.640 Can you prevent this soft lock from killing Linux? You can let's say you want to explore all 21:56.640 --> 22:05.440 possible schedules, which can occur in systems. There's two parts to that. We have two of these 22:05.440 --> 22:10.480 kind of lock-up detectors. I glossed over it earlier, but sked eggs itself, if you have, we're given 22:10.480 --> 22:14.560 a task that's runable, and you wait more than 30 seconds and don't run it, the SCX scheduler will 22:14.560 --> 22:19.120 get kicked, and all those tasks will move back to the fair scheduler in the kernel. There's also the 22:19.120 --> 22:23.120 soft lock-up detector, which happens a bit later, that if the machine isn't, I'm not too, 22:23.120 --> 22:27.120 super sure on the details, I just know we hit it. If the machine isn't making reasonable progress, 22:27.120 --> 22:32.480 and it's not much later, I think it's maybe 40 seconds, 45, and then it just repeats the machine, 22:32.480 --> 22:39.440 and that one, I would say turning that off probably isn't super productive, because if you were to hit 22:39.520 --> 22:43.200 that with any normal schedule without your scheduler, the machine would do the same thing. 22:44.480 --> 22:52.000 The SCX one, we haven't needed to turn it off, because 30 seconds is such a long time. Technically, 22:52.000 --> 22:55.360 if we were making kind of network requests, they could take longer to come back than that, 22:55.360 --> 23:00.000 but for the vast majority of bugs, 30 seconds should be plenty. If you do want to change it, 23:00.000 --> 23:04.080 there's a number in the kernel, and you can always recompile it, and that will get longer, 23:04.080 --> 23:07.360 and that you've got to be, there's many systems that come into make that stop. 23:24.800 --> 23:29.440 Yeah, that's a good question. The question was about more erratic behavior, instead of just scheduling 23:29.520 --> 23:35.680 timing. The show answer is no, basically. There's stuff where we're interested in a memory 23:35.680 --> 23:41.200 latency change on systems as they get more loaded. We haven't done any work to train calls like 23:41.200 --> 23:46.000 yet, and it's not easy to do that with the SCX scheduler. There are ways to do it. You can kind of 23:46.000 --> 23:51.280 force things to mess up their cashiers more often with scheduling decisions, and introduce extra 23:51.280 --> 23:56.560 processes that do that too, but I think those races are a lot finer-grained, and we haven't 23:56.560 --> 24:03.360 decided looking about yet. That's great. Thank you very much.