WEBVTT

00:00.000 --> 00:11.040
Hi everyone, I'm Jake Hillian. I was hoping to present this with Johannesburg, but he hasn't

00:11.040 --> 00:15.920
quite made it yet due to Deutsche Bahn delays mainly, but he should be here hopefully by

00:15.920 --> 00:20.960
the end to say hello. We're here to talk today about a side project we're working on

00:20.960 --> 00:26.720
involving concurrency testing using custom Linux schedulers. I work at Meta and I work on

00:26.720 --> 00:32.320
schedulers, mainly custom Linux schedulers, so this is pretty related to what I do. Johannes

00:32.320 --> 00:37.280
is an open JDK developer who recently had to spend a lot of time debugging a race condition.

00:37.280 --> 00:41.840
So we hope we could put those two things together and we've got a bit of a proof of concept today

00:41.840 --> 00:48.320
that we can show you and explain how it works to attempt to make these a little bit more likely

00:48.320 --> 00:56.640
to occur, which should make them easier to debug. So Hizenbooks, I imagine lots of us are familiar,

00:56.960 --> 01:01.680
you've got the same input, you hope for the same output, but instead you're going to crash.

01:01.680 --> 01:07.520
This is not great and it's especially not great when it happens 1 in 10,000 or 1 in 100,000,

01:07.520 --> 01:14.080
the 1 in the million invocations. As an application owner, debugging that from reports is very tricky.

01:14.080 --> 01:20.160
We'll go for a simple example now, very simple because these things get complex in reality,

01:20.160 --> 01:24.160
but imagine we've got some data being produced from a producer for it and we're consuming it in a

01:24.160 --> 01:29.680
consumer thread. In our case, and in our example later on, there's an explicit expiry date on

01:29.680 --> 01:34.720
that data, which isn't what really happens in production, more likely you've got some reference

01:34.720 --> 01:39.520
to a pointer that you might clear in some other thread, all of these expiry reasons that that data

01:39.520 --> 01:46.800
might no longer be valid at some point in the future. It doesn't crash, the vast majority of the time.

01:47.600 --> 01:54.640
The reason for this is that schedulers are pretty good, but when that interaction happens,

01:54.640 --> 01:58.480
when your machine gets a bit busy, when some processes get in the way that you weren't expecting,

01:58.480 --> 02:04.480
when the network gets slow, all of these things can just add extra delay. So a large reason for

02:04.480 --> 02:09.440
these conditions is scheduling. We see this, for example, RR, the debugger has a chaos mode that

02:09.440 --> 02:16.000
supports to make these a lot more likely, too, but that has its own issues. What is scheduling

02:16.000 --> 02:20.560
then? What do we actually do? In this case, we're talking about CPU scheduling. It's one of the

02:20.560 --> 02:25.760
more common types. The problem we have, we've got many processes, likely in the order of thousands,

02:25.760 --> 02:32.080
and some number of CPUs, likely in the order of tens nowadays, and we need to somehow make sure

02:32.080 --> 02:37.360
all those processes work successfully on that CPU to share the system. The simplest way we might

02:37.360 --> 02:42.880
do this is just a schedule process A, and whenever it stops, we'll schedule process B.

02:42.880 --> 02:47.680
Unfortunately, that really does not work. There are classes of non-preemptive schedulers,

02:47.680 --> 02:50.960
sometimes it makes sense, but the vast majority of the time we're going to need to schedule

02:50.960 --> 02:54.560
B a bit sooner, or the issues we were talking about before, they'll happen more often,

02:54.560 --> 02:59.920
network timeouts, all that sort of thing. So instead, we go over time, and we'll stop scheduling

02:59.920 --> 03:04.640
A for a bit. B might not be ready the next time, we might schedule A again. We'll schedule B,

03:04.640 --> 03:08.080
we'll schedule A, we'll flip back and forth, and we're doing this on the scale of many

03:08.080 --> 03:13.680
thousands of processes, likely we've several ready at any point in time. On an actual system,

03:13.680 --> 03:17.120
it might look something like this. We're not going to be able to look at any of the detail on

03:17.120 --> 03:22.720
this chart, but on the left, in the y-axis, we have which CPU we're looking at, as we go across,

03:22.720 --> 03:27.280
we're looking at what's happening on that CPU, whether a process is scheduled, the different colors,

03:27.280 --> 03:32.080
the different processes, nor gets quite complex, but these are the charts super interesting. This

03:32.080 --> 03:38.080
is a 612 Linux system, just running EVDF, that default scheduler, we can see processes are

03:38.080 --> 03:41.040
darting around all over the place, they're coming in, they're running for a short time,

03:41.040 --> 03:45.120
some of them are long running, they move about a bit, there's all sorts of complexity,

03:45.120 --> 03:50.640
and this is even on a pretty quiet system. When you start looking at big systems, hundreds of CPUs,

03:50.640 --> 03:57.040
all the interactions just get way more complicated. So when we look at our race conditions,

03:57.040 --> 04:02.000
you're replicating this on your lovely deaf machine, you've got a 32-core processor,

04:02.000 --> 04:05.200
it's nice and quiet, you don't want anything getting in the way of your testing,

04:05.200 --> 04:11.280
the bug never happens, ever, it's a nightmare, you know it's happening, people are reporting it,

04:11.280 --> 04:15.440
you're running the standard scheduler, and the bug never happens, you can try running stresses

04:15.440 --> 04:20.000
in the background, and that might make it a little bit more likely, but the bug still never happens,

04:21.040 --> 04:25.600
working on custom scheduler's at matter, I got to work with some scheduler's that are not too good,

04:26.160 --> 04:29.680
which is great, it turns out when you write a schedule of yourself and you have loads of

04:29.680 --> 04:35.520
configuration options, there are many ways to configure that scheduler badly, and I found one

04:35.520 --> 04:39.520
another while ago, I was working on a service, I was writing a scheduler, I got the configuration

04:39.520 --> 04:45.520
terribly wrong and the service failed, it really didn't work, but there were free parts to the service,

04:46.160 --> 04:50.560
two of them hit massive timeout errors, but they came back to life, one of them crashed,

04:51.360 --> 04:57.440
so 250 hosts died all at once because of my scheduler, it turns out this was a race condition,

04:58.080 --> 05:01.920
someone was storing the value of a shared pointer instead of copying the shared pointer and see

05:01.920 --> 05:06.000
plus plus for efficiency reasons and they got it wrong, so if there was a long enough delay in

05:06.000 --> 05:11.120
scheduling the service with crash, this does happen in the real application, but it's so rare that

05:11.120 --> 05:15.920
nobody would ever look at it, or even if you did, you would really struggle to replicate it,

05:16.720 --> 05:22.560
so what if we wrote a scheduler that was deliberately bad, what if it was deliberately erratic and

05:22.560 --> 05:28.080
got us into these states more often where these errors are likely to happen, that's what we

05:28.080 --> 05:33.200
got a demo of today, but how would you write an erratic scheduler? Well, there's some options here,

05:34.000 --> 05:40.080
you could write it in the Linux kernel, you might have a hard time doing that in general,

05:40.080 --> 05:44.320
the scheduler is very sensitive, if you get it slightly wrong, your system will hit the soft lock

05:44.320 --> 05:49.840
up the texture and immediately reboot, which is a bit of a pain, it's hard to do, if you get it wrong

05:49.920 --> 05:55.200
in memory on safe ways, your system will crash even more quickly, and if you get it right,

05:55.200 --> 06:00.480
you're still waiting, well, maybe in the tens of seconds for a K exec every time you want to change

06:00.480 --> 06:06.160
your kernel, this is a bit awkward. Nowadays, we can do it in user space, you're

06:06.160 --> 06:09.760
kind of supposed to talk about Java, he's got a project that I'm not going to give enough credit

06:09.760 --> 06:13.120
in this presentation because I don't know enough about it, where you can write these schedulers

06:13.120 --> 06:17.360
in Java, I'm more familiar with the Rust ones, you can also do it in C, if that's what you're

06:17.440 --> 06:25.280
like to, and it's all because of BPS, which is the B, scared x, which is the, I think we're supposed

06:25.280 --> 06:32.480
to call it a sex deposit, which is an interesting choice of logo we got there, and then this is

06:32.480 --> 06:40.240
the additional logo on the right. This is a photo of running Greg, supposedly shouting at hard

06:40.240 --> 06:46.640
drives, but there's a quote about putting JavaScript into the Linux kernel here, many similarities

06:46.720 --> 06:50.880
between the EBPS, the way it runs in the kernel, and a virtual machine for JavaScript you might

06:50.880 --> 06:56.240
have in your browser. Here's a photo of him looking slightly more normal, I think he'd prefer that one.

06:57.440 --> 07:02.480
EBPS, we're not going to go into it, it's not super important how it works, but that there are

07:02.480 --> 07:07.120
a few details that we need to cover just for the understanding. When you develop an EBPS program,

07:07.120 --> 07:10.960
you're going to write your source code in some language, there's a few options we have,

07:11.440 --> 07:18.160
it sees the standard one, rust works reasonably well, there's some sort of academic languages

07:18.160 --> 07:22.640
you can choose as well, and then there's the Java Transfiler that your analysis got, which is quite

07:22.640 --> 07:28.320
exciting. You can pull that into BPS byte code, it's like assembly, but it's its own language

07:28.320 --> 07:35.760
that works on all the Linux systems effectively. We make a CIS call to BPS to ask it to load our

07:35.760 --> 07:39.920
program into the kernel, you need a lot of privilege for this, it's a root only operation again,

07:39.920 --> 07:44.960
now I think, it kind of got user for a while, but now it's all root. It goes through the Verifier,

07:44.960 --> 07:51.760
the Verifier is a magic black box that's the post to make sure your program is safe in certain ways,

07:51.760 --> 07:56.240
so you can't remember it in a bad way that will cause your system to crash,

07:56.960 --> 08:00.720
you can't have unbounded loops that don't terminate because we're running this in the schedule

08:00.720 --> 08:06.720
a hop-off, if you have non-terminating code there, you're in trouble, stuff like that,

08:06.720 --> 08:13.840
it's a bit of a beast to work with, but once you're verified, you get jick compiled, loaded

08:13.840 --> 08:19.760
into the kernel, you can look at sockets, network interfaces scheduling now, you're loaded as an x86

08:19.760 --> 08:26.800
program, there's no arm, whichever system you're on, there's no further runtime basically attached,

08:26.880 --> 08:30.800
and then you communicate with that, mostly using Cisco's at the minute, we're getting some new

08:30.800 --> 08:35.280
stuff called Arena's which are more like map memory, and you can communicate back to user space,

08:35.280 --> 08:40.720
so we can write an application across user space and kernel space, which is pretty cool,

08:40.720 --> 08:45.120
so the general way we write our production schedulers is we write some rust that talks to the BPS,

08:45.120 --> 08:49.040
and then the BPS runs in the kernel and makes quick scheduling decisions.

08:51.120 --> 08:55.200
So that's BPS, how do we use that for scheduling? And recently,

08:55.200 --> 09:01.120
you mentioned that it's kernel 612, we're now 613, so it's a pretty recent SCADAXT, SCADAXT,

09:01.120 --> 09:06.080
it's the extension framework for jumping in as a scheduler from BPS space.

09:07.520 --> 09:12.800
This is a tasian, the creator, there are a few key features, I mentioned some of the troubles

09:12.800 --> 09:17.280
we're working in the kernel before, and the idea is that SCADAXt makes them better,

09:17.280 --> 09:21.520
there's a magnum perfect, but it certainly makes them better, so ease of experimentation,

09:21.520 --> 09:27.840
we have a repo with, in the order of 10 schedulers now, maybe a few more, the Linux kernel has

09:28.400 --> 09:33.600
two-ish schedulers, even the old one has to be ripped out to make way for the new one, so we don't,

09:33.600 --> 09:38.960
we don't have a lot of optionality in the kernel, but you can run many different SCADAXt schedulers on

09:38.960 --> 09:44.240
your machine switching between them just by running a program and pressing Ctrl C, it's super easy.

09:44.960 --> 09:51.040
Customisation 2, we can talk to user space, you can do basically anything you want in these schedulers,

09:51.120 --> 09:55.840
sure, some of it has to avoid the hot path, and you've got to communicate with user space a little

09:55.840 --> 10:00.480
bit, turns out that's not as bad as we might think, but you can make loads of choices, you can use

10:00.480 --> 10:05.360
information from Nvidia RESTMI that the Linux kernel is never going to do, and stuff like that,

10:05.360 --> 10:10.480
and finally rapid scheduler deployments, deploying a new kernel at scale is tricky, and we have to

10:10.480 --> 10:16.320
get it to millions of machines, and it takes in the order of weeks to get that kernel out, deploying

10:16.400 --> 10:22.000
a new scheduler can take a day, it's really easy, and running it and stopping it is also easy,

10:22.000 --> 10:26.240
you don't have to reboot, so if we find out weeks later that our scheduler is kind of bad,

10:26.240 --> 10:30.080
we can just stop it, and we go back to the default, and everyone's safe, we don't have to

10:30.080 --> 10:32.720
err, we don't have to worry about how much we've broken all the systems.

10:35.520 --> 10:40.640
In the SCADAXt scheduler, again, maybe don't worry too much about the details, but we have a few

10:40.640 --> 10:45.280
bits that we have to worry about, on each CPU, we have a local phypho queue, it's just first in

10:45.280 --> 10:50.720
a first out, and that's effectively read from by the kernel. If you've put stuff in that queue,

10:50.720 --> 10:56.560
the kernel side of SCX will make sure it gets run, on that CPU in that order, quite convenient.

10:57.680 --> 11:03.120
In SCX, we generally have global queue as well, when we write our own schedulers,

11:03.120 --> 11:07.520
in this picture we got one, you can have a dozen, you can have as many as you like, and within

11:07.520 --> 11:11.120
those queue, you can meet different things, so on some schedulers, we might have a different

11:11.200 --> 11:16.160
queue per LLC, some schedulers, we have a different queue per how much we want to prioritise

11:16.160 --> 11:22.240
the workflow, and various different things like this. The scheduler's job that we write in SCX

11:22.240 --> 11:27.520
is to move things from global queues into local queues, and accept new processes, make decisions

11:27.520 --> 11:33.280
based on them, and let them run in the order we like. To view a super simple scheduler,

11:33.280 --> 11:38.880
in the Java side of this framework, first step, well, first step is the license everything

11:38.880 --> 11:44.720
is GPL, that's an absolute requirement with BPS, which is pretty cool. License it's that,

11:44.720 --> 11:50.640
we're going to share this constant of a shared DSQ ID, nice and easy. We'll create a shared DSQ,

11:50.640 --> 11:56.800
which we need to be able to handle tasks in a more uniform way, handling it per CPU would end up

11:56.800 --> 12:01.840
with separate scheduling issues, so we'll create that DSQ, and now we've got our queue, and that's it.

12:01.840 --> 12:08.320
Next one, NQ. This happens when you receive a task that is now runnable, and you've got

12:08.320 --> 12:13.440
a task ideally, you want to put it on a CPU, but if you can't put it on a CPU, we're going to

12:13.440 --> 12:18.720
end Q it. In this case, we're using another K-Funk SCX BPS dispatch, we're taking our task,

12:19.280 --> 12:23.440
we're putting it in our shared DSQ, we're saying next time it runs, it can have up to five

12:23.440 --> 12:28.080
milliseconds, and then we're just passing through these flags. It's pretty simple too, so far.

12:30.000 --> 12:34.080
And the final one is dispatch. This is what's called runnable CPU goes idle,

12:34.080 --> 12:37.680
you have your CPU, it's finished doing whatever it was doing. It doesn't know what to run next,

12:37.680 --> 12:42.720
because it's little queue is empty, so we just run SCX BPS consume from the shared queue,

12:42.720 --> 12:46.400
which takes the task from the shared queue, and just runs it on that, that's CPU for us.

12:47.360 --> 12:51.600
That's it, that's a whole scheduler. It's not a very good scheduler. We're using

12:51.600 --> 12:56.480
five-fold queues everywhere, there's no priority for any processes, everything is completely equivalent,

12:57.200 --> 13:01.200
which it turns out doesn't work very well, and there's also only one global queue, so if you're

13:01.200 --> 13:06.240
in any sort of complicated CPU, that will really struggle. If you've got two sockets on certain

13:06.240 --> 13:12.080
Intel machines, this will kill the machine, because their cross-socket communication is so slow,

13:12.080 --> 13:17.120
that if you try and run the scheduler, you hit the soft lock up detector, before the skedets

13:17.120 --> 13:21.360
think and get kicked out. It's normally very safe, normally skedets, if you don't schedule stuff,

13:21.360 --> 13:25.920
it just gets kicked out, and you go back to normal, but the Intel machines are so slow,

13:25.920 --> 13:29.520
you can't actually get kicked out, because that bit of kernel code can't run in time.

13:30.320 --> 13:34.640
So that's quite interesting, but in the general case, you're pretty safe, this will run,

13:35.280 --> 13:41.360
and then you can extend it as you like. Producing erratic scheduling orders. That's what

13:41.360 --> 13:47.280
this was all about. How can we make our race condition fun more likely? We have a, let's see,

13:48.320 --> 13:53.360
let's go first of all, this is the example. We have an example here written, I believe in Java

13:53.360 --> 14:00.000
again. It's a super simple thing to crash, we just consume things from a queue that are only

14:00.000 --> 14:04.400
valid for a certain amount of time. It's missing a little bit of the code to it, and I won't find

14:04.640 --> 14:08.800
it, because it gets, you always need a bit of plumbing to make these things work, but effectively

14:08.800 --> 14:13.200
we get a task come from this producer thread. We've set a time on it, that's just the limit,

14:13.200 --> 14:17.440
and if we try and read it beyond that, we're going to crash, and then we just keep reading it.

14:17.440 --> 14:22.320
On a quiet system, this is fine. This will run for days at a time, and it will never crash.

14:22.880 --> 14:27.040
Even on a busy system, we haven't yet seen a crash, but it can theoretically happen.

14:28.240 --> 14:32.640
We had to get quite simple with these examples to make them fit, basically.

14:33.600 --> 14:38.400
I've got a video to show you from your harness that I'm going to have to talk over, I believe.

14:40.960 --> 14:47.440
There we go. Okay, so we started with our schedule, we've got schedule over SH, which just

14:47.440 --> 14:53.040
launches our schedule, with the correct arguments, samplesruncue.sh. Here's our sample script,

14:53.040 --> 14:57.040
the Java Rage Show G. And we're also, we're getting some extra of the velocity out of it.

14:57.440 --> 15:02.000
Every time we make a scheduling decision, we're printing it here. And the way we've set this up

15:02.080 --> 15:07.760
is it's going to sleep things for just way longer than it needs to. We'll take runnable tasks

15:07.760 --> 15:13.040
that would get a CPU immediately on a normal scheduler and not schedule them. For whatever amount

15:13.040 --> 15:17.280
of time this is saying. So this is going between half a second and a second and a half of not

15:17.280 --> 15:22.480
scheduling our tasks that could be scheduled. Then when it runs it, we run it for 80 milliseconds,

15:22.480 --> 15:29.600
something like that, and it normally finishes. And we saw it crashed, which is pretty good.

15:30.560 --> 15:35.040
This program, again, hours at a time, we've left it running on these machines, it doesn't crash.

15:35.040 --> 15:39.680
It just never hits these edge cases, but those edge cases are there, and they can be hit,

15:39.680 --> 15:45.360
and if you scaled this efficiently, if that Fred with the random delay was actually a network

15:45.360 --> 15:51.280
request, and then your systems would slow that was curing on those, this would crash. So we've got,

15:51.280 --> 15:55.840
we've got a lot of times where this could happen. If someone reported it, you wouldn't be able to debug

15:55.920 --> 16:03.520
it comfortably on your machine. And that's it, we've got a crash. I think we've got a few minutes

16:03.520 --> 16:09.920
where I can briefly scroll through the code. There isn't too much of it, surprisingly, again.

16:10.320 --> 16:20.320
Let's go. So we go back to...

16:24.880 --> 16:28.160
I'll take it back. Before I do that, have we gone any questions?

16:28.240 --> 16:42.560
So the question was, are we running that in the CIO just locally at the minute? At this point,

16:42.560 --> 16:48.080
this is very local. It's quite constrained, this scheduler, it only works on small machines

16:48.080 --> 16:54.320
at the minute, we use a lot of those phytocus. It's very new, it's very early. What we were excited

16:54.320 --> 16:59.760
about seeing was whether we could make it crash, and we can. So the next steps with this would be to

16:59.760 --> 17:04.880
production, I said a bit more, get it able to run on a big service. That example, I mentioned earlier,

17:04.880 --> 17:08.640
if I tried to route that machine, it's one of the new AMD chips, so it's got loads of LLCs,

17:08.640 --> 17:12.000
it's all a bit complicated. If we try and run this schedule on there, it just doesn't work.

17:12.880 --> 17:17.520
It gets kicked out. The machine survives, but it doesn't work. So we need a more complex hierarchy

17:17.520 --> 17:22.640
in the scheduler, and then to interject the randomness. We also need some seeding and bits like that

17:22.640 --> 17:27.440
to try and get it more consistent, and probably a bit of searching to find the right conditions

17:27.440 --> 17:31.520
to make it crash. So this is still very early. We'll be happy about contributions too.

17:35.520 --> 17:40.560
Have we tried it on an arm machine? No, but it does work. So schedule it in general, we have

17:40.560 --> 17:46.160
tried an arm machine. It works fine. There's nothing super worrying about it, which is great,

17:46.160 --> 17:49.920
because we've got a lot of arm to cover at the minute.

17:52.640 --> 17:57.600
You do from the schedule. In general, for example, to the process memory, to see what help,

17:57.600 --> 18:02.400
it has what testing purpose, it has these kind of bits, now I know that I have to

18:03.920 --> 18:12.240
schedule it to get a CPUs like another process. So the question was about looking at the

18:12.240 --> 18:16.640
memory of the process, there's a more information we can use from them to make a scheduling decision.

18:16.640 --> 18:21.600
That's super interesting. We haven't looked at that yet. We can do the filtering at the minute,

18:21.600 --> 18:27.520
it's based on parent pids effectively. So we were a little bit, the way we're running the

18:27.520 --> 18:32.480
schedule is we schedule the whole machine, but we only care about messing up this specific process,

18:32.480 --> 18:37.440
because otherwise we'll stop finding race conditions in the shell, and then we'll be in trouble.

18:38.480 --> 18:44.080
So the reason we do it like that is just easier. We use p-pid at the minute, you can filter on

18:44.080 --> 18:49.680
in other schedules, we filter on things like calm, process name, thread groups, or these things.

18:49.680 --> 18:54.000
So that, but I've never seen the option of actually looking at process memory, that's super

18:54.000 --> 18:59.760
interesting. We are doing some stuff where the application can tell the scheduler what it wants

18:59.760 --> 19:05.920
in a more fine-grained way. Currently we just use niceness, which is a bit weak, it's not very

19:05.920 --> 19:11.200
rich. So we're doing more communication from the process to the scheduler in our production

19:11.200 --> 19:16.080
schedule, and I think that would be possible here too. We're running a BPF, it's four

19:16.240 --> 19:18.240
privileges, you can kind of do what you want, which is cool.

19:25.920 --> 19:31.840
It's an excellent question. Their question was about reproducing the crashes, and how we can

19:31.840 --> 19:36.880
make that happen. The answer at the minute is, no, we don't have that. It's a lot of the goal of this

19:36.880 --> 19:41.680
project, but when we were running through it, we started looking at how we could build this scheduler

19:41.680 --> 19:50.000
to get rid of a huge amount of the non-determinance in the process. The way we saw we're scheduling

19:50.000 --> 19:54.240
it, we were going to slow it down too much if we tried to get rid of too much of the non-determinance,

19:54.240 --> 19:58.080
because we want to be in the position long term where we can run this on production applications

19:58.080 --> 20:03.840
with out slowing them down to the point where they stop serving traffic. And that meant we

20:03.840 --> 20:10.400
made some compromises. The main one is we schedule things on course pretty quickly now, when the

20:10.400 --> 20:15.440
original plan was to go one-fred process, but it just isn't scalable at that point. So there's

20:15.440 --> 20:19.120
definitely some work to be done to getting the seeded and making it more reproducible.

20:22.960 --> 20:27.840
I was wondering about this scheduling point where you can decide to make this end. Sounds like

20:27.840 --> 20:32.160
only when the process was the kernel, is the point that you can really do this schedule,

20:32.160 --> 20:36.560
but if you have some divide, let's say, you increment a variable, you should be economy,

20:36.560 --> 20:41.280
but you have the increment do with a fashion head, but just so how do you do to make things

20:41.280 --> 20:47.280
or leaving everything? Yes. So the question was about effectively summarized when we can preempt things.

20:48.160 --> 20:54.320
Anytime, it's good. We have full control over it basically. So we can cut the slices down,

20:54.320 --> 21:00.560
which helps, but we also have a K-fun called a CXBFF kick CPU that kicks a CPU quickly,

21:01.520 --> 21:04.960
which is pretty cool. How we'd integrate that as a different question. We haven't done it yet.

21:04.960 --> 21:08.960
We're purely working with slices in a minute. So that does, sorry.

21:10.320 --> 21:14.640
That does get the primitive scheduler to kick in and stop the process, and we will get these

21:14.640 --> 21:18.880
into leaving eventually, but you're having a more, if you could look at the memory and see a bit

21:18.880 --> 21:24.320
that's flipped, and then kick it, we have that option, which would be very exciting in the future.

21:24.320 --> 21:28.960
I'm hoping that we've opened sked eggs for the world of testing, and that everyone will have

21:28.960 --> 21:33.920
great ideas now, because I'm a scheduler developer. Your harnesses of open JDK developer that

21:34.000 --> 21:39.600
likes schedulers, so it would be really cool for other people to see that schedulers are

21:39.600 --> 21:45.440
available to them, and not completely impossible to write now, and use that in testing more widely.

21:45.520 --> 21:56.640
Can you prevent this soft lock from killing Linux? You can let's say you want to explore all

21:56.640 --> 22:05.440
possible schedules, which can occur in systems. There's two parts to that. We have two of these

22:05.440 --> 22:10.480
kind of lock-up detectors. I glossed over it earlier, but sked eggs itself, if you have, we're given

22:10.480 --> 22:14.560
a task that's runable, and you wait more than 30 seconds and don't run it, the SCX scheduler will

22:14.560 --> 22:19.120
get kicked, and all those tasks will move back to the fair scheduler in the kernel. There's also the

22:19.120 --> 22:23.120
soft lock-up detector, which happens a bit later, that if the machine isn't, I'm not too,

22:23.120 --> 22:27.120
super sure on the details, I just know we hit it. If the machine isn't making reasonable progress,

22:27.120 --> 22:32.480
and it's not much later, I think it's maybe 40 seconds, 45, and then it just repeats the machine,

22:32.480 --> 22:39.440
and that one, I would say turning that off probably isn't super productive, because if you were to hit

22:39.520 --> 22:43.200
that with any normal schedule without your scheduler, the machine would do the same thing.

22:44.480 --> 22:52.000
The SCX one, we haven't needed to turn it off, because 30 seconds is such a long time. Technically,

22:52.000 --> 22:55.360
if we were making kind of network requests, they could take longer to come back than that,

22:55.360 --> 23:00.000
but for the vast majority of bugs, 30 seconds should be plenty. If you do want to change it,

23:00.000 --> 23:04.080
there's a number in the kernel, and you can always recompile it, and that will get longer,

23:04.080 --> 23:07.360
and that you've got to be, there's many systems that come into make that stop.

23:24.800 --> 23:29.440
Yeah, that's a good question. The question was about more erratic behavior, instead of just scheduling

23:29.520 --> 23:35.680
timing. The show answer is no, basically. There's stuff where we're interested in a memory

23:35.680 --> 23:41.200
latency change on systems as they get more loaded. We haven't done any work to train calls like

23:41.200 --> 23:46.000
yet, and it's not easy to do that with the SCX scheduler. There are ways to do it. You can kind of

23:46.000 --> 23:51.280
force things to mess up their cashiers more often with scheduling decisions, and introduce extra

23:51.280 --> 23:56.560
processes that do that too, but I think those races are a lot finer-grained, and we haven't

23:56.560 --> 24:03.360
decided looking about yet. That's great. Thank you very much.