WEBVTT 00:00.000 --> 00:10.000 For the next presentation, we've got Sarmat talking about performance and reliability and 00:10.000 --> 00:11.000 analysis. 00:11.000 --> 00:12.000 I hope you feel sorry. 00:12.000 --> 00:13.000 For BPM. 00:13.000 --> 00:18.000 There you go. 00:18.000 --> 00:27.000 Hello everyone. 00:27.000 --> 00:32.000 Thank you so much for taking out the time to be present at this talk. 00:32.000 --> 00:37.000 My name is Sarmat Socket and I'll be talking about the performance and reliability 00:37.000 --> 00:39.000 pitfalls of EBBF. 00:39.000 --> 00:41.000 So just a little bit about me. 00:41.000 --> 00:43.000 I work as a software engineer at Data Dog. 00:43.000 --> 00:48.000 I work at EBBF as part of the EBBF platform team. 00:48.000 --> 00:52.600 I enjoy exploring and working with internal internals and thus the direction I'll be 00:52.600 --> 00:55.000 approaching the stock from. 00:55.000 --> 01:00.000 So the way I'll be covering this topic of performance and reliability will be through 01:00.000 --> 01:04.000 three stories about some issues that we ran into. 01:04.000 --> 01:11.000 And hopefully I'll be able to thread a common team through them and talk about what is shared 01:11.000 --> 01:17.000 between all of these that could be considered like the general team of the problems that we're addressing. 01:17.000 --> 01:22.000 So we're talking about some of the pains with scaling care at probes that we've experienced. 01:22.000 --> 01:30.000 This deadlock in FNG that we came across once and recently a production incident that we were trying to resolve 01:30.000 --> 01:33.000 a few probes were at the center of everything. 01:33.000 --> 01:35.000 Well kind of next to the center. 01:35.000 --> 01:38.000 There was something else in the center. 01:38.000 --> 01:43.000 So starting with the scaling pains of care at probes some time ago, 01:43.000 --> 01:49.000 we got this customer issue where they were complaining that when they install our agent, 01:49.000 --> 01:55.000 suddenly CPU spikes and their network throughput is completely decimated. 01:55.000 --> 02:01.000 The customer was running a pretty old kernel version so kind of 4.14 with Amazon and X2. 02:01.000 --> 02:05.000 And they had a very large machine with 96 cores. 02:05.000 --> 02:09.000 So we set up the similar environment, 02:09.000 --> 02:16.000 ran our agent started looking at Perf to see if we could kind of just easily pinpoint where the problem is coming from. 02:16.000 --> 02:18.000 And that turned out to be pretty helpful. 02:18.000 --> 02:27.000 So before I go into what the actual problem was just to give you a brief overview of how care at probes function. 02:27.000 --> 02:34.000 So as mentioned before the optimised care probes are what you commonly see so that's the schema I'll be talking about today. 02:35.000 --> 02:40.000 So care at probes basically instrument a trampoline instruction in the beginning of a function. 02:40.000 --> 02:47.000 This trampoline goes to a trampoline function which is responsible for overriding the return address on the stack. 02:47.000 --> 02:50.000 So that when the function returns you go to your hook. 02:50.000 --> 02:57.000 And it also saves some invocation specific state in this structure called the care at probe instance structure. 02:57.000 --> 03:03.000 So some of the say that it saves is the original return address so you can eventually return back to it. 03:03.000 --> 03:11.000 And the point thing to note over here is that these care at probe instance structs are invocation specific. 03:11.000 --> 03:15.000 So each invocation of the function gets its own struct. 03:15.000 --> 03:22.000 So these structs are present in the care at probe structure as a link list. 03:22.000 --> 03:26.000 And in order to synchronize access to this link list, 03:26.000 --> 03:29.000 the kernel sort of one for was using a spin lock. 03:29.000 --> 03:36.000 So you can imagine that if you have this massive machine, let's say running a bunch of TCP received messages in parallel. 03:36.000 --> 03:41.000 All of these invocations are serializing on this spin lock and that becomes a massive massive bottleneck. 03:41.000 --> 03:49.000 And when you run ran per stop, this reflected this observation we saw large amount of CPU time being spent in this spin lock. 03:49.000 --> 03:54.000 And that obviously explains the massive drop in throughput. 03:54.000 --> 04:02.000 So unfortunately since this is a problem, as you see later on, all of these were problems with the surrounding infrastructure of the BPS, 04:02.000 --> 04:07.000 the only recommendation that we can give is to upgrade your kernel version. 04:07.000 --> 04:12.000 Unfortunately pretty soon you found that wasn't enough up until the certain point. 04:12.000 --> 04:15.000 Like soon afterwards we received another customer impact. 04:15.000 --> 04:22.000 And this time the customer was running kernel 6.6 again with a very large instance type and again with the same symptom. 04:22.000 --> 04:25.000 Network gets destroyed. 04:25.000 --> 04:28.000 Like the throughput gets destroyed. 04:28.000 --> 04:33.000 So this time around Perth wasn't very helpful in pinpointing anything. 04:33.000 --> 04:36.000 So we started looking into what changes have happened. 04:36.000 --> 04:40.000 In the kernel especially in the scary probe infrastructure part since then. 04:40.000 --> 04:51.000 And we found that in order to adjust that spin lock intention issue, the kernel developers had come up with a solution of putting the scary probe in 04:51.000 --> 04:55.000 the infrastructure in this lock list field list. 04:55.000 --> 05:00.000 So basically there's no lock so there should be no intention. 05:00.000 --> 05:06.000 But if you look at the implementation of this field list, right at the top of the page in the header comment, 05:06.000 --> 05:09.000 you see this very helpful hint as to what the problem could be. 05:09.000 --> 05:13.000 It's not the fastest thing in the world under heavy contention. 05:13.000 --> 05:16.000 And that's precisely this scenario that we're running into. 05:16.000 --> 05:26.000 So what happens over here? You have this lock list field list, but you have this shared memory region that is being written to simultaneously by a large number of CPUs. 05:26.000 --> 05:32.000 So memory itself at this point becomes a bottleneck because of excessive cache coherency traffic. 05:32.000 --> 05:36.000 And obviously this gets worse as you add more and more CPUs. 05:36.000 --> 05:42.000 And since we were running a thing this morning in the massive instance, the effect was particularly dramatic. 05:42.000 --> 05:51.000 So I realized that I did not attach the first step output in the slide, but ultimately when we used birth to record some of them, 05:51.000 --> 05:56.000 my rocket actual performance counters, we in fact saw that we were bottlenecking on memory, 05:56.000 --> 05:59.000 like we were spending most of our time waiting for memory to come. 05:59.000 --> 06:04.000 So again, recommendation is a great kernel, sorry. 06:05.000 --> 06:15.000 And luckily enough after a kernel 6.6, I in fact I think in the later stages of the releases of kernel 6.6, this problem was ultimately addressed. 06:15.000 --> 06:24.000 And now K-Red probe used this object pool mechanism, which uses this birth CPU rig, to store the K-Red probe instance structures. 06:24.000 --> 06:29.000 And since it's birth CPU, very scalable, you don't run into this problem. 06:29.000 --> 06:39.000 So while we were investigating these performance characteristics of K-Red probe, we were also realizing that, hey, we don't want to be running into these problems in our infrastructure. 06:39.000 --> 06:44.000 And back then, unfortunately, we were also behind this optimization as well. 06:44.000 --> 06:51.000 So we decided, let's do a way with K-Red probe, let's move into the modern world, let's use F entry. 06:51.000 --> 07:03.000 We did that, we made the implementation, and then pretty soon we started getting reports of some failures on when the agent container was exiting. 07:03.000 --> 07:09.000 So we set up an environment, launch the container, solve the container, and you see this error. 07:09.000 --> 07:12.000 Try to kill container, but it don't receive an exit event. 07:12.000 --> 07:15.000 Kind of gives you the hint that something stuck somewhere. 07:15.000 --> 07:24.000 So we used this RQ to dump all tasks in uninterruptible sleep, to see who's the one causing a problem. 07:24.000 --> 07:29.000 And lower than behold, you do have some tasks stuck in an uninterruptible sleep. 07:29.000 --> 07:31.000 And in this case, you have to. 07:31.000 --> 07:35.000 So on the left hand side, you see the star score system probe. 07:35.000 --> 07:40.000 So system probe is a component of the agent, which is responsible for managing BPS and stuff. 07:40.000 --> 07:46.000 And on the right hand side, you have this RCU task, K-Red, it's stuck somewhere as well. 07:46.000 --> 07:50.000 So the system probe is trying to detach F entry. 07:50.000 --> 07:56.000 And while it's attaching F entry, it calls this function, synchronize RCU tasks. 07:56.000 --> 07:59.000 And once it calls that, it gets put to sleep. 07:59.000 --> 08:04.000 And on the right hand side, you have this kernel thread, which is doing some kind of waiting. 08:04.000 --> 08:10.000 And in the middle of this waiting, it calls this other synchronize RCU function and gets put to sleep. 08:10.000 --> 08:15.000 At this point, you can already see that there's some kind of dependency over here, like system probe is obviously 08:15.000 --> 08:17.000 depending on this kernel thread. 08:17.000 --> 08:21.000 But what's preventing this kernel thread from making forward progress? 08:21.000 --> 08:28.000 So whatever it was, wasn't showing up in the list of tasks that are stuck in uninterruptible sleep. 08:28.000 --> 08:31.000 So we say, let's dump everything. 08:31.000 --> 08:37.000 All right, pass equity to this RQ and the kernel just shows you all the tasks that are on all CPUs. 08:37.000 --> 08:43.000 And scrolling to that pretty long list, we note, this catches our eye. 08:43.000 --> 08:45.000 S6 SV scan. 08:45.000 --> 08:51.000 So S6 are the set of process management utilities that Docker uses. 08:51.000 --> 08:54.000 And so by the way, this was Docker container. 08:54.000 --> 08:57.000 So process management utilities that Docker uses. 08:57.000 --> 09:03.000 What, a quarter eye, because this is a Docker, and we see the issue happened inside Docker. 09:03.000 --> 09:05.000 And also, this is the parent of the system probe. 09:05.000 --> 09:11.000 Not only is the parent of the system probe, it's also the PID1 in the PID namespace of the container. 09:11.000 --> 09:17.000 So if you look at what it's up to, it's trying to destroy the PID namespace and as part of it, 09:17.000 --> 09:20.000 it's waiting on something. 09:20.000 --> 09:23.000 So now you have these three entities that are in some way seen, 09:23.000 --> 09:28.000 to be related to each other, but we don't really know what's precisely causing the problem. 09:28.000 --> 09:38.000 In order to understand that, we need to take a step back and actually see how things work under the hood for each of the different entities that we're trying to understand. 09:38.000 --> 09:40.000 So with system probes, it's working with FNG. 09:40.000 --> 09:42.000 So let's take a look at what FNG is. 09:42.000 --> 09:48.000 FNG, very similar to carrot probe instruments, a sampling instruction on top of the function. 09:48.000 --> 09:54.000 But unlike carrot probe, where you jump into some shared infrastructure code and you have all of the state management stuff, 09:54.000 --> 09:56.000 this does something very simple. 09:56.000 --> 09:59.000 It just jumps to a trampoline image. 09:59.000 --> 10:02.000 In the beginning of the trampoline image, you run the FNG program. 10:02.000 --> 10:06.000 You call the function, and if you have an FNG program, you call that. 10:06.000 --> 10:08.000 It's very simple. 10:08.000 --> 10:11.000 The trampoline image is generated at low time. 10:11.000 --> 10:14.000 So this literally a stub associated with each trampoline. 10:14.000 --> 10:18.000 So this scales very well as well, because there's no shared state, no coordination problem. 10:18.000 --> 10:23.000 They need to solve, just read only memory, amazing. 10:23.000 --> 10:27.000 Now, on the other side, so that was system probe doing FNG stuff. 10:27.000 --> 10:32.000 On the other side, we had this RCU tasks, K thread. 10:32.000 --> 10:33.000 What is RCU tasks? 10:33.000 --> 10:36.000 So this is a subsystem of the Linux kernel. 10:36.000 --> 10:41.000 I don't know if we can call it a subsystem, but it's a different RCU flavor. 10:41.000 --> 10:48.000 And the facility that it provides to the kernel is to basically allow it to figure out 10:48.000 --> 10:52.000 when no process is holding a reference to some structure. 10:52.000 --> 10:56.000 So the way normal RCU works is also similar conceptually, 10:56.000 --> 11:00.000 but what the problem that's trying to solve is to provide the kernel with a mechanism 11:00.000 --> 11:04.000 to figure out when no CPU is holding a reference to a structure. 11:04.000 --> 11:09.000 So there are some invariants that the code has to observe and it's using normal RCU 11:09.000 --> 11:15.000 which is that only one process on a CPU can hold a reference in contrast. 11:15.000 --> 11:22.000 However, RCU tasks, multiple processes on a CPU can hold a reference to some structure. 11:22.000 --> 11:31.000 And the way the waiting for the release of the structure happens is that before the start of the weight, 11:31.000 --> 11:36.000 all of the tasks that were running need to have performed a voluntary context switch 11:36.000 --> 11:39.000 before the weight ends. 11:39.000 --> 11:42.000 So not preemption, but a voluntary context switch. 11:42.000 --> 11:44.000 So how does this work with FN3? 11:44.000 --> 11:46.000 Let's say you have this trampoline image. 11:46.000 --> 11:48.000 Similar to that PCP received message, 11:48.000 --> 11:54.000 example, bunch of parallel executions where a lot of tasks are executing in this code region. 11:54.000 --> 11:58.000 Let's say something happens, like you need to update that trampoline image, 11:58.000 --> 12:02.000 what the kernel will do is it will generate the new trampoline image, 12:02.000 --> 12:04.000 point the trampoline to that. 12:04.000 --> 12:10.000 So that all subsequent executions start executing in this new stop. 12:10.000 --> 12:14.000 But now the problem that the kernel needs to solve is how does it know that the old 12:14.000 --> 12:17.000 sampling image is safe to release. 12:17.000 --> 12:22.000 So what could happen can actually is let's say some task is executing in this old 12:22.000 --> 12:24.000 sampling image, it gets preempted. 12:24.000 --> 12:28.000 And by the time it comes back online, the kernel may have deleted the 12:28.000 --> 12:29.000 sampling image from underneath it. 12:29.000 --> 12:33.000 And suddenly it's executing junk instructions and you get a crash. 12:33.000 --> 12:37.000 That's not what we want, so this is where RCU tasks come in. 12:37.000 --> 12:47.000 As I said, RCU tasks allow the kernel to figure out when no process is holding a reference to some structure. 12:47.000 --> 12:49.000 And how does it do that? 12:49.000 --> 12:54.000 So basically the algorithm broadly is comprised of three parts. 12:54.000 --> 12:59.000 The first is that it scans the task list for all tasks on the run Q and has 12:59.000 --> 13:01.000 adds them to this hold out list. 13:01.000 --> 13:07.000 The hold out list is all the tasks which have not yet performed a voluntary context. 13:07.000 --> 13:13.000 So the second part which is highlighted because this is involved in the bug is it 13:13.000 --> 13:16.000 also has to synchronize with late stage do exit. 13:16.000 --> 13:21.000 And the reason it needs to do that is in late stage do exit. 13:21.000 --> 13:26.000 A task executing in late stage do exit is no longer on the task list. 13:26.000 --> 13:31.000 So the scan of the task list doesn't make that late stage do exit task, 13:31.000 --> 13:33.000 visible to RCU task subsystem. 13:33.000 --> 13:40.000 So what can happen over here if there's no synchronization is that if you have a probe somewhere in this late stage do exit. 13:40.000 --> 13:43.000 The same it gets be emptied out. 13:43.000 --> 13:47.000 The RCU task system does not see it, so the kernel removes the trampoline image. 13:47.000 --> 13:50.000 It executes the instructions and it does. 13:50.000 --> 13:58.000 So in order to prevent that you need to synchronize with the stage and to do this the kernel brings in another RCU variant. 13:58.000 --> 14:01.000 This is the SRC subsystem. 14:01.000 --> 14:09.000 And basically the idea over here is that the RCU task subsystem weights until all tasks that we're performing and exit. 14:09.000 --> 14:14.000 Bend the weights started, have done the exit by the time the weight ends. 14:15.000 --> 14:21.000 And it uses this tasks RCU exit, SRC you log to do that. 14:21.000 --> 14:28.000 Finally, the kernel loops over the whole list until all tasks have been seen from a voluntary context switch. 14:28.000 --> 14:33.000 And then it's safe to delete what you were trying to delete. 14:33.000 --> 14:37.000 Because at that point you know that nobody has a reference to it. 14:37.000 --> 14:43.000 So one crucial thing is that this was in part of the original RCU tasks implementation as far as I can tell. 14:43.000 --> 14:46.000 Maybe as far as I could see from the kernel history. 14:46.000 --> 14:53.000 This was explicitly added to resolve this problem of coordination with late stage to exit. 14:53.000 --> 14:56.000 So finally, we know what system provides up to. 14:56.000 --> 14:58.000 It's doing some frances stuff. 14:58.000 --> 15:01.000 We know how this RCU task K3rd is working. 15:01.000 --> 15:04.000 And the final entity that we were dealing with was S6S scan. 15:04.000 --> 15:06.000 So the zapper for the PID namespace. 15:06.000 --> 15:10.000 What it does straightforward, it sends a kill signal to all the children. 15:10.000 --> 15:12.000 Wates until they're dead. 15:12.000 --> 15:19.000 But crucially, it does this waiting inside of the tasks RCU exit RCU SRCU that's a mouthful. 15:19.000 --> 15:21.000 We decide critical section. 15:21.000 --> 15:27.000 So now we kind of have a more precise picture forming of what the dependencies are. 15:27.000 --> 15:31.000 So we have system probe, which calls synchronize RCU tasks. 15:31.000 --> 15:35.000 And it's waiting on RCU task K3rd to make formal progress. 15:35.000 --> 15:40.000 RCU task K3rd adds all runnable tasks to the whole list. 15:40.000 --> 15:47.000 It calls synchronize RCU whatever whatever to wait for tasks in late stage to exit to finish. 15:47.000 --> 15:53.000 And we have this S6S scan, which is also in late stage to exit. 15:53.000 --> 15:57.000 This is waiting for all processes in its PID namespace to exit. 15:57.000 --> 16:02.000 And it does this waiting inside of the SRCU recycle critical section. 16:02.000 --> 16:05.000 So now you see you have the circular dependency. 16:05.000 --> 16:08.000 One task is waiting to make forward progress. 16:08.000 --> 16:11.000 One task is waiting for the other one to make forward progress. 16:11.000 --> 16:14.000 And none of them can because this is a cycle. 16:14.000 --> 16:16.000 And that was the bug. 16:16.000 --> 16:26.000 So unfortunately, the fix for this was pretty late into the kernel versions from that time on. 16:26.000 --> 16:28.000 So in kernel 6.9. 16:28.000 --> 16:36.000 The author of this patch removed this SRC mechanism entirely for coordination and came with a different way. 16:36.000 --> 16:39.000 So that this problem no longer exists. 16:39.000 --> 16:43.000 And when we were trying to port over to FNT, at this patch had been back. 16:43.000 --> 16:45.000 But at this point it's been back ported. 16:45.000 --> 16:49.000 So it's safe to use FNT, I don't know how far it's been back ported. 16:49.000 --> 16:52.000 It's generally safe to use FNT. 16:52.000 --> 16:56.000 We also from this thread found another investigation into this issue. 16:56.000 --> 17:04.000 In this case the author was trying to run into this one dealing with PID namespaces. 17:04.000 --> 17:07.000 And when dealing with PID namespaces. 17:07.000 --> 17:14.000 But as far as I know, I haven't seen any documentation of anybody running into this problem with FNT before. 17:14.000 --> 17:19.000 So that was our foreign trying to scale here at probes. 17:19.000 --> 17:23.000 Try FNT failed because of a bug. 17:23.000 --> 17:31.000 And then now we can change the case a little bit and go to this other subsystem of the Linux kernel, which is the cube subsystem. 17:31.000 --> 17:38.000 And we recently were trying to resolve this production incident where we were getting reports of service, 17:38.000 --> 17:42.000 and unavailability on a few nodes in production. 17:42.000 --> 17:53.000 And so eventually somebody figured out that K-Hantosti was reporting that a bunch of tasks were just stuck. 17:53.000 --> 17:56.000 They were waiting for more than two minutes to proceed. 17:56.000 --> 17:59.000 Some of these tasks were specific to the production application. 17:59.000 --> 18:03.000 Some of them were like just random system tasks like bash, system D, run C, etc. 18:03.000 --> 18:08.000 And we got pulled into this is because amongst the stack races that K-Hantosti was reporting, 18:08.000 --> 18:12.000 there were stops, stack races pointing to your registration and your profiling. 18:12.000 --> 18:19.000 So we started investigating, looking to whether the agent could somehow trick in an issue. 18:19.000 --> 18:27.000 So we started the investigation again to each of the hypothesis, the deadlock again. 18:27.000 --> 18:32.000 We saw like an almost circular lock depends on dependency, but not really. 18:32.000 --> 18:34.000 So could not conclude the deadlock. 18:34.000 --> 18:42.000 Is it a memory corruption? Because we found a math that patch that the kernel that we were seeing running was missing. 18:42.000 --> 18:45.000 And it was talking about a memory corruption and an instruct. 18:45.000 --> 18:53.000 So we built a custom kernel and with some modifications to trigger this corruption deterministically to see what happens. 18:53.000 --> 18:58.000 But the symptoms that we saw in that case did not match with the symptoms that we were seeing in production. 18:58.000 --> 19:00.000 So not that. 19:00.000 --> 19:02.000 And then we had this while hypothesis. 19:02.000 --> 19:06.000 Some kind of live lock due to memory pressure that the kernel is trying to allocate memory, 19:06.000 --> 19:09.000 but due to memory pressure failing and it's looping somehow. 19:09.000 --> 19:16.000 But we evaluated all page allocation paths in the UPP register and you're profiling and this did not seem possible. 19:17.000 --> 19:24.000 So we had a lucky break when in order to establish this deadlock hypothesis. 19:24.000 --> 19:37.000 We had built this very simple EBBF based utility that was leveraging these conditions begin and condition end trace points to basically track lock intention in the kernel. 19:37.000 --> 19:44.000 And we stumbled across something that gave us the hint as to what could be going wrong. 19:44.000 --> 19:49.000 So in order to kind of set the stage for that. 19:49.000 --> 19:55.000 I was going to talk about a little bit about how UPP registration works, but I'll skip this because it's not really that important. 19:55.000 --> 19:59.000 I'll go over to this one which is more relevant. 19:59.000 --> 20:03.000 In that when the UPP registration code path is happening, you take a bunch of locks. 20:03.000 --> 20:08.000 So you take a re-trike center for lock on the UPP that you're dealing with. 20:08.000 --> 20:13.000 You take this interesting, dope, M map, same lock. 20:13.000 --> 20:17.000 So this lock is very interesting because it synchronizes. 20:17.000 --> 20:21.000 UPP registration with forks and the problem that is trying to solve is that. 20:21.000 --> 20:25.000 While in underage trace happening and a fork happens simultaneously. 20:25.000 --> 20:31.000 A child M instruction may inherit the debug trap that UPPs put in. 20:31.000 --> 20:33.000 Without an associated UPP. 20:33.000 --> 20:40.000 So this has to synchronize that the underage structure with the creation of the child M instruction. 20:40.000 --> 20:48.000 And then finally what basically what this was showing is that using the iNode of the file. 20:48.000 --> 20:52.000 The kernel discovers all of the MM strokes. 20:52.000 --> 20:56.000 So all of the address spaces that are mapping the file that we're trying to probe. 20:56.000 --> 20:59.000 It takes a right lock on it, does some work and unlocks it. 20:59.000 --> 21:05.000 So the reason it takes an right lock on the MM stroke is to synchronize with UPP firing. 21:05.000 --> 21:10.000 So the problem it's trying to solve over here is again with underage structure. 21:10.000 --> 21:15.000 That you don't want the UPP firing path to discover a UPP that you have. 21:15.000 --> 21:20.000 Underage district since then and it runs and it doesn't find. 21:20.000 --> 21:23.000 So basically you probe down the line and returns an error. 21:23.000 --> 21:27.000 So you need to have synchronization going on over there. 21:27.000 --> 21:32.000 Essentially we try to graph what was going on with the UPP registration path. 21:32.000 --> 21:38.000 And you see that in this system probe one execution is going down which you see a bunch of logs get taken. 21:38.000 --> 21:47.000 And every single case we were seeing it get just get blocked on trying to acquire a right lock on the MM block. 21:47.000 --> 21:58.000 So going over UPP firing the important thing over here is that it works with the address space of the process in which the UPP is firing. 21:58.000 --> 22:01.000 And it takes a lead lock on it. 22:01.000 --> 22:07.000 And with the case of UPP firing we were seeing a large number of tasks blocked on find active UPP probe. 22:07.000 --> 22:10.000 And this is the function where you take the lead lock on the MM stroke. 22:10.000 --> 22:14.000 And again it was blocked trying to acquire this lead lock. 22:14.000 --> 22:20.000 And lots of application threads were piling out waiting for this lock to happen. 22:20.000 --> 22:22.000 So what is this MM block? 22:22.000 --> 22:27.000 Well it's this lead writes MM4 allows concurrent readers but exclusive rights. 22:27.000 --> 22:34.000 So the EBU for utility that we have been talking about we modified it to try to get some questions answered about what's going on. 22:34.000 --> 22:40.000 The important thing over here is what the depth of the weight queue is when these issues happen. 22:40.000 --> 22:46.000 So the weight queue is like task waiting to acquire the semifur. 22:46.000 --> 22:50.000 What's the contention duration? Like how long is the contention going on for? 22:50.000 --> 22:52.000 And whether forward progress is always made. 22:52.000 --> 23:00.000 So this is added to rule out dead locks. And we figured out that even though things were waiting for like two minutes plus they were making forward progress. 23:00.000 --> 23:06.000 So eventually what we saw was that the weight queue became very very deep. 23:06.000 --> 23:12.000 And the pattern was that there were lots of readers with a few writers interest rates everywhere. 23:12.000 --> 23:19.000 And these writers were coming from MMAP and advice operations in the task and UPP registers. 23:19.000 --> 23:31.000 So the way read writes MM4's work is that every time a writer gives up the lock, it obviously wakes up all of the readers in the weight queue. 23:31.000 --> 23:37.000 But it only wakes up 250 readers at the same time to avoid like attending her situation. 23:37.000 --> 23:42.000 So this means that if the weight queue is broken up then batches of reads become serialized. 23:42.000 --> 23:53.000 And not only that, in order to prevent writer starvation, if there's ever a writer in the beginning of the weight queue, no concurrent readers are allowed. 23:53.000 --> 23:57.000 So this big creates a situation where readers start to pile up. 23:57.000 --> 24:04.000 And if they're writes interest rates here and there, you have these readers serialized in batches. 24:04.000 --> 24:09.000 And then eventually the writer makes it to the head of the weight queue and suddenly more readers. 24:09.000 --> 24:15.000 And if there's a very, if you have a UPP in the heart path, certainly a lot of readers are collecting. 24:15.000 --> 24:24.000 So this diagram is supposed to explain why we were seeing weird task like batch and run C getting stuck. 24:24.000 --> 24:33.000 So as I mentioned that in the UPP register part, let's imagine you're waiting for two minutes to take this MMAP lock because it's a very cute so deep. 24:33.000 --> 24:37.000 So I thought that you've taken this dope MMAP Sam with synchronizes with fork. 24:37.000 --> 24:43.000 So suddenly, all fork's in the systems are waiting on you to get this right MM lock. 24:43.000 --> 24:48.000 So suddenly if you do SSH, a batch comes up, it can fork nothing happens. 24:48.000 --> 24:53.000 System D can do anything, run C can do anything, your system is pretty much brick. 24:53.000 --> 25:01.000 And in the case of UPP firing, it's blocked on getting a read lock, but it's still blocked. 25:01.000 --> 25:09.000 In this case, application performance suffers because application threads just start to pile up in the kernel and they can make for progress. 25:09.000 --> 25:15.000 So by things happen, the application gets owned, there's unavailability, etc. 25:15.000 --> 25:28.000 So eventually there's a fix in the kernel and right now in the very latest ones, I think they have this speculative acquisition approach for getting the MMAP lock. 25:28.000 --> 25:36.000 So they don't get it unnecessarily, they only get it if they see that something has changed underneath them and so they repeat the operation with the lock in place. 25:36.000 --> 25:40.000 So this really improves the scalability of the operation. 25:40.000 --> 25:50.000 So basically the common thread that I was trying to get to was that concurrency is very hard. 25:50.000 --> 26:00.000 And sometimes even the BPS meant to be very usable and simple, the complexity of the kernel leaks into it, especially when you're trying to do things at scale. 26:00.000 --> 26:10.000 And sometimes you have to go beyond the BPS program to figure out where some problem may be coming from. 26:10.000 --> 26:16.000 So thank you, thank you for taking the time to listen. 26:16.000 --> 26:20.000 If you have time for questions, I'll be back. 26:20.000 --> 26:24.000 Yeah, we do have time for a couple of questions. 26:24.000 --> 26:26.000 Thank you. 26:26.000 --> 26:28.000 Just curious about the preempt. 26:28.000 --> 26:38.000 If you could talk a little bit more on that, I'm not sure if I quite got the relevance of the preempt on the slide you were discussing. 26:38.000 --> 26:48.000 So in this last part. 26:48.000 --> 26:49.000 Over here. 26:49.000 --> 27:07.000 So basically what you're trying to solve is that in certain configurations of the kernel, especially when you have when the kernel can be preempted, what may happen is that kernel code while at ask is executing kernel code, it can get scheduled out. 27:07.000 --> 27:13.000 So imagine that a process gets scheduled out, wireless executing the F entry mechanism. 27:13.000 --> 27:19.000 It technically still has a reference to the trampoline image, but the normal RCU cannot track that. 27:19.000 --> 27:22.000 Well, in some cases, in some cases, the can. 27:22.000 --> 27:36.000 So what you're trying to get a surety of is that the task that was executing F entry has performed a voluntary context switch, because at that point you know it has completed the F entry trampoline image. 27:36.000 --> 27:42.000 And given up execution, because it's done with it. 27:42.000 --> 27:44.000 Thank you. 27:44.000 --> 27:48.000 One last question. 27:48.000 --> 27:54.000 I guess the other conclusion is to always use the latest kernel. 27:54.000 --> 27:59.000 One other question is also the F exit related. 27:59.000 --> 28:04.000 Did it also expose the back or was it really only specific to F entry? 28:04.000 --> 28:07.000 It's the whole F entry effect. 28:07.000 --> 28:08.000 Okay. 28:08.000 --> 28:12.000 Because if you look at the doctors over here, it's the trampoline. 28:12.000 --> 28:13.000 Yeah, infrastructure that is. 28:13.000 --> 28:14.000 I get, yeah. 28:14.000 --> 28:15.000 I agree. 28:18.000 --> 28:19.000 Thank you. 28:22.000 --> 28:23.000 Thank you. 28:23.000 --> 28:24.000 Thank you.