WEBVTT 00:00.000 --> 00:11.000 Welcome, I'm going to do the first presentation, so I think everything is going to be 00:11.000 --> 00:18.000 in Cretando, but I'm going to present you a statistical path coverage that this is a technique 00:18.000 --> 00:23.000 or a method that we started working on, still to Linux and P project, lead by us at the end. 00:23.000 --> 00:32.200 This was a project based on researching on Linux safety projects, and we started with a statistical 00:32.200 --> 00:37.600 path coverage there. It was very focused for safety related systems, but we think it's interesting 00:37.600 --> 00:43.600 for all type of critical systems, not only safety ones, and it may be interesting for 00:43.600 --> 00:52.000 increasing the testing of strategy of these systems. So, first of all, let's put it a little 00:52.000 --> 00:57.400 bit in context. The critical system that we are building today, we say the next generation 00:57.400 --> 01:02.600 critical system, but we are already building them, have or are running deep learning algorithms, 01:02.600 --> 01:08.200 we see trying with the break of AI. Also, they have... 01:08.200 --> 01:24.200 Okay, okay, so loader? Okay, I'm going to start again. So, first of all, the context is that the critical 01:24.200 --> 01:30.200 system that we are building currently, have different requirements, they are running deep learning 01:30.200 --> 01:36.200 algorithms, some of them, they also security plays a key role in all these applications, and 01:36.200 --> 01:43.000 they have high performance requirements. Different industries, as far as to employ the commercial 01:43.000 --> 01:49.000 of the self-smaltical, the performance provided by the commercial of the multiple processors, 01:49.000 --> 01:54.400 but these hardware was designing from an average maximum performance and not for critical 01:54.400 --> 02:00.400 systems. For example, if we check it for safety related systems, right now we don't have a 35 02:00.600 --> 02:09.200 multi-core hardware to use it. With all these points in siltyling example, we saw that there 02:09.200 --> 02:16.000 was a need of an operating system to run complex algorithms with high performance requirements 02:16.000 --> 02:24.800 and also with security requirements. If we check the Linux kernel, I will say we can see that 02:25.200 --> 02:34.200 it's the key, I will say. It's the non-critical domain. It's the linear operating system 02:34.200 --> 02:41.200 in embedded systems, in smartphones, it's almost the only solution in supercomputers, and it 02:41.200 --> 02:47.200 works also in the majority of web servers. There are different reasons for this, but the 02:47.200 --> 02:53.000 main ones it has white hardware support. We have lots of drivers in the Linux kernel, and 02:53.000 --> 02:57.920 this provides all this support. It has really good, or significantly better than all 02:57.920 --> 03:03.480 those multi-core capabilities, and also security capabilities are important, and that's 03:03.480 --> 03:10.200 why different companies choose or even governments choose Linux. Also, let's develop 03:10.200 --> 03:16.760 a new system and why the developers community like here, right? So the question that 03:16.760 --> 03:23.960 racist here is, can we use Linux for critical systems? Right now, we have thousands of Linux 03:23.960 --> 03:29.400 based computer orbiting their Earth. So different companies and governments rely on Linux to 03:29.400 --> 03:35.560 build their satellites. We can find Linux even in a space rocket, in control units of a space 03:35.560 --> 03:41.000 rocket, and we can find Linux even in other planets like in Mars. NASA deployed a small 03:41.080 --> 03:48.440 drone, or a small helicopter based on Linux, and a Snapdragon multicore. Also, different 03:48.440 --> 03:56.600 governments and companies rely on Linux for the telecommunications, critical servers, and for 03:56.600 --> 04:03.720 instance, banking systems. So can we use it for critical system? I would say that, yeah, we are 04:03.720 --> 04:12.760 already using it, how do we test the Linux 40s. So here, when we start it into Linux, we 04:12.760 --> 04:18.680 identify some problems, and it's that we know that this in an analysis of a critical 04:18.680 --> 04:24.520 system is crucial, and that we need to quantify this testing effort. When we are doing testing, 04:24.520 --> 04:30.520 we need to know how much do we test. Traditionally, the objective of the testing process was to get 04:30.520 --> 04:37.240 100% test coverage, but we are going to see that in the presentation, sometimes this is not possible, 04:37.240 --> 04:46.520 not feasible, and even not the syllable. We don't need 100% test coverage. I'm going to put some 04:46.520 --> 04:51.240 definition from functional safety. It's only for functional safety, but I think it's interesting 04:51.240 --> 04:59.000 to have this ideas for all critical systems. And when we talk about functional safety, we are talking 04:59.160 --> 05:05.320 about systems with credible risk, or freedom from an acceptable risk. Sometimes we can hear from 05:05.320 --> 05:12.360 some people that safety systems without any risk, and any developer in this room will say that 05:12.360 --> 05:18.920 the system has no back, right? So every system has some risk, but we need to achieve an acceptable 05:18.920 --> 05:25.800 risk. So therefore, in system safety, when we are doing engineering processes, tools, the 05:25.880 --> 05:34.280 objective is to get this achieve our risk. For that, we use testing analysis, and testing helps us 05:34.280 --> 05:40.360 to fix bugs, or mitigate them, also we understand better the system while we are doing this testing, 05:41.080 --> 05:46.600 and therefore, consequently, we need to quantify this testing effort, right? This testing process. 05:46.600 --> 05:50.760 We need to know how much do we know about the system or how tested it is it? 05:51.640 --> 05:59.080 To quantify this testing effort, we always talk about test coverage, and there are different 05:59.080 --> 06:05.320 metrics for test coverage. We can find French coverage, function coverage, land coverage, there are 06:05.320 --> 06:11.720 different types. For the SILT-LINUXMP project, as the objective was a SILT, we focus in path coverage, 06:11.720 --> 06:18.120 but that's not an important point here, and the presentation goes about path coverage, but it's not 06:18.120 --> 06:26.360 my idea. The important note here is that, for instance, in the SILT-LINUXMP, SILT-LINUXMP, we can consider that 06:26.360 --> 06:32.520 it is the general standard for SILT-LINUXMP systems. It notes world 100 coverage cannot be achieved, 06:33.160 --> 06:39.240 and a property explanation should be given. This is one point that we need to remember through 06:39.240 --> 06:46.520 the presentation. So let's try to get 100% test coverage, right? We take the LINUXMP 06:46.520 --> 06:51.880 kernel, we start like that, let's download the LINUX kernel and check it. LINUXMP has 06:51.880 --> 06:57.400 over 27 million lives of code, so it's a huge project. It is treated from this 06:58.680 --> 07:05.480 almost 13 million lives of code, or maybe it's by now it's over 30, I don't know. Half of it 07:05.480 --> 07:11.560 remains for drivers, it's for hardware, so we are not going to use all this 27 million lives of code, 07:11.560 --> 07:17.000 right? Also, our application is going to use only part of this kernel, only some features that 07:17.000 --> 07:24.760 we need only for our application, so it's not going to exercise. It's also important to know 07:24.760 --> 07:29.400 that LINUXMP is continuously evolving. We have already this critical system needs to be 07:29.400 --> 07:34.680 updated, traditionally we have the systems critical and we didn't have it, but now it's 07:34.680 --> 07:41.240 going to be important to update the system for the security requirements. We have a six 07:41.240 --> 07:46.840 patcher per hour rate of update and LINUXMP has been developed every day in all the year, 07:46.840 --> 07:54.840 weekdays, weekends, every day. So what is going to be exercise in the LINUX kernel for 07:54.840 --> 08:01.400 our critical application? Yes, starting like that, it was like, okay, just take us, that the 08:01.400 --> 08:10.120 code analysis and see what is going to be exercise. We got this huge call graph, enormous call graphs, 08:11.000 --> 08:15.720 there were not to suffer for anything. You cannot deal with that, it's enormous, it makes no sense. 08:16.920 --> 08:23.240 Amor, this call graph only gives partial results because they cannot solve some problems that are 08:23.240 --> 08:28.680 selected in runtime. They cannot solve integrated calls, they cannot solve aliases, they provide 08:28.680 --> 08:33.560 that code also that it's not going to be used and therefore it's not going to be tested and they 08:33.560 --> 08:41.000 cannot solve assembly code. To this problem, we need to add another feature of the LINUX 08:41.000 --> 08:48.520 kernel that it's non-determinist. LINUX kernel is non-deterministic, this means that the same application 08:48.520 --> 08:55.560 with the same inputs may follow different execution paths. So traditionally, we have deterministic 08:55.560 --> 09:01.160 systems. This means that for the same input, always the function sequence that it was the 09:01.160 --> 09:07.160 executed, it was always the same and therefore it was easier to test it. But right now, LINUX kernel 09:07.160 --> 09:13.880 is non-deterministic and this is due to the global state of the system. It will select which 09:13.880 --> 09:23.160 execution path to follow depending on the global state of the system. So as example, if we do an 09:23.160 --> 09:29.320 application that it's writing in DevNull for instance, we write some string in DevNull, we have 09:29.400 --> 09:38.280 a really clean and nice looking execution phrase like this. Very short time, taking into account 09:38.280 --> 09:44.040 that the LINUX kernel is designed or developed from performance point of view, perfect sense. But 09:44.040 --> 09:51.640 sometimes there will happen as synchronous events like RCQ that we can find almost anywhere in the 09:51.640 --> 09:58.600 execution phrase. So for non-determinist, it's important to have 09:58.600 --> 10:03.880 to account that it's not possible to force the execution of a specific path because it 10:03.880 --> 10:08.280 doesn't rely only in the input. We cannot force the execution because depending on the input, 10:08.280 --> 10:14.680 there can be many execution paths. It relates on the state of the system. And this 10:14.680 --> 10:19.960 state is not generally reproducible due to the complexity. There are enormous amount of asynchronous 10:19.960 --> 10:28.040 events and concurrency going on in hardware and software. So with all these, we identify the 10:28.040 --> 10:33.400 many issues we will have due in testing. We see that one iteration of a testing, it's not enough. 10:33.400 --> 10:37.880 And this is why many kernel developers struggle finding some facts. They know that 10:38.840 --> 10:45.880 some time, a bug is happening, but they are not able to force the execution of that bug. And they 10:45.880 --> 10:54.360 once per week find it and something is going on. Therefore, we need to execute all the tests 10:54.360 --> 11:00.040 repeatedly. And one conclusion is that we need continuous testing, continuous testing is okay. 11:01.400 --> 11:05.560 But the problem is that we don't know which traces can be executed, 11:05.560 --> 11:12.360 which execution path can be executed. So the question that arises here is, which traces need to 11:12.680 --> 11:19.000 be tested, how many traces do we need to test? And how do we quantify this testing effort? 11:20.920 --> 11:29.320 And that's why we start thinking on it and say, let's go to a statistical work. Let's use probabilities 11:29.320 --> 11:36.360 to do it. So we are going to show a statistical pulse coverage that it's based on a statistical analysis 11:36.360 --> 11:41.880 to do test quantification and also to be able to estimate the residual risk in software. 11:43.080 --> 11:49.800 The approach is to based on probability and not only in possibilities, 11:50.680 --> 11:57.080 we are going to be focused on credible risk of our critical system and not all risk, 11:57.720 --> 12:03.240 because it makes no sense. We are going to focus in the credible risk of our system. 12:04.760 --> 12:09.400 And therefore, for doing that, we are going to record the behavior of our system with this 12:09.400 --> 12:14.120 recording, we are going to get some data, where we can provide some statistical analysis. 12:14.840 --> 12:19.960 And finally, we can quantify the testing of software selection likeliness. 12:23.880 --> 12:31.080 Statistical path coverage, we can divide it in three faces. Collect data, model data and risk estimation. 12:31.400 --> 12:41.080 For the first one, for the first one, data collection, we built a tool called DV4C2 12:41.880 --> 12:48.520 that it's based on dynamic data collection, that it's F trace. It's well known in kernel developers. 12:48.520 --> 12:54.440 It's including in the Linux kernel and it allows to record a execution that it's going on in the Linux kernel. 12:55.400 --> 12:59.960 And these two, it's publicly available. It's in our first repository in GitLab. 13:00.920 --> 13:07.320 So you can view it or you can test it if you want. And basically what it does, it has a client and a server. 13:07.960 --> 13:11.560 In the client we have our critical system that it's recording, what it's going on. 13:12.200 --> 13:17.080 Once we have all these recordings, we send it to the server and we post-process them. 13:18.040 --> 13:21.640 In this first processing, we identify the system calls right now. 13:21.640 --> 13:24.920 This is not complete, but it has a prototype, it works well. 13:25.480 --> 13:28.520 Because we consider the system calls the entry point in the kernel. 13:31.400 --> 13:39.240 We do an analysis of independence within system calls and we check that system calls are independent. 13:39.800 --> 13:43.480 And we calculate the MD5 hash of these execution traces. 13:44.360 --> 13:50.040 Why do we calculate the hash? Because this permits us to do analysis matches here. 13:50.040 --> 13:54.360 We can see how the sequence is closed, the frequency of the its system call. 13:55.560 --> 14:01.000 To calculate the statistical analysis later, it matches here. So we deal with MD5. 14:03.480 --> 14:08.120 We are working also in a graphical interface unit for the tool to be matches here. 14:08.840 --> 14:11.800 This is not yet publicly available, but I think it is to have it. 14:12.520 --> 14:19.240 Where we can solve different data. In this data we can solve for instance the sequence diagram. 14:19.240 --> 14:23.560 The execution sequence diagrams by system calls. So we have the system calls. 14:23.560 --> 14:27.240 In black lines we find the common one, the one that I showed you before for instance. 14:28.280 --> 14:32.840 And in red lines, it's around here and if the color is the best one, 14:33.800 --> 14:39.400 we identify the sapaths that have happened and when they happen, we switch frequency. 14:41.400 --> 14:48.760 Also we are able to plot the histograms. So we see how many times its trace has been executed within the testing process. 14:49.960 --> 14:55.480 And we get an idea that normally as Linux it's developed from an upper promise point of view. 14:55.480 --> 14:59.800 They come on for a security a lot of times, but we have some faces. 15:00.520 --> 15:05.160 Sometimes there are some pretty tries that are happening and there are a huge amount of these pretty tries. 15:08.440 --> 15:14.280 So once we have all this data, we can continue to fetch you that it's model in data. 15:15.480 --> 15:22.440 And for this data model we choose to have a parametric approach. Why a parametric approach? 15:22.440 --> 15:28.840 Because if we get a model with a fixed parameters that describe the behavior of the system, 15:29.480 --> 15:32.840 we are able to extrapolate from the model we have. 15:33.800 --> 15:41.640 So if we get this model, we can extrapolate to say, okay, I get the decay during 10,000 hours of testing. 15:42.360 --> 15:43.720 Let's coat and thin it, right? 15:46.520 --> 15:50.920 So for this events, I'm not going to go deep in the statistics. If you want to 15:50.920 --> 15:57.880 can ask me or send me an email, but we just choose over your events that it's a 15:57.880 --> 16:03.160 personal distribution, but for that we need to focus in the execution traces that happened 16:03.160 --> 16:10.760 directly. To select these execution traces, the very ones we use entropy theory that basically 16:10.760 --> 16:16.920 we divide the groups into two with the same amount of information. And we can focus on them. 16:17.880 --> 16:22.760 So here, in the plot we saw the number of traces that have appeared, the rate traces that has 16:22.760 --> 16:27.080 appeared during the testing process, during the test cycles that we have managed, 16:27.080 --> 16:32.760 test cycles that test campaigns. And we see that while we are testing the system, 16:32.760 --> 16:37.720 the number of these rate traces goes, it's decreasing, it makes sense, right? Because we know more 16:37.720 --> 16:44.360 about our system. Therefore, if we are able to model, we get the model that it's code decreasing. 16:44.760 --> 16:52.040 And after we can do, we can extrapolate this model and think, okay, instead of 250 test 16:52.040 --> 16:58.360 campaigns of test cycles, if we go to infinity, how many traces would be appear? And just doing an 16:58.360 --> 17:05.720 improper integral, calculating the area, we can do this estimation. So in this use case that we 17:05.720 --> 17:14.120 have that it was a autonomous emergency braking system, we got a test coverage of 85%. So 15% was 17:14.120 --> 17:25.640 not tested in 10,000 hours of testing process. Is this 15% an acceptable risk or not? That's the 17:25.640 --> 17:32.600 question, right? We have 85, but what does 85 mean in this case? We need to know if it is a acceptable risk 17:32.600 --> 17:38.440 or not. So for that we need to estimate the risk. And if you remember, we were thinking that 17:38.520 --> 17:44.200 100% coverage cannot be achieved and appropriate estimation should be given. So let's calculate 17:44.200 --> 17:50.360 the risk as an explanation. And therefore, we can see if it is freedom from an acceptable risk or not. 17:51.880 --> 17:57.880 Risk can be calculated probability by severity. So let's go to the work system area where 17:57.880 --> 18:04.200 the severity is one because we consider that the execution of an untested trace, it's an acceptable 18:04.280 --> 18:10.200 or it's catastrophic, right? We need to calculate the probability for that. And to calculate 18:10.200 --> 18:15.720 the probability of an event that didn't happen, we can use simple good terrain, that it's 18:15.720 --> 18:22.600 known in statistics and we calculate that probability. So we know now, which is the probability 18:22.600 --> 18:33.080 of executing one of unknown tests, right? And in hardware, it is common to find in different 18:33.080 --> 18:39.560 standards or in manuals, not as it's probability of failure per hour. But in software, it's not at home. 18:39.560 --> 18:46.920 There are some standards, but it's not with software developers don't talk about that much, right? 18:47.560 --> 18:51.800 And I think that for this case, it's interesting to have these ranges. So we can get this range 18:51.800 --> 18:55.480 each and say, okay, it's this acceptable or it is not acceptable. 18:56.360 --> 19:03.880 Furthermore, this probability can be proof and this can be proof capitalizing on complexity 19:03.880 --> 19:11.080 of the systems. So if we know that the Linux kernel is not deterministic, we can use 19:11.080 --> 19:18.040 redundant architecture that are well known in critical system and safety system. And let's say, 19:18.040 --> 19:24.760 okay, execute application, same time in two containers, for instance, and the probability that 19:24.840 --> 19:30.280 two containers execute at the same time and untested races, the probability will decrease. 19:30.280 --> 19:36.040 So we can reduce also this probability by using or capitalizing the non-determinist 19:36.040 --> 19:41.400 of the Linux kernel, using redundant architectures. Instead of two channels, if we have three 19:41.400 --> 19:51.960 channels or four channels, this risk also will go decreasing. So just to end, some conclusions about 19:52.840 --> 20:00.920 the presentation and the method, we proposed this statistical method to estimate the number of 20:00.920 --> 20:05.800 traces that we are going to exercise and have relevant probability of being executed. 20:06.760 --> 20:13.240 We can estimate this execution probability of what we haven't tested and therefore we can calculate 20:13.240 --> 20:25.560 also the residual risk of this untested races. I will say that this can be a problem or also a 20:25.560 --> 20:31.000 possibility and if we capitalize on complexity and we don't focus on all possible risks but 20:31.000 --> 20:37.240 incredible risk in all possible paths but in paths with a probability of being executed, 20:38.200 --> 20:44.680 we can save to a probability world and capitalize on that. So it will be an opportunity and not only 20:44.680 --> 20:52.360 a problem and we are open to other statistics. The idea was not to select a statistic, which wants to 20:52.360 --> 20:59.480 use. The idea was to see if it is feasible to use a statistical world for this and we think the 20:59.480 --> 21:06.200 objective was achieved and we know that the technique is only possible if continuous monitoring 21:06.200 --> 21:11.720 is done but I think for the next generation, next generation, critical system, continuous monitoring 21:11.720 --> 21:19.480 would be mandatory anyway. And it will be great to have additional experts review or certification 21:19.480 --> 21:27.160 authorities review. As future long, we want to extend analysis not by system call but we want 21:27.160 --> 21:33.640 to include all the calls that are in the Linux kernel. We know that this is limited. We want to 21:33.640 --> 21:41.160 publish the graphical interface in the public repository soon and it's working adequately and also 21:41.160 --> 21:47.160 we note that this is an argument to be updating continuously and we know that our statistic has 21:47.160 --> 21:54.200 some parameters that will detect if an update changes the behavior of the system. Right now it's working 21:54.200 --> 22:01.400 correctly but we need further analysis on this also and this is the next big step that we are doing 22:01.400 --> 22:09.640 in this statistical path coverage. And this is in context technique that depends on the use case. 22:09.640 --> 22:15.080 It's great to have different use case to be analyzed and being continuous monitoring to do it. 22:17.320 --> 22:22.360 So thank you very much. If anyone has any question, you can ask it right now or you can send 22:22.360 --> 22:32.360 me an email if you want. 22:32.360 --> 22:37.880 I would like to ask a question about the unique traces and experiments as a whole. Where you have 22:37.880 --> 22:44.920 dates in the kernel version or user space during these 10,000 hours? No, in this 10,000 hours it was 22:45.080 --> 22:53.800 a question. Okay. So he asked the result if we saw that if the system was being updated and 22:53.800 --> 23:00.200 it was not, it was a static kernel that it was not being updated. So the update part, it's been 23:00.200 --> 23:05.800 tested right now, we are dealing with that. So all these statistical models has some parameters that 23:05.800 --> 23:14.840 will detect if changing the behavior of the system happens. Right now in this result is 23:14.840 --> 23:27.160 it was not. Okay, thank you. All clear? Okay, great. 23:27.160 --> 23:35.480 What is the question? Do you have a target of the reduction on the number of tests of our testing 23:35.480 --> 23:43.960 when you start to expect that to save 20 percent of the time? No, no, the about the percentage of 23:43.960 --> 23:49.800 the test coverage, no, about the residual risk. That's why I talk about probability per 23:49.800 --> 23:56.440 value per hour. There are some in hardware, well known and 65,58, that it's the generic 23:56.440 --> 24:04.120 standards, so some values and we did this comparison with these values and we using redundant 24:04.120 --> 24:11.320 architecture we can achieve them in this case. But there is no well-known ranks right now selected. 24:14.920 --> 24:24.120 You gave the example and showed us that for the test that you did, you got about 85 percent 24:24.120 --> 24:29.480 test coverage and you asked if that wasn't added. You have any metrics to find out whether or not 24:29.480 --> 24:35.560 that is enough, you get any results on that. Can you repeat it? Can you have any anecdotes of whether 24:35.560 --> 24:41.720 or not 85 percent is a good number for testing in this kernel? Or is this enough in use cases 24:42.600 --> 24:48.760 Okay, so tradition, if you go to a tradition, I'm sorry, so he asked if the 80 percent 24:48.760 --> 24:57.080 test coverage is it enough or not. So traditionally one will say that if you have not 100 percent 24:57.080 --> 25:03.000 it's not enough, I say that that's why we do the risk estimation and that's the important thing. 25:03.000 --> 25:10.360 The test coverage it can give you some feelings, so if it is enough or not, but the important 25:10.360 --> 25:18.680 thing is the risk. So I can say that it's 99, but if this one percent is catastrophic, 99 25:18.680 --> 25:25.800 it's not enough. If this is 10 percent it's non-catastrophic, it's enough. So it's not about getting 25:25.800 --> 25:32.600 the number and this happened to every developer. When you are testing, the objective of testing 25:32.600 --> 25:39.800 it's to fix it and not to get 100 percent, but sometimes you are doing the test and you need to 25:39.800 --> 25:44.840 be objective on that. You are doing the test to get the number and that doesn't make sense, right? 25:45.880 --> 25:52.840 It's not about getting the number, it's making the system safe. Thank you. 25:53.880 --> 26:00.920 Yeah? You talked about needs continuous monitoring, what you mean by continuous monitoring 26:01.160 --> 26:07.880 and what purpose? Okay, so yes, asking about continuous monitoring. 26:10.840 --> 26:16.040 So continuous monitoring here should be, I think, deciding in context of the use case, 26:16.040 --> 26:20.760 we need to doubt that because the test will be designed depending on the use case, 26:21.720 --> 26:28.360 but like in real-time Linux they are using continuous monitoring to see how it's 26:28.360 --> 26:34.200 pre-altime Linux working on, for that we also will need to do that for these systems. 26:34.200 --> 26:42.440 And before for instance, we have to unafter to the critical system that it's in the route or in the 26:42.440 --> 26:47.720 street or in an industry, we need to do this continuous monitoring and our labs. So that's why 26:47.720 --> 26:52.360 we need these systems running in our labs, being tested continuously and checking all these 26:52.360 --> 27:03.720 topics, it's going to be okay, you're not okay. That's it. Thank you very much.