WEBVTT 00:00.000 --> 00:20.000 Thank you very much for your patience, first of all I've already asked who's heard of MSF, 00:20.000 --> 00:26.680 but the people who haven't heard of MSF, this is us, international independent medical humanitarian 00:26.680 --> 00:34.200 organisation. The independence is really important, that is what enabled us to speak out when 00:34.200 --> 00:39.360 the witness, for example, violations of humanitarian law, war crimes, and that kind of stuff 00:39.360 --> 00:44.600 and I'll talk a bit more about that in a minute. I'm going to talk very briefly about who we are, 00:44.600 --> 00:49.080 what we do, where we use next to S, why and how, and then if I move really quickly there'll be 00:49.080 --> 00:58.600 time for questions. I'm Ian, by the way, this is Shaw, and Ranzis is over here. So, we're quite 00:58.600 --> 01:05.320 proud of what we do, we work all over the world, and almost all of our funding comes from individual 01:05.320 --> 01:11.480 donors, and that's what means we can be independent. So we don't have to worry about government's 01:11.480 --> 01:16.440 cutting off our funding because they don't like us denouncing their migration policy or enabling 01:16.440 --> 01:20.280 or committing war crimes, for example, which you might have read about in the news recently, 01:20.280 --> 01:27.080 gets more and more important. So, Shaw and myself, we work for MSF, our payroll is up there 01:28.200 --> 01:33.640 in the general administration section. So, yeah, we do a lot of stuff. You might be wondering, 01:33.640 --> 01:38.280 okay, find vaccinations, but what does NixOS have to do with that? The answer is that you'll find 01:38.280 --> 01:43.720 NixOS in our systems deployed in the Clared, and also deployed in the healthcare facilities 01:44.680 --> 01:51.240 all over the world, where you work. So, for example, in an ITFC, that's an intensive therapeutic 01:51.240 --> 01:56.840 feeding center, which is used for feeding people who are either very small children or who are 01:56.840 --> 02:03.000 too malnourished, and I'll show you some photos of them in a minute. This is a picture of two 02:03.000 --> 02:08.920 ambulances in IT. The reason I'm showing you these is because there's a story behind them. In Christmas 02:08.920 --> 02:15.960 2023, we were forced to stop activity in IT because our patients were being killed in transit 02:15.960 --> 02:21.880 between hospitals. They'd be dragged out of the ambulance and shot. So, while we couldn't guarantee 02:21.880 --> 02:27.320 the safety of our patients or our staff, we had to stop work. While we stopped work, and while 02:27.320 --> 02:33.560 our hospitals were empty, we went in there with two of our guys, so Carl and Daniel, our X-ray and 02:33.560 --> 02:38.200 network technician, we upgraded the network, and then we installed virtual machine running NixOS, 02:38.200 --> 02:44.520 and on top of that, an X-ray image archiving system. And when hospital came back, it came back 02:44.520 --> 02:53.080 better. The surgeons were more able to look at the X-rays of these kind of operations. MSF is a 02:53.080 --> 03:02.680 big organization. Think about Fortune 500 company size, but we're about the fortune. Sure, and I work 03:02.680 --> 03:09.400 for a part of MSF called the Operational Center Brussels, which in total is about 15% of the 03:09.400 --> 03:17.960 global stuff. And IT inside the HQ is 505 people. Five people work on NixOS. So what I'm trying to 03:17.960 --> 03:26.040 convey here is that we do not have a choice. We have to work at scale. Right. Next slide. This is 03:26.040 --> 03:31.720 some of the cases where we work for a centralized data systems. I'm going to explain to you one of 03:31.880 --> 03:37.560 the systems now. So this is my degree in intensive circuit heating sensor. You can see the solar panels 03:37.560 --> 03:43.480 on the roofs. Those are new. The deployments are solar panels across all our hospitals, 03:43.480 --> 03:48.920 has been enabled by data driven decisions, which are enabled by a platform called Things Board, 03:48.920 --> 03:54.600 which is an Apache licensed MQTT dashboard. That's deployed on top of NixOS, and we 03:54.600 --> 03:59.800 centralized power consumption information from all over the world on that platform. And for 03:59.800 --> 04:06.280 example, the solar panels here have cut diesel consumption generators by more than half. There's no grid. 04:06.280 --> 04:15.000 It's diesel, solar, or nothing. Next. So inside that hospital, you'll see kit like this. 04:16.680 --> 04:22.520 That's a ruggedized filter. It's kind of mini rack set up with the UPS and this kind of stuff. 04:23.080 --> 04:27.560 They're very, very rugged. I only know about one which broke, and the reason it broke is because 04:27.560 --> 04:34.280 a snake pulled inside the power supply. The snake died. The power supply died. Very bad. 04:35.160 --> 04:43.560 And then on top of that, we also deploy smaller pieces of kits. So this is a nook. It's a small industrial 04:43.560 --> 04:50.360 panel of Intel PC. Very powerful for their size, and they fit in a backpack. So you can ship 04:50.360 --> 04:55.800 them very easily to wherever you have to go. And we use them, for example, when you want to isolate 04:55.800 --> 04:59.480 patient data, and you can also do things like using for active passive replication. 05:02.120 --> 05:07.800 This is the example of this inside the hospital. Here, for example, you would see 05:10.040 --> 05:16.120 patient nurses in health care workers watching around with tablets. Inside the therapeutic 05:16.120 --> 05:21.000 feeding centers, it's really important to keep track of who you've fed and what and how. 05:21.800 --> 05:27.160 The tablets themselves are linked to an application for DHS2, which is running on a pair of nooks. 05:29.000 --> 05:36.120 And that's, that's inside that. This is two of our favorite pictures in IT, MSF, on the left 05:36.120 --> 05:42.440 is Alex. He helped design the field network kit. On the right here is John. He's working on a vaccination 05:42.440 --> 05:48.600 campaign in RDC against measles, where a quarter of million people every year are still infected 05:48.600 --> 05:55.640 and more than 5,000 people a year still by. This is a platform. So we have the platform and then 05:55.640 --> 06:00.520 we have the applications. You should recognize most of these logos, especially the first one. 06:01.640 --> 06:07.640 But you might be wondering what it's Ansible doing here. And the answer is that we started using 06:07.640 --> 06:14.440 NixOS before a subsnix became available. So we use Ansible to keep our secrets encrypted. 06:15.400 --> 06:23.320 So those are the platform components. The applications that we deploy. So DHS2 is a public health 06:23.320 --> 06:29.720 management information system. You can use it to track patient data. You can use it also to do things 06:29.720 --> 06:35.720 like record information about academics and this kind of thing. Or I think it is a piece of software 06:35.720 --> 06:39.720 from managing X-ray imagery and this kind of thing. It's also open source to relative to an 06:39.720 --> 06:46.440 Belgian. And then BAMNI is an electronic medical record system. It's used to run hospitals. And 06:46.440 --> 06:52.120 with that I'm going to introduce Charles and experts in BAMNI. He's deployed medical record systems 06:52.120 --> 06:57.000 throughout Banger Dash before he worked for MSF. And then when he came to MSF, he started deploying 06:57.000 --> 07:03.480 DHS2 all over the world without NixOS, which is what makes him ideal to explain just how much time 07:03.480 --> 07:13.480 it's open over to you. Any time you do the mic once, it's okay. 07:13.480 --> 07:43.400 Okay, thank you. Good morning, great. So, why do we use NixOS? So MSF is a big idea to do this. 07:43.480 --> 07:52.920 So, we have to manage complex IT operations. And this complex IT operation will need a system that 07:52.920 --> 08:01.160 is robust resilient. And we wanted a system that can evolve quickly by itself with a minimum 08:01.160 --> 08:10.520 maintenance. And also we wanted a system that the same system we can run both for our field 08:10.520 --> 08:17.320 operation management and headquarter operation management. So, there we select NixOS. 08:18.360 --> 08:25.080 As it's a support for declarative configuration management, this will all know. And in 08:25.080 --> 08:33.560 first lecture as a code. So, then I won't say how NixOS help us to overcome our many 08:33.560 --> 08:39.400 operations of challenges. So, as we started this setup, we initially 08:39.400 --> 08:46.680 developed what we wanted to get our platform, like as everyone wants. So, we wanted our 08:46.680 --> 08:55.000 configuration to be centrally managed. It is secured and declaratively we wanted to manage. 08:56.120 --> 09:02.040 Then the changes of our configuration we wanted to be automatically tested 09:02.040 --> 09:11.480 then on deployed automatically. Then, it will be a security pass update. Then, 09:12.840 --> 09:18.520 operation everything we wanted to manage with a very minimal effort with a very minimal maintenance 09:18.520 --> 09:26.520 effort. Even access control to the server we wanted to manage automatically and declaratively. 09:27.080 --> 09:36.680 And as our servers are distributed at just different places, we wanted a easy access to our 09:36.680 --> 09:45.640 server with a minimal dependency on the network and we wanted to deploy containerized application 09:45.640 --> 09:52.520 to the NixOS server. So, we will see how easy NixOS we have received majority of 09:52.520 --> 09:59.640 on design goals that is set initially. So, let us go between history about our easy NixOS. 09:59.640 --> 10:08.200 So, since 2018, we started to fast deploy our custom NixOS platform to manage a fleet of 10:08.200 --> 10:15.000 Linux servers and since then we started to write our machine definition in a Nix code 10:15.880 --> 10:21.800 along with the application configuration, application 6 is together and save it in a 10:22.680 --> 10:30.040 GitHub repository. So, we can make all this possible at NixOS support security 10:30.040 --> 10:38.120 configuration management. And then the center repository we use that is connected with all 10:38.120 --> 10:47.320 the server we use in different areas. So, the servers run by the schedule and the pool the 10:47.320 --> 10:56.680 Nix code and building side itself and get the update of the configurations. So, here is a 10:56.680 --> 11:03.240 code snippet. How we define our servers in Nix code? So, everything about the server, 11:03.240 --> 11:10.040 server time zones, other settings even even the disk partition, boot mode and the services we 11:10.040 --> 11:21.800 wanted to run inside the server all are defined declaratively. So, here is a example here is a 11:23.160 --> 11:30.760 here is a configuration code like for example, to all those servers we wanted to 11:30.760 --> 11:39.880 disperse those content as a configuration and this part actually defined like in the name of 11:39.880 --> 11:45.960 this file this content will disperse and plus all those servers. So, this is how we define our 11:45.960 --> 11:54.680 configuration in NML. Civil I will define our secrets in NML file but that is a encrypted 11:54.760 --> 12:03.720 with a civil fault. So, before moving a posit further just I wanted to show you this picture. 12:03.720 --> 12:11.320 This is where basically our server as a few server ran. So, it is a picture of jump to the 12:11.320 --> 12:17.880 refusing comes in cost with the Bangladesh. So, it is a very remote area where sometimes to run a 12:17.960 --> 12:25.400 server very essential needs right power and internet supply is very difficult. So, we need to 12:25.400 --> 12:34.040 run those servers under those extreme constant and we will see how using NixOS we overcome those 12:34.040 --> 12:42.680 challenges. So, as we have everything in a code. So, this gives us a significant advantages right. 12:43.080 --> 12:57.480 So, we can take the advantages of advantage of for example, every changes to be tracked then when 13:01.320 --> 13:08.200 the deployment or we can take the advantage of gate automation for deployment for testing 13:09.000 --> 13:19.480 and the gate ops operation we can run over it. So, now I am going to talk about how we regularly 13:20.360 --> 13:31.880 upgrade and gate security pass update for our servers. So, as our servers are connected to our central 13:31.960 --> 13:41.080 configuration repository it pulls the updates from that central repository and rebuild itself and 13:41.080 --> 13:50.200 an upgrade itself and we also use in our server NixOS, in our platform we use NixFlex and every week 13:50.200 --> 13:59.160 we do the flight lock bump with help us to pass the security update from the upstream and when 13:59.240 --> 14:05.960 the NixFlex version upgraded twice in a year we also do the platform upgrade and twice in a year. 14:06.600 --> 14:13.960 But, we run our upgrade addition in a three steps or three wave frequency faster, 14:13.960 --> 14:18.280 middle wave and final wave. So, in the first wave we actually run the 14:19.240 --> 14:26.280 operation in our relays and dead machines, middle wave in UAT and test servers and low SLF production 14:26.280 --> 14:33.640 servers and the final wave we run in our mission critical application at the server hosting mission 14:33.640 --> 14:42.600 critical applications. So, this process gives us a lot of advantages why? Because we minimally 14:42.680 --> 14:51.800 wanted to disrupt our operation in the field and if application inherits any force we wanted 14:51.800 --> 15:00.520 that to be surfaced in a faster and not hit impact the final wave servers for critical applications 15:00.520 --> 15:09.960 running server. So, this is from why the way it started our faster the 15:10.920 --> 15:20.360 notes we just next to my desk and the final wave ended of executing the field servers for 15:20.360 --> 15:25.720 basically our mission critical applications run. As you see the doctor is taking 15:25.720 --> 15:33.560 preparation of a surgery and we wanted him to be minimally disturbed. So, that is our ultimate goal 15:33.560 --> 15:39.320 of every time whatever we want. We do not want to show the technical excellence, we want to 15:39.400 --> 15:50.120 vary minimally disturbed our field operation. So, how do we test our next code? Similarly, 15:50.120 --> 15:56.520 the count to two. So, as our everything in a code. So, every change is we do in our next 15:56.520 --> 16:03.480 source configuration. Then we run the build test validation test integration test 16:03.560 --> 16:09.640 recently we started to use the VM based test and we wanted to run our test as similar as 16:09.640 --> 16:15.480 a production environment. The goal is ultimately the same. We wanted to get no 16:15.480 --> 16:23.160 surprise after deployment. We also manage our code using a staging and main runs. 16:24.520 --> 16:31.080 Particularly, the major critical changes in an next code we progress through these 16:31.080 --> 16:38.840 staging runs and the regular operational changes like changes in a configuration which is less 16:38.840 --> 16:47.160 impactful. We progress through the main runs. So, we follow this geared work forward to decouple 16:47.160 --> 16:56.600 the development from the operation. Because for example, when I am doing some critical changes 16:56.680 --> 17:02.360 or while my other colleagues he needs to deploy some small changes of the configuration, 17:02.360 --> 17:06.360 I do not want to block him. Because my changes would be more 17:06.360 --> 17:11.080 in fact, who is a man here, I need to be tested, who is gone through the gestaging 17:11.080 --> 17:20.760 runs and the normal changes go through the main runs. So, how do we manage the remote access 17:20.840 --> 17:26.840 to our servers? We manage remote access to our servers using relay. Because even this simple task 17:26.840 --> 17:32.040 is not simple for us, because the servers are distributed across many places, not in a single 17:32.040 --> 17:37.400 network, in a different network. Some are inside a field network, some are inside a 17:37.400 --> 17:45.320 scenario, we use some are for example, in a hybrid cloud, some are in a cloud. So, we will 17:45.400 --> 17:52.760 relay, we overcome the challenges and we are becoming less dependent on network, we can ensure 17:52.760 --> 18:03.400 access to our servers. So, how do we, this is how we declaratively manage a user access control? 18:05.160 --> 18:11.720 So, in a simple file we define the users, user roles and the roles access to the servers. 18:11.800 --> 18:17.240 So, who is a transform to next code? Next code, that is transform that is the 18:17.240 --> 18:27.080 son file into next option. Then when it is basically get builds, we get the users inside our 18:27.080 --> 18:33.640 server. So, which means we are also declaratively, declaratively can manage our access control. 18:33.640 --> 18:43.640 This gives us couple of advantages, first of all. So, we can prevent the configuration 18:43.640 --> 18:53.240 grid and then we can for example, another applications like I am or they can also use the same 18:53.240 --> 18:58.920 definition, which is a different declaratively. This is how we define our containerized application 18:59.240 --> 19:06.600 next source server. So, we declare application in a Docker container. We intentionally chose 19:06.600 --> 19:13.080 this so that, for example, I do not have any excellence with a next or next code. I do not know 19:13.080 --> 19:18.680 nothing and about it, but it is still I can deploy my application inside out inside those next 19:19.560 --> 19:25.560 source as the deployment process is automated and yeah. 19:28.840 --> 19:35.080 So, this is how we define our deployment service as you can see in here, just I need to say 19:35.080 --> 19:42.440 which application I wanted to deploy. So, I just write the repo name and then the browse name 19:42.520 --> 19:50.840 and that is and then then the machine target machine, why I wanted to enable this option. So, 19:50.840 --> 19:57.960 that is it. So, then the automated deployment options, just check out the emails and then 19:57.960 --> 20:08.760 deploy it to the correct target machine. Yeah. And then after deployment, we will get ready 20:08.760 --> 20:19.240 our application to be used. How do you manage our installation to a new machine? We basically use 20:19.240 --> 20:24.920 next source anywhere and this code. We compose this to into an installation script as long as 20:24.920 --> 20:34.360 server is resubbel by SSAs. We can install next source to there. We encrypt our data partition 20:34.360 --> 20:42.040 is looks to this usually do because our sensitive patient file are inside of it. So, now I am 20:42.040 --> 20:51.240 handing over to Ramsis or basically the main architect behind the Ramsis. Behind this system and 20:51.240 --> 20:54.520 here we will talk more about our next improvement plan. 20:54.520 --> 20:56.520 Yeah, go ahead. 21:01.320 --> 21:06.520 Yeah, maybe just very quick because we are kind of running out of time. So, there is a couple 21:06.520 --> 21:12.360 of improvements that we see in the system. So, as we mentioned, we started in 2018. So, the whole 21:12.360 --> 21:18.120 ecosystem was a lot smaller back then. Things like shops, nicks and such didn't exist. Nicks 21:18.120 --> 21:22.680 was anywhere that didn't exist. So, it is a bit weird. Maybe that we are using Ansible Fold for 21:22.760 --> 21:27.800 secrets, but that is because we kind of had to roll our own system at the time. So, at some 21:27.800 --> 21:35.080 day, we should improve. That is also linked with the whole encryption key thing because there was 21:35.080 --> 21:41.400 yeah. It is a bit weird that we use the same keys for like secrets and SSAs. There is a couple 21:41.400 --> 21:46.360 of things that we should probably migrate to like the system D in it already. We tried it once. 21:46.440 --> 21:52.120 It didn't work. We should probably try it again. We would really like to use 21:52.120 --> 21:57.240 verified boot and measured boot to secure our servers in the field, which we did a 21:57.240 --> 22:01.960 minute so far. I think there has been kind of a bit of advancements in the ecosystem so far, 22:01.960 --> 22:09.800 but it is still not the most straightforward thing to do. We have been writing on VMTest 22:10.680 --> 22:17.000 which are very helpful to avoid issues. And then finally, one thing that we still do is we 22:17.000 --> 22:22.680 evil and build everything on the actual machines, which has a couple of advantages for us, 22:23.880 --> 22:25.720 but eventually we probably want to move away from that. 22:29.080 --> 22:33.480 Okay, I'm going to skip very quickly past our next slides because I think we are completely out of time. 22:33.480 --> 22:39.160 So, I just want to say, first of all, we have had some challenges, but some of the books have 22:39.160 --> 22:44.760 been helpful. I want to say, first of all, Nick, so I see an amazing technology and it's a good 22:44.760 --> 22:50.520 community. You all deserve a big round of applause for the products that you've managed to create here. 22:50.520 --> 22:55.000 It's really a really good technology. If you have to choose one thing and I'm going to duck here, 22:55.000 --> 23:02.120 it's stable place. Yes, that's, yeah, and I'd also like to give us a big thanks to Ramsey since 23:02.200 --> 23:07.400 Nundite for their help. Ramsey's was the original architect of the system. We're 23:07.400 --> 23:12.440 operationally independent there, but he's been great for helping us with our evolution. And that's it.