WEBVTT 00:00.000 --> 00:10.000 All right, we're going to start with zero touch HPC load. 00:10.000 --> 00:11.000 Thank you. 00:11.000 --> 00:12.000 Hello, everyone. 00:12.000 --> 00:18.000 I'm Emit and I'm here with Leon and we are here to talk about our journey to zero touch HPC. 00:18.000 --> 00:27.000 We manage scientific computing for diverse group of researchers and as our scale group, 00:27.000 --> 00:32.000 our old ways of managing the infrastructure by hand didn't really keep up. 00:32.000 --> 00:37.000 In 2020, we presented our work on running open stack on top of, sorry, 00:37.000 --> 00:40.000 HPC on top of open stack here at the first step. 00:40.000 --> 00:45.000 And while the payload, this long cluster, was automated using the open stacks, 00:45.000 --> 00:50.000 our augustiation engine heat, and Ansible, the underlying infrastructure was a different story. 00:50.000 --> 00:53.000 And as the infrastructure side was managed manually by hand, 00:53.000 --> 00:58.000 the knowledge about that infrastructure was just in the heads of a few individuals. 00:58.000 --> 01:03.000 And there was no reproducible way of rebuilding the underlying infrastructure. 01:03.000 --> 01:05.000 This is the heart of our talk. 01:05.000 --> 01:12.000 The journey takes us from the HPC hardware wrecked into our data center to the running jobs on a slum cluster, 01:12.000 --> 01:18.000 ideally with no human intervention or just little human intervention. 01:18.000 --> 01:26.000 Our first step in this journey starts with the hardware, once we receive the hardware, in netbox. 01:26.000 --> 01:31.000 Netbox is a popular open source data center infrastructure management, or short, 01:31.000 --> 01:37.000 they seem system, it can help track and manage both the physical layer of a data center, 01:37.000 --> 01:43.000 such as data center floors, racks, PDUs, patch panels, as well as the physical devices, 01:43.000 --> 01:47.000 such as storage systems, servers, switches, and so on. 01:47.000 --> 01:53.000 In addition, it provides IPAM functionality, as well as the ability to manage all kinds of connections, 01:53.000 --> 01:59.000 ranging from simple power cables to network connections to circuits, aka internet updates. 01:59.000 --> 02:05.000 We currently track over 300 servers across 100 racks and 2,500 cables. 02:05.000 --> 02:10.000 If it's not in netbox, it basically doesn't exist in our infrastructure. 02:10.000 --> 02:16.000 Netbox also provides custom scripts that allow you to, that we use for all kinds of automation, 02:16.000 --> 02:25.000 such as drift detection, input and export functionality, and a programmatic access is provided by a GraphQL and the Rest API. 02:25.000 --> 02:30.000 Before netbox, we did our wreck management in Excel, you can see an example on the left side. 02:30.000 --> 02:33.000 On the right side, you see the same racks in netbox. 02:33.000 --> 02:39.000 The nice thing is you can click on any of these servers, and you will get a detailed view of the server with the network cards, 02:39.000 --> 02:45.000 the cable link to which top of the wreck, which these are cable and maybe assigned IP addresses. 02:45.000 --> 02:48.000 So, netbox doesn't live in a vacuum. 02:48.000 --> 02:54.000 We use custom drift checks and scripts when shoot data consistency across all of our vendor appliances, 02:54.000 --> 03:00.000 like, for example, LenovoExclarity, Dell OpenManage, and our IPAM system, which is in full blocks. 03:00.000 --> 03:07.000 This scripts automatically verify serial numbers, MAC addresses, and the LLDP neighbour information with our software-defined network, 03:07.000 --> 03:12.000 to ensure that the physical reality matches the reality that we have in netbox. 03:12.000 --> 03:16.000 We also use them to import from those appliances. 03:16.000 --> 03:24.000 For example, we can import MAC addresses for all the network interfaces of all the servers that these appliances manage, 03:24.000 --> 03:35.000 and then later we can use those DHCP addresses together with IP addresses to provision the DHCP reservation in our IPAM solution using OpenTow4 for example. 03:36.000 --> 03:45.000 Once everything is stored and linked together in netbox, it provides end-to-end visibility and allows us to drill down from different perspectives. 03:45.000 --> 03:52.000 So, for example, I can search for an IP address, and I see this is assigned to this certain interface on a compute node, 03:52.000 --> 03:57.000 and this compute node is cable through two patch panels to this top of the wreck switch. 03:57.000 --> 04:05.000 And on the right side, you can see the visual cable trace for our border gateway that is connected to our internet uplink via two patch panels. 04:05.000 --> 04:08.000 Netbox can also be extended with plugins. 04:08.000 --> 04:09.000 We use two plugins. 04:09.000 --> 04:11.000 One is the floor plan plugin. 04:11.000 --> 04:13.000 As a name implies, you can draw floor plans. 04:13.000 --> 04:19.000 On the left side, as you see, the floor plan manually drawn for a data center room and on the right side, you see the same. 04:19.000 --> 04:24.000 In netbox, again, you can click on the racks, and you get to the data rack view. 04:25.000 --> 04:34.000 Another plugin is a topology plugin. It allows basically to have a life-self updating documentation of your logical network connections in the data center. 04:34.000 --> 04:40.000 So, for example, you can see how your course, which is connected to the border gateway and to the internet uplink. 04:40.000 --> 04:49.000 And if you change any cable in your connection in netbox, this will get reflected in the interactive visualization on the right side, 04:49.000 --> 04:54.000 while in the old way you would have to update the diagram on the left side. 04:54.000 --> 05:01.000 And this is a final example of our access layer in the campus network. 05:01.000 --> 05:06.000 All right, so at this point, we have basically established networks as our source of truth, 05:06.000 --> 05:09.000 and we will later use this to provision host. 05:09.000 --> 05:13.000 But another thing that we need is an actually system to run on these nodes. 05:13.000 --> 05:24.000 So, in our old system, we would run a base operating system, and then execute long-running and simple playbooks against them, and this would do so manually. 05:24.000 --> 05:33.000 And apart from being slow, this would also introduce configuration drift easily, and it may grow back basically in feasible. 05:33.000 --> 05:44.000 So, in the new system, we want to be able to build this image and customize it to our needs, and then deploy this without having to run any steps manually. 05:44.000 --> 05:55.000 And this approach would have sample advantages, in the beginning we know that all images that we build this way, all nodes that are running on this image are going to have the same initial state. 05:55.000 --> 06:02.000 Secondly, provisioning should also be a lot faster, because there are actually a lot of customization steps that we know, 06:02.000 --> 06:10.000 we need to run them on all the nodes in any way, and so we can do this during image build time instead of during nodes set up later. 06:10.000 --> 06:17.000 Also, of course, we have all the customization, save this code, and the image is going to be reproducible. 06:17.000 --> 06:25.000 And then the image is kind of surface versioned artifacts that we can reference, and because they are reproducible and the other advantages, 06:26.000 --> 06:29.000 rollbacks should be at least less cumbersome. 06:29.000 --> 06:33.000 So, the question is, how can we implement this? 06:33.000 --> 06:40.000 What we would like to be able to do is to take a base upstream operating system that we want to customize, 06:40.000 --> 06:49.000 then our image build tool chain should be able to pick up on the distribution and the version that we are working on, perform customizations depending on that, 06:49.000 --> 06:53.000 and also be able to add any other customizations that we need. 06:53.000 --> 06:59.000 And after this, it should be able to push the image to the various endpoints that we are going to use. 06:59.000 --> 07:06.000 And the tool that we have picked for this is called Pecker, and with it's all for several reasons. 07:06.000 --> 07:10.000 One reason being, it supports many platforms that we can use out of the box. 07:10.000 --> 07:18.000 Then, pleasantly for us, it has a configuration syntax that is very similar to tofu, but we already use anyways. 07:18.000 --> 07:24.000 And most importantly, this supports many tools that we can use to customize the image that we are working on. 07:24.000 --> 07:30.000 So, we have a short example here of how such a Pecker configuration template might look like. 07:30.000 --> 07:35.000 At the very top, we define the plugins that we are going to use, and we can pin the versions. 07:35.000 --> 07:44.000 Then, the first interesting block, maybe, are these source blocks, and these basically describe where the image comes from that we are going to work on. 07:44.000 --> 07:52.000 And the builder that is going to configure it, here we have a Docker, but it could also be QAmo or many other things. 07:52.000 --> 08:00.000 Then, the most interesting block is this build block here, and this is basically what defines the actual customizations that we are doing on the image. 08:00.000 --> 08:10.000 And here, we have first this sources, which basically defines, from which of the sources that we defined earlier, the customizations should run on, because there can be many of them. 08:10.000 --> 08:18.000 And then, you can have an arbitrary amount of provisioner blocks. The provisioner block is what actually performs the customizations. 08:18.000 --> 08:27.000 So, here we have a very simple example of an Ansible provisioner that just runs a playbook without any further configuration, of course, as we have many more options there. 08:27.000 --> 08:36.000 And there also have other provisiones available like shell provisioner that can run arbitrary shell commands or file provisioner that can copy files into the image. 08:36.000 --> 08:43.000 Once all of these provisioner blocks have finished executing, we can then still run post-processer blocks. 08:43.000 --> 08:55.000 And these are used to manage the artifacts that we produce, produce a manifest file, or in our case here run again arbitrary shell commands. 08:55.000 --> 09:02.000 Now I want to give you a quick overview of how we have implemented this image building pipeline at our side. 09:02.000 --> 09:18.000 So at the heart of this, we have a GitHub workflow that can either be called from the base repository where we keep this packet code or from a remote repository that then can basically contain extra Ansible files that will also be used to customize the image. 09:18.000 --> 09:24.000 Packer will then look for variables that describe the distribution version that we are working on. 09:24.000 --> 09:31.000 And it will then spin up QAmo for M using the respective faster features. 09:31.000 --> 09:37.000 We can also do a manually invocation using what you see on the right here if you want to override this for a reason. 09:37.000 --> 09:44.000 Packer will then move on to perform or to run the Ansible Playbooks and we have multiple steps here. 09:44.000 --> 09:52.000 First we have a range of common customizations that we know we need them in any case and we perform them without any conditions. 09:52.000 --> 09:59.000 Then we have these like distribution version specific playbooks that we will then run if they are needed. 09:59.000 --> 10:04.000 And then most interestingly we have this remote entry point. 10:04.000 --> 10:13.000 So if we call this from a remote repository and this repository has a directory called Packer, we can then include any files that are relevant to Ansible there. 10:13.000 --> 10:19.000 Packer will pick up on them and then perform also the things that we define there. 10:19.000 --> 10:27.000 At this point we are nearly finished and we just need to make sure that once we boot this image up it's going to look like a fresh Linux installation. 10:27.000 --> 10:33.000 So we clear out some logs we set the machine at the to uninitialized things like this. 10:33.000 --> 10:42.000 And then the image is actually finished which has need to make sure that the text is properly and then upload this to wherever we need it. 10:42.000 --> 10:54.000 And now this all sounds very great but actually this still has a kind of problem and that is that not all of these customizations we can or should run during the image build time. 10:54.000 --> 11:03.000 For example for tasks that involve secrets or any tasks that depend on the role of the node in the slum cluster for example. 11:03.000 --> 11:14.000 When we then inspect our usable Ansible code we noticed that most of this is held in Ansible roles and many of these are most have tasks that fall in this category. 11:14.000 --> 11:20.000 But there are also many tasks that we could apply on all the nodes unconditionally. 11:20.000 --> 11:26.000 So our solution to this was to split the Ansible roles into two parts install and configure. 11:26.000 --> 11:32.000 And the install tasks out of tasks that we performed during image build time. 11:32.000 --> 11:40.000 Basically there's all tasks that come into all nodes installing most packages setting up some directory structures things like this. 11:40.000 --> 11:53.000 And the configured roles with the configured tasks we are going to then run when the node first boots and this these are the tasks that are done to specialize the nodes towards its exact role basically. 11:53.000 --> 11:59.000 So this is also a task that involves secrets things that we don't want to have in the image basically. 11:59.000 --> 12:05.000 And the top priority for example of this where we include the role as a steal but we only include the install tasks here. 12:05.000 --> 12:11.000 So this is the simplest example of what you would see also in our image build pipeline. 12:12.000 --> 12:26.000 Thank you. So now that we have our custom image, how do we deploy it? And we use two infrastructure code tools to decleratively define and deploy our HP system open tofu that's a fork of terraform and terra grant. 12:26.000 --> 12:33.000 We won't go too much into technical details, but I want to highlight some of the ways we use these two tools. 12:33.000 --> 12:41.000 We define reusable infrastructure components in open tofu and stored them in an infrastructure module catalog that's basically just a git repo. 12:41.000 --> 12:52.000 These are building blocks or basic building blocks in a component library that allow downstream users to define all kinds of hard infrastructure not only for the HPC use case. 12:52.000 --> 12:59.000 And these modules encapsulate common infrastructure pieces such as networks clusters. 12:59.000 --> 13:08.000 For example, the network module will create network IPv4 subnet and optional IPv6 subnet and router using the open stack provider. 13:08.000 --> 13:15.000 Although open tofu provides or modules provide a way to abstract this common infrastructure pieces. 13:15.000 --> 13:22.000 It doesn't really provide a dry way to customize and parameterize it across different environments. 13:22.000 --> 13:25.000 For example, depth staging or production. 13:25.000 --> 13:30.000 For this reason we use terra grant, which is a wrap around open tofu terraform. 13:30.000 --> 13:39.000 It's basically a code generator that fills this niche by allowing you to define units and stacks that reference these corresponding terraform modules. 13:39.000 --> 13:42.000 And this makes it possible to keep the code dry. 13:42.000 --> 13:53.000 So here you can see our HPC terra grant git repo where we define virtual and bare metal HPC clusters across free environments. 13:53.000 --> 14:06.000 And terra grant stack HGL file references this open tofu cluster module and allows it to customize the flavor and the number of nodes and the open stack cloud, 14:06.000 --> 14:10.000 where the HPC server cluster should be set up. 14:10.000 --> 14:19.000 Now that we explained which orchestration tool we use to deploy the HPC cluster, maybe also let us briefly explain how it is actually done. 14:19.000 --> 14:24.000 For that we use open stack spare metal service ironic together with netbox. 14:24.000 --> 14:31.000 So in netbox we define custom export templates that provide the Yamil that ironic expects to onboard the bare metal nodes. 14:31.000 --> 14:35.000 Then ironic manages the power state and can deploy our custom image. 14:35.000 --> 14:45.000 Using the various drivers that it supports ranging from legacy pixel and dfdp methods to all the way to modern ratfish and virtual media. 14:45.000 --> 14:51.000 This is an example of this Yamil for one of our four way blade servers. 14:51.000 --> 14:56.000 Because ironic can manage chassis and nodes assignments and we model these information in netbox. 14:56.000 --> 15:01.000 We can just include them in the Yamil. So you see in top there's a chassis definition with the four nodes. 15:01.000 --> 15:10.000 There's one node defined with the BMC credentials and some meta information that is used to connect the flavor to the bare metal node. 15:10.000 --> 15:19.000 And then information about to which top of the x which port this compute node is connected because our open stack integrates with this. 15:19.000 --> 15:29.000 As the end so the network service in open stack can provide connectivity during deployment and operation. 15:29.000 --> 15:36.000 Alright so at this point we have built an image and also we have spun up the nodes but now all these nodes are currently in the same state. 15:36.000 --> 15:42.000 The exact state that we defined in the image and now we need to make sure how we specialize them. 15:42.000 --> 15:48.000 What we did for this, the problem is how can we do it without having to do anything manually. 15:48.000 --> 15:54.000 So what we have introduced here is what we call the Ansible Initialization System or Ansible Invent and Shard. 15:54.000 --> 15:58.000 Props to this go for stack HPC, they have given us the initial idea for this. 15:58.000 --> 16:04.000 So basically it's very simple, this is just the system D service that will run on the nodes very first boot. 16:04.000 --> 16:12.000 If we look for file and if it doesn't exist it then goes on to run an unspecified amount of play Ansible Playbooks. 16:12.000 --> 16:18.000 And what it really needs to do is that it runs these configured tasks that I mentioned earlier. 16:18.000 --> 16:27.000 So this is what we need to do to really specialize the nodes in the way that we need and also apply tasks that are dependent on secrets. 16:27.000 --> 16:33.000 So one question here is how can we specify which roles should be run that is service. 16:33.000 --> 16:41.000 And we have two mechanisms for this either it finds an inventory directly and then it just uses this. 16:41.000 --> 16:47.000 But it carries metadata that it can either get from a config drive or from the color metadata endpoint. 16:47.000 --> 16:53.000 And then it looks automated data it should run these certain roles and it does so. 16:53.000 --> 17:01.000 And if there are no problems it will then run the Sentinel file and on sub secret boots this service will not run anymore. 17:01.000 --> 17:03.000 So that we are not doing this needlessly. 17:03.000 --> 17:09.000 To be noted here is maybe that's actually all of the playbooks that we use to customize these nodes. 17:09.000 --> 17:16.000 We already bake into the image when we build it so we don't need to rely on the Ansible Pool mechanism or something to do this. 17:16.000 --> 17:24.000 So now what I still want to quickly touch on is how we do manage secrets injection into these nodes. 17:24.000 --> 17:28.000 And there basically two thoughts here. 17:28.000 --> 17:34.000 First we don't want to have any long lift credentials for the nodes to access any vault applications. 17:34.000 --> 17:40.000 And secondly obviously we don't want to have any secrets baked into the image directly. 17:40.000 --> 17:51.000 What we would like to have is that each node each node on provision gets a secret that it can use for short time to access the secrets that it needs to configure itself. 17:51.000 --> 17:57.000 And then after a short time like one hour something this credential should no longer be valid. 17:57.000 --> 18:09.000 So what we are doing is that we proliferate subset of our big organizational vault into a vault with secrets that the nodes need to configure themselves. 18:10.000 --> 18:23.000 And upon provisioning for each node for each node we create a unique credential that this node can then use to access this smaller vault that has the node secret. 18:23.000 --> 18:27.000 And that is basically the system that we're using here. 18:27.000 --> 18:38.000 So with all of these methods combined if everything is correct you should then be able to spin up a cluster without having to intervene manually. 18:38.000 --> 18:48.000 And what remains to me is to then thank the HPC team at the Vienna Biocenter and the part from this if there any question we're happy to discuss them with you. 18:48.000 --> 18:51.000 And thank you for coming to the start. 18:51.000 --> 19:17.000 So the question is if we are aware of where we will and why did we go with this instead of where we will. 19:17.000 --> 19:27.000 And we are aware of where we will but it's a bit of based on. 19:27.000 --> 19:43.000 We wanted to use open stack for more like a software defined data center and because we already had this in place and we have experience or expertise we decided to use it also for the parameter or for the HPC part. 19:43.000 --> 20:10.000 So the question is if we redeploy the nodes whenever we have a configuration change it's a good question we haven't really decided we now can basically have immutable node images where we just build a new image redeploy the nodes and everything works. 20:10.000 --> 20:18.000 I think it's kind of a trade-off question of trade-offs if it's a simple configuration change you could roll it out we are sensible. 20:18.000 --> 20:27.000 But then you have to make sure that the image also contains we haven't really thought about this completely and we haven't decided on it yet but it's is a valid point here. 20:41.000 --> 20:53.000 Yes, so the question is why don't we use the automatic and roll method of ironic to first enroll them in ironic and then in net box or important in that box. 20:53.000 --> 20:56.000 It's a valid approach the only problems we need to. 20:56.000 --> 21:15.000 So because of our SDN network it's not that easy and we anyways need to put them in net box and at least have the DHCP addresses the Mac addresses of the PMC addresses to create the DHCP reservations so that they get IP addresses there's ways around this. 21:15.000 --> 21:19.000 That would be a valid approach also but we. 21:19.000 --> 21:32.000 Whenever we get hardware the first thing is we onboard this in net box so we have documented it also for cabling and so we use it as a source of truth for everything. 21:32.000 --> 21:37.000 I don't know who was first but. 21:37.000 --> 21:41.000 So we have a separate multi tenant in open stack. 21:41.000 --> 22:03.000 Basically we are the only tenant so you could argue maybe it's an overkill to your open stack but we use it because of the integration with the with our SDN and so we don't really support multi tenant because we as the HPDC team are using it alone kind of. 22:03.000 --> 22:07.000 So if you're doing any changing to the ACF. 22:07.000 --> 22:12.000 During the provisioning so we also drive some of the Cisco ACIs. 22:12.000 --> 22:28.000 Fabric side configuration for the battle notes through net box using tofu and also because open stack integral neutron integrates with the Cisco SDN during deployment the node gets plugged into a provisioning network. 22:28.000 --> 22:40.000 And then we plug it into the tenant network so it's fully integrated. 22:40.000 --> 22:46.000 If the templates are we will get up which templates exactly the. 22:46.000 --> 22:48.000 We have it in the internal. 22:49.000 --> 22:53.000 It's not really polished yet eventually we would like to. 22:53.000 --> 22:57.000 Maybe make it public the only problem is it is a bit specific to our. 22:57.000 --> 23:13.000 System with the Cisco SDN because of contracts that you have to consume between so it might not be really usable for other people but we can publish it as a reference basically yes. 23:13.000 --> 23:22.000 Yeah yeah it's a good question I mean you could even have tofu do changes in that box. 23:22.000 --> 23:26.000 Sorry how do we handle changes in that box. 23:26.000 --> 23:42.000 Currently you change it in that box you could theoretically drive networks through tofu then you have multiple state you have a state in open tofu for networks and then have from that box I mean yes it I think at some point you need to change it. 23:42.000 --> 23:52.000 Where you stole change is always like a question use the networks as like the source of truth. 23:52.000 --> 23:54.000 If it can. 23:54.000 --> 24:04.000 So networks tracks has an audit trail so you can see every change even programmatic change and you see the changes I'm not 100% sure if you can roll back what networks implemented as a plugin is. 24:04.000 --> 24:11.000 Versioning like in GIT where for example imagine you restructure one of the data centers. 24:11.000 --> 24:23.000 In back in the days they set up a second networks and try it out and then apply it now you can fork basically the network status apply the changes and then commit them. 24:23.000 --> 24:30.000 So there is there is an option but I think it's used to be plugin maybe integrated by now. 24:30.000 --> 24:45.000 Is the good question this is currently also working progress so we are not as far to experience this in real life. 24:45.000 --> 25:05.000 So how do we deal because you see not failures during boot and configuration how we deal with them we haven't really any experience yet because this is still working progress we are not so this is maybe caveat we are working on this we haven't we don't have this in production is the plan. 25:05.000 --> 25:17.000 If you have any ideas or if you have any tips how to deal with this we can maybe talk later and so we can exchange. 25:17.000 --> 25:27.000 Sorry. 25:27.000 --> 25:32.000 Something else like. 25:32.000 --> 25:45.000 So if the networking service of open stack that currently talks in our case to this is good as the end can it also talk to other networking. 25:45.000 --> 25:52.000 It's a good question if this can autumn there are tools to scrape your network. 25:52.000 --> 25:55.000 Your network. 25:55.000 --> 26:06.000 System and put the data in that box we haven't looked into those I guess for really big networks where it doesn't make sense to manually. 26:06.000 --> 26:18.000 Like manage all the cables and connections that might be a useful approach also maybe for brown field or green brown field systems where you add networks. 26:18.000 --> 26:22.000 The time is up thank you very much for coming to the talk.