WEBVTT

00:00.000 --> 00:10.000
All right, we're going to start with zero touch HPC load.

00:10.000 --> 00:11.000
Thank you.

00:11.000 --> 00:12.000
Hello, everyone.

00:12.000 --> 00:18.000
I'm Emit and I'm here with Leon and we are here to talk about our journey to zero touch HPC.

00:18.000 --> 00:27.000
We manage scientific computing for diverse group of researchers and as our scale group,

00:27.000 --> 00:32.000
our old ways of managing the infrastructure by hand didn't really keep up.

00:32.000 --> 00:37.000
In 2020, we presented our work on running open stack on top of, sorry,

00:37.000 --> 00:40.000
HPC on top of open stack here at the first step.

00:40.000 --> 00:45.000
And while the payload, this long cluster, was automated using the open stacks,

00:45.000 --> 00:50.000
our augustiation engine heat, and Ansible, the underlying infrastructure was a different story.

00:50.000 --> 00:53.000
And as the infrastructure side was managed manually by hand,

00:53.000 --> 00:58.000
the knowledge about that infrastructure was just in the heads of a few individuals.

00:58.000 --> 01:03.000
And there was no reproducible way of rebuilding the underlying infrastructure.

01:03.000 --> 01:05.000
This is the heart of our talk.

01:05.000 --> 01:12.000
The journey takes us from the HPC hardware wrecked into our data center to the running jobs on a slum cluster,

01:12.000 --> 01:18.000
ideally with no human intervention or just little human intervention.

01:18.000 --> 01:26.000
Our first step in this journey starts with the hardware, once we receive the hardware, in netbox.

01:26.000 --> 01:31.000
Netbox is a popular open source data center infrastructure management, or short,

01:31.000 --> 01:37.000
they seem system, it can help track and manage both the physical layer of a data center,

01:37.000 --> 01:43.000
such as data center floors, racks, PDUs, patch panels, as well as the physical devices,

01:43.000 --> 01:47.000
such as storage systems, servers, switches, and so on.

01:47.000 --> 01:53.000
In addition, it provides IPAM functionality, as well as the ability to manage all kinds of connections,

01:53.000 --> 01:59.000
ranging from simple power cables to network connections to circuits, aka internet updates.

01:59.000 --> 02:05.000
We currently track over 300 servers across 100 racks and 2,500 cables.

02:05.000 --> 02:10.000
If it's not in netbox, it basically doesn't exist in our infrastructure.

02:10.000 --> 02:16.000
Netbox also provides custom scripts that allow you to, that we use for all kinds of automation,

02:16.000 --> 02:25.000
such as drift detection, input and export functionality, and a programmatic access is provided by a GraphQL and the Rest API.

02:25.000 --> 02:30.000
Before netbox, we did our wreck management in Excel, you can see an example on the left side.

02:30.000 --> 02:33.000
On the right side, you see the same racks in netbox.

02:33.000 --> 02:39.000
The nice thing is you can click on any of these servers, and you will get a detailed view of the server with the network cards,

02:39.000 --> 02:45.000
the cable link to which top of the wreck, which these are cable and maybe assigned IP addresses.

02:45.000 --> 02:48.000
So, netbox doesn't live in a vacuum.

02:48.000 --> 02:54.000
We use custom drift checks and scripts when shoot data consistency across all of our vendor appliances,

02:54.000 --> 03:00.000
like, for example, LenovoExclarity, Dell OpenManage, and our IPAM system, which is in full blocks.

03:00.000 --> 03:07.000
This scripts automatically verify serial numbers, MAC addresses, and the LLDP neighbour information with our software-defined network,

03:07.000 --> 03:12.000
to ensure that the physical reality matches the reality that we have in netbox.

03:12.000 --> 03:16.000
We also use them to import from those appliances.

03:16.000 --> 03:24.000
For example, we can import MAC addresses for all the network interfaces of all the servers that these appliances manage,

03:24.000 --> 03:35.000
and then later we can use those DHCP addresses together with IP addresses to provision the DHCP reservation in our IPAM solution using OpenTow4 for example.

03:36.000 --> 03:45.000
Once everything is stored and linked together in netbox, it provides end-to-end visibility and allows us to drill down from different perspectives.

03:45.000 --> 03:52.000
So, for example, I can search for an IP address, and I see this is assigned to this certain interface on a compute node,

03:52.000 --> 03:57.000
and this compute node is cable through two patch panels to this top of the wreck switch.

03:57.000 --> 04:05.000
And on the right side, you can see the visual cable trace for our border gateway that is connected to our internet uplink via two patch panels.

04:05.000 --> 04:08.000
Netbox can also be extended with plugins.

04:08.000 --> 04:09.000
We use two plugins.

04:09.000 --> 04:11.000
One is the floor plan plugin.

04:11.000 --> 04:13.000
As a name implies, you can draw floor plans.

04:13.000 --> 04:19.000
On the left side, as you see, the floor plan manually drawn for a data center room and on the right side, you see the same.

04:19.000 --> 04:24.000
In netbox, again, you can click on the racks, and you get to the data rack view.

04:25.000 --> 04:34.000
Another plugin is a topology plugin. It allows basically to have a life-self updating documentation of your logical network connections in the data center.

04:34.000 --> 04:40.000
So, for example, you can see how your course, which is connected to the border gateway and to the internet uplink.

04:40.000 --> 04:49.000
And if you change any cable in your connection in netbox, this will get reflected in the interactive visualization on the right side,

04:49.000 --> 04:54.000
while in the old way you would have to update the diagram on the left side.

04:54.000 --> 05:01.000
And this is a final example of our access layer in the campus network.

05:01.000 --> 05:06.000
All right, so at this point, we have basically established networks as our source of truth,

05:06.000 --> 05:09.000
and we will later use this to provision host.

05:09.000 --> 05:13.000
But another thing that we need is an actually system to run on these nodes.

05:13.000 --> 05:24.000
So, in our old system, we would run a base operating system, and then execute long-running and simple playbooks against them, and this would do so manually.

05:24.000 --> 05:33.000
And apart from being slow, this would also introduce configuration drift easily, and it may grow back basically in feasible.

05:33.000 --> 05:44.000
So, in the new system, we want to be able to build this image and customize it to our needs, and then deploy this without having to run any steps manually.

05:44.000 --> 05:55.000
And this approach would have sample advantages, in the beginning we know that all images that we build this way, all nodes that are running on this image are going to have the same initial state.

05:55.000 --> 06:02.000
Secondly, provisioning should also be a lot faster, because there are actually a lot of customization steps that we know,

06:02.000 --> 06:10.000
we need to run them on all the nodes in any way, and so we can do this during image build time instead of during nodes set up later.

06:10.000 --> 06:17.000
Also, of course, we have all the customization, save this code, and the image is going to be reproducible.

06:17.000 --> 06:25.000
And then the image is kind of surface versioned artifacts that we can reference, and because they are reproducible and the other advantages,

06:26.000 --> 06:29.000
rollbacks should be at least less cumbersome.

06:29.000 --> 06:33.000
So, the question is, how can we implement this?

06:33.000 --> 06:40.000
What we would like to be able to do is to take a base upstream operating system that we want to customize,

06:40.000 --> 06:49.000
then our image build tool chain should be able to pick up on the distribution and the version that we are working on, perform customizations depending on that,

06:49.000 --> 06:53.000
and also be able to add any other customizations that we need.

06:53.000 --> 06:59.000
And after this, it should be able to push the image to the various endpoints that we are going to use.

06:59.000 --> 07:06.000
And the tool that we have picked for this is called Pecker, and with it's all for several reasons.

07:06.000 --> 07:10.000
One reason being, it supports many platforms that we can use out of the box.

07:10.000 --> 07:18.000
Then, pleasantly for us, it has a configuration syntax that is very similar to tofu, but we already use anyways.

07:18.000 --> 07:24.000
And most importantly, this supports many tools that we can use to customize the image that we are working on.

07:24.000 --> 07:30.000
So, we have a short example here of how such a Pecker configuration template might look like.

07:30.000 --> 07:35.000
At the very top, we define the plugins that we are going to use, and we can pin the versions.

07:35.000 --> 07:44.000
Then, the first interesting block, maybe, are these source blocks, and these basically describe where the image comes from that we are going to work on.

07:44.000 --> 07:52.000
And the builder that is going to configure it, here we have a Docker, but it could also be QAmo or many other things.

07:52.000 --> 08:00.000
Then, the most interesting block is this build block here, and this is basically what defines the actual customizations that we are doing on the image.

08:00.000 --> 08:10.000
And here, we have first this sources, which basically defines, from which of the sources that we defined earlier, the customizations should run on, because there can be many of them.

08:10.000 --> 08:18.000
And then, you can have an arbitrary amount of provisioner blocks. The provisioner block is what actually performs the customizations.

08:18.000 --> 08:27.000
So, here we have a very simple example of an Ansible provisioner that just runs a playbook without any further configuration, of course, as we have many more options there.

08:27.000 --> 08:36.000
And there also have other provisiones available like shell provisioner that can run arbitrary shell commands or file provisioner that can copy files into the image.

08:36.000 --> 08:43.000
Once all of these provisioner blocks have finished executing, we can then still run post-processer blocks.

08:43.000 --> 08:55.000
And these are used to manage the artifacts that we produce, produce a manifest file, or in our case here run again arbitrary shell commands.

08:55.000 --> 09:02.000
Now I want to give you a quick overview of how we have implemented this image building pipeline at our side.

09:02.000 --> 09:18.000
So at the heart of this, we have a GitHub workflow that can either be called from the base repository where we keep this packet code or from a remote repository that then can basically contain extra Ansible files that will also be used to customize the image.

09:18.000 --> 09:24.000
Packer will then look for variables that describe the distribution version that we are working on.

09:24.000 --> 09:31.000
And it will then spin up QAmo for M using the respective faster features.

09:31.000 --> 09:37.000
We can also do a manually invocation using what you see on the right here if you want to override this for a reason.

09:37.000 --> 09:44.000
Packer will then move on to perform or to run the Ansible Playbooks and we have multiple steps here.

09:44.000 --> 09:52.000
First we have a range of common customizations that we know we need them in any case and we perform them without any conditions.

09:52.000 --> 09:59.000
Then we have these like distribution version specific playbooks that we will then run if they are needed.

09:59.000 --> 10:04.000
And then most interestingly we have this remote entry point.

10:04.000 --> 10:13.000
So if we call this from a remote repository and this repository has a directory called Packer, we can then include any files that are relevant to Ansible there.

10:13.000 --> 10:19.000
Packer will pick up on them and then perform also the things that we define there.

10:19.000 --> 10:27.000
At this point we are nearly finished and we just need to make sure that once we boot this image up it's going to look like a fresh Linux installation.

10:27.000 --> 10:33.000
So we clear out some logs we set the machine at the to uninitialized things like this.

10:33.000 --> 10:42.000
And then the image is actually finished which has need to make sure that the text is properly and then upload this to wherever we need it.

10:42.000 --> 10:54.000
And now this all sounds very great but actually this still has a kind of problem and that is that not all of these customizations we can or should run during the image build time.

10:54.000 --> 11:03.000
For example for tasks that involve secrets or any tasks that depend on the role of the node in the slum cluster for example.

11:03.000 --> 11:14.000
When we then inspect our usable Ansible code we noticed that most of this is held in Ansible roles and many of these are most have tasks that fall in this category.

11:14.000 --> 11:20.000
But there are also many tasks that we could apply on all the nodes unconditionally.

11:20.000 --> 11:26.000
So our solution to this was to split the Ansible roles into two parts install and configure.

11:26.000 --> 11:32.000
And the install tasks out of tasks that we performed during image build time.

11:32.000 --> 11:40.000
Basically there's all tasks that come into all nodes installing most packages setting up some directory structures things like this.

11:40.000 --> 11:53.000
And the configured roles with the configured tasks we are going to then run when the node first boots and this these are the tasks that are done to specialize the nodes towards its exact role basically.

11:53.000 --> 11:59.000
So this is also a task that involves secrets things that we don't want to have in the image basically.

11:59.000 --> 12:05.000
And the top priority for example of this where we include the role as a steal but we only include the install tasks here.

12:05.000 --> 12:11.000
So this is the simplest example of what you would see also in our image build pipeline.

12:12.000 --> 12:26.000
Thank you. So now that we have our custom image, how do we deploy it? And we use two infrastructure code tools to decleratively define and deploy our HP system open tofu that's a fork of terraform and terra grant.

12:26.000 --> 12:33.000
We won't go too much into technical details, but I want to highlight some of the ways we use these two tools.

12:33.000 --> 12:41.000
We define reusable infrastructure components in open tofu and stored them in an infrastructure module catalog that's basically just a git repo.

12:41.000 --> 12:52.000
These are building blocks or basic building blocks in a component library that allow downstream users to define all kinds of hard infrastructure not only for the HPC use case.

12:52.000 --> 12:59.000
And these modules encapsulate common infrastructure pieces such as networks clusters.

12:59.000 --> 13:08.000
For example, the network module will create network IPv4 subnet and optional IPv6 subnet and router using the open stack provider.

13:08.000 --> 13:15.000
Although open tofu provides or modules provide a way to abstract this common infrastructure pieces.

13:15.000 --> 13:22.000
It doesn't really provide a dry way to customize and parameterize it across different environments.

13:22.000 --> 13:25.000
For example, depth staging or production.

13:25.000 --> 13:30.000
For this reason we use terra grant, which is a wrap around open tofu terraform.

13:30.000 --> 13:39.000
It's basically a code generator that fills this niche by allowing you to define units and stacks that reference these corresponding terraform modules.

13:39.000 --> 13:42.000
And this makes it possible to keep the code dry.

13:42.000 --> 13:53.000
So here you can see our HPC terra grant git repo where we define virtual and bare metal HPC clusters across free environments.

13:53.000 --> 14:06.000
And terra grant stack HGL file references this open tofu cluster module and allows it to customize the flavor and the number of nodes and the open stack cloud,

14:06.000 --> 14:10.000
where the HPC server cluster should be set up.

14:10.000 --> 14:19.000
Now that we explained which orchestration tool we use to deploy the HPC cluster, maybe also let us briefly explain how it is actually done.

14:19.000 --> 14:24.000
For that we use open stack spare metal service ironic together with netbox.

14:24.000 --> 14:31.000
So in netbox we define custom export templates that provide the Yamil that ironic expects to onboard the bare metal nodes.

14:31.000 --> 14:35.000
Then ironic manages the power state and can deploy our custom image.

14:35.000 --> 14:45.000
Using the various drivers that it supports ranging from legacy pixel and dfdp methods to all the way to modern ratfish and virtual media.

14:45.000 --> 14:51.000
This is an example of this Yamil for one of our four way blade servers.

14:51.000 --> 14:56.000
Because ironic can manage chassis and nodes assignments and we model these information in netbox.

14:56.000 --> 15:01.000
We can just include them in the Yamil. So you see in top there's a chassis definition with the four nodes.

15:01.000 --> 15:10.000
There's one node defined with the BMC credentials and some meta information that is used to connect the flavor to the bare metal node.

15:10.000 --> 15:19.000
And then information about to which top of the x which port this compute node is connected because our open stack integrates with this.

15:19.000 --> 15:29.000
As the end so the network service in open stack can provide connectivity during deployment and operation.

15:29.000 --> 15:36.000
Alright so at this point we have built an image and also we have spun up the nodes but now all these nodes are currently in the same state.

15:36.000 --> 15:42.000
The exact state that we defined in the image and now we need to make sure how we specialize them.

15:42.000 --> 15:48.000
What we did for this, the problem is how can we do it without having to do anything manually.

15:48.000 --> 15:54.000
So what we have introduced here is what we call the Ansible Initialization System or Ansible Invent and Shard.

15:54.000 --> 15:58.000
Props to this go for stack HPC, they have given us the initial idea for this.

15:58.000 --> 16:04.000
So basically it's very simple, this is just the system D service that will run on the nodes very first boot.

16:04.000 --> 16:12.000
If we look for file and if it doesn't exist it then goes on to run an unspecified amount of play Ansible Playbooks.

16:12.000 --> 16:18.000
And what it really needs to do is that it runs these configured tasks that I mentioned earlier.

16:18.000 --> 16:27.000
So this is what we need to do to really specialize the nodes in the way that we need and also apply tasks that are dependent on secrets.

16:27.000 --> 16:33.000
So one question here is how can we specify which roles should be run that is service.

16:33.000 --> 16:41.000
And we have two mechanisms for this either it finds an inventory directly and then it just uses this.

16:41.000 --> 16:47.000
But it carries metadata that it can either get from a config drive or from the color metadata endpoint.

16:47.000 --> 16:53.000
And then it looks automated data it should run these certain roles and it does so.

16:53.000 --> 17:01.000
And if there are no problems it will then run the Sentinel file and on sub secret boots this service will not run anymore.

17:01.000 --> 17:03.000
So that we are not doing this needlessly.

17:03.000 --> 17:09.000
To be noted here is maybe that's actually all of the playbooks that we use to customize these nodes.

17:09.000 --> 17:16.000
We already bake into the image when we build it so we don't need to rely on the Ansible Pool mechanism or something to do this.

17:16.000 --> 17:24.000
So now what I still want to quickly touch on is how we do manage secrets injection into these nodes.

17:24.000 --> 17:28.000
And there basically two thoughts here.

17:28.000 --> 17:34.000
First we don't want to have any long lift credentials for the nodes to access any vault applications.

17:34.000 --> 17:40.000
And secondly obviously we don't want to have any secrets baked into the image directly.

17:40.000 --> 17:51.000
What we would like to have is that each node each node on provision gets a secret that it can use for short time to access the secrets that it needs to configure itself.

17:51.000 --> 17:57.000
And then after a short time like one hour something this credential should no longer be valid.

17:57.000 --> 18:09.000
So what we are doing is that we proliferate subset of our big organizational vault into a vault with secrets that the nodes need to configure themselves.

18:10.000 --> 18:23.000
And upon provisioning for each node for each node we create a unique credential that this node can then use to access this smaller vault that has the node secret.

18:23.000 --> 18:27.000
And that is basically the system that we're using here.

18:27.000 --> 18:38.000
So with all of these methods combined if everything is correct you should then be able to spin up a cluster without having to intervene manually.

18:38.000 --> 18:48.000
And what remains to me is to then thank the HPC team at the Vienna Biocenter and the part from this if there any question we're happy to discuss them with you.

18:48.000 --> 18:51.000
And thank you for coming to the start.

18:51.000 --> 19:17.000
So the question is if we are aware of where we will and why did we go with this instead of where we will.

19:17.000 --> 19:27.000
And we are aware of where we will but it's a bit of based on.

19:27.000 --> 19:43.000
We wanted to use open stack for more like a software defined data center and because we already had this in place and we have experience or expertise we decided to use it also for the parameter or for the HPC part.

19:43.000 --> 20:10.000
So the question is if we redeploy the nodes whenever we have a configuration change it's a good question we haven't really decided we now can basically have immutable node images where we just build a new image redeploy the nodes and everything works.

20:10.000 --> 20:18.000
I think it's kind of a trade-off question of trade-offs if it's a simple configuration change you could roll it out we are sensible.

20:18.000 --> 20:27.000
But then you have to make sure that the image also contains we haven't really thought about this completely and we haven't decided on it yet but it's is a valid point here.

20:41.000 --> 20:53.000
Yes, so the question is why don't we use the automatic and roll method of ironic to first enroll them in ironic and then in net box or important in that box.

20:53.000 --> 20:56.000
It's a valid approach the only problems we need to.

20:56.000 --> 21:15.000
So because of our SDN network it's not that easy and we anyways need to put them in net box and at least have the DHCP addresses the Mac addresses of the PMC addresses to create the DHCP reservations so that they get IP addresses there's ways around this.

21:15.000 --> 21:19.000
That would be a valid approach also but we.

21:19.000 --> 21:32.000
Whenever we get hardware the first thing is we onboard this in net box so we have documented it also for cabling and so we use it as a source of truth for everything.

21:32.000 --> 21:37.000
I don't know who was first but.

21:37.000 --> 21:41.000
So we have a separate multi tenant in open stack.

21:41.000 --> 22:03.000
Basically we are the only tenant so you could argue maybe it's an overkill to your open stack but we use it because of the integration with the with our SDN and so we don't really support multi tenant because we as the HPDC team are using it alone kind of.

22:03.000 --> 22:07.000
So if you're doing any changing to the ACF.

22:07.000 --> 22:12.000
During the provisioning so we also drive some of the Cisco ACIs.

22:12.000 --> 22:28.000
Fabric side configuration for the battle notes through net box using tofu and also because open stack integral neutron integrates with the Cisco SDN during deployment the node gets plugged into a provisioning network.

22:28.000 --> 22:40.000
And then we plug it into the tenant network so it's fully integrated.

22:40.000 --> 22:46.000
If the templates are we will get up which templates exactly the.

22:46.000 --> 22:48.000
We have it in the internal.

22:49.000 --> 22:53.000
It's not really polished yet eventually we would like to.

22:53.000 --> 22:57.000
Maybe make it public the only problem is it is a bit specific to our.

22:57.000 --> 23:13.000
System with the Cisco SDN because of contracts that you have to consume between so it might not be really usable for other people but we can publish it as a reference basically yes.

23:13.000 --> 23:22.000
Yeah yeah it's a good question I mean you could even have tofu do changes in that box.

23:22.000 --> 23:26.000
Sorry how do we handle changes in that box.

23:26.000 --> 23:42.000
Currently you change it in that box you could theoretically drive networks through tofu then you have multiple state you have a state in open tofu for networks and then have from that box I mean yes it I think at some point you need to change it.

23:42.000 --> 23:52.000
Where you stole change is always like a question use the networks as like the source of truth.

23:52.000 --> 23:54.000
If it can.

23:54.000 --> 24:04.000
So networks tracks has an audit trail so you can see every change even programmatic change and you see the changes I'm not 100% sure if you can roll back what networks implemented as a plugin is.

24:04.000 --> 24:11.000
Versioning like in GIT where for example imagine you restructure one of the data centers.

24:11.000 --> 24:23.000
In back in the days they set up a second networks and try it out and then apply it now you can fork basically the network status apply the changes and then commit them.

24:23.000 --> 24:30.000
So there is there is an option but I think it's used to be plugin maybe integrated by now.

24:30.000 --> 24:45.000
Is the good question this is currently also working progress so we are not as far to experience this in real life.

24:45.000 --> 25:05.000
So how do we deal because you see not failures during boot and configuration how we deal with them we haven't really any experience yet because this is still working progress we are not so this is maybe caveat we are working on this we haven't we don't have this in production is the plan.

25:05.000 --> 25:17.000
If you have any ideas or if you have any tips how to deal with this we can maybe talk later and so we can exchange.

25:17.000 --> 25:27.000
Sorry.

25:27.000 --> 25:32.000
Something else like.

25:32.000 --> 25:45.000
So if the networking service of open stack that currently talks in our case to this is good as the end can it also talk to other networking.

25:45.000 --> 25:52.000
It's a good question if this can autumn there are tools to scrape your network.

25:52.000 --> 25:55.000
Your network.

25:55.000 --> 26:06.000
System and put the data in that box we haven't looked into those I guess for really big networks where it doesn't make sense to manually.

26:06.000 --> 26:18.000
Like manage all the cables and connections that might be a useful approach also maybe for brown field or green brown field systems where you add networks.

26:18.000 --> 26:22.000
The time is up thank you very much for coming to the talk.