WEBVTT

00:00.000 --> 00:07.000
Let's see.

00:07.000 --> 00:10.000
All right, hey everyone.

00:10.000 --> 00:11.000
My name is J. P. Lair.

00:11.000 --> 00:14.000
I'm the fourth presenter that has this logo on this light.

00:14.000 --> 00:18.000
So apologies for stopping the dev room, but apparently we're somewhat active here.

00:18.000 --> 00:20.000
So I'm going to talk about programming is fantastic.

00:20.000 --> 00:21.000
It's needed.

00:21.000 --> 00:25.000
And in for us, you know, you can fill in the blanks, whatever you think in for us.

00:25.000 --> 00:29.000
I think it's needed, but it's also, I would hope it's somebody else's job.

00:29.000 --> 00:33.000
But get guess what?

00:33.000 --> 00:37.000
I've kind of anyway.

00:37.000 --> 00:39.000
Why do we actually care?

00:39.000 --> 00:40.000
Why do we care about stuff?

00:40.000 --> 00:43.000
So we test upstream LVN for two reasons.

00:43.000 --> 00:47.000
One is we have many people working immediately in upstream LVN.

00:47.000 --> 00:49.000
Of course, we want to support the development.

00:49.000 --> 00:52.000
We want to support our upstream developers.

00:52.000 --> 00:56.000
So most of our actually back-end work for Amy GPU is done upstream.

00:56.000 --> 01:04.000
But also, we want to guard downstream rockham because we actually merge upstream changes into downstream rockham three times a day.

01:04.000 --> 01:11.000
And whenever something breaks upstream, and it hits a certain window where we pull in the changes for the merges,

01:11.000 --> 01:13.000
we break our internal build of rockham too.

01:13.000 --> 01:15.000
And we don't want to have that.

01:15.000 --> 01:20.000
And this is why we actually test upstream and test upstream more and more and more.

01:20.000 --> 01:23.000
And so we're bringing up more upstream testing.

01:23.000 --> 01:27.000
Now, in this presentation, I'm going to talk about my journey,

01:27.000 --> 01:33.000
when I kind of inherited the then existing build bots.

01:33.000 --> 01:39.000
And, you know, towards bringing online more build bots and working towards also potentially

01:39.000 --> 01:43.000
bringing online pre-competesting.

01:43.000 --> 01:49.000
When you open a PR, such that we actually run on GPU tests if you touch stuff that we care about.

01:49.000 --> 01:53.000
Okay, so since it's about my journey, first things first,

01:53.000 --> 01:57.000
LVM testing landscape, there's three different technologies in use here.

01:57.000 --> 02:02.000
So one is you have GitHub with a GitHub actions and a GitHub runners for some of the testing.

02:02.000 --> 02:09.000
So, for example, I think LipsXX fully relies on GitHub runners and GitHub actions for the testing.

02:09.000 --> 02:11.000
Then there's build kite.

02:11.000 --> 02:14.000
The current pre-compet testing is actually done.

02:14.000 --> 02:20.000
There's a GitHub workflow that puts together a string, which is a build kite pipeline definition,

02:20.000 --> 02:25.000
and then sends it off to build kite to test your PR changes through build kite,

02:25.000 --> 02:28.000
and then report the result back to GitHub.

02:28.000 --> 02:32.000
I think it's going to go away at some point, but it's still there right now.

02:32.000 --> 02:39.000
And this is pre-compet and then there's build bot and build bot is the whole post-compet fleet.

02:39.000 --> 02:42.000
And this talk is mostly about build bot.

02:43.000 --> 02:48.000
Simply because that's where I put most of the most of my energy so far in,

02:48.000 --> 02:53.000
and pre-compet is a little more new.

02:53.000 --> 02:55.000
So first of all, build bot.

02:55.000 --> 02:59.000
Let's talk a little bit about terminology, and I have this slide.

02:59.000 --> 03:03.000
There was a talk from David Spickett at the 2022 LVM Deaf meeting.

03:03.000 --> 03:06.000
If you're interested in the build bot, I would highly recommend this talk,

03:06.000 --> 03:11.000
because it goes into much more detail, what it is, being a build bot maintainer.

03:11.000 --> 03:15.000
But just for terminology wise, so build but it's orchestrated.

03:15.000 --> 03:20.000
You have a build master that's a server that's run by Galena, so thank you, Galena.

03:20.000 --> 03:23.000
And then you have what is called builders.

03:23.000 --> 03:30.000
Builders are kind of the logical entity that typically would build a configuration of LVM.

03:30.000 --> 03:35.000
And these builders then run on a worker, and a worker is more or less the physical,

03:35.000 --> 03:39.000
machine, or the actual thing that does the computing.

03:39.000 --> 03:42.000
So you can have a builder to work a mapping of one to one.

03:42.000 --> 03:45.000
You can have one to end. You can have, you know, you see it there.

03:45.000 --> 03:50.000
In our case, we always run a single builder on a single worker.

03:50.000 --> 03:55.000
So there's no, like we, or more crazy setup.

03:55.000 --> 03:59.000
If you scan the QR code, that's the link to David's talk.

03:59.000 --> 04:03.000
Just beware. I have another, I think couple of slides more with us QR codes.

04:03.000 --> 04:06.000
So this is the YouTube link to his talk.

04:07.000 --> 04:16.000
Oh, one thing to, to remember is the, the, the specification of the setup of these builders and these workers.

04:16.000 --> 04:19.000
That's maintained and what is called the LVM Zorque repository.

04:19.000 --> 04:22.000
So there's a separate repository for that.

04:22.000 --> 04:25.000
Okay, more terminology.

04:25.000 --> 04:30.000
In the repository, you will find building blocks that allow you to, you know, write your build paths.

04:30.000 --> 04:33.000
So for example, there's something that's called an annotated builder.

04:33.000 --> 04:36.000
That's kind of you build your own builder thing.

04:36.000 --> 04:41.000
So you have to script everything and tell it exactly the steps that this thing should do.

04:41.000 --> 04:44.000
And then there's more convenience kind of off the shelf things.

04:44.000 --> 04:47.000
So one, for example, is the OpenMP builder.

04:47.000 --> 04:52.000
This is kind of a standard way of building OpenMP in LVM.

04:52.000 --> 04:57.000
And the way this looks like is you have some sort of like Python stuff where you say,

04:57.000 --> 05:04.000
I have, you know, I have a builder name on a worker and it has a certain built directory.

05:04.000 --> 05:09.000
This has to be unique across the whole fleet of everything.

05:09.000 --> 05:14.000
So that's typically then why you put the built directory to be the same like the builder name.

05:14.000 --> 05:20.000
And then you, it depends on certain projects and you invoke a script and so that's one of the annotated builders.

05:20.000 --> 05:22.000
OpenMP builder wouldn't fit you.

05:22.000 --> 05:26.000
But anyway, our current built-bot fleet roughly looks like this.

05:26.000 --> 05:31.000
So this is production we have about eight production bots right now.

05:31.000 --> 05:36.000
And I'm going to take a now talk you through how we arrived here.

05:36.000 --> 05:45.000
So let's first look at what I put the assignment here and what I call the inherited time is the initial time when I actually inherited to take care of these built bots.

05:45.000 --> 05:51.000
Then there's an intermediate time where I played around and there's the current time which is now and then there's a bright future, right?

05:51.000 --> 05:56.000
Everybody likes the bright future. Okay, so let's let's we start with the inherited time.

05:56.000 --> 06:00.000
In heritage time was we had three machines basically.

06:00.000 --> 06:15.000
One for hip, one for openMP and one that basically just let the same thing as the as one as the openMP but then the third machine mirrored what what the openMP machine was doing.

06:15.000 --> 06:34.000
And then when you wanted to make changes you would bring up a PR to again, soric you would land it and then from soric it would like trickle down to the builders for the staging bots that would happen every two hours for the production bots would need to send galene and email that she needs to like bring it into the production bots.

06:34.000 --> 06:40.000
And this merging into staging again every two hours unless there's a problem.

06:40.000 --> 06:44.000
In the zork repository, but you don't know.

06:44.000 --> 06:50.000
Because there was no linting or testing or anything, so it was like I did submit a change to zork eight hours ago.

06:50.000 --> 06:54.000
It didn't trickle down that I do wrong that somebody else put up a patch.

06:54.000 --> 07:05.000
So sometimes it's just gotten email from Galena telling you, oh I landed your patch because there was something broken for commits before yours and there was kind of a mess.

07:05.000 --> 07:12.000
But anyway, so typically it worked and the builders didn't need to do like, didn't need too much attention so everything's fine, we have three machines.

07:12.000 --> 07:25.000
But then we discussed, there's two machines doing the same thing, that's kind of annoying because we were bringing in more stuff and for example Joseph Hubert started to work on lipsy for GPU and so we wanted to test that right.

07:25.000 --> 07:28.000
Okay, that's.

07:28.000 --> 07:30.000
Does it work?

07:30.000 --> 07:32.000
The intermediate time.

07:32.000 --> 07:33.000
Okay.

07:33.000 --> 07:39.000
So we had the setup and then lipsy came along. Okay, so let's just make this thing also test lipsy.

07:39.000 --> 07:43.000
Lipsy on GPU, so lipsy on GPU is tested on a mdgpu.

07:43.000 --> 07:49.000
And then there was a change that broke our internal rockham build but on suci.

07:49.000 --> 07:54.000
So the Linux 15, what does the default gcc version anybody knows?

07:54.000 --> 07:58.000
7.5.

07:58.000 --> 08:03.000
Typically developers do not use gcc 7.5 on the regular.

08:03.000 --> 08:08.000
So they use things from c++ 17 that are not present in gcc 7.5, right.

08:08.000 --> 08:09.000
And so that would break.

08:09.000 --> 08:15.000
So we brought online more builders specifically a slas builder.

08:15.000 --> 08:18.000
So we would actually catch these things when they land.

08:18.000 --> 08:21.000
We also then put in well eight, well nine.

08:21.000 --> 08:25.000
And one that also does does flank testing.

08:25.000 --> 08:27.000
So we test flying too.

08:27.000 --> 08:30.000
And that's great.

08:30.000 --> 08:35.000
But bringing all these machines online was done manually.

08:35.000 --> 08:45.000
And you can, you know, I leave it up to you to think who, who that person is that's represented by two people here.

08:45.000 --> 08:46.000
Okay.

08:46.000 --> 08:50.000
But one of the things is you notice there's different, different icons here, right.

08:50.000 --> 08:55.000
So there's this net up above that's a container, right.

08:55.000 --> 08:56.000
That's a, it's a box.

08:56.000 --> 08:57.000
It's a container.

08:57.000 --> 08:58.000
And this is bare metal.

08:58.000 --> 09:05.000
So we would actually bring in the new builders more and more containerized.

09:05.000 --> 09:10.000
And that has advantages to some extent.

09:10.000 --> 09:14.000
So a couple of them is we have more machines that are larger.

09:14.000 --> 09:18.000
And with containers, we can nice or nicely separate different builders.

09:18.000 --> 09:26.000
With different OS's put them on one machine and can organize for throughput basically.

09:26.000 --> 09:28.000
And that's one of, actually one of the reasons.

09:28.000 --> 09:33.000
And those bring us to the current time.

09:33.000 --> 09:34.000
Okay.

09:34.000 --> 09:41.000
So here's this.

09:41.000 --> 09:42.000
Let's see.

09:42.000 --> 09:43.000
Don't know what's coming next.

09:43.000 --> 09:44.000
There we go.

09:44.000 --> 09:45.000
Okay.

09:45.000 --> 09:47.000
I have to now can talk.

09:47.000 --> 09:51.000
So then I was working on actually containerizing the lower half two.

09:51.000 --> 09:54.000
And that's nice because I was actually the, the hip pot.

09:54.000 --> 09:57.000
That's the one that I'm least familiar with.

09:57.000 --> 09:59.000
And then we found the problem there.

09:59.000 --> 10:02.000
In the sense that it went red.

10:02.000 --> 10:06.000
And we couldn't reproduce the issue locally.

10:06.000 --> 10:11.000
So we put it into staging back and I put up a container.

10:11.000 --> 10:17.000
I created it because how, how the hip pot was set up was barely documented.

10:17.000 --> 10:20.000
And only internally documented.

10:20.000 --> 10:25.000
And then since I couldn't reproduce the issue then locally in this containerized environment.

10:25.000 --> 10:26.000
We're like, okay.

10:26.000 --> 10:28.000
The rock conversion on the host is different.

10:28.000 --> 10:31.000
So let's update the rock conversion on the on the.

10:31.000 --> 10:33.000
Uh, actual built for worker.

10:33.000 --> 10:34.000
Okay. Great.

10:34.000 --> 10:35.000
Sure.

10:35.000 --> 10:36.000
Let's do this.

10:36.000 --> 10:39.000
So it runs updates and then it has to restart services.

10:39.000 --> 10:40.000
Okay.

10:40.000 --> 10:42.000
And then connection time down.

10:42.000 --> 10:43.000
Oops.

10:43.000 --> 10:45.000
Okay.

10:45.000 --> 10:47.000
I have IPMI information.

10:47.000 --> 10:52.000
I can like log into the node and that IPMI information apparently is outdated.

10:52.000 --> 10:54.000
So I couldn't log into that node again.

10:54.000 --> 10:55.000
But I lost the builder.

10:55.000 --> 10:57.000
And thank you.

10:57.000 --> 10:58.000
I containerized this.

10:58.000 --> 11:02.000
And I actually had put together some ensemble playbooks to deploy these things.

11:02.000 --> 11:06.000
So I can actually, I was actually able to.

11:06.000 --> 11:11.000
Build on the work that I did and in about 30 minutes turn this thing into a container.

11:11.000 --> 11:15.000
And now it's back into production, back in options are we testing again.

11:15.000 --> 11:24.000
Also our hip stuff and we're moving more and more stuff towards this more automated approach of containerized builders.

11:24.000 --> 11:32.000
Deployment, via Ansible, onto the actual machines.

11:32.000 --> 11:35.000
So let's talk about the bright future.

11:35.000 --> 11:37.000
Maybe just the future, but hopefully it's bright.

11:37.000 --> 11:45.000
And so this is where I would like to move our testing towards because I recognize some of the problems that we have.

11:45.000 --> 11:47.000
So first is for post commit.

11:47.000 --> 11:54.000
I would like to get everything containerized and then make a distinction between what is called slowbots and fastbots.

11:54.000 --> 11:59.000
Fastbots are basically build only only little tests, nothing else.

11:59.000 --> 12:06.000
But the idea is that those things are fast enough to test every single commit, nothing batched.

12:06.000 --> 12:08.000
Every single commit.

12:08.000 --> 12:14.000
Because if the testing batched, which is, you know, you have three or four commits and your bot turns red,

12:14.000 --> 12:16.000
you don't know which commit that breaks.

12:16.000 --> 12:21.000
So as a maintainer, you then have to go locally, revert, check this.

12:21.000 --> 12:25.000
This is the breaking change, re-apply, revert something else.

12:25.000 --> 12:28.000
And that's kind of annoying.

12:29.000 --> 12:39.000
The slowbots, that's, they will be actually batching things, but we'll have probably like a turnaround time between 45 minutes and an hour.

12:39.000 --> 12:41.000
So we could actually run some workloads through them.

12:41.000 --> 12:47.000
And see, you know, what is the impact on, for example, a spec CPU suite or whatever.

12:47.000 --> 12:51.000
And then, of course, there's a bright future.

12:52.000 --> 12:57.000
With GitHub, actions, runner, controller, uncubanetties.

12:57.000 --> 13:05.000
And now, so that would give us pre-commit build and pre-commit test on GPU.

13:05.000 --> 13:12.000
And that's what I'm currently working on, and it's, that's also been a journey, but that's a different topic.

13:12.000 --> 13:18.000
Okay, so, lessons learned, best practices.

13:19.000 --> 13:22.000
Mostly lessons learned for me, I guess.

13:22.000 --> 13:28.000
When I inherited the single builder, it's easy enough to maintain a single thing.

13:28.000 --> 13:31.000
And it's quite well documented, actually.

13:31.000 --> 13:35.000
I think.

13:35.000 --> 13:42.000
When I started testing changes to the build, but was a nightmare, because it was basically non-existent.

13:42.000 --> 13:44.000
That's now much, much better.

13:44.000 --> 13:47.000
So I don't remember who line up the changes, but thank you so much.

13:47.000 --> 13:50.000
It's working great.

13:50.000 --> 13:57.000
And also production builders would send you emails if the build breaks.

13:57.000 --> 14:04.000
But only if you actually marked them as sending emails in this file, that's something I learned about six weeks ago.

14:04.000 --> 14:06.000
So this is why I put this here.

14:06.000 --> 14:12.000
If you want to have your builder sending you emails as the maintainer, they have to be in this file.

14:12.000 --> 14:22.000
And the other, the next thing is that I learned is, you want to have your builds or the builders, actually, the build, but it's as reproducible as possible.

14:22.000 --> 14:27.000
Because you have a contributor asking, how can I reproduce that issue?

14:27.000 --> 14:29.000
How can I reproduce the failing build?

14:29.000 --> 14:35.000
And then you say, I have this machine, you know?

14:36.000 --> 14:42.000
And so I've now containerized all the environments that we use.

14:42.000 --> 14:47.000
And you will find the Dockerfiles through this link here.

14:47.000 --> 14:55.000
And also I've started to put together CMake cache files that are entry in LVM trunk in the off-load project.

14:55.000 --> 14:59.000
And they are used as the build configs in our build pots.

14:59.000 --> 15:03.000
You can actually download the Dockerfile build the Docker locally.

15:03.000 --> 15:10.000
And then simply do a ninja dash capital C link to the CMake cache file and get the exact build config that our build is run.

15:10.000 --> 15:15.000
So that should make it fairly easy to at least reproduce the build environment.

15:15.000 --> 15:23.000
All be it you might not have an AMD GPU to potentially then execute the test the same way we do it.

15:23.000 --> 15:32.000
And then finally one of the things that I realized in the end is that if you're actually running a fleet of build pots, that's harder.

15:32.000 --> 15:36.000
But that's not harder because of build pots, but because you're actually managing a fleet of stuff.

15:36.000 --> 15:38.000
Right? And that's a different thing.

15:38.000 --> 15:47.000
You have to think not only about, oh, how do I make this, you know, reproducible or whatever, but how do I manage eight machines, 16 deployments, whatever.

15:47.000 --> 15:59.000
And that's I think you should automate deployments through Ansible, for example, because that can serve as documentation, but also make it easier in the long run.

15:59.000 --> 16:03.000
And if you found any of this interesting, we're hiring.

16:03.000 --> 16:07.000
All right, so come talk to me, come talk to other folks from AMD.

16:07.000 --> 16:10.000
I think I'm supposed to show you this.

16:10.000 --> 16:13.000
And thank you so much, I'm very happy to take questions.

16:13.000 --> 16:26.000
How specific is build-off to build?

16:26.000 --> 16:33.000
How specific is build pot to LLVM? I think, according to the website not at all.

16:33.000 --> 16:40.000
It could be used for some totally different project.

16:40.000 --> 16:45.000
And it's because it's more a toolbox to build your own CI.

16:45.000 --> 16:49.000
Kind of thing.

16:49.000 --> 16:52.000
Yeah, build pot might be older than LVM was a comment here.

16:52.000 --> 16:59.000
Okay, yes.

16:59.000 --> 17:07.000
We use C-Kesh. The question was whether we use any techniques to, yes, there's no way around C-Kesh for us.

17:07.000 --> 17:11.000
Yes.

17:11.000 --> 17:16.000
The question is, do we also use it for kernel or misa changes? We don't.

17:16.000 --> 17:25.000
This is completely what I showed is just LVM upstream and nothing else.

17:25.000 --> 17:29.000
No, this is not rockham. This is pure upstream.

17:29.000 --> 17:33.000
Yeah. What we do internally, that's a whole different story, but this is pure upstream.

17:33.000 --> 17:35.000
There's nothing rockham specific.

17:35.000 --> 17:40.000
What we need is we need the current fusion driver for the actually getting access to the GPUs.

17:40.000 --> 17:42.000
And then we have the HSA runtime on top of that.

17:42.000 --> 17:46.000
So we can build off-loading tests, GPU off-loading tests, but that's it.

17:46.000 --> 17:50.000
The build pots are not actually having large rockham installations.

17:50.000 --> 17:55.000
Yes, the HSA runtime is fixed.

17:56.000 --> 18:02.000
And we updated every now and then and then, you know, that's reflected in the builder description.

18:02.000 --> 18:10.000
So we note which rockham version that builder is running and so you could install the same thing.

18:10.000 --> 18:11.000
Thank you.

18:11.000 --> 18:16.000
Cool.

18:16.000 --> 18:18.000
All right.

18:25.000 --> 18:32.000
Thanks for watching.