WEBVTT 00:00.000 --> 00:07.000 Let's see. 00:07.000 --> 00:10.000 All right, hey everyone. 00:10.000 --> 00:11.000 My name is J. P. Lair. 00:11.000 --> 00:14.000 I'm the fourth presenter that has this logo on this light. 00:14.000 --> 00:18.000 So apologies for stopping the dev room, but apparently we're somewhat active here. 00:18.000 --> 00:20.000 So I'm going to talk about programming is fantastic. 00:20.000 --> 00:21.000 It's needed. 00:21.000 --> 00:25.000 And in for us, you know, you can fill in the blanks, whatever you think in for us. 00:25.000 --> 00:29.000 I think it's needed, but it's also, I would hope it's somebody else's job. 00:29.000 --> 00:33.000 But get guess what? 00:33.000 --> 00:37.000 I've kind of anyway. 00:37.000 --> 00:39.000 Why do we actually care? 00:39.000 --> 00:40.000 Why do we care about stuff? 00:40.000 --> 00:43.000 So we test upstream LVN for two reasons. 00:43.000 --> 00:47.000 One is we have many people working immediately in upstream LVN. 00:47.000 --> 00:49.000 Of course, we want to support the development. 00:49.000 --> 00:52.000 We want to support our upstream developers. 00:52.000 --> 00:56.000 So most of our actually back-end work for Amy GPU is done upstream. 00:56.000 --> 01:04.000 But also, we want to guard downstream rockham because we actually merge upstream changes into downstream rockham three times a day. 01:04.000 --> 01:11.000 And whenever something breaks upstream, and it hits a certain window where we pull in the changes for the merges, 01:11.000 --> 01:13.000 we break our internal build of rockham too. 01:13.000 --> 01:15.000 And we don't want to have that. 01:15.000 --> 01:20.000 And this is why we actually test upstream and test upstream more and more and more. 01:20.000 --> 01:23.000 And so we're bringing up more upstream testing. 01:23.000 --> 01:27.000 Now, in this presentation, I'm going to talk about my journey, 01:27.000 --> 01:33.000 when I kind of inherited the then existing build bots. 01:33.000 --> 01:39.000 And, you know, towards bringing online more build bots and working towards also potentially 01:39.000 --> 01:43.000 bringing online pre-competesting. 01:43.000 --> 01:49.000 When you open a PR, such that we actually run on GPU tests if you touch stuff that we care about. 01:49.000 --> 01:53.000 Okay, so since it's about my journey, first things first, 01:53.000 --> 01:57.000 LVM testing landscape, there's three different technologies in use here. 01:57.000 --> 02:02.000 So one is you have GitHub with a GitHub actions and a GitHub runners for some of the testing. 02:02.000 --> 02:09.000 So, for example, I think LipsXX fully relies on GitHub runners and GitHub actions for the testing. 02:09.000 --> 02:11.000 Then there's build kite. 02:11.000 --> 02:14.000 The current pre-compet testing is actually done. 02:14.000 --> 02:20.000 There's a GitHub workflow that puts together a string, which is a build kite pipeline definition, 02:20.000 --> 02:25.000 and then sends it off to build kite to test your PR changes through build kite, 02:25.000 --> 02:28.000 and then report the result back to GitHub. 02:28.000 --> 02:32.000 I think it's going to go away at some point, but it's still there right now. 02:32.000 --> 02:39.000 And this is pre-compet and then there's build bot and build bot is the whole post-compet fleet. 02:39.000 --> 02:42.000 And this talk is mostly about build bot. 02:43.000 --> 02:48.000 Simply because that's where I put most of the most of my energy so far in, 02:48.000 --> 02:53.000 and pre-compet is a little more new. 02:53.000 --> 02:55.000 So first of all, build bot. 02:55.000 --> 02:59.000 Let's talk a little bit about terminology, and I have this slide. 02:59.000 --> 03:03.000 There was a talk from David Spickett at the 2022 LVM Deaf meeting. 03:03.000 --> 03:06.000 If you're interested in the build bot, I would highly recommend this talk, 03:06.000 --> 03:11.000 because it goes into much more detail, what it is, being a build bot maintainer. 03:11.000 --> 03:15.000 But just for terminology wise, so build but it's orchestrated. 03:15.000 --> 03:20.000 You have a build master that's a server that's run by Galena, so thank you, Galena. 03:20.000 --> 03:23.000 And then you have what is called builders. 03:23.000 --> 03:30.000 Builders are kind of the logical entity that typically would build a configuration of LVM. 03:30.000 --> 03:35.000 And these builders then run on a worker, and a worker is more or less the physical, 03:35.000 --> 03:39.000 machine, or the actual thing that does the computing. 03:39.000 --> 03:42.000 So you can have a builder to work a mapping of one to one. 03:42.000 --> 03:45.000 You can have one to end. You can have, you know, you see it there. 03:45.000 --> 03:50.000 In our case, we always run a single builder on a single worker. 03:50.000 --> 03:55.000 So there's no, like we, or more crazy setup. 03:55.000 --> 03:59.000 If you scan the QR code, that's the link to David's talk. 03:59.000 --> 04:03.000 Just beware. I have another, I think couple of slides more with us QR codes. 04:03.000 --> 04:06.000 So this is the YouTube link to his talk. 04:07.000 --> 04:16.000 Oh, one thing to, to remember is the, the, the specification of the setup of these builders and these workers. 04:16.000 --> 04:19.000 That's maintained and what is called the LVM Zorque repository. 04:19.000 --> 04:22.000 So there's a separate repository for that. 04:22.000 --> 04:25.000 Okay, more terminology. 04:25.000 --> 04:30.000 In the repository, you will find building blocks that allow you to, you know, write your build paths. 04:30.000 --> 04:33.000 So for example, there's something that's called an annotated builder. 04:33.000 --> 04:36.000 That's kind of you build your own builder thing. 04:36.000 --> 04:41.000 So you have to script everything and tell it exactly the steps that this thing should do. 04:41.000 --> 04:44.000 And then there's more convenience kind of off the shelf things. 04:44.000 --> 04:47.000 So one, for example, is the OpenMP builder. 04:47.000 --> 04:52.000 This is kind of a standard way of building OpenMP in LVM. 04:52.000 --> 04:57.000 And the way this looks like is you have some sort of like Python stuff where you say, 04:57.000 --> 05:04.000 I have, you know, I have a builder name on a worker and it has a certain built directory. 05:04.000 --> 05:09.000 This has to be unique across the whole fleet of everything. 05:09.000 --> 05:14.000 So that's typically then why you put the built directory to be the same like the builder name. 05:14.000 --> 05:20.000 And then you, it depends on certain projects and you invoke a script and so that's one of the annotated builders. 05:20.000 --> 05:22.000 OpenMP builder wouldn't fit you. 05:22.000 --> 05:26.000 But anyway, our current built-bot fleet roughly looks like this. 05:26.000 --> 05:31.000 So this is production we have about eight production bots right now. 05:31.000 --> 05:36.000 And I'm going to take a now talk you through how we arrived here. 05:36.000 --> 05:45.000 So let's first look at what I put the assignment here and what I call the inherited time is the initial time when I actually inherited to take care of these built bots. 05:45.000 --> 05:51.000 Then there's an intermediate time where I played around and there's the current time which is now and then there's a bright future, right? 05:51.000 --> 05:56.000 Everybody likes the bright future. Okay, so let's let's we start with the inherited time. 05:56.000 --> 06:00.000 In heritage time was we had three machines basically. 06:00.000 --> 06:15.000 One for hip, one for openMP and one that basically just let the same thing as the as one as the openMP but then the third machine mirrored what what the openMP machine was doing. 06:15.000 --> 06:34.000 And then when you wanted to make changes you would bring up a PR to again, soric you would land it and then from soric it would like trickle down to the builders for the staging bots that would happen every two hours for the production bots would need to send galene and email that she needs to like bring it into the production bots. 06:34.000 --> 06:40.000 And this merging into staging again every two hours unless there's a problem. 06:40.000 --> 06:44.000 In the zork repository, but you don't know. 06:44.000 --> 06:50.000 Because there was no linting or testing or anything, so it was like I did submit a change to zork eight hours ago. 06:50.000 --> 06:54.000 It didn't trickle down that I do wrong that somebody else put up a patch. 06:54.000 --> 07:05.000 So sometimes it's just gotten email from Galena telling you, oh I landed your patch because there was something broken for commits before yours and there was kind of a mess. 07:05.000 --> 07:12.000 But anyway, so typically it worked and the builders didn't need to do like, didn't need too much attention so everything's fine, we have three machines. 07:12.000 --> 07:25.000 But then we discussed, there's two machines doing the same thing, that's kind of annoying because we were bringing in more stuff and for example Joseph Hubert started to work on lipsy for GPU and so we wanted to test that right. 07:25.000 --> 07:28.000 Okay, that's. 07:28.000 --> 07:30.000 Does it work? 07:30.000 --> 07:32.000 The intermediate time. 07:32.000 --> 07:33.000 Okay. 07:33.000 --> 07:39.000 So we had the setup and then lipsy came along. Okay, so let's just make this thing also test lipsy. 07:39.000 --> 07:43.000 Lipsy on GPU, so lipsy on GPU is tested on a mdgpu. 07:43.000 --> 07:49.000 And then there was a change that broke our internal rockham build but on suci. 07:49.000 --> 07:54.000 So the Linux 15, what does the default gcc version anybody knows? 07:54.000 --> 07:58.000 7.5. 07:58.000 --> 08:03.000 Typically developers do not use gcc 7.5 on the regular. 08:03.000 --> 08:08.000 So they use things from c++ 17 that are not present in gcc 7.5, right. 08:08.000 --> 08:09.000 And so that would break. 08:09.000 --> 08:15.000 So we brought online more builders specifically a slas builder. 08:15.000 --> 08:18.000 So we would actually catch these things when they land. 08:18.000 --> 08:21.000 We also then put in well eight, well nine. 08:21.000 --> 08:25.000 And one that also does does flank testing. 08:25.000 --> 08:27.000 So we test flying too. 08:27.000 --> 08:30.000 And that's great. 08:30.000 --> 08:35.000 But bringing all these machines online was done manually. 08:35.000 --> 08:45.000 And you can, you know, I leave it up to you to think who, who that person is that's represented by two people here. 08:45.000 --> 08:46.000 Okay. 08:46.000 --> 08:50.000 But one of the things is you notice there's different, different icons here, right. 08:50.000 --> 08:55.000 So there's this net up above that's a container, right. 08:55.000 --> 08:56.000 That's a, it's a box. 08:56.000 --> 08:57.000 It's a container. 08:57.000 --> 08:58.000 And this is bare metal. 08:58.000 --> 09:05.000 So we would actually bring in the new builders more and more containerized. 09:05.000 --> 09:10.000 And that has advantages to some extent. 09:10.000 --> 09:14.000 So a couple of them is we have more machines that are larger. 09:14.000 --> 09:18.000 And with containers, we can nice or nicely separate different builders. 09:18.000 --> 09:26.000 With different OS's put them on one machine and can organize for throughput basically. 09:26.000 --> 09:28.000 And that's one of, actually one of the reasons. 09:28.000 --> 09:33.000 And those bring us to the current time. 09:33.000 --> 09:34.000 Okay. 09:34.000 --> 09:41.000 So here's this. 09:41.000 --> 09:42.000 Let's see. 09:42.000 --> 09:43.000 Don't know what's coming next. 09:43.000 --> 09:44.000 There we go. 09:44.000 --> 09:45.000 Okay. 09:45.000 --> 09:47.000 I have to now can talk. 09:47.000 --> 09:51.000 So then I was working on actually containerizing the lower half two. 09:51.000 --> 09:54.000 And that's nice because I was actually the, the hip pot. 09:54.000 --> 09:57.000 That's the one that I'm least familiar with. 09:57.000 --> 09:59.000 And then we found the problem there. 09:59.000 --> 10:02.000 In the sense that it went red. 10:02.000 --> 10:06.000 And we couldn't reproduce the issue locally. 10:06.000 --> 10:11.000 So we put it into staging back and I put up a container. 10:11.000 --> 10:17.000 I created it because how, how the hip pot was set up was barely documented. 10:17.000 --> 10:20.000 And only internally documented. 10:20.000 --> 10:25.000 And then since I couldn't reproduce the issue then locally in this containerized environment. 10:25.000 --> 10:26.000 We're like, okay. 10:26.000 --> 10:28.000 The rock conversion on the host is different. 10:28.000 --> 10:31.000 So let's update the rock conversion on the on the. 10:31.000 --> 10:33.000 Uh, actual built for worker. 10:33.000 --> 10:34.000 Okay. Great. 10:34.000 --> 10:35.000 Sure. 10:35.000 --> 10:36.000 Let's do this. 10:36.000 --> 10:39.000 So it runs updates and then it has to restart services. 10:39.000 --> 10:40.000 Okay. 10:40.000 --> 10:42.000 And then connection time down. 10:42.000 --> 10:43.000 Oops. 10:43.000 --> 10:45.000 Okay. 10:45.000 --> 10:47.000 I have IPMI information. 10:47.000 --> 10:52.000 I can like log into the node and that IPMI information apparently is outdated. 10:52.000 --> 10:54.000 So I couldn't log into that node again. 10:54.000 --> 10:55.000 But I lost the builder. 10:55.000 --> 10:57.000 And thank you. 10:57.000 --> 10:58.000 I containerized this. 10:58.000 --> 11:02.000 And I actually had put together some ensemble playbooks to deploy these things. 11:02.000 --> 11:06.000 So I can actually, I was actually able to. 11:06.000 --> 11:11.000 Build on the work that I did and in about 30 minutes turn this thing into a container. 11:11.000 --> 11:15.000 And now it's back into production, back in options are we testing again. 11:15.000 --> 11:24.000 Also our hip stuff and we're moving more and more stuff towards this more automated approach of containerized builders. 11:24.000 --> 11:32.000 Deployment, via Ansible, onto the actual machines. 11:32.000 --> 11:35.000 So let's talk about the bright future. 11:35.000 --> 11:37.000 Maybe just the future, but hopefully it's bright. 11:37.000 --> 11:45.000 And so this is where I would like to move our testing towards because I recognize some of the problems that we have. 11:45.000 --> 11:47.000 So first is for post commit. 11:47.000 --> 11:54.000 I would like to get everything containerized and then make a distinction between what is called slowbots and fastbots. 11:54.000 --> 11:59.000 Fastbots are basically build only only little tests, nothing else. 11:59.000 --> 12:06.000 But the idea is that those things are fast enough to test every single commit, nothing batched. 12:06.000 --> 12:08.000 Every single commit. 12:08.000 --> 12:14.000 Because if the testing batched, which is, you know, you have three or four commits and your bot turns red, 12:14.000 --> 12:16.000 you don't know which commit that breaks. 12:16.000 --> 12:21.000 So as a maintainer, you then have to go locally, revert, check this. 12:21.000 --> 12:25.000 This is the breaking change, re-apply, revert something else. 12:25.000 --> 12:28.000 And that's kind of annoying. 12:29.000 --> 12:39.000 The slowbots, that's, they will be actually batching things, but we'll have probably like a turnaround time between 45 minutes and an hour. 12:39.000 --> 12:41.000 So we could actually run some workloads through them. 12:41.000 --> 12:47.000 And see, you know, what is the impact on, for example, a spec CPU suite or whatever. 12:47.000 --> 12:51.000 And then, of course, there's a bright future. 12:52.000 --> 12:57.000 With GitHub, actions, runner, controller, uncubanetties. 12:57.000 --> 13:05.000 And now, so that would give us pre-commit build and pre-commit test on GPU. 13:05.000 --> 13:12.000 And that's what I'm currently working on, and it's, that's also been a journey, but that's a different topic. 13:12.000 --> 13:18.000 Okay, so, lessons learned, best practices. 13:19.000 --> 13:22.000 Mostly lessons learned for me, I guess. 13:22.000 --> 13:28.000 When I inherited the single builder, it's easy enough to maintain a single thing. 13:28.000 --> 13:31.000 And it's quite well documented, actually. 13:31.000 --> 13:35.000 I think. 13:35.000 --> 13:42.000 When I started testing changes to the build, but was a nightmare, because it was basically non-existent. 13:42.000 --> 13:44.000 That's now much, much better. 13:44.000 --> 13:47.000 So I don't remember who line up the changes, but thank you so much. 13:47.000 --> 13:50.000 It's working great. 13:50.000 --> 13:57.000 And also production builders would send you emails if the build breaks. 13:57.000 --> 14:04.000 But only if you actually marked them as sending emails in this file, that's something I learned about six weeks ago. 14:04.000 --> 14:06.000 So this is why I put this here. 14:06.000 --> 14:12.000 If you want to have your builder sending you emails as the maintainer, they have to be in this file. 14:12.000 --> 14:22.000 And the other, the next thing is that I learned is, you want to have your builds or the builders, actually, the build, but it's as reproducible as possible. 14:22.000 --> 14:27.000 Because you have a contributor asking, how can I reproduce that issue? 14:27.000 --> 14:29.000 How can I reproduce the failing build? 14:29.000 --> 14:35.000 And then you say, I have this machine, you know? 14:36.000 --> 14:42.000 And so I've now containerized all the environments that we use. 14:42.000 --> 14:47.000 And you will find the Dockerfiles through this link here. 14:47.000 --> 14:55.000 And also I've started to put together CMake cache files that are entry in LVM trunk in the off-load project. 14:55.000 --> 14:59.000 And they are used as the build configs in our build pots. 14:59.000 --> 15:03.000 You can actually download the Dockerfile build the Docker locally. 15:03.000 --> 15:10.000 And then simply do a ninja dash capital C link to the CMake cache file and get the exact build config that our build is run. 15:10.000 --> 15:15.000 So that should make it fairly easy to at least reproduce the build environment. 15:15.000 --> 15:23.000 All be it you might not have an AMD GPU to potentially then execute the test the same way we do it. 15:23.000 --> 15:32.000 And then finally one of the things that I realized in the end is that if you're actually running a fleet of build pots, that's harder. 15:32.000 --> 15:36.000 But that's not harder because of build pots, but because you're actually managing a fleet of stuff. 15:36.000 --> 15:38.000 Right? And that's a different thing. 15:38.000 --> 15:47.000 You have to think not only about, oh, how do I make this, you know, reproducible or whatever, but how do I manage eight machines, 16 deployments, whatever. 15:47.000 --> 15:59.000 And that's I think you should automate deployments through Ansible, for example, because that can serve as documentation, but also make it easier in the long run. 15:59.000 --> 16:03.000 And if you found any of this interesting, we're hiring. 16:03.000 --> 16:07.000 All right, so come talk to me, come talk to other folks from AMD. 16:07.000 --> 16:10.000 I think I'm supposed to show you this. 16:10.000 --> 16:13.000 And thank you so much, I'm very happy to take questions. 16:13.000 --> 16:26.000 How specific is build-off to build? 16:26.000 --> 16:33.000 How specific is build pot to LLVM? I think, according to the website not at all. 16:33.000 --> 16:40.000 It could be used for some totally different project. 16:40.000 --> 16:45.000 And it's because it's more a toolbox to build your own CI. 16:45.000 --> 16:49.000 Kind of thing. 16:49.000 --> 16:52.000 Yeah, build pot might be older than LVM was a comment here. 16:52.000 --> 16:59.000 Okay, yes. 16:59.000 --> 17:07.000 We use C-Kesh. The question was whether we use any techniques to, yes, there's no way around C-Kesh for us. 17:07.000 --> 17:11.000 Yes. 17:11.000 --> 17:16.000 The question is, do we also use it for kernel or misa changes? We don't. 17:16.000 --> 17:25.000 This is completely what I showed is just LVM upstream and nothing else. 17:25.000 --> 17:29.000 No, this is not rockham. This is pure upstream. 17:29.000 --> 17:33.000 Yeah. What we do internally, that's a whole different story, but this is pure upstream. 17:33.000 --> 17:35.000 There's nothing rockham specific. 17:35.000 --> 17:40.000 What we need is we need the current fusion driver for the actually getting access to the GPUs. 17:40.000 --> 17:42.000 And then we have the HSA runtime on top of that. 17:42.000 --> 17:46.000 So we can build off-loading tests, GPU off-loading tests, but that's it. 17:46.000 --> 17:50.000 The build pots are not actually having large rockham installations. 17:50.000 --> 17:55.000 Yes, the HSA runtime is fixed. 17:56.000 --> 18:02.000 And we updated every now and then and then, you know, that's reflected in the builder description. 18:02.000 --> 18:10.000 So we note which rockham version that builder is running and so you could install the same thing. 18:10.000 --> 18:11.000 Thank you. 18:11.000 --> 18:16.000 Cool. 18:16.000 --> 18:18.000 All right. 18:25.000 --> 18:32.000 Thanks for watching.