WEBVTT 00:00.000 --> 00:13.000 Hello everyone and thank you for joining this presentation. My name is Stefan. I'm currently 00:13.000 --> 00:19.920 working on analog devices and I'll present today an automated way of testing custom 00:19.920 --> 00:36.360 and using a multi-power hardware platform. The presentation is focused on system-level testing 00:36.360 --> 00:43.360 mostly showing the importance of it in integration between hardware and software and to ensure 00:43.360 --> 00:55.360 that those components are working together, there is required to robust testing infrastructure. 00:55.360 --> 01:01.000 So some key attributes like a raw-dair automation integration, rigorous testing, are some 01:01.000 --> 01:09.080 of the must-have for having obtaining that quality from the middle and of course a reliable 01:09.080 --> 01:16.680 testing infrastructure not only optimize and decrease the testing grant time but also reduce 01:16.680 --> 01:27.080 the probability to find bugs in production. 01:27.080 --> 01:32.960 So our custom Linux distribution is called analog devices, hyper Linux. It's a free open source 01:33.000 --> 01:39.960 Linux distribution and comes with a lot of pre-kept files like different applications, 01:39.960 --> 01:46.680 Linux drivers, libraries and a lot of examples. 01:46.680 --> 01:55.200 Hyper-supports, multiple hardware platforms, AMD, Intel FPGAs and also Raspberry Pi and recently 01:55.240 --> 02:08.720 NXP. It also incorporates Linux package repository which includes, which is included by default 02:08.720 --> 02:18.200 in the distribution and the image. So it makes all the update process much easier. 02:18.200 --> 02:22.160 For more details about the hyper, the last time from there, my colleague Andrea from there 02:22.240 --> 02:30.400 will have a talk on 6 p.m. regarding how we optimize the whole release and it will 02:30.400 --> 02:37.720 be in the same thread here on continuous testing. 02:37.720 --> 02:48.720 The flow for individual components is using the classical tools like HR pipeline Jenkins, 02:48.840 --> 02:57.680 GitHub actions and so on. The CI will build a software components across multiple 02:57.680 --> 03:05.680 operating systems and the resulting binaries that can be Linux packages with those installers 03:05.680 --> 03:14.680 or even other kind of binaries or files can be saved directly in GitHub's repositories 03:14.680 --> 03:21.760 in that package repository for the bin or internal servers. 03:21.760 --> 03:28.840 This approach not only improves efficiency, but also ensures traceability, version control 03:28.840 --> 03:36.840 and artifacts acceptability so they can be accessed from multiple ways from multiple 03:36.920 --> 03:45.800 other machines. And of course, that somehow required to scale the system to more and more 03:45.800 --> 03:54.520 repositories. The testing process in hardware closely mirrors the one in software. 03:54.520 --> 04:01.960 The workflow begins by writing specific boot files in the hardware. Most of the hardware 04:02.040 --> 04:07.480 are supporting SD card and I will present in few slides what solution will help for that. 04:09.320 --> 04:16.120 The test are executed in parallel on multiple platforms reducing in this way the overall testing 04:16.120 --> 04:24.280 time and increasing of course efficiency. Most of the popular testing frameworks are focused on 04:24.760 --> 04:32.520 application level or focused just on a specific hardware platform. So that's the main reason 04:32.520 --> 04:40.600 for which we have created our own testing framework called test harness, hardware test harness. 04:41.880 --> 04:47.400 And this hardware test harness is designed to unify testing across a wide range of platforms 04:48.280 --> 05:03.560 being consistent overall. So as you can see builds are triggered by pull requests, pushes, 05:03.560 --> 05:09.800 and GitHub we still add the manual trigger and the crown in Jenkins there is a main server 05:10.680 --> 05:22.840 and that main server will, that JFrog server will just keep the artifact. It will 05:22.840 --> 05:32.040 attach a buffer there. And once the build passes the result binaries will be sent to the main Jenkins 05:32.040 --> 05:38.040 server or the main Jenkins job and from there that they will be distributed and executed 05:38.120 --> 05:45.560 to multiple agents each agent has a hardware platform usually different hardware platforms attached. 05:48.840 --> 05:57.400 Most of the tests are written in Python and they have tags to tell them on which hardware 05:58.040 --> 06:08.920 there are compatibility and where they can be run. So as I said the intermediate server 06:09.960 --> 06:16.920 we are using artifact from JFrog from there. Main reason is to have it as a kind of buffer because 06:16.920 --> 06:24.360 we are using multiple repositories. So to eliminate the risk condition of having pull requests or 06:24.360 --> 06:35.000 pushes in the same time. Another one is to be able to have multiple instances of the test harness 06:35.000 --> 06:42.920 also that artifact to the server will be like a buffer and every instance will just download 06:42.920 --> 06:49.960 file from there for testing and also we use it to organize the files by timestamps, by versions 06:50.920 --> 07:01.480 and to increase the scalability of the whole framework. Jenkins agents it's also to be physical 07:01.480 --> 07:10.840 machines and it wears why. It's harder to connect it to them so they cannot be in cloud or VMs or 07:10.840 --> 07:19.880 anything else. So they are handling all the connections like USB, Wart, SSH, sending scripts to 07:19.960 --> 07:27.240 the hardware. Overriding the files from the SD card if there are SD cards used and so on. 07:28.120 --> 07:32.440 And also the build server in our case needs to be a physical machine and there's not enough to 07:32.440 --> 07:41.400 have a cloud one or VM. One of the reason being because there are implied a lot of computational 07:41.480 --> 07:50.600 resources. And also for example special licenses for different tools like silings, Intel, 07:50.600 --> 07:58.360 Quartus, Matlab and so on. So it's pretty hard to have a VM with all those tools pre installed 07:58.360 --> 08:11.320 and licensed. So a good framework should be adaptable to different hardware types and testing scenarios 08:12.280 --> 08:22.280 and modular enough to accommodate all the changes. By changes I'm referring to number of repositories 08:22.280 --> 08:30.120 that can be added to be tested in this hardware setup. Number of devices and other test, 08:30.120 --> 08:37.400 hardware board or number of automated tests if they will be modified or expanded. 08:38.360 --> 08:45.400 Many time there is a necessary a test manager to queue test also to distribute to execute and to 08:45.400 --> 08:52.600 collect results. But I'll go first through some implementation details and afterwards come back to 08:52.600 --> 09:03.080 this test manager. Some of the tools that we used main one is Jenkins. There are some advantages 09:03.160 --> 09:10.360 so it can be hosted individually with not abandon any cloud or any other sub tools or dependencies. 09:11.240 --> 09:17.720 It also integrates easily with GitHub with Artifactory or other additional tools 09:18.920 --> 09:26.600 and there is a very big online community that help on issues. Some other 09:26.680 --> 09:35.720 features from Jenkins like Jenkins share library, dynamic script language, resource locking, 09:35.720 --> 09:45.240 proof to be very useful in our use case. The next one is an in-house developed tool. It's called 09:45.240 --> 09:53.560 Nebula. This is actually a collection of Python scripts that manage hardware connections as I said 09:53.640 --> 10:04.120 where the internet j tag USB and so on. By using PDUs for distribution unit from there and also 10:04.120 --> 10:13.160 USB is the card MOOCs, the other one from here. We managed to have the full control remotely 10:13.160 --> 10:19.480 of all the hardware. Those two can be both from the internet are available. 10:19.800 --> 10:28.520 And we are using them to ensure that the hardware does not remain hang in a state that it's not 10:28.520 --> 10:38.840 usable. More than that both tools both set up from their PDU and SD card MOOCs are controlled 10:38.840 --> 10:46.840 through Python so they got integrated into Nebula. So now we have a set of commands used to 10:47.800 --> 11:03.400 do everything with hardware. Netbox is another tool probably some of you knows about it. 11:03.400 --> 11:10.120 So this was initially designed for modeling and documenting network cracks. That's why those 11:10.120 --> 11:16.840 pictures from network cracks back. But also it fits better in our use case because in the end we 11:16.840 --> 11:26.360 managed to put our hardware in those network cracks. So that this was necessary once we scaled up and 11:26.360 --> 11:33.160 we started to add more and more hardware to not just be laid around the desks and shelves and think 11:33.160 --> 11:40.040 like that. This was great. More than that we are using the netbox to get the Nebula config files 11:40.200 --> 11:44.680 for each piece of hardware from there we have all the information regarding connections. 11:47.000 --> 11:53.880 Device under test what tags or attributes has that hardware and think like that. 11:56.760 --> 12:04.920 And all the information from netbox is updated only whenever we need a new setup device under test 12:04.920 --> 12:10.200 to be added to be modified or re-arranged between them. So it doesn't imply manual 12:11.960 --> 12:12.680 modifications. 12:19.080 --> 12:24.200 As I said, Jenkins shared libraries it's a very good way to centralize the groovy script in 12:24.200 --> 12:29.720 one single place. It can be a repo. We can have tests on that repo. 12:30.280 --> 12:37.000 In our case it contains the definition for common functions and pipeline steps that can 12:37.000 --> 12:44.360 be shared across multiple Jenkins server or agents. We use it to update the agents to update 12:44.360 --> 12:50.920 the tools to manage the hardware and so on. And this approach ensures an efficient process of updating 12:50.920 --> 12:57.080 and maintaining the same functionality across all the machines from the test hardness. 13:00.040 --> 13:05.320 By mixing the diagrams of continuous CI with continuous testing will obtain something like this. 13:06.920 --> 13:16.360 Behind it there are about 100 pipelines, CI pipelines and over 10 build servers. 13:17.320 --> 13:26.520 We are using Azure, GitHub actions, Jenkins, Docker and so on. And for most of the repositories 13:27.400 --> 13:34.920 besides the build, we also have the hardware test results from the board farm back to the GitHub. 13:36.280 --> 13:41.240 Some of their components like libraries or other things that can be tested, it didn't 13:41.240 --> 13:46.520 individually directly in the hardware on different platforms, have the results sent back to the GitHub. 13:47.800 --> 13:54.120 Other of them that depends or have multiple dependencies will save the 13:54.120 --> 14:03.880 output of the build on internal servers. Even if that internal server is the package manager 14:03.880 --> 14:10.440 or JFrogart factory just because they need to be combined with other components to be tested. 14:13.880 --> 14:20.440 So Linux packages are created automatically for multiple distribution, multiple 14:20.440 --> 14:32.200 Linux distribution. And we have that package where we have two environments testing and 14:32.200 --> 14:42.840 production easily to switch from one to another one. Let's now have the results are looking. 14:43.400 --> 14:49.000 So first of all this is the main Jenkins output, the standard one was not enough of course 14:49.080 --> 14:56.280 with switch to something a bit better, which was blue ocean. Blue ocean is just a plugin in 14:56.280 --> 15:02.360 Jenkins to see the results on different pipelines. This helped us to identify exactly the stages 15:03.240 --> 15:12.680 and hardware where the problem was fine but still not enough. Then we switch to J unit with some 15:12.760 --> 15:22.280 graphs and in the end we switch to log-stash for processing results, elastic search for storing 15:22.280 --> 15:31.000 them in database, keep on up for generating all the graphs and so on. The tax value, none of those 15:31.000 --> 15:38.360 was not good enough because developers were needed to go into Jenkins or any database of results 15:38.520 --> 15:43.080 and look for their polyquest results to know if it's okay to merge or not. 15:44.280 --> 15:54.280 So even if the results are shown in graphs or dashboard or so on. So the final step was somehow 15:54.280 --> 16:02.200 to close the loop. This was one of the most important features from the test harness and the main 16:02.280 --> 16:08.920 challenge here was to ensure that the private data is not shown into the public repositories. 16:11.960 --> 16:16.200 I don't know any kind of Jenkins laying site, internal site PE and so on. 16:17.560 --> 16:24.360 And down so here was GIST, posting GIST and besides the build status having also the hardware 16:24.360 --> 16:31.240 test results and with this system in place we were finally able to enable the required CI to pass 16:31.320 --> 16:42.520 in the GitHub so that we ensure the branches stability. As I said, the recovery hardware setups 16:42.520 --> 16:52.680 with that PDO and SD card MOOCs because it was somehow a common issue when the boot files produced 16:52.680 --> 17:01.160 by the CI were not good enough. In this case, hardware setups were hanging, they require manually 17:01.160 --> 17:09.160 intervention. But now with this system in place the frame of detects if the board is not booting. 17:09.160 --> 17:17.320 Even if that board is FPGA, it is very pire or other type. And we have a kind of golden file 17:18.200 --> 17:25.800 system which is a reliable baseline of boot files that are over it and every time a boot is passing. 17:25.880 --> 17:32.440 So we are every time over writing those files. However, there are still some rare scenarios 17:32.440 --> 17:39.960 where the hardware is broken or fully disconnected and requires manual intervention. But we 17:39.960 --> 17:52.360 use the manual intervention a lot with this. So not that I have been through all the main tools 17:52.440 --> 17:59.800 that we used, let's say the overview diagram. On the left side you can see the TRIGONIC mechanism, 17:59.800 --> 18:09.960 Jenkins files, Jenkins server, the Jenkins server, main Jenkins server manages the testing 18:09.960 --> 18:18.280 request from multiple GitHub repositories. We have multiple test harness instances, as I said, 18:18.280 --> 18:27.160 and all the results are collected together with elastic search, Kibana, and most of them sent 18:27.160 --> 18:38.680 back to GitHub. Test harness supports now tests that are in different languages, a Python 18:38.760 --> 18:46.520 C++ MATLAB, and so on. Harder boards are locked only when the tests are running. 18:48.440 --> 18:55.320 Otherwise they remain accessible for remote people, they can connect to harder boards and do the 18:55.320 --> 19:02.680 bug or development. And also the results are well structured and presented clearly as you saw 19:02.680 --> 19:13.880 with gifts back to the GitHub. And now let's see how it looks in real world. 19:17.400 --> 19:24.680 This was the prototype a few years ago, early beginning stages while the boards connected 19:24.680 --> 19:32.360 between them laying around on a desk there. At that point we were working on adding support 19:32.360 --> 19:43.240 for multiple platforms, ARM 64, everything that's related to the SPGA from Intel, 19:44.200 --> 19:50.200 and so on. And also we tried at some point to use AirPies as Jenkins agents. 19:52.680 --> 19:58.680 And this is how it looks now. We have two rocks filled with hardware. 20:00.920 --> 20:08.280 They are almost fully controlled remotely. It supports the Kibana Linux, the distribution 20:08.280 --> 20:14.600 that I talked in the beginning. And we are working on adding the octosupport. 20:15.400 --> 20:21.400 Also on multiple hardware. And many a little bit better distribution of the hardware 20:21.400 --> 20:27.400 across multiple physical locations. So to be able to have one single set up or just a few set ups 20:27.400 --> 20:34.840 in one location connected to the main Jenkins servers or the results to be collected together and sent 20:35.160 --> 20:42.120 back to the GitHub. In conclusion, we have managed to implement a complex testing framework 20:42.120 --> 20:47.880 that can be triggered from multiple GitHub reports, but still keeping the Chrome and manual 20:47.880 --> 20:53.560 triggering from Jenkins. Harder set ups remain accessible from remote connections, 20:54.520 --> 21:02.040 allowing colleagues to do development or debugging on them. It supports multiple 21:02.920 --> 21:10.280 platforms and can run tests in different languages. The resources got optimized by using Jenkins 21:10.280 --> 21:16.840 agents inside Docker containers. So this means that on every physical machine we can connect 21:16.840 --> 21:25.400 multiple hardware like three four set ups. There is a robust recovery mechanism in place and 21:25.400 --> 21:36.680 manual integration got reduced a lot. And also the test results are well structured and bind 21:36.680 --> 21:44.920 back to the GitHub, ensuring in this way that bugs are found in a fairly as possible. So that's 21:44.920 --> 21:53.400 how we stream either software and hardware integration and the testing process. Thank you. 21:55.400 --> 22:18.040 Sure. For pushing buttons in hardware. Yeah, there are multiple solutions. Yeah. So the 22:18.040 --> 22:24.200 question was if we have a solution for pushing buttons, yes, there are multiple solutions. You can connect 22:24.200 --> 22:30.680 with wires there and use Raspberry Pi is the case, but the video that I show in previous slides 22:30.680 --> 22:37.960 power distribution unit is able to do the power, the hardware power reboot or power off power on. 22:38.840 --> 22:44.040 If you need only for power, it's a K to use that power distribution unit. If you need really 22:44.040 --> 22:49.880 to push the buttons and you cannot do it from software, you can use Raspberry Pi or MPK or 22:49.880 --> 23:02.200 some other hardware and connect wires on the buttons. Yeah. Yeah. No. It's not required in our use case. 23:19.960 --> 23:29.080 Yeah. I will repeat the question. So the question was if there is any solution to flash the 23:29.080 --> 23:38.840 images on disks. Yeah. So we have that SD card MOOCs if you are referring to SD cards or anything else. 23:38.840 --> 23:46.280 We have platforms that support SD card and that SD card MOOCs works like MOOCs between the real SD card 23:46.280 --> 23:55.320 than the USB. So we can switch it to be seen by the host by laptop or machine as a card or something 23:55.320 --> 24:00.520 attached to the host or we can switch back to using as a card attached to the hardware. 24:01.320 --> 24:08.200 And from where we switch between two modes combined with power distribution unit, we know exactly 24:08.200 --> 24:15.880 when to reboot the board, the hardware reboot and we can do that. For other boards that do not have SD card 24:16.040 --> 24:22.360 and are working just by pressing down by programming them, we have also solutions by J-Tag so 24:22.360 --> 24:28.280 that it needs to be a host, laptop or any kind of physical machine where the hardware is connected 24:28.840 --> 24:36.360 and through J-Tag or other solutions we push files like micro blaze etaps, for example, 24:36.840 --> 24:45.000 the octo, the same. And in that case, in the worst case where there is no SD card, 24:45.960 --> 24:55.800 is that they have usually a manual switch and needs to be put on programming mode or booting mode. 24:55.800 --> 25:05.000 Let's say that. And that manual switch is handled with wires, just wires sold there and then it is 25:05.000 --> 25:15.560 a very wide pair of ok and control. So, just one small question about the SD wire module. Do you 25:15.560 --> 25:24.680 plug it directly to the depth point? Yes. Or have you ever tried using a SD card extender? Because I have 25:24.680 --> 25:31.720 a bunch of them, they work with the one I plug them directly, but I need the SD card extender to reach 25:31.720 --> 25:41.480 into some of our modules and adjust the support. I repeat the question. So, the question is about 25:41.480 --> 25:47.960 the SD card MOOCs from this picture from this slide, if we plug that into the directly into the 25:47.960 --> 25:55.160 boards, yes, the answer is yes. So, as you can see here, it has the shape of SD card and actually 25:55.160 --> 26:04.200 here it is the real SD card and USB from the host. Some problems that can be here are related to 26:04.200 --> 26:12.760 the attention level. So, the SD card either from the board may have a different speed or tension 26:12.760 --> 26:19.160 level compared with this small device. In that case, you just need to go on the hardware and do 26:19.160 --> 26:30.040 some tricks they are to switch to other speed, other level. Yeah, the voltage and yeah. 26:30.920 --> 26:35.800 In most of the cases, this can be done directly from the boot files device reasoned, yes. 26:37.400 --> 26:42.680 Is this an option? How is it custom design? No, this can be found on the internet. 26:43.240 --> 26:55.800 Yeah. Actually, there are multiple models, yes. No, it's not a neighborhood that some of you 26:55.800 --> 27:05.400 may know, it's something developed internally, but it's a documentation. It's our documentation. 27:05.720 --> 27:12.920 What is the name of the board? Of the board, it's SD card USB, USB SD card mooks. 27:14.600 --> 27:17.720 Sorry, that time over. Thank you.