WEBVTT

00:00.000 --> 00:13.000
Hello everyone and thank you for joining this presentation. My name is Stefan. I'm currently

00:13.000 --> 00:19.920
working on analog devices and I'll present today an automated way of testing custom

00:19.920 --> 00:36.360
and using a multi-power hardware platform. The presentation is focused on system-level testing

00:36.360 --> 00:43.360
mostly showing the importance of it in integration between hardware and software and to ensure

00:43.360 --> 00:55.360
that those components are working together, there is required to robust testing infrastructure.

00:55.360 --> 01:01.000
So some key attributes like a raw-dair automation integration, rigorous testing, are some

01:01.000 --> 01:09.080
of the must-have for having obtaining that quality from the middle and of course a reliable

01:09.080 --> 01:16.680
testing infrastructure not only optimize and decrease the testing grant time but also reduce

01:16.680 --> 01:27.080
the probability to find bugs in production.

01:27.080 --> 01:32.960
So our custom Linux distribution is called analog devices, hyper Linux. It's a free open source

01:33.000 --> 01:39.960
Linux distribution and comes with a lot of pre-kept files like different applications,

01:39.960 --> 01:46.680
Linux drivers, libraries and a lot of examples.

01:46.680 --> 01:55.200
Hyper-supports, multiple hardware platforms, AMD, Intel FPGAs and also Raspberry Pi and recently

01:55.240 --> 02:08.720
NXP. It also incorporates Linux package repository which includes, which is included by default

02:08.720 --> 02:18.200
in the distribution and the image. So it makes all the update process much easier.

02:18.200 --> 02:22.160
For more details about the hyper, the last time from there, my colleague Andrea from there

02:22.240 --> 02:30.400
will have a talk on 6 p.m. regarding how we optimize the whole release and it will

02:30.400 --> 02:37.720
be in the same thread here on continuous testing.

02:37.720 --> 02:48.720
The flow for individual components is using the classical tools like HR pipeline Jenkins,

02:48.840 --> 02:57.680
GitHub actions and so on. The CI will build a software components across multiple

02:57.680 --> 03:05.680
operating systems and the resulting binaries that can be Linux packages with those installers

03:05.680 --> 03:14.680
or even other kind of binaries or files can be saved directly in GitHub's repositories

03:14.680 --> 03:21.760
in that package repository for the bin or internal servers.

03:21.760 --> 03:28.840
This approach not only improves efficiency, but also ensures traceability, version control

03:28.840 --> 03:36.840
and artifacts acceptability so they can be accessed from multiple ways from multiple

03:36.920 --> 03:45.800
other machines. And of course, that somehow required to scale the system to more and more

03:45.800 --> 03:54.520
repositories. The testing process in hardware closely mirrors the one in software.

03:54.520 --> 04:01.960
The workflow begins by writing specific boot files in the hardware. Most of the hardware

04:02.040 --> 04:07.480
are supporting SD card and I will present in few slides what solution will help for that.

04:09.320 --> 04:16.120
The test are executed in parallel on multiple platforms reducing in this way the overall testing

04:16.120 --> 04:24.280
time and increasing of course efficiency. Most of the popular testing frameworks are focused on

04:24.760 --> 04:32.520
application level or focused just on a specific hardware platform. So that's the main reason

04:32.520 --> 04:40.600
for which we have created our own testing framework called test harness, hardware test harness.

04:41.880 --> 04:47.400
And this hardware test harness is designed to unify testing across a wide range of platforms

04:48.280 --> 05:03.560
being consistent overall. So as you can see builds are triggered by pull requests, pushes,

05:03.560 --> 05:09.800
and GitHub we still add the manual trigger and the crown in Jenkins there is a main server

05:10.680 --> 05:22.840
and that main server will, that JFrog server will just keep the artifact. It will

05:22.840 --> 05:32.040
attach a buffer there. And once the build passes the result binaries will be sent to the main Jenkins

05:32.040 --> 05:38.040
server or the main Jenkins job and from there that they will be distributed and executed

05:38.120 --> 05:45.560
to multiple agents each agent has a hardware platform usually different hardware platforms attached.

05:48.840 --> 05:57.400
Most of the tests are written in Python and they have tags to tell them on which hardware

05:58.040 --> 06:08.920
there are compatibility and where they can be run. So as I said the intermediate server

06:09.960 --> 06:16.920
we are using artifact from JFrog from there. Main reason is to have it as a kind of buffer because

06:16.920 --> 06:24.360
we are using multiple repositories. So to eliminate the risk condition of having pull requests or

06:24.360 --> 06:35.000
pushes in the same time. Another one is to be able to have multiple instances of the test harness

06:35.000 --> 06:42.920
also that artifact to the server will be like a buffer and every instance will just download

06:42.920 --> 06:49.960
file from there for testing and also we use it to organize the files by timestamps, by versions

06:50.920 --> 07:01.480
and to increase the scalability of the whole framework. Jenkins agents it's also to be physical

07:01.480 --> 07:10.840
machines and it wears why. It's harder to connect it to them so they cannot be in cloud or VMs or

07:10.840 --> 07:19.880
anything else. So they are handling all the connections like USB, Wart, SSH, sending scripts to

07:19.960 --> 07:27.240
the hardware. Overriding the files from the SD card if there are SD cards used and so on.

07:28.120 --> 07:32.440
And also the build server in our case needs to be a physical machine and there's not enough to

07:32.440 --> 07:41.400
have a cloud one or VM. One of the reason being because there are implied a lot of computational

07:41.480 --> 07:50.600
resources. And also for example special licenses for different tools like silings, Intel,

07:50.600 --> 07:58.360
Quartus, Matlab and so on. So it's pretty hard to have a VM with all those tools pre installed

07:58.360 --> 08:11.320
and licensed. So a good framework should be adaptable to different hardware types and testing scenarios

08:12.280 --> 08:22.280
and modular enough to accommodate all the changes. By changes I'm referring to number of repositories

08:22.280 --> 08:30.120
that can be added to be tested in this hardware setup. Number of devices and other test,

08:30.120 --> 08:37.400
hardware board or number of automated tests if they will be modified or expanded.

08:38.360 --> 08:45.400
Many time there is a necessary a test manager to queue test also to distribute to execute and to

08:45.400 --> 08:52.600
collect results. But I'll go first through some implementation details and afterwards come back to

08:52.600 --> 09:03.080
this test manager. Some of the tools that we used main one is Jenkins. There are some advantages

09:03.160 --> 09:10.360
so it can be hosted individually with not abandon any cloud or any other sub tools or dependencies.

09:11.240 --> 09:17.720
It also integrates easily with GitHub with Artifactory or other additional tools

09:18.920 --> 09:26.600
and there is a very big online community that help on issues. Some other

09:26.680 --> 09:35.720
features from Jenkins like Jenkins share library, dynamic script language, resource locking,

09:35.720 --> 09:45.240
proof to be very useful in our use case. The next one is an in-house developed tool. It's called

09:45.240 --> 09:53.560
Nebula. This is actually a collection of Python scripts that manage hardware connections as I said

09:53.640 --> 10:04.120
where the internet j tag USB and so on. By using PDUs for distribution unit from there and also

10:04.120 --> 10:13.160
USB is the card MOOCs, the other one from here. We managed to have the full control remotely

10:13.160 --> 10:19.480
of all the hardware. Those two can be both from the internet are available.

10:19.800 --> 10:28.520
And we are using them to ensure that the hardware does not remain hang in a state that it's not

10:28.520 --> 10:38.840
usable. More than that both tools both set up from their PDU and SD card MOOCs are controlled

10:38.840 --> 10:46.840
through Python so they got integrated into Nebula. So now we have a set of commands used to

10:47.800 --> 11:03.400
do everything with hardware. Netbox is another tool probably some of you knows about it.

11:03.400 --> 11:10.120
So this was initially designed for modeling and documenting network cracks. That's why those

11:10.120 --> 11:16.840
pictures from network cracks back. But also it fits better in our use case because in the end we

11:16.840 --> 11:26.360
managed to put our hardware in those network cracks. So that this was necessary once we scaled up and

11:26.360 --> 11:33.160
we started to add more and more hardware to not just be laid around the desks and shelves and think

11:33.160 --> 11:40.040
like that. This was great. More than that we are using the netbox to get the Nebula config files

11:40.200 --> 11:44.680
for each piece of hardware from there we have all the information regarding connections.

11:47.000 --> 11:53.880
Device under test what tags or attributes has that hardware and think like that.

11:56.760 --> 12:04.920
And all the information from netbox is updated only whenever we need a new setup device under test

12:04.920 --> 12:10.200
to be added to be modified or re-arranged between them. So it doesn't imply manual

12:11.960 --> 12:12.680
modifications.

12:19.080 --> 12:24.200
As I said, Jenkins shared libraries it's a very good way to centralize the groovy script in

12:24.200 --> 12:29.720
one single place. It can be a repo. We can have tests on that repo.

12:30.280 --> 12:37.000
In our case it contains the definition for common functions and pipeline steps that can

12:37.000 --> 12:44.360
be shared across multiple Jenkins server or agents. We use it to update the agents to update

12:44.360 --> 12:50.920
the tools to manage the hardware and so on. And this approach ensures an efficient process of updating

12:50.920 --> 12:57.080
and maintaining the same functionality across all the machines from the test hardness.

13:00.040 --> 13:05.320
By mixing the diagrams of continuous CI with continuous testing will obtain something like this.

13:06.920 --> 13:16.360
Behind it there are about 100 pipelines, CI pipelines and over 10 build servers.

13:17.320 --> 13:26.520
We are using Azure, GitHub actions, Jenkins, Docker and so on. And for most of the repositories

13:27.400 --> 13:34.920
besides the build, we also have the hardware test results from the board farm back to the GitHub.

13:36.280 --> 13:41.240
Some of their components like libraries or other things that can be tested, it didn't

13:41.240 --> 13:46.520
individually directly in the hardware on different platforms, have the results sent back to the GitHub.

13:47.800 --> 13:54.120
Other of them that depends or have multiple dependencies will save the

13:54.120 --> 14:03.880
output of the build on internal servers. Even if that internal server is the package manager

14:03.880 --> 14:10.440
or JFrogart factory just because they need to be combined with other components to be tested.

14:13.880 --> 14:20.440
So Linux packages are created automatically for multiple distribution, multiple

14:20.440 --> 14:32.200
Linux distribution. And we have that package where we have two environments testing and

14:32.200 --> 14:42.840
production easily to switch from one to another one. Let's now have the results are looking.

14:43.400 --> 14:49.000
So first of all this is the main Jenkins output, the standard one was not enough of course

14:49.080 --> 14:56.280
with switch to something a bit better, which was blue ocean. Blue ocean is just a plugin in

14:56.280 --> 15:02.360
Jenkins to see the results on different pipelines. This helped us to identify exactly the stages

15:03.240 --> 15:12.680
and hardware where the problem was fine but still not enough. Then we switch to J unit with some

15:12.760 --> 15:22.280
graphs and in the end we switch to log-stash for processing results, elastic search for storing

15:22.280 --> 15:31.000
them in database, keep on up for generating all the graphs and so on. The tax value, none of those

15:31.000 --> 15:38.360
was not good enough because developers were needed to go into Jenkins or any database of results

15:38.520 --> 15:43.080
and look for their polyquest results to know if it's okay to merge or not.

15:44.280 --> 15:54.280
So even if the results are shown in graphs or dashboard or so on. So the final step was somehow

15:54.280 --> 16:02.200
to close the loop. This was one of the most important features from the test harness and the main

16:02.280 --> 16:08.920
challenge here was to ensure that the private data is not shown into the public repositories.

16:11.960 --> 16:16.200
I don't know any kind of Jenkins laying site, internal site PE and so on.

16:17.560 --> 16:24.360
And down so here was GIST, posting GIST and besides the build status having also the hardware

16:24.360 --> 16:31.240
test results and with this system in place we were finally able to enable the required CI to pass

16:31.320 --> 16:42.520
in the GitHub so that we ensure the branches stability. As I said, the recovery hardware setups

16:42.520 --> 16:52.680
with that PDO and SD card MOOCs because it was somehow a common issue when the boot files produced

16:52.680 --> 17:01.160
by the CI were not good enough. In this case, hardware setups were hanging, they require manually

17:01.160 --> 17:09.160
intervention. But now with this system in place the frame of detects if the board is not booting.

17:09.160 --> 17:17.320
Even if that board is FPGA, it is very pire or other type. And we have a kind of golden file

17:18.200 --> 17:25.800
system which is a reliable baseline of boot files that are over it and every time a boot is passing.

17:25.880 --> 17:32.440
So we are every time over writing those files. However, there are still some rare scenarios

17:32.440 --> 17:39.960
where the hardware is broken or fully disconnected and requires manual intervention. But we

17:39.960 --> 17:52.360
use the manual intervention a lot with this. So not that I have been through all the main tools

17:52.440 --> 17:59.800
that we used, let's say the overview diagram. On the left side you can see the TRIGONIC mechanism,

17:59.800 --> 18:09.960
Jenkins files, Jenkins server, the Jenkins server, main Jenkins server manages the testing

18:09.960 --> 18:18.280
request from multiple GitHub repositories. We have multiple test harness instances, as I said,

18:18.280 --> 18:27.160
and all the results are collected together with elastic search, Kibana, and most of them sent

18:27.160 --> 18:38.680
back to GitHub. Test harness supports now tests that are in different languages, a Python

18:38.760 --> 18:46.520
C++ MATLAB, and so on. Harder boards are locked only when the tests are running.

18:48.440 --> 18:55.320
Otherwise they remain accessible for remote people, they can connect to harder boards and do the

18:55.320 --> 19:02.680
bug or development. And also the results are well structured and presented clearly as you saw

19:02.680 --> 19:13.880
with gifts back to the GitHub. And now let's see how it looks in real world.

19:17.400 --> 19:24.680
This was the prototype a few years ago, early beginning stages while the boards connected

19:24.680 --> 19:32.360
between them laying around on a desk there. At that point we were working on adding support

19:32.360 --> 19:43.240
for multiple platforms, ARM 64, everything that's related to the SPGA from Intel,

19:44.200 --> 19:50.200
and so on. And also we tried at some point to use AirPies as Jenkins agents.

19:52.680 --> 19:58.680
And this is how it looks now. We have two rocks filled with hardware.

20:00.920 --> 20:08.280
They are almost fully controlled remotely. It supports the Kibana Linux, the distribution

20:08.280 --> 20:14.600
that I talked in the beginning. And we are working on adding the octosupport.

20:15.400 --> 20:21.400
Also on multiple hardware. And many a little bit better distribution of the hardware

20:21.400 --> 20:27.400
across multiple physical locations. So to be able to have one single set up or just a few set ups

20:27.400 --> 20:34.840
in one location connected to the main Jenkins servers or the results to be collected together and sent

20:35.160 --> 20:42.120
back to the GitHub. In conclusion, we have managed to implement a complex testing framework

20:42.120 --> 20:47.880
that can be triggered from multiple GitHub reports, but still keeping the Chrome and manual

20:47.880 --> 20:53.560
triggering from Jenkins. Harder set ups remain accessible from remote connections,

20:54.520 --> 21:02.040
allowing colleagues to do development or debugging on them. It supports multiple

21:02.920 --> 21:10.280
platforms and can run tests in different languages. The resources got optimized by using Jenkins

21:10.280 --> 21:16.840
agents inside Docker containers. So this means that on every physical machine we can connect

21:16.840 --> 21:25.400
multiple hardware like three four set ups. There is a robust recovery mechanism in place and

21:25.400 --> 21:36.680
manual integration got reduced a lot. And also the test results are well structured and bind

21:36.680 --> 21:44.920
back to the GitHub, ensuring in this way that bugs are found in a fairly as possible. So that's

21:44.920 --> 21:53.400
how we stream either software and hardware integration and the testing process. Thank you.

21:55.400 --> 22:18.040
Sure. For pushing buttons in hardware. Yeah, there are multiple solutions. Yeah. So the

22:18.040 --> 22:24.200
question was if we have a solution for pushing buttons, yes, there are multiple solutions. You can connect

22:24.200 --> 22:30.680
with wires there and use Raspberry Pi is the case, but the video that I show in previous slides

22:30.680 --> 22:37.960
power distribution unit is able to do the power, the hardware power reboot or power off power on.

22:38.840 --> 22:44.040
If you need only for power, it's a K to use that power distribution unit. If you need really

22:44.040 --> 22:49.880
to push the buttons and you cannot do it from software, you can use Raspberry Pi or MPK or

22:49.880 --> 23:02.200
some other hardware and connect wires on the buttons. Yeah. Yeah. No. It's not required in our use case.

23:19.960 --> 23:29.080
Yeah. I will repeat the question. So the question was if there is any solution to flash the

23:29.080 --> 23:38.840
images on disks. Yeah. So we have that SD card MOOCs if you are referring to SD cards or anything else.

23:38.840 --> 23:46.280
We have platforms that support SD card and that SD card MOOCs works like MOOCs between the real SD card

23:46.280 --> 23:55.320
than the USB. So we can switch it to be seen by the host by laptop or machine as a card or something

23:55.320 --> 24:00.520
attached to the host or we can switch back to using as a card attached to the hardware.

24:01.320 --> 24:08.200
And from where we switch between two modes combined with power distribution unit, we know exactly

24:08.200 --> 24:15.880
when to reboot the board, the hardware reboot and we can do that. For other boards that do not have SD card

24:16.040 --> 24:22.360
and are working just by pressing down by programming them, we have also solutions by J-Tag so

24:22.360 --> 24:28.280
that it needs to be a host, laptop or any kind of physical machine where the hardware is connected

24:28.840 --> 24:36.360
and through J-Tag or other solutions we push files like micro blaze etaps, for example,

24:36.840 --> 24:45.000
the octo, the same. And in that case, in the worst case where there is no SD card,

24:45.960 --> 24:55.800
is that they have usually a manual switch and needs to be put on programming mode or booting mode.

24:55.800 --> 25:05.000
Let's say that. And that manual switch is handled with wires, just wires sold there and then it is

25:05.000 --> 25:15.560
a very wide pair of ok and control. So, just one small question about the SD wire module. Do you

25:15.560 --> 25:24.680
plug it directly to the depth point? Yes. Or have you ever tried using a SD card extender? Because I have

25:24.680 --> 25:31.720
a bunch of them, they work with the one I plug them directly, but I need the SD card extender to reach

25:31.720 --> 25:41.480
into some of our modules and adjust the support. I repeat the question. So, the question is about

25:41.480 --> 25:47.960
the SD card MOOCs from this picture from this slide, if we plug that into the directly into the

25:47.960 --> 25:55.160
boards, yes, the answer is yes. So, as you can see here, it has the shape of SD card and actually

25:55.160 --> 26:04.200
here it is the real SD card and USB from the host. Some problems that can be here are related to

26:04.200 --> 26:12.760
the attention level. So, the SD card either from the board may have a different speed or tension

26:12.760 --> 26:19.160
level compared with this small device. In that case, you just need to go on the hardware and do

26:19.160 --> 26:30.040
some tricks they are to switch to other speed, other level. Yeah, the voltage and yeah.

26:30.920 --> 26:35.800
In most of the cases, this can be done directly from the boot files device reasoned, yes.

26:37.400 --> 26:42.680
Is this an option? How is it custom design? No, this can be found on the internet.

26:43.240 --> 26:55.800
Yeah. Actually, there are multiple models, yes. No, it's not a neighborhood that some of you

26:55.800 --> 27:05.400
may know, it's something developed internally, but it's a documentation. It's our documentation.

27:05.720 --> 27:12.920
What is the name of the board? Of the board, it's SD card USB, USB SD card mooks.

27:14.600 --> 27:17.720
Sorry, that time over. Thank you.