WEBVTT

00:00.000 --> 00:20.000
I am two Italian people. So, yes, good afternoon. I'm working at Sern. So, it's a bit

00:20.000 --> 00:25.000
intimidated to present under the, like, at the, the, the, the, the, the, the, the, the, the, the, the, the, the,

00:25.000 --> 00:41.000
the electron mass and charge constants up there. But, uh, so I'm a computer scientist, and I'm presenting today the project that I had with the, with the colleague with the student at the University of Bologna on, uh, on, uh, kind of investigating a bit softer,

00:41.000 --> 00:55.000
I think. So, to visualizing visualizing it and, and to leverage the knowledge that we can extract from from that.

00:55.000 --> 01:05.000
Uh, so I'm, I'm a computer scientist doing mostly system administration stuff, but I also helped build the Sern Ospo, so I'm responsible for it.

01:05.000 --> 01:12.000
When I'm not working on Linux systems. So, I just have a couple of slides. I don't know if you've ever been at Sern.

01:12.000 --> 01:21.000
Can raise your hand if you have not, not that many. So, we're, uh, we're an international particle physics laboratory at the border between Switzerland and France.

01:21.000 --> 01:25.000
And, uh, and basically the,

01:26.000 --> 01:37.000
because the website is a research field that's being, like, a chain of accelerators that they're actually underground, so because it's more practical, uh,

01:37.000 --> 01:45.000
to to to accelerate particles and and and provision a colliding beams to, to scientists.

01:45.000 --> 01:52.000
Uh, we are an international organization, so we're a public funded by, by member states, mostly European.

01:52.000 --> 02:07.000
And we have around, we are around 3000 employed members of personnel, but we have a whole set of different scientists coming over for short or long periods of time to do their experiments.

02:07.000 --> 02:13.000
And from many different internationalities, so basically the product of the research looks like this.

02:13.000 --> 02:20.000
So those are the, I have to put the four pictures of the four experiments because otherwise this is recorded.

02:20.000 --> 02:29.000
We have killed, but so basically we provision those colliding beams and then the particles basically explode.

02:29.000 --> 02:35.000
And I don't know if you're familiar with Einstein equation that has to be always mentioned in a certain presentation.

02:35.000 --> 02:47.000
So, but so basically we see it happen from, from, from the mass becoming energy, typically when you burn something, but we never see it happen on the other, on the other direction.

02:47.000 --> 02:54.000
But here at very high levels of energy we see this happening with, of course, concentration of energy that creates new particles.

02:54.000 --> 03:00.000
So this is how the physicist kind of work. I'm sorry if you're a physicist, so because I'm not.

03:00.000 --> 03:15.000
But one important part of our, of our research, of our convention, which is, is the idea that is written really in the rules that we need to like share the products of the research and make it available.

03:15.000 --> 03:28.000
So there's been like a longer list of, a long history of sharing computing and, and three and open source software and hardware, which sounds cool of computing and distributing like software in the 80s,

03:28.000 --> 03:33.000
releasing the web as open source because it was in the public domain initially in fact in 1989.

03:33.000 --> 03:40.000
But then it was released an open source, trying to go quickly in the standards and introduction to get to the.

03:40.000 --> 03:49.000
And, but it's important to understand also how the landscape of, of the, in the first ecosystem and the IT services ecosystem has evolved.

03:49.000 --> 04:02.000
And with, well, obviously we also wrote the certain open hardware license and around ten years ago there's been a big shift to adopting many, many products in a very, let's say, grassroot way.

04:02.000 --> 04:09.000
Since we have so many visiting people who come install stuff and then it becomes critical for an experiment and then they leave.

04:09.000 --> 04:15.000
So it's, it's a very, let's say diverse landscapes, so we tried to, to convince, we managed to convince there.

04:15.000 --> 04:17.000
So this is actually whiteboard somewhere, right?

04:17.000 --> 04:23.000
That it, yes, no, it took it's all the images are taken from our, yes, internal repository.

04:23.000 --> 04:35.000
So we, we managed to convince the organization that we needed like a more systemic approach to to kind of free an open source software and hardware and we managed to get like an hospital approved with an internal mandate and internal mandate.

04:35.000 --> 04:40.000
And, and since we needed also to have security onboard, they say yes, but you put it in the mandate.

04:40.000 --> 04:47.000
The fact that we want to track the dependencies of open source, it was not only security, it's speaker closer to the microphone, I'm sorry.

04:47.000 --> 04:49.000
Yes, thank you.

04:50.000 --> 04:58.000
So basically one, one part of the mandate is to try to help us to track dependencies on, on open source components.

04:58.000 --> 05:15.000
And so we're back to the, to kind of the problem that we're, that we're trying to solve and why, why we're here and why, and the project that the jalluca has been working on that turned into his master thesis, so we have like a very diverse ecosystem of internal services.

05:15.000 --> 05:33.000
So we have a very 100, it is those are the ones that are officially declared that we know of sometimes when stuff break, then we do discover new, new things, we have an internal GitHub instance with more than 140,000 projects that are in diverse like state.

05:33.000 --> 05:44.000
And, and of course all of the languages, and this is called academic freedom, basically apparently so it's, it's very, it's very nice because it's faster like a quick like adoption.

05:44.000 --> 06:13.000
And yes, very agility, but it's also complicated to, to keep stuff under control. And there are some areas of the organization that are very, very tightly controlled, the accelerator software that is used for running the control systems obviously is not developed as the data scientists that are trying to analyze the data. So it's, there's also very different levels of maturity for, for what is, so the, I want colleagues running the accelerator, they have a full integrated as bomb supply chain.

06:13.000 --> 06:35.000
Like generations, signing, extraction, centralized tracking, so but in other places, they don't even know what this is. So we, so the goal was to try to, to have a, to end up with a comprehensive view that could be used for both security, but also for ourselves as the, as the open source program office to help the organization show what we have.

06:35.000 --> 07:02.000
And in order to be able to, to contribute, to maybe better to the projects we rely on and to discover that we depend on that many different projects depend on the single library of stuff like this. And so we came up with a kind of a computer science, like approach and a knowledge representation approach and in, in particular, divided in three, three levels, they are all on the same.

07:03.000 --> 07:20.000
Yeah, the first level is just organizing the data and try to formalize this as actually a knowledge graph that has been mentioned already and then be able to drill down more and visualize the dependencies and drill down at the, even at the functional level for, for analyzing specific projects.

07:20.000 --> 07:38.000
But okay, first of all, that the first problem has been how to actually gather and represent the data because there's so much stuff that it's, and it's really the diversity of the maturity level was was very challenging to, to cope with.

07:38.000 --> 08:01.000
And so, so basically, of course, we, we said, we'll end up with a survey, so we go and we asked people what they run, which was actually a good, a good thing because we, we could push this through the management lines and at the same time, help people to, to have pipelines to, to generate, as bombs and of course, not everyone was ready to do this.

08:01.000 --> 08:25.000
So we, so people could also upload like just generic description, free text and point us to some components. So, but this was considered kind of good enough as a proof of concept first pass and allowed us to, to collect a few, a few answers from, from most of the, of the services that were that were involved, we didn't cover the 100% in the end yet.

08:25.000 --> 08:47.000
But okay, when I'm back, I'll have stickers, so maybe people will, will actually start doing things and I can hand them over. The stickers, so we, so we went for like a former representation of the knowledge with ontologies, so if you've started computer science, probably you have come across this, which is a formal model to, yes, represent the knowledge.

08:47.000 --> 08:57.000
And, and this is divided in, in, in components, relationships and, and, and basically you can, you can add the semantics on top.

08:57.000 --> 09:11.000
And, and, and then I have interoperability between the, the different systems, and we, you end up with, with a knowledge graph that can represent different things that are interconnected between them.

09:11.000 --> 09:26.000
So, um, so then, uh, and then we are able to kind of add, uh, and once we have a knowledge graph, we can ask you what is called competency questions that can be different kinds of questions depending on who, uh,

09:26.000 --> 09:38.000
the use cases. So, asking, and what are the libraries that are mostly used, what service is relying on what impact of vulnerabilities and this kind of, let's say, highlight.

09:38.000 --> 09:46.000
From high level, very generic to more detailed, the more detailed questions once we have, once we have the data, the data in.

09:46.000 --> 09:58.000
Uh, so then, uh, how to, then do, actually do this, and organize the data. So, uh, we actually, we started, uh, checking what was available,

09:58.000 --> 10:06.000
and SPDX has an ontology, uh, defined, there are already some other for, for software license ontology and, and other things.

10:06.000 --> 10:14.000
We were missing, like, a cyclone DX ontology because this was the standard that our colleagues in the accelerator sector have, uh, had adopted.

10:14.000 --> 10:22.000
So, we had to write this. We wrote a certain specific to try to model the, the, the certain services and the hierarchies and organizational units.

10:22.000 --> 10:26.000
So, internal kind of representing the, the organization.

10:26.000 --> 10:38.000
And, and then we glued everything together with an hospital kind of ontology on top. That is basically, uh, allowed us to, to go from, from the road data that we extracted.

10:38.000 --> 10:48.000
To write machinery to process this and, and make it into, into an actual knowledge graph. So, and, and using that classic, like sparkle.

10:48.000 --> 11:00.000
Where is to, to answer competency questions. So, this is the typical owl thing to, to, to load one of the, uh, one of the ontologies.

11:00.000 --> 11:10.000
And, uh, I'm going to fast probably, but, okay. And then after we collected the data, the survey responses, then there was a lot of big exercise.

11:10.000 --> 11:16.000
And this is probably the most boring looking slide, but it's the one that is describing most of the work.

11:16.000 --> 11:24.000
Uh, this basically to, to convert the, as bombs in the, in the, uh, get the missing ones.

11:24.000 --> 11:34.000
So, many people were not really ready to upload. They didn't want to invest the time and energy and it's difficult to force people when there's that lines and, and other things.

11:34.000 --> 11:42.000
So, we, uh, so we helped them to, we asked what is the main components. So, we went to actually generate the, as bomb ourselves.

11:42.000 --> 11:48.000
And, uh, and then, uh, add, uh, yes, convert the stuff to CSV. Yes, CSV.

11:48.000 --> 11:54.000
And, uh, yeah, try to normalize and then normalize a little bit more.

11:54.000 --> 12:03.000
And, uh, and be able then to, to convert, uh, to use the CSV to enrich the, as bomb and, and convert everything in RDF form.

12:03.000 --> 12:12.000
So, that we could, uh, we could then run the, the, the, the, the, the, the, the magic, uh, some magic tools that to, like graph arrangement and generation.

12:12.000 --> 12:21.000
And then we end up to use, uh, and then we wrote an ontology query engine where we could actually put the, the sparkle queries into, into that.

12:21.000 --> 12:28.000
So, I basically tricked the, the, the master student into doing something that was worth three years of, of work, but he did it.

12:28.000 --> 12:38.000
So, it was, uh, it was pretty, pretty nice. So, it's basically, this is just an example of the, of the, uh, queries that you can, uh, you can write in sparkle.

12:38.000 --> 12:44.000
Maybe you're familiar with this to, to try this, that the classic, yes, to do link to this instead of this. So, yeah, well, okay.

12:44.000 --> 12:51.000
And then you can run them in, in the Python interactive application on or other similar stuff.

12:51.000 --> 12:59.000
So, um, so this was actually most of the, most of the, the largest part of the work was to collect the data and try to write, uh,

12:59.000 --> 13:05.000
uh, the ontologies to build the knowledge graph and then be able to run the competency question, but this allows us also to, kind of,

13:05.000 --> 13:12.000
the couple a bit, the, the two layers and have like the set of questions that then can be adapted to, if we change,

13:12.000 --> 13:20.000
like knowledge graph model or, or, or, or stuff like this. And then we started to look more into the, into the dependencies, uh,

13:20.000 --> 13:27.000
for projects, so more, uh, and via the visual visualizations. So, we kind of departed a bit from the security angle and looked more,

13:27.000 --> 13:35.000
but it's dependencies, but it's more, also, to, to present it to the, to the end user. And we basically looked at a,

13:35.000 --> 13:43.000
a, a, a, a, a, a, a, a, a, a graph of the, of the, of the, of the different, uh, dependencies of vulnerability,

13:43.000 --> 13:51.000
vulnerabilities, we use depth.dev that we will replace, and, uh, and our SV.dev for, for vulnerabilities.

13:51.000 --> 13:58.000
So, in basically to, and then Neo4j and, and that has this, a language that is called cipher to, to, to,

13:58.000 --> 14:04.000
to store the data that we're not super happy with. So, uh, we will, we will see what, uh, what happens,

14:04.000 --> 14:11.000
but so, basically Neo4j looks like this. You, you load the graph and you see all the different connection between the,

14:11.000 --> 14:18.000
the different libraries and components and, and you have an example of the nice language that looks like SQL,

14:18.000 --> 14:24.000
but it's not and looks like something else, but it's not. So, it's a bit, uh, not very user-friendly to,

14:24.000 --> 14:33.000
to use, but you end up, like, be able to, to, to, to visualize, I just, to, can example this event management software that we wrote,

14:34.000 --> 14:40.000
uh, that team, uh, team at certain routes to, to just see that you're able to, all the images,

14:40.000 --> 14:47.000
doesn't look nice, sorry, I'll replace it. Uh, so, uh, try to query for, like, the, um,

14:47.000 --> 14:54.000
the vulnerabilities with high severity score, uh, this not more than three to the root node and this kind of stuff.

14:54.000 --> 15:01.000
So, you can, you can play and, and, and have fun to, and, and give it security to, so that they can see,

15:01.000 --> 15:07.000
they can evaluate, uh, a bit of the impact of vulnerabilities on, on the critical services.

15:07.000 --> 15:13.000
So, this, this has also been a long-standing request of our, of our users because typically,

15:13.000 --> 15:17.000
when there's, there's a lot of Java, for example, in the, in the control systems and when there's

15:17.000 --> 15:21.000
a vulnerability security say, you need to pass, but they want to say, okay, I need to pass,

15:21.000 --> 15:27.000
but maybe not because they, they have to keep the accelerator running and they have a very limited

15:27.000 --> 15:35.000
time to, to intervene so they can, so, and if you, if you can come up with, uh, with some kind of showing that

15:35.000 --> 15:40.000
that the impact is very limited because the dependency is far or stuff like this, that it's,

15:40.000 --> 15:46.000
it's, it's a bit more helpful for, for the users and releases a bit internal,

15:46.000 --> 15:52.000
tensions that are kind of natural, as well. Uh, but in order to do this, we, we,

15:52.000 --> 15:57.000
also, um, we're looking at the more in-depth, uh,

15:57.000 --> 16:02.000
investigation into the code and how the code is structured, uh, and, um,

16:02.000 --> 16:08.000
this, of course, it's, uh, a more computationally intensive and looks like more at,

16:08.000 --> 16:13.000
at the function, at the graph, uh, of the, of the, of the code, uh, of the code functions,

16:13.000 --> 16:18.000
so, uh, this called the graph attention network is based on this, on this paper,

16:18.000 --> 16:23.000
and we try to, to build the map of how the code is structured with functions and nodes

16:23.000 --> 16:31.000
and calls as, as edges and then we are able to build, um, to, to have different metrics

16:31.000 --> 16:36.000
that, and then calculate the risk score, let's say, based on the, on a vulnerability,

16:36.000 --> 16:42.000
this is affecting a library and then where the, the library is actually used in the code.

16:42.000 --> 16:48.000
So, here you see that in this, this prepare context, uh, has a, is affected by, uh,

16:48.000 --> 16:52.000
a different vulnerability, so you can, uh, you can easily, easily.

16:52.000 --> 16:57.000
Remediate and try to see how to, how to mitigate this or show that in the next version,

16:57.000 --> 17:02.000
this is fixed. So, this library is not, no longer called or something has changed,

17:02.000 --> 17:06.000
and it, this, this kind of stuff. So, but this, of course, it's very much

17:06.000 --> 17:11.000
computationally intensive because you have to build the graph per softer package and,

17:11.000 --> 17:15.000
and every time you have a new version, then you have to rebuild it. So, this is, uh, yeah.

17:15.000 --> 17:21.000
But it's, it's, it's another proof that, that, that we wanted to, to, to see how to,

17:21.000 --> 17:28.000
you can actually go more, more into detail, uh, on, on this. So, um, so, um, so,

17:28.000 --> 17:34.000
finally, the, so how, how it looks like, it's, um, the, the, the, the, the frameworks,

17:34.000 --> 17:40.000
so the blue boxes are the ones that we kind of, uh, developed, and, um,

17:40.000 --> 17:46.000
and the green are kind of the data, where, where we store the data and knowledge graphs in Neo4j,

17:46.000 --> 17:53.000
and then gray, which is actually white data sources, um, and, and the yellow are the users.

17:53.000 --> 17:58.000
So, yes, so we can see that it's, it's different levels, and the, the users can actually use,

17:59.000 --> 18:05.000
uh, the ontology query engine and the graph analysis tools for, for going more or less into,

18:05.000 --> 18:10.000
into detail and ask different questions. And, um, and we've been quite happy with,

18:10.000 --> 18:16.000
with the way it is turned out, uh, for, for us and for, for our security team that now,

18:16.000 --> 18:22.000
they want to convince us that we should run this for them, but, okay, that's, that's another story.

18:22.000 --> 18:29.000
Um, so just to, so basically to conclude, we, we, we've been happy with, with the way we were able to,

18:29.000 --> 18:35.000
uh, to produce this framework and to look at, like, a network, a, a knowledge graph for,

18:35.000 --> 18:41.000
for representing gas bomb data, organizational data together, which links also with what was presented before,

18:41.000 --> 18:45.000
and it's very interesting, uh, because there's, uh, like, it's really a need from,

18:45.000 --> 18:51.000
from the actual user's point of view, we think, to, to be able to also demonstrate that this is,

18:51.000 --> 18:56.000
uh, actually something that adds value to other areas, then, then security.

18:56.000 --> 19:03.000
And, um, so to, we're, so to be able to use the ontologies as a foundation for the, for, for the knowledge graph,

19:03.000 --> 19:08.000
has been, has been a nice, uh, a nice way, a straightforward way to, uh,

19:08.000 --> 19:14.000
to demonstrate forward in the concept way to, to build this, and, uh, we've been actually amazed,

19:14.000 --> 19:20.000
so throughout all this, we've been, like, uh, checking many different standards in tooling,

19:20.000 --> 19:25.000
and the tooling ecosystem, and the freedom of tools tooling ecosystem for supply chain is amazing.

19:25.000 --> 19:28.000
There's, uh, like, everything, and, uh, but, okay, we could have done.

19:28.000 --> 19:33.000
It was spent one year, only just testing out everything that is there, so it's, uh,

19:33.000 --> 19:38.000
very nice, so we've received, like, support from, from different people and, and, and help,

19:38.000 --> 19:42.000
uh, in forums and, uh, and online, but very, very nice.

19:42.000 --> 19:46.000
So if you are maintaining some tooling, thank you, because probably we've interacted,

19:46.000 --> 19:47.000
or maybe you don't know.

19:47.000 --> 19:53.000
Uh, but obviously the next big step is to try to automate all the, the whole pipeline,

19:53.000 --> 19:58.000
and, uh, not, not to, I mean, coming to just to push a button and have the,

19:58.000 --> 20:03.000
pipeline refreshed is a bit of, uh, of a kind of, I think, uh, it's, it's,

20:03.000 --> 20:09.000
yes, it's kind of impossible, but to help users at least to generate things automatically,

20:09.000 --> 20:12.000
and reach and then ship them to a central location.

20:12.000 --> 20:16.000
This is feasible, so we are going to, we're, for sure, going to look, uh,

20:16.000 --> 20:20.000
to look into this, we have a couple of technical details that we want to, to replace,

20:20.000 --> 20:24.000
like use ecosystems instead of that sort of dev, looking at the new for Jay,

20:24.000 --> 20:27.000
or Cypher kind of, uh, alternative, sorry.

20:27.000 --> 20:33.000
And, uh, yeah, uh, and also have a, have an easy way to query, to query the data for,

20:33.000 --> 20:39.000
for end users, and to, yeah, based on the different, on the different use cases.

20:39.000 --> 20:43.000
So, uh, yeah, this is, I'm, I'm a bit ahead of time.

20:43.000 --> 20:48.000
Uh, I just want to also say that if you are, if you are a student, a university student,

20:48.000 --> 20:51.000
you're still alive and not bored to death.

20:51.000 --> 20:55.000
Uh, I'm going to recruit the students to continue this work this year.

20:55.000 --> 20:59.000
So, you can apply it to that certain technical students program.

20:59.000 --> 21:04.000
There's a link, uh, by the, before, uh, like mid-march, and then, uh, yeah.

21:04.000 --> 21:07.000
And you can say, I saw the album, uh, because there's a generic application.

21:07.000 --> 21:15.000
But anyway, so it's, uh, uh, uh, I'm, I'm very happy to be, to be here presenting, and, uh, thank you very much.

21:15.000 --> 21:22.000
Thank you very much, Jack.

21:22.000 --> 21:25.000
Question.

21:25.000 --> 21:28.000
Thanks for the presentation.

21:28.000 --> 21:30.000
What's wrong with Neo4j?

21:30.000 --> 21:33.000
What's wrong with Neo4j?

21:33.000 --> 21:36.000
So, the question is, what's wrong with Neo4j?

21:36.000 --> 21:37.000
And what is the limitation?

21:37.000 --> 21:44.000
So, uh, I've not been using it myself, but I, I understand that we were not happy with, uh, with the query engine.

21:44.000 --> 21:51.000
It's not that good.

21:51.000 --> 21:54.000
Part of questions.

21:54.000 --> 21:55.000
Thanks.

21:55.000 --> 21:56.000
Thank you again.

21:56.000 --> 21:57.000
Thank you.

21:57.000 --> 21:59.000
That's your plan.