WEBVTT 00:00.000 --> 00:20.000 I am two Italian people. So, yes, good afternoon. I'm working at Sern. So, it's a bit 00:20.000 --> 00:25.000 intimidated to present under the, like, at the, the, the, the, the, the, the, the, the, the, the, the, the, the, 00:25.000 --> 00:41.000 the electron mass and charge constants up there. But, uh, so I'm a computer scientist, and I'm presenting today the project that I had with the, with the colleague with the student at the University of Bologna on, uh, on, uh, kind of investigating a bit softer, 00:41.000 --> 00:55.000 I think. So, to visualizing visualizing it and, and to leverage the knowledge that we can extract from from that. 00:55.000 --> 01:05.000 Uh, so I'm, I'm a computer scientist doing mostly system administration stuff, but I also helped build the Sern Ospo, so I'm responsible for it. 01:05.000 --> 01:12.000 When I'm not working on Linux systems. So, I just have a couple of slides. I don't know if you've ever been at Sern. 01:12.000 --> 01:21.000 Can raise your hand if you have not, not that many. So, we're, uh, we're an international particle physics laboratory at the border between Switzerland and France. 01:21.000 --> 01:25.000 And, uh, and basically the, 01:26.000 --> 01:37.000 because the website is a research field that's being, like, a chain of accelerators that they're actually underground, so because it's more practical, uh, 01:37.000 --> 01:45.000 to to to accelerate particles and and and provision a colliding beams to, to scientists. 01:45.000 --> 01:52.000 Uh, we are an international organization, so we're a public funded by, by member states, mostly European. 01:52.000 --> 02:07.000 And we have around, we are around 3000 employed members of personnel, but we have a whole set of different scientists coming over for short or long periods of time to do their experiments. 02:07.000 --> 02:13.000 And from many different internationalities, so basically the product of the research looks like this. 02:13.000 --> 02:20.000 So those are the, I have to put the four pictures of the four experiments because otherwise this is recorded. 02:20.000 --> 02:29.000 We have killed, but so basically we provision those colliding beams and then the particles basically explode. 02:29.000 --> 02:35.000 And I don't know if you're familiar with Einstein equation that has to be always mentioned in a certain presentation. 02:35.000 --> 02:47.000 So, but so basically we see it happen from, from, from the mass becoming energy, typically when you burn something, but we never see it happen on the other, on the other direction. 02:47.000 --> 02:54.000 But here at very high levels of energy we see this happening with, of course, concentration of energy that creates new particles. 02:54.000 --> 03:00.000 So this is how the physicist kind of work. I'm sorry if you're a physicist, so because I'm not. 03:00.000 --> 03:15.000 But one important part of our, of our research, of our convention, which is, is the idea that is written really in the rules that we need to like share the products of the research and make it available. 03:15.000 --> 03:28.000 So there's been like a longer list of, a long history of sharing computing and, and three and open source software and hardware, which sounds cool of computing and distributing like software in the 80s, 03:28.000 --> 03:33.000 releasing the web as open source because it was in the public domain initially in fact in 1989. 03:33.000 --> 03:40.000 But then it was released an open source, trying to go quickly in the standards and introduction to get to the. 03:40.000 --> 03:49.000 And, but it's important to understand also how the landscape of, of the, in the first ecosystem and the IT services ecosystem has evolved. 03:49.000 --> 04:02.000 And with, well, obviously we also wrote the certain open hardware license and around ten years ago there's been a big shift to adopting many, many products in a very, let's say, grassroot way. 04:02.000 --> 04:09.000 Since we have so many visiting people who come install stuff and then it becomes critical for an experiment and then they leave. 04:09.000 --> 04:15.000 So it's, it's a very, let's say diverse landscapes, so we tried to, to convince, we managed to convince there. 04:15.000 --> 04:17.000 So this is actually whiteboard somewhere, right? 04:17.000 --> 04:23.000 That it, yes, no, it took it's all the images are taken from our, yes, internal repository. 04:23.000 --> 04:35.000 So we, we managed to convince the organization that we needed like a more systemic approach to to kind of free an open source software and hardware and we managed to get like an hospital approved with an internal mandate and internal mandate. 04:35.000 --> 04:40.000 And, and since we needed also to have security onboard, they say yes, but you put it in the mandate. 04:40.000 --> 04:47.000 The fact that we want to track the dependencies of open source, it was not only security, it's speaker closer to the microphone, I'm sorry. 04:47.000 --> 04:49.000 Yes, thank you. 04:50.000 --> 04:58.000 So basically one, one part of the mandate is to try to help us to track dependencies on, on open source components. 04:58.000 --> 05:15.000 And so we're back to the, to kind of the problem that we're, that we're trying to solve and why, why we're here and why, and the project that the jalluca has been working on that turned into his master thesis, so we have like a very diverse ecosystem of internal services. 05:15.000 --> 05:33.000 So we have a very 100, it is those are the ones that are officially declared that we know of sometimes when stuff break, then we do discover new, new things, we have an internal GitHub instance with more than 140,000 projects that are in diverse like state. 05:33.000 --> 05:44.000 And, and of course all of the languages, and this is called academic freedom, basically apparently so it's, it's very, it's very nice because it's faster like a quick like adoption. 05:44.000 --> 06:13.000 And yes, very agility, but it's also complicated to, to keep stuff under control. And there are some areas of the organization that are very, very tightly controlled, the accelerator software that is used for running the control systems obviously is not developed as the data scientists that are trying to analyze the data. So it's, there's also very different levels of maturity for, for what is, so the, I want colleagues running the accelerator, they have a full integrated as bomb supply chain. 06:13.000 --> 06:35.000 Like generations, signing, extraction, centralized tracking, so but in other places, they don't even know what this is. So we, so the goal was to try to, to have a, to end up with a comprehensive view that could be used for both security, but also for ourselves as the, as the open source program office to help the organization show what we have. 06:35.000 --> 07:02.000 And in order to be able to, to contribute, to maybe better to the projects we rely on and to discover that we depend on that many different projects depend on the single library of stuff like this. And so we came up with a kind of a computer science, like approach and a knowledge representation approach and in, in particular, divided in three, three levels, they are all on the same. 07:03.000 --> 07:20.000 Yeah, the first level is just organizing the data and try to formalize this as actually a knowledge graph that has been mentioned already and then be able to drill down more and visualize the dependencies and drill down at the, even at the functional level for, for analyzing specific projects. 07:20.000 --> 07:38.000 But okay, first of all, that the first problem has been how to actually gather and represent the data because there's so much stuff that it's, and it's really the diversity of the maturity level was was very challenging to, to cope with. 07:38.000 --> 08:01.000 And so, so basically, of course, we, we said, we'll end up with a survey, so we go and we asked people what they run, which was actually a good, a good thing because we, we could push this through the management lines and at the same time, help people to, to have pipelines to, to generate, as bombs and of course, not everyone was ready to do this. 08:01.000 --> 08:25.000 So we, so people could also upload like just generic description, free text and point us to some components. So, but this was considered kind of good enough as a proof of concept first pass and allowed us to, to collect a few, a few answers from, from most of the, of the services that were that were involved, we didn't cover the 100% in the end yet. 08:25.000 --> 08:47.000 But okay, when I'm back, I'll have stickers, so maybe people will, will actually start doing things and I can hand them over. The stickers, so we, so we went for like a former representation of the knowledge with ontologies, so if you've started computer science, probably you have come across this, which is a formal model to, yes, represent the knowledge. 08:47.000 --> 08:57.000 And, and this is divided in, in, in components, relationships and, and, and basically you can, you can add the semantics on top. 08:57.000 --> 09:11.000 And, and, and then I have interoperability between the, the different systems, and we, you end up with, with a knowledge graph that can represent different things that are interconnected between them. 09:11.000 --> 09:26.000 So, um, so then, uh, and then we are able to kind of add, uh, and once we have a knowledge graph, we can ask you what is called competency questions that can be different kinds of questions depending on who, uh, 09:26.000 --> 09:38.000 the use cases. So, asking, and what are the libraries that are mostly used, what service is relying on what impact of vulnerabilities and this kind of, let's say, highlight. 09:38.000 --> 09:46.000 From high level, very generic to more detailed, the more detailed questions once we have, once we have the data, the data in. 09:46.000 --> 09:58.000 Uh, so then, uh, how to, then do, actually do this, and organize the data. So, uh, we actually, we started, uh, checking what was available, 09:58.000 --> 10:06.000 and SPDX has an ontology, uh, defined, there are already some other for, for software license ontology and, and other things. 10:06.000 --> 10:14.000 We were missing, like, a cyclone DX ontology because this was the standard that our colleagues in the accelerator sector have, uh, had adopted. 10:14.000 --> 10:22.000 So, we had to write this. We wrote a certain specific to try to model the, the, the certain services and the hierarchies and organizational units. 10:22.000 --> 10:26.000 So, internal kind of representing the, the organization. 10:26.000 --> 10:38.000 And, and then we glued everything together with an hospital kind of ontology on top. That is basically, uh, allowed us to, to go from, from the road data that we extracted. 10:38.000 --> 10:48.000 To write machinery to process this and, and make it into, into an actual knowledge graph. So, and, and using that classic, like sparkle. 10:48.000 --> 11:00.000 Where is to, to answer competency questions. So, this is the typical owl thing to, to, to load one of the, uh, one of the ontologies. 11:00.000 --> 11:10.000 And, uh, I'm going to fast probably, but, okay. And then after we collected the data, the survey responses, then there was a lot of big exercise. 11:10.000 --> 11:16.000 And this is probably the most boring looking slide, but it's the one that is describing most of the work. 11:16.000 --> 11:24.000 Uh, this basically to, to convert the, as bombs in the, in the, uh, get the missing ones. 11:24.000 --> 11:34.000 So, many people were not really ready to upload. They didn't want to invest the time and energy and it's difficult to force people when there's that lines and, and other things. 11:34.000 --> 11:42.000 So, we, uh, so we helped them to, we asked what is the main components. So, we went to actually generate the, as bomb ourselves. 11:42.000 --> 11:48.000 And, uh, and then, uh, add, uh, yes, convert the stuff to CSV. Yes, CSV. 11:48.000 --> 11:54.000 And, uh, yeah, try to normalize and then normalize a little bit more. 11:54.000 --> 12:03.000 And, uh, and be able then to, to convert, uh, to use the CSV to enrich the, as bomb and, and convert everything in RDF form. 12:03.000 --> 12:12.000 So, that we could, uh, we could then run the, the, the, the, the, the, the, the magic, uh, some magic tools that to, like graph arrangement and generation. 12:12.000 --> 12:21.000 And then we end up to use, uh, and then we wrote an ontology query engine where we could actually put the, the sparkle queries into, into that. 12:21.000 --> 12:28.000 So, I basically tricked the, the, the master student into doing something that was worth three years of, of work, but he did it. 12:28.000 --> 12:38.000 So, it was, uh, it was pretty, pretty nice. So, it's basically, this is just an example of the, of the, uh, queries that you can, uh, you can write in sparkle. 12:38.000 --> 12:44.000 Maybe you're familiar with this to, to try this, that the classic, yes, to do link to this instead of this. So, yeah, well, okay. 12:44.000 --> 12:51.000 And then you can run them in, in the Python interactive application on or other similar stuff. 12:51.000 --> 12:59.000 So, um, so this was actually most of the, most of the, the largest part of the work was to collect the data and try to write, uh, 12:59.000 --> 13:05.000 uh, the ontologies to build the knowledge graph and then be able to run the competency question, but this allows us also to, kind of, 13:05.000 --> 13:12.000 the couple a bit, the, the two layers and have like the set of questions that then can be adapted to, if we change, 13:12.000 --> 13:20.000 like knowledge graph model or, or, or, or stuff like this. And then we started to look more into the, into the dependencies, uh, 13:20.000 --> 13:27.000 for projects, so more, uh, and via the visual visualizations. So, we kind of departed a bit from the security angle and looked more, 13:27.000 --> 13:35.000 but it's dependencies, but it's more, also, to, to present it to the, to the end user. And we basically looked at a, 13:35.000 --> 13:43.000 a, a, a, a, a, a, a, a, a, a graph of the, of the, of the, of the different, uh, dependencies of vulnerability, 13:43.000 --> 13:51.000 vulnerabilities, we use depth.dev that we will replace, and, uh, and our SV.dev for, for vulnerabilities. 13:51.000 --> 13:58.000 So, in basically to, and then Neo4j and, and that has this, a language that is called cipher to, to, to, 13:58.000 --> 14:04.000 to store the data that we're not super happy with. So, uh, we will, we will see what, uh, what happens, 14:04.000 --> 14:11.000 but so, basically Neo4j looks like this. You, you load the graph and you see all the different connection between the, 14:11.000 --> 14:18.000 the different libraries and components and, and you have an example of the nice language that looks like SQL, 14:18.000 --> 14:24.000 but it's not and looks like something else, but it's not. So, it's a bit, uh, not very user-friendly to, 14:24.000 --> 14:33.000 to use, but you end up, like, be able to, to, to, to visualize, I just, to, can example this event management software that we wrote, 14:34.000 --> 14:40.000 uh, that team, uh, team at certain routes to, to just see that you're able to, all the images, 14:40.000 --> 14:47.000 doesn't look nice, sorry, I'll replace it. Uh, so, uh, try to query for, like, the, um, 14:47.000 --> 14:54.000 the vulnerabilities with high severity score, uh, this not more than three to the root node and this kind of stuff. 14:54.000 --> 15:01.000 So, you can, you can play and, and, and have fun to, and, and give it security to, so that they can see, 15:01.000 --> 15:07.000 they can evaluate, uh, a bit of the impact of vulnerabilities on, on the critical services. 15:07.000 --> 15:13.000 So, this, this has also been a long-standing request of our, of our users because typically, 15:13.000 --> 15:17.000 when there's, there's a lot of Java, for example, in the, in the control systems and when there's 15:17.000 --> 15:21.000 a vulnerability security say, you need to pass, but they want to say, okay, I need to pass, 15:21.000 --> 15:27.000 but maybe not because they, they have to keep the accelerator running and they have a very limited 15:27.000 --> 15:35.000 time to, to intervene so they can, so, and if you, if you can come up with, uh, with some kind of showing that 15:35.000 --> 15:40.000 that the impact is very limited because the dependency is far or stuff like this, that it's, 15:40.000 --> 15:46.000 it's, it's a bit more helpful for, for the users and releases a bit internal, 15:46.000 --> 15:52.000 tensions that are kind of natural, as well. Uh, but in order to do this, we, we, 15:52.000 --> 15:57.000 also, um, we're looking at the more in-depth, uh, 15:57.000 --> 16:02.000 investigation into the code and how the code is structured, uh, and, um, 16:02.000 --> 16:08.000 this, of course, it's, uh, a more computationally intensive and looks like more at, 16:08.000 --> 16:13.000 at the function, at the graph, uh, of the, of the, of the code, uh, of the code functions, 16:13.000 --> 16:18.000 so, uh, this called the graph attention network is based on this, on this paper, 16:18.000 --> 16:23.000 and we try to, to build the map of how the code is structured with functions and nodes 16:23.000 --> 16:31.000 and calls as, as edges and then we are able to build, um, to, to have different metrics 16:31.000 --> 16:36.000 that, and then calculate the risk score, let's say, based on the, on a vulnerability, 16:36.000 --> 16:42.000 this is affecting a library and then where the, the library is actually used in the code. 16:42.000 --> 16:48.000 So, here you see that in this, this prepare context, uh, has a, is affected by, uh, 16:48.000 --> 16:52.000 a different vulnerability, so you can, uh, you can easily, easily. 16:52.000 --> 16:57.000 Remediate and try to see how to, how to mitigate this or show that in the next version, 16:57.000 --> 17:02.000 this is fixed. So, this library is not, no longer called or something has changed, 17:02.000 --> 17:06.000 and it, this, this kind of stuff. So, but this, of course, it's very much 17:06.000 --> 17:11.000 computationally intensive because you have to build the graph per softer package and, 17:11.000 --> 17:15.000 and every time you have a new version, then you have to rebuild it. So, this is, uh, yeah. 17:15.000 --> 17:21.000 But it's, it's, it's another proof that, that, that we wanted to, to, to see how to, 17:21.000 --> 17:28.000 you can actually go more, more into detail, uh, on, on this. So, um, so, um, so, 17:28.000 --> 17:34.000 finally, the, so how, how it looks like, it's, um, the, the, the, the, the frameworks, 17:34.000 --> 17:40.000 so the blue boxes are the ones that we kind of, uh, developed, and, um, 17:40.000 --> 17:46.000 and the green are kind of the data, where, where we store the data and knowledge graphs in Neo4j, 17:46.000 --> 17:53.000 and then gray, which is actually white data sources, um, and, and the yellow are the users. 17:53.000 --> 17:58.000 So, yes, so we can see that it's, it's different levels, and the, the users can actually use, 17:59.000 --> 18:05.000 uh, the ontology query engine and the graph analysis tools for, for going more or less into, 18:05.000 --> 18:10.000 into detail and ask different questions. And, um, and we've been quite happy with, 18:10.000 --> 18:16.000 with the way it is turned out, uh, for, for us and for, for our security team that now, 18:16.000 --> 18:22.000 they want to convince us that we should run this for them, but, okay, that's, that's another story. 18:22.000 --> 18:29.000 Um, so just to, so basically to conclude, we, we, we've been happy with, with the way we were able to, 18:29.000 --> 18:35.000 uh, to produce this framework and to look at, like, a network, a, a knowledge graph for, 18:35.000 --> 18:41.000 for representing gas bomb data, organizational data together, which links also with what was presented before, 18:41.000 --> 18:45.000 and it's very interesting, uh, because there's, uh, like, it's really a need from, 18:45.000 --> 18:51.000 from the actual user's point of view, we think, to, to be able to also demonstrate that this is, 18:51.000 --> 18:56.000 uh, actually something that adds value to other areas, then, then security. 18:56.000 --> 19:03.000 And, um, so to, we're, so to be able to use the ontologies as a foundation for the, for, for the knowledge graph, 19:03.000 --> 19:08.000 has been, has been a nice, uh, a nice way, a straightforward way to, uh, 19:08.000 --> 19:14.000 to demonstrate forward in the concept way to, to build this, and, uh, we've been actually amazed, 19:14.000 --> 19:20.000 so throughout all this, we've been, like, uh, checking many different standards in tooling, 19:20.000 --> 19:25.000 and the tooling ecosystem, and the freedom of tools tooling ecosystem for supply chain is amazing. 19:25.000 --> 19:28.000 There's, uh, like, everything, and, uh, but, okay, we could have done. 19:28.000 --> 19:33.000 It was spent one year, only just testing out everything that is there, so it's, uh, 19:33.000 --> 19:38.000 very nice, so we've received, like, support from, from different people and, and, and help, 19:38.000 --> 19:42.000 uh, in forums and, uh, and online, but very, very nice. 19:42.000 --> 19:46.000 So if you are maintaining some tooling, thank you, because probably we've interacted, 19:46.000 --> 19:47.000 or maybe you don't know. 19:47.000 --> 19:53.000 Uh, but obviously the next big step is to try to automate all the, the whole pipeline, 19:53.000 --> 19:58.000 and, uh, not, not to, I mean, coming to just to push a button and have the, 19:58.000 --> 20:03.000 pipeline refreshed is a bit of, uh, of a kind of, I think, uh, it's, it's, 20:03.000 --> 20:09.000 yes, it's kind of impossible, but to help users at least to generate things automatically, 20:09.000 --> 20:12.000 and reach and then ship them to a central location. 20:12.000 --> 20:16.000 This is feasible, so we are going to, we're, for sure, going to look, uh, 20:16.000 --> 20:20.000 to look into this, we have a couple of technical details that we want to, to replace, 20:20.000 --> 20:24.000 like use ecosystems instead of that sort of dev, looking at the new for Jay, 20:24.000 --> 20:27.000 or Cypher kind of, uh, alternative, sorry. 20:27.000 --> 20:33.000 And, uh, yeah, uh, and also have a, have an easy way to query, to query the data for, 20:33.000 --> 20:39.000 for end users, and to, yeah, based on the different, on the different use cases. 20:39.000 --> 20:43.000 So, uh, yeah, this is, I'm, I'm a bit ahead of time. 20:43.000 --> 20:48.000 Uh, I just want to also say that if you are, if you are a student, a university student, 20:48.000 --> 20:51.000 you're still alive and not bored to death. 20:51.000 --> 20:55.000 Uh, I'm going to recruit the students to continue this work this year. 20:55.000 --> 20:59.000 So, you can apply it to that certain technical students program. 20:59.000 --> 21:04.000 There's a link, uh, by the, before, uh, like mid-march, and then, uh, yeah. 21:04.000 --> 21:07.000 And you can say, I saw the album, uh, because there's a generic application. 21:07.000 --> 21:15.000 But anyway, so it's, uh, uh, uh, I'm, I'm very happy to be, to be here presenting, and, uh, thank you very much. 21:15.000 --> 21:22.000 Thank you very much, Jack. 21:22.000 --> 21:25.000 Question. 21:25.000 --> 21:28.000 Thanks for the presentation. 21:28.000 --> 21:30.000 What's wrong with Neo4j? 21:30.000 --> 21:33.000 What's wrong with Neo4j? 21:33.000 --> 21:36.000 So, the question is, what's wrong with Neo4j? 21:36.000 --> 21:37.000 And what is the limitation? 21:37.000 --> 21:44.000 So, uh, I've not been using it myself, but I, I understand that we were not happy with, uh, with the query engine. 21:44.000 --> 21:51.000 It's not that good. 21:51.000 --> 21:54.000 Part of questions. 21:54.000 --> 21:55.000 Thanks. 21:55.000 --> 21:56.000 Thank you again. 21:56.000 --> 21:57.000 Thank you. 21:57.000 --> 21:59.000 That's your plan.