WEBVTT 00:00.000 --> 00:14.400 I'm just like, I had no idea what's really being doing something like a keynote talk. 00:14.400 --> 00:22.480 I'm feeling like I like to talk about clarity, licensing, and metadata for mixed packages 00:22.480 --> 00:24.640 and way to problem. 00:24.640 --> 00:29.640 I maintain a bunch of tools, some of you may know about them in the space, something called 00:29.640 --> 00:32.640 scan code. 00:32.640 --> 00:37.080 I double those so a bit in standards, like something called package URL, which is used 00:37.080 --> 00:42.920 to identify packages in S-bombs, and standards, we need more of them, and the co-founder of 00:42.920 --> 00:49.880 SPDX, and I'm also a core contributor to CycleDX. 00:49.880 --> 00:57.880 So I'll try to avoid making the room too late starting late, so I'll try to finish 00:57.880 --> 00:58.880 early. 00:58.880 --> 01:07.560 The problem is that mixed package has metadata in particular licenses, is a bit of a mess. 01:07.560 --> 01:13.560 It's not your fault, it's not all fault entirely, a bit, but it's also because it's hard, 01:13.560 --> 01:20.560 because upstream is also sometimes pretty brain damaged. 01:20.560 --> 01:29.160 If you look at, say, something like, look at this license, this page, which, and you 01:29.160 --> 01:37.000 squint, it says MIT, but it says, you know, permission is not granted, right? 01:37.000 --> 01:44.000 If there's some junkity, but I have a whole collection of these, I call these fubo licenses, 01:44.000 --> 01:47.000 and it's just a small slice of this. 01:47.000 --> 01:53.200 It's easy for us to be full by that, because, you know, it says MIT, and in some case even 01:53.200 --> 01:57.200 GitHub says MIT, and it looks legit. 01:57.200 --> 02:02.440 So that's the problem, so it's not entirely on our fault, it's a problem of the tools, 02:02.440 --> 02:06.960 it's a problem of upstream or so, which is sometimes pretty brain damaged. 02:06.960 --> 02:12.800 In the end, it's useful, each time you have a license issue, you're putting a bit of 02:12.800 --> 02:19.800 a friction in using packages, and license of somewhat important, open source exists because 02:19.800 --> 02:23.120 of packages being under an open source license, right? 02:23.120 --> 02:27.520 You remove the open source license, there's no open source anymore. 02:27.520 --> 02:36.520 So, so useful for some things like emerging regulations like CRA, and, and really, 02:36.520 --> 02:42.560 any kind of responsible reuse, whether it's in the corporates, in an organization, in 02:42.560 --> 02:46.840 an open source project, you want to know about the license. 02:46.840 --> 02:49.040 So what can we do? 02:49.040 --> 02:50.480 Does it take a fix the problem? 02:50.480 --> 02:58.520 So we've started a small project called Nix Clarity, which is supported by a program funded 02:58.520 --> 03:06.480 by European Union called Fegiversity, and the goal is to help fix the mess. 03:06.480 --> 03:11.040 Hopefully also help upstream, because the great thing about Nix is that a lot of the code 03:11.040 --> 03:16.240 is pristine upstream, and ideally, we don't want to fix it for Nix. 03:16.240 --> 03:21.920 We want to fix it for everyone, so in a thousand years from now, open source license clarity 03:21.920 --> 03:26.080 will no longer be a problem. 03:26.080 --> 03:31.040 So the plan is, you know, if we can use package URL to help standardize the case where 03:31.040 --> 03:37.760 we have Vendor code, which may not have been externalized, could be paste, automate the detection 03:37.760 --> 03:45.600 with open source tools, eventually deploy at or from Nix, that's not up to me to decide, 03:45.600 --> 03:52.160 but ideally, like whenever you publish a package on Nix packages, then you get feedback. 03:52.160 --> 03:57.340 If the foundation of community wants to block on funky, weird, missing license, proprietary 03:57.340 --> 04:02.420 license, more power to you as a group, that's not for me to decide. 04:02.420 --> 04:10.540 So, quick work about Rails, I have any of you heard about package URLs? 04:10.540 --> 04:12.540 Good, so I need to talk a bit about it. 04:12.540 --> 04:21.780 It's a small strip it's standard to identify your package in a Nix bomb or elsewhere. 04:21.780 --> 04:27.460 It's useful and you need to know about it because it's been merged into the CV schema for 04:27.460 --> 04:38.380 vulnerabilities in data October, eventually you can go from a scan of a codebase to straight 04:38.380 --> 04:41.500 look up in the vulnerability database without much friction. 04:41.500 --> 04:46.020 So that's that's really useful for that. 04:46.020 --> 04:50.220 Somebody has to tell me if I'm going late because I can go on that for ours, it's been 04:50.220 --> 04:58.460 recently on the 6th of December standardize as an ECMAS standard, it's only choice to 04:58.460 --> 05:02.980 be an ISO standard, which is really interesting for a small string like that. 05:02.980 --> 05:10.820 But you know, it's important for companies to have also this standard to ensure these are 05:10.820 --> 05:13.900 not moving and usable. 05:13.900 --> 05:19.740 We are working by the way specifically supporting package URLs for Nix packages, which 05:19.740 --> 05:25.180 has specific challenges because Nix and because we have their revisions and a lot of 05:25.180 --> 05:30.420 hack has to deal with. 05:30.420 --> 05:37.860 It's used a bit everywhere, most if not all the tools that scan for origin, license and 05:37.860 --> 05:46.700 vulnerabilities do use package URL, so the standardization can basically after adoption. 05:46.820 --> 05:56.140 Database, CVs, using that GitHub, GitHub, all these companies, most open source foundations. 05:56.140 --> 06:03.740 So it's hopefully useful, it's not perfect, I like to say it's less bad than other approaches 06:03.740 --> 06:07.380 that have existed before, so it's just a small set for it. 06:07.380 --> 06:11.580 So now what are we going to do for Nix? 06:11.620 --> 06:16.380 I don't if I vendor it copied code is one thing, so we don't want to create and validate 06:16.380 --> 06:17.820 all the licenses. 06:17.820 --> 06:33.580 You may not know how Nix licenses are actually tracked in Nix itself, but some saying 06:33.580 --> 06:40.260 you search like that, I had a page up and we said that, there's a big, oh, there you go, 06:40.340 --> 06:42.340 it's at the top. 06:42.340 --> 06:47.260 You have a big Nix file that at least each and every possible license that could be 06:47.260 --> 06:48.660 used in the package. 06:48.660 --> 06:55.900 It doesn't scale because you can have something like the Linux kernel which may not be 06:55.900 --> 07:01.980 strictly available as a new Nix package, but you may have a package that has a combination 07:01.980 --> 07:05.940 of license, license, exceptions, these kind of things. 07:05.940 --> 07:09.940 In the current mode, you would have to expand this file forever to list all the possible 07:09.940 --> 07:10.940 combinations. 07:10.940 --> 07:15.940 So that's one problem. 07:15.940 --> 07:19.740 So we want to validate all the license, but there's probably going to be work to do 07:19.740 --> 07:27.340 specifically to support license expressions in Nix language. 07:27.340 --> 07:31.900 And I don't know how to do that, I need to write, how to write Python, I can be dangerous 07:31.900 --> 07:42.420 in C and a bit of C++, I don't know Nix to do something there. 07:42.420 --> 07:47.540 The problem also at scale is it's very large we're talking about tens and tens of 07:47.540 --> 07:52.300 terabytes of code. 07:52.300 --> 07:56.860 We want to clarify the license, so using scan code and match code to do things, scan code 07:56.860 --> 07:57.860 that takes licenses. 07:57.860 --> 08:05.140 It's basically a stupid dump, deep, between many license texts and license mentions 08:05.140 --> 08:11.100 and the code, except it's not stupid when you need to do that billions and billions of 08:11.100 --> 08:16.740 times, so there's a few tricks to make it fast. 08:16.740 --> 08:23.740 It always requires new licenses, especially with AI, there's a lot of very bad innovation 08:23.740 --> 08:29.500 going on, everybody wants to involve great new licenses, which are almost upon source, but 08:29.500 --> 08:30.700 not exactly. 08:30.700 --> 08:38.940 You've seen the case where you just had not every week we find new license, typically from 08:38.940 --> 08:44.500 AI-related projects, which are really problematic. 08:44.500 --> 08:49.860 So being able to detect the license, one thing, the other thing is being able to detect 08:49.860 --> 08:53.780 vendor it's copied code, that's where match code comes in, it's basically database of 08:53.780 --> 08:55.780 harshes. 08:55.780 --> 09:02.580 Again, very simple, in essence, can be a bit tricky if you're trying to find efficiently 09:02.580 --> 09:08.620 matches to code that was copied and it's been modified. 09:08.620 --> 09:14.700 There's a few tricks for that also, and the problem here, we're talking about generative 09:14.700 --> 09:19.820 AI, I have a tool and I can prove that 09:19.820 --> 09:29.340 and scientifically, about 20% of the time when you can easily point LNs to actually 09:29.340 --> 09:34.900 speed, very bad in copies of the source code that was used to trend them. 09:34.900 --> 09:41.420 So AI-company used collectively or source code to trend them model. 09:41.420 --> 09:45.820 This is eventually an index that can speed and memorize everything, whenever you have 09:45.860 --> 09:51.140 that, it's great, copyright infringement machine, it's fast and of course there's no 09:51.140 --> 09:56.820 attribution, no notice that comes with it. 09:56.820 --> 10:03.500 We're going to have a lot of bugs to fix, again, nix is big, nix packages big, we have 10:03.500 --> 10:11.620 some not-secret, top and source, but some machine learning tools to spot incorrect license 10:11.660 --> 10:19.300 detection, so to fix the fix, help fix the bugs, and we're going to need also a way 10:19.300 --> 10:26.340 to avoid being a drain on package maintenance at nix and eventually upstream. 10:26.340 --> 10:32.860 I don't know any solution that due to it one at a time with humans in the loop, I hate 10:32.860 --> 10:41.340 bugs, it's not the way, which means we cannot do the fix of problems upstream without 10:41.340 --> 10:46.140 involving massive the community. 10:46.140 --> 10:50.620 So we talked about the license expression for nix, right now the solution of license 10:50.620 --> 10:56.180 that nix in a scale, if you really want to have a credit license at the scale of the 10:56.180 --> 11:02.500 current nix package table, it's going to have to be 100 times bigger, which is going to 11:02.500 --> 11:09.300 be a challenge even for poor request merging and this kind of thing. 11:09.300 --> 11:17.820 It's SPDX not SPX sorry, we need a way to manage and deal with SPDX expressions, which 11:17.820 --> 11:25.380 are combinations, where you just keep the individual licenses and you can say, oh this is 11:25.380 --> 11:32.820 GPL and BSD, as opposed to say, here's the GSD, here's the GPL, today the approach in nix is 11:32.820 --> 11:40.900 to store GPL, BSD, GPL and BSD, GPL or BSD, GPL and MIT and all this combination, again, cannot 11:40.900 --> 11:41.900 scale. 11:41.900 --> 11:49.420 I have Python code for that, we need help to bring that to fruition to nix. 11:49.420 --> 11:55.460 We need also to find a way when you have package competing to packages, to make sure we 11:55.460 --> 12:01.740 are correct origins, it's going a bit beyond license, if, like you have say, in a rust 12:01.740 --> 12:10.740 crate, a vendor copier of a PNSSL, and you only know about the top-level rust package, 12:10.740 --> 12:14.580 and you forgot that you have vulnerable copier of a PNSSL, that's a problem. 12:14.580 --> 12:19.380 This may go undetected when you actually build nix packages, because you may not 12:19.420 --> 12:24.500 devander systematically the PNSSL copier, the PNSSL copier, the PNSSL copier may have 12:24.500 --> 12:29.340 been patched for what you know, and that's a problem, so it's really interesting and 12:29.340 --> 12:33.700 important, not only for license, but also for security, eventually it's also an issue 12:33.700 --> 12:36.700 for upstream, right? 12:36.700 --> 12:44.220 We want to do that on a slice as big as possible of the nix packages, and brings 12:44.260 --> 12:50.780 and repeat, and again, the goal is not for us to own that, but to empowers the package 12:50.780 --> 12:55.140 maintenance, to have this information, and, on the nix community, and the nix, so 12:55.140 --> 13:00.180 it's conditioned to own this stuff, if possible. 13:00.180 --> 13:02.620 And in the end, we use that for our system. 13:02.620 --> 13:10.300 We have running project with a star in the space, with the maintenance of logforj. 13:10.340 --> 13:14.780 You cannot make a presentation on security issues, and open source without talking about 13:14.780 --> 13:22.060 logforj and logforcial, and what we're doing is working together to fix the problem at 13:22.060 --> 13:27.380 naven, to find hidden copies of logforj. 13:27.380 --> 13:31.860 Doing something on rest also, debyan is in need of love, and suffers a lot of these 13:31.940 --> 13:35.940 licensing issues, when a package is uploaded in a... 13:42.740 --> 13:51.060 They lose the, they lose track, and the metadata usually drift, so if we can do good for nix 13:51.060 --> 13:55.140 for debyan, that would be awesome. 13:55.140 --> 13:56.740 That's it. 13:56.820 --> 14:03.140 We're, as I said, public benefit on profit, foundation-based Brussels, recently formed, 14:03.140 --> 14:10.420 we've been blessed to receive support from the German government, large US corporations, 14:10.420 --> 14:16.420 and a lot from the European Union through the NGI program in particular with NLNet. 14:16.500 --> 14:27.540 We, we leave up and survive as a charity, and I just want to say that we're trying to help, 14:27.540 --> 14:31.940 so if you have questions, and that's pretty much it. 14:39.780 --> 14:40.740 Any question? 14:44.420 --> 14:45.060 Yes, go ahead. 14:46.580 --> 14:58.740 So, the question is, how do you, how do you handle the less common licenses? 15:03.140 --> 15:08.740 So, the detection scan code, as I said, is stupid. 15:09.700 --> 15:16.340 You, you take all the variations, non-variations of MIT, which are bona fide MIT, 15:16.340 --> 15:21.540 the original correct license text, that's agreed upon, because there's no real 15:21.540 --> 15:29.860 original version for MIT, and all the known bad variations, and you're trying to match that exactly, 15:29.860 --> 15:34.260 using a D, when I said using a D, eventually we use a modified version of D, 15:34.340 --> 15:42.580 which works on big vectors, but it doesn't matter. If there's variations like one more difference, 15:42.580 --> 15:49.140 you'll say, there's not exact match. As a matter of fact, we're also indexed, these we have 15:49.860 --> 15:55.380 fubo license, so we detect them eventually as proprietary licenses, but it's, it's really a string matching 15:55.620 --> 16:04.260 at the world level, and the goal is to have as big a database of example of licenses, 16:04.260 --> 16:12.500 about at the moment, about 40,000, and the more we have of this example of bad and good licenses, 16:12.500 --> 16:20.180 the faster we are detecting, because we build automatons, and we use an algorithm for search 16:20.180 --> 16:27.460 called alcharsic, which is used typically for detecting viruses. And in the most, we were very 16:27.460 --> 16:34.100 efficient fast, exact detection, exacting the exact sequence of words, ignoring formatting, 16:34.100 --> 16:41.780 and all that kind of stuff. So it's really string matching on a large scale. The one thing to understand 16:41.860 --> 16:51.540 is that say, given an index like Google, queries, ignoring the new AI mode, was limited to 32 16:51.540 --> 16:59.300 words in a query, index is petabytes. In our case, a codebase or large codebase could be a 16:59.300 --> 17:09.700 copolygabytes, but the codebase is the query. The index is 10, 20, 30 megabytes, the codebase is 17:09.780 --> 17:15.780 the query. It's interesting and less common problems. How do you deal with gigabytes size queries 17:15.780 --> 17:22.340 on the small index? I'm going to see, hopefully people can hear me with this microphone, 17:22.980 --> 17:26.180 I apologize. No, that's okay, that's what the sign to tell me, it's over. 17:27.860 --> 17:31.460 Philip, I'm going to thank you very much. We can take one more question. 17:33.700 --> 17:37.700 I apologize for not introducing you ahead of the talk, we're still figuring out the Microsoft 17:37.700 --> 17:44.580 situation a bit. So we're going to see if the mic can ask you questions that way, 17:44.580 --> 17:47.860 we'll work, so it's because don't have to repeat the questions. Are there any further questions? 17:48.900 --> 17:50.500 Please raise your hand if you have a question. 17:52.420 --> 17:56.100 Somebody can think of a question, right? Can be interesting. Okay, so go ahead then. 17:56.100 --> 18:11.060 No, I didn't hear the question, sorry. Sorry, I did lose my phone coming. Can you hear me? 18:13.060 --> 18:21.380 Is it working? Yes. I'll shout, okay. I was wondering, is there one specific place for working 18:21.380 --> 18:25.780 on this project? Because there's a lot of links on the board. Yes, that's a good point. 18:25.780 --> 18:33.460 So there's one specific place, which is a board of issues we track on a robot code organization. 18:33.460 --> 18:39.700 I'll make sure I'll put the link on the first-hand talk page. Thank you for that. I forgot about it. 18:41.540 --> 18:45.380 All right, thank you very much. Thank you. 18:51.380 --> 18:53.380 Thank you.