WEBVTT 00:00.000 --> 00:11.280 Hello, I am Dorothy Benamu and I am working at the National Library of France as part 00:11.280 --> 00:18.280 of the web accounting team and this presentation is about how we manage the tension between 00:18.280 --> 00:23.920 hosting large amounts of copyrighted data with limited access and meeting researchers 00:23.920 --> 00:32.040 needs to explore these data with open source software. 00:32.040 --> 00:37.560 So since the early 2000s, the National Library of France has been collecting the 00:37.560 --> 00:46.360 French web, that is to say that we regularly collect samples of the cultural production 00:46.360 --> 00:52.400 that are online and which are, as you know, very ephemeral, so this allows researchers 00:52.400 --> 01:00.320 to study the political, social, scientific debates that either primarily take place 01:00.320 --> 01:07.360 on the web or at least find significant echoes there and to study as well the major 01:07.360 --> 01:12.800 transformation that the web has brought into a rare aspect of our lives. 01:12.800 --> 01:22.360 So the B&F is authorized to do this under a low from 2006 and what does it mean to actually 01:22.360 --> 01:29.320 collect the web so we use software web crawlers boats to do so. 01:29.320 --> 01:34.800 We are using open source tools that we develop together with other institutions and 01:34.800 --> 01:40.920 organisms that are part of the international internet preservation consumption, the 01:40.920 --> 01:50.640 AI PC and so you might be familiar with the internet archives way back machine so 01:50.640 --> 01:56.840 internet archive is part of the AI PC2 and we use the same kind of tool for accessing 01:56.840 --> 02:01.680 and browsing past versions of the web. 02:01.680 --> 02:06.600 The main difference is that we focus on the French web so we have more comprehensive collections 02:06.600 --> 02:11.240 of websites, hostility or produced inference. 02:11.240 --> 02:17.240 So what is the scope of the collections of web archives that are available? 02:17.240 --> 02:24.760 So obviously we cannot preserve everything so we try to collect representative samples 02:24.760 --> 02:30.280 regularly of the French web and our harvesting model is a mixed one combining two types 02:30.280 --> 02:32.080 of crows. 02:32.080 --> 02:38.280 Once a year we run a national domain crawl or a broad crawl for this call there is no selection 02:38.280 --> 02:39.280 process. 02:39.280 --> 02:46.840 We are collaborating with diverse organisms to gather lists of domain name that are hosted 02:46.920 --> 02:55.120 in France and for each of these domains we are collecting around 2000 euro. 02:55.120 --> 03:01.520 So last cherry representatives almost 6 million domain names and complementary to this 03:01.520 --> 03:08.040 we also have thematic or curated crows so these are thematic selections made by a network 03:08.040 --> 03:15.640 of librarians mostly but also researchers and associations within their fields of expertise. 03:15.720 --> 03:22.880 We have selected websites that are harvested more in-depth and more frequently so fragrances 03:22.880 --> 03:26.280 ranging from daily to annually. 03:26.280 --> 03:32.880 They cover all disciplines, literatures, sciences, history and we also have a thematic crows for 03:32.880 --> 03:39.560 example for the COVID-19 pandemic or for the European Games or for the electoral for the 03:39.560 --> 03:40.560 elections. 03:40.640 --> 03:49.120 So to develop this example for each major French election we collect the electoral debate 03:49.120 --> 03:50.120 around the election. 03:50.120 --> 03:58.680 It means we select websites, blogs, social networks when possible and diverse online content 03:58.680 --> 04:07.160 from political parties, from unions, candidate associations, sciences, tumourist bolsters 04:07.240 --> 04:14.280 and from any cities and expressing themselves on the internet about the election for example. 04:14.280 --> 04:20.320 So how do we make this data more open to scientific research? 04:20.320 --> 04:28.920 So these are great material for researchers but there are several limitations on those collections. 04:28.920 --> 04:37.920 So the first one is that due to legal restriction, due to privacy and copyright concerns, 04:37.920 --> 04:42.800 the web archives can only be accessed on-site at the B&F in the research reading rooms 04:42.800 --> 04:49.120 or in a network of 20-regional library or questions that are specifically listed in a 04:49.120 --> 04:50.120 decree. 04:50.120 --> 04:55.760 So this is why they can be considered as close data and as a major challenge is that there 04:56.160 --> 05:03.080 are massive data so we try to provide specific tools and services to allow to explore those 05:03.080 --> 05:10.320 data such as Pondore within the B&F data lab and finally there are digital artefacts 05:10.320 --> 05:17.400 so there are traces or recalls of what was on the web at a given period and if researchers 05:17.400 --> 05:22.520 want to use them for their research they have to understand how they were constituted 05:22.600 --> 05:29.720 the intellectual and technical choices that were made and have to deal with the biases and 05:29.720 --> 05:35.560 they have to deal with the incompleteness, the multiple version of the same page, maybe time 05:35.560 --> 05:39.880 and consistencies etc. 05:39.880 --> 05:47.320 So making them more open for research remains a major challenge and this is one of the 05:47.400 --> 06:00.680 issue that Pondore is addressing and I will let Gium explain you why. 06:00.680 --> 06:07.080 All right so hi everyone I'm just going to try to stick that right here if it works. 06:07.080 --> 06:14.760 My name is Gium Luvoye I'm a political scientist I'm a researcher and as such well again 06:14.840 --> 06:20.200 we've talked about this earlier today already for me the practice of doing research is about 06:20.200 --> 06:25.960 building a method that serves an epistemological goal trying to build something that we try to call 06:25.960 --> 06:32.680 scientific knowledge. The thing is today that label is very rarely enforced what we call 06:33.480 --> 06:39.480 scientific knowledge in the common speech is usually things that you find in articles that are 06:39.480 --> 06:46.760 in peer review journals but what that means defecto is often left and said as in the reviewers 06:46.760 --> 06:52.120 that are supposed to vet the research usually don't have access to raw data or to the method. 06:52.760 --> 06:59.400 So it becomes more and again it depends on the discipline but it becomes more about the 06:59.400 --> 07:06.440 plausibility of a narrative rather than the actual work. Hence the need again this is why we're here 07:06.600 --> 07:13.240 for two that are free open source with intelligible source code whose execution can be decentralized 07:13.240 --> 07:18.040 when it's possible and whose outputs are in the control of the user and then that's how we connect 07:18.040 --> 07:24.680 with the previous presentation. We hope that having both the process and the tools available 07:24.680 --> 07:33.320 over a long period of time enables to give us a better chance as being able to reproduce the work 07:33.880 --> 07:39.800 and hence the very relevant question we had on the hardware is through that sometimes the hardware 07:39.800 --> 07:44.680 can be a big issue but since it's the best we can do given the means that we have. 07:46.120 --> 07:50.520 But as researchers we're always trying to reach all available data sources for once research 07:50.520 --> 07:55.080 which collides a little bit with the idea of being reproducible and accessible to most 07:55.800 --> 08:00.520 which is why I started building Penderway which is a software that does 08:01.240 --> 08:06.920 three things or has many processes that can be abstracted into three things harvesting data, 08:06.920 --> 08:13.080 standardizing data and exploring this data. So basically how it works is that it has a first 08:13.080 --> 08:18.600 process called flux that is basically connected to different types of APIs that it calls in a way 08:18.600 --> 08:25.960 that is respectful of the API in order not to drown it. Then so the data it queries has to be 08:26.520 --> 08:32.120 abstractable as documents to be poured into the terro which is useful because researchers usually 08:32.120 --> 08:36.520 even if they're not very good at using computer they know how to use the terro and the 08:36.520 --> 08:43.400 terro is both a web service it's a desktop software it's an SQL database behind it so it provides 08:43.400 --> 08:50.520 a lot of tools and all documents can have attachments as well and notes so it's quite a powerful 08:50.600 --> 08:54.600 open source software and I think everyone is thankful in the scientific community for the terro 08:55.320 --> 09:01.960 and types is a series of data vis systems that are calibrated to explore 09:01.960 --> 09:09.400 copies of documents that rely on D3JS. What does that what does this mean in the context of the 09:09.400 --> 09:16.040 web archives at the national library of friends so the BNF? It means that we need to first 09:16.040 --> 09:24.040 identify that when you're on site when you have access to those to the archives that again are 09:24.040 --> 09:30.280 protected you need to check whether there is something in the archive that's interesting to you 09:30.280 --> 09:37.000 and then you do the full circle of querying the data abstracting it into the terro which is tricky 09:37.000 --> 09:43.160 because you cannot get the data out then visualizing the corpus and then having the opportunity 09:43.160 --> 09:47.320 of looking at each capture of web pages that you want to explore into the web archive browser. 09:48.120 --> 09:54.200 So this is the tension that we had that the whole thing mentioned earlier it's open source 09:54.200 --> 09:59.000 but it's closed data of course we're not the first people to have that problem but we have to 09:59.000 --> 10:05.560 negotiate the specific constraints that we have both in terms of law in friends and in terms of 10:05.560 --> 10:12.120 technical capacity from that institution to be able to provide access to such data so quickly this is 10:12.120 --> 10:17.000 how it looks like it's the form it's a search engine you check that you have a number of results 10:17.000 --> 10:21.800 that's interesting to you here I was looking for final in the 2002 election in France 10:22.600 --> 10:28.680 and once in you are in that software in Pandora it detects that you are within a network that is 10:28.680 --> 10:36.280 authorized gives you the opportunity to do the same request on the same data set tells you that you have 10:36.280 --> 10:41.400 so this is a different request actually it's on dolly the little sheep it tells you that you have 10:42.280 --> 10:48.760 a request under captures that are relevant to your request over time in the collection that's 10:48.760 --> 10:55.640 relevant and gives you an example of the kind of websites that contain those terms because those 10:55.640 --> 11:02.280 are full text indexed collections and then the solution that we found is that we sent only the 11:02.280 --> 11:08.840 metadata to Zootero so we did not upload the actual content of the fields but we only selected we 11:08.920 --> 11:15.400 remapped the documents and selected a series of metadata not all of them that were relevant to 11:17.160 --> 11:22.680 to the corpus and so here you have an example have the website title the host the date 11:23.480 --> 11:29.800 and we used to pass a JSON object stringified JSON object at the short title because it's 11:29.800 --> 11:34.760 common to also Tero documents but now we cannot do that anymore we used to be able to do that so now 11:34.760 --> 11:41.480 we upload notes instead and then we re-stand it back to paint a ray when you can see it as a corpus 11:42.520 --> 11:48.200 here you have in blue the captures and the links towards documents that are different and so when you click 11:48.200 --> 11:57.560 a link if you are within the bnf it sends a query and gives you access to the full content so it tells you 11:57.560 --> 12:03.240 how many captures there are over time and you can look forward within the content so it tells you 12:03.240 --> 12:08.200 and it gives you a short exert so you can have an idea of how relevant the term you're looking for 12:08.200 --> 12:14.040 is in the context of that page so the way we found to conceptualize this kind of relationship 12:14.040 --> 12:21.240 that enables us to build open software with in that instance close data is the one way mirror model 12:21.240 --> 12:28.200 so in this model there is we consider the web archive as a bit like a suspect in an interrogation room 12:29.160 --> 12:33.480 and they have a one way mirror on their side so they only see themselves and you can ask them 12:33.480 --> 12:40.280 question so we build that model on metaphor based on three properties one is that you can harvest the 12:40.280 --> 12:45.000 data when you're in the room but you cannot have right access to it so you can talk to the person 12:45.000 --> 12:50.520 which you cannot change what the person knows you can take parts of each record so here the 12:50.520 --> 12:55.720 metadata and not whole thing but only some fields that we determined but you cannot send requests 12:55.720 --> 13:01.320 when you're away you have to be in here to ask the questions and you can come back for highlight 13:01.320 --> 13:06.840 on a specific piece so that refers to the that part of the exploration when you click on a note and 13:06.840 --> 13:13.880 send your request information on a specific note but you cannot take the whole body out so you cannot 13:13.880 --> 13:19.720 ask for the whole corpus to to be extracted because there would be a risk that then you would just put 13:19.800 --> 13:26.440 it on a USB key and walk home with it so that's that's how we build our model thank you for 13:26.440 --> 13:45.080 attention we'll be taking questions yep hi so how much of this is opposed by right holders on 13:45.160 --> 13:52.200 a continual basis so you always fighting the problem that you have to be careful as to what 13:52.200 --> 13:59.880 people can take out of the closed room yes yes oh yes sorry the question I should know that so the 13:59.880 --> 14:05.960 question is how do we manage the fact that everything is under the rights of the creators of the 14:05.960 --> 14:12.280 authors and how much we have to fight that so there is a context that is still to be I guess proven 14:12.280 --> 14:19.640 by practice but basically the idea is that you don't have you're not justified in taking anything out 14:19.640 --> 14:27.640 of there except if you're a researcher and you need to take an excerpt to show a good faith that 14:27.640 --> 14:33.480 the argument actually sends back to something that is empirically in the database so if someone wants 14:33.480 --> 14:38.200 to if a researcher reads your paper and wants to know more or to check that what you're quoting is 14:38.200 --> 14:43.800 actually accurate and you have that number of documents that are in the family of the phenomenon that 14:43.800 --> 14:50.680 you're trying to describe they can come to the NF and use hopefully the same tool on the same data 14:50.680 --> 14:56.680 and find the same results so it's extremely coercitive in a way there's no way of taking out the data 14:56.680 --> 15:01.800 you're just you're supposed to be able to to quote it and that's the liberty I took as a researcher 15:01.800 --> 15:07.480 to just show you those little sentences also I think this is still on the live web so you might 15:07.560 --> 15:13.560 find it but so yeah it's very it's very narrow and it's a problem for research but for now 15:13.560 --> 15:23.160 it's how it is I think maybe tell you one more question otherwise let's take this speaker