WEBVTT 00:00.000 --> 00:12.240 We're back on time, so it's very, I'm very pleased to present Tilo Mathiths who will be 00:12.240 --> 00:16.560 speaking about something that often asks people who sit at computers, kind of forget, 00:16.560 --> 00:24.680 which is that biology is actually an experimental science and over to you. 00:24.760 --> 00:32.680 Right. And I'm also, I'm co-presenting here with Hussie, so it's a very important part 00:32.680 --> 00:36.520 because we're talking essentially about two open source projects that came together to solve 00:36.520 --> 00:43.480 a problem in science and it also has to do with provenance and tracking essentially the 00:43.480 --> 00:50.600 scientific progress. So yeah, I really want to start with a small overview on how you get 00:50.600 --> 00:54.200 from primary data to computational analysis, what is involved in there. 00:54.680 --> 01:00.360 And then we'll both talk about our collaboration, like how we thought about exchanging data 01:00.360 --> 01:08.520 between two tools that do quite different things and essentially leading to connecting computational 01:08.520 --> 01:15.160 workflows with overall research documentation. So I want to start with an example that's a real 01:15.160 --> 01:23.640 person, Rolizsa. She is working with us on certain things and she's one example of a researcher who does 01:23.720 --> 01:30.600 both a lot of wet lab work. So there's a lot of sample preparation, DNA extraction. And then 01:30.600 --> 01:36.200 she hands it over to computational work. And a lot of times a lot of these researchers are doing 01:36.200 --> 01:42.600 like all of these things to some degree, right. So for her to challenge just like how to 01:42.600 --> 01:50.280 aeroblessly connect that primary data, these physical samples to the experimental context and then 01:50.280 --> 01:56.280 the analysis workflows that have been done and afterwards on it. So if you look at the tools and 01:56.280 --> 02:01.080 the workflows and relitz us, you see there's a lot going on, right? There's cultivating cells, 02:01.080 --> 02:08.200 sample preparation, a lot of small data collection, like images, handwritten nodes, digital nodes, 02:08.200 --> 02:13.960 all kind of stuff. The data lifts across different kind of file stores, local computers, etc. 02:13.960 --> 02:19.000 Right. And then essentially a lot of the data is analyzed. So there's code snippets lying around, 02:19.000 --> 02:26.600 node Jupyter node books, HPC scripts, etc. Right. So and how essentially do you get like all this 02:26.600 --> 02:34.760 together? And here is basically where briefly we want to introduce the tools that we're looking at. 02:34.760 --> 02:41.080 So I work at research space. We build our space. It's an open source research platform for 02:41.080 --> 02:46.440 institutional research data management. So if you look at the core of this very busy image here, 02:46.440 --> 02:50.280 there's an electronic lab node book and an inventory management system in there, 02:51.000 --> 02:56.200 which helps researchers in the active phase of like sample preparation and primary data 02:56.200 --> 03:02.040 preparation to document what they did and what came out of it. What differentiates us from 03:02.040 --> 03:07.320 other node book tools and inventory management tools is that we're very cool interoperable 03:07.320 --> 03:11.960 with other research tools and research infrastructure, such as institutional file stores, 03:12.280 --> 03:18.840 data management planning tools, etc. And essentially our space can through these integrations 03:18.840 --> 03:25.560 can essentially become a hub of recording the scientific progress. So and with that I'll give it 03:25.560 --> 03:36.440 over to you. Thank you. So about Galaxy, we are basically open source data analysis platform 03:36.520 --> 03:45.160 on the web. It originally started from the field of bioinformatics, but over time it has been 03:45.880 --> 03:51.640 expanding to other scientific algorithms such as climate science, ecology, 03:51.640 --> 03:56.840 came from mathematics, imaging data science, materials science, astronomy, and most recently, 03:56.840 --> 04:05.640 even humanities. I put a very simple example there of like an extremely simple analysis that you 04:05.640 --> 04:14.520 could do imaging data that is basically counting the amount of points that you have on this 04:14.520 --> 04:27.160 microscope image where you can see. Okay, great. Where you can see basically a stain cells, 04:28.200 --> 04:35.000 but most typically you don't carry out this kind of analysis, but you use Galaxy in conjunction 04:35.000 --> 04:41.400 with an HPC computing network and you run computationally heavy and more complex analysis. For 04:41.400 --> 04:47.880 example RNAsake or I don't know, protein folding or all these sort of analysis that you do in 04:47.880 --> 04:55.080 bioinformatics. You can use the platform via web browser via an API and most recently we also have 04:55.080 --> 05:04.520 an MCP server, so if you want to connect AI to it, that's also possible. So the design or 05:04.520 --> 05:13.240 Galaxy is same that researchers and it focuses on accessibility and transparency. The design 05:13.240 --> 05:20.520 is let's say the most two most important concepts are histories, which are a sequence of 05:20.520 --> 05:26.520 no code and reproducible transformations that you apply to data sets and those are carried out 05:26.520 --> 05:31.400 by the so-called Galaxy tools which are light wrappers around existing bioinformatic tools. 05:33.560 --> 05:39.160 Then the next logical steps are the workloads, which are recipes that spawn a history from 05:40.360 --> 05:47.080 a set of inputs. Basically, you have some dynamic control flow as well on top of that and they 05:47.160 --> 05:54.120 can be created using the workflow editor, which is GUI where you can basically rearrange the boxes, 05:54.120 --> 06:00.440 up boxes, connect them and in this way control the flow, you can create the workflows from a 06:00.440 --> 06:09.400 history etc. And then the design is very friendly to the very principles because of some 06:09.400 --> 06:15.640 characteristics. I will mention only some of them like append-only design. You also have version 06:15.800 --> 06:22.280 tools and version workflows. It's possible to publish everything you have seen some links around 06:22.280 --> 06:27.400 my presentation. This is because I've been publishing the analysis and the workload. You can just 06:27.400 --> 06:36.680 click there and you will have access to it. You can also export workflows and histories and the 06:36.680 --> 06:41.560 platform can interpret with many storage systems and compute and you can even bring your own 06:42.200 --> 06:53.480 for your user. Of course, this history system has built-in provenance and that's about it. 06:53.480 --> 07:00.200 The last part is about what is Galaxy. Of course, also a community, we would be nothing without 07:00.200 --> 07:08.200 our users because a community maintains some critical parts or infrastructure for Galaxy. One is 07:08.200 --> 07:13.560 a toolset, which is a public repository of tools, contributed by the Galaxy community, which has 07:13.560 --> 07:21.000 over 10k tools. Then there is also the Galaxy training network, which is a large collection of 07:21.000 --> 07:27.160 tutorials that are also contributed by the community. You just visit this URL and you have access to 07:27.160 --> 07:33.960 a wide range of tutorials in different scientific domains and also in Galaxy development. So 07:34.840 --> 07:42.920 administering a server or developing tools. Then the community sensors aim at specific research 07:42.920 --> 07:49.800 areas. Many of the things have their own subdomain, so they can have their own list of tools and so on. 07:51.000 --> 07:57.000 Finally, but most important are the public Galaxy servers that can be accessed by anyone and are 07:57.080 --> 08:06.440 free to use. I personally maintain the use Galaxy.eu server, but there's also a large US server 08:06.440 --> 08:11.800 and an Australian server and also a French one, but more and in the making. And just listing, 08:11.800 --> 08:18.280 let's say the ones that have existing for a longer time, but we are incorporating, for example, 08:18.280 --> 08:27.160 a Belgian server and so on. So with this, I hand back over to Tilo. Although you will have back 08:27.160 --> 08:33.960 to me. Thanks. Yeah. So essentially, like you've seen, we have now two tools and we want to use the 08:33.960 --> 08:40.280 best of both worlds. So in Galaxy, you have user-friendly access to computational workflows. 08:40.280 --> 08:45.720 In our space, it's basically a documentation hub, right? That also then can put your data 08:46.520 --> 08:52.200 once you're done with it elsewhere later. So we thought about, like, how can we use these two things 08:52.200 --> 09:02.600 to streamline workflows and help researchers like Reloads? And so essentially, so essentially, 09:03.880 --> 09:11.880 from the Galaxy side, we incorporated our space as a file store. This means that you can mount 09:11.960 --> 09:19.400 our spaces are repository in Galaxy. And you can therefore import or export data sets from our 09:19.400 --> 09:26.680 space or to our space. You can also export the whole histories themselves. You can also export workflows 09:26.680 --> 09:33.480 as arrow crates. And the setup process is quite simple. Basically, you go to your user preferences 09:33.480 --> 09:39.080 and you will have a screen to connect your own instance or whatever our space instance that you want. 09:39.080 --> 09:47.800 And then, basically, after that, you have it available as a repository to import and export files 09:47.800 --> 09:56.840 and browsing. So. Right. And then we thought we wanted to make it, like, also more user-friendly 09:56.840 --> 10:04.280 from the our space side. So we created a small fordance where researchers can from a document that 10:04.280 --> 10:09.480 they have in our space that has data connected to it. Basically, create a new history in Galaxy 10:09.480 --> 10:16.360 and send that data over automatically. Automatically, also there's some metadata being pushed 10:16.360 --> 10:24.200 over like unique identifiers of the file and the data in our space. And then as soon as on the 10:24.200 --> 10:29.720 Galaxy side, the workflow is invoked, the user in our space can actually keep track of what's going on 10:29.720 --> 10:35.640 when it's done. And then when it's done, they can put the data back into their our space document. 10:36.360 --> 10:44.120 Right. And yeah. And so, who says that, like this, these exports can be informal for 10:44.120 --> 10:52.120 crates, right? Or bio-computer objects. And last but not least, if we have a little bit of time, 10:52.200 --> 11:00.200 we have a video, how this looks like. Not sure if the sound will work. 11:06.200 --> 11:08.920 Oh, yes. That looks good. Yeah. 11:08.920 --> 11:12.280 The video doesn't string. Let me test this out. We haven't seen the information. 11:12.280 --> 11:16.600 There's seamless data analysis and documentation on it for us. Well, automatically, 11:16.600 --> 11:23.400 keeping provenance from primary data to analysis results. Starting in our space, researchers 11:23.400 --> 11:29.240 can select data attached to their documents like this aerial forest photograph and upload it directly 11:29.240 --> 11:37.800 to Galaxy with one click. Our space automatically creates a new Galaxy history with systematic 11:37.800 --> 11:48.920 naming based on the our space document that data was attached to. Additionally, our space adds 11:48.920 --> 11:54.120 metadata linking back to the original experimental documentation and data in the annotation of 11:54.120 --> 12:04.520 the transfer to files. In Galaxy, researchers can now run any available workflow on their data. 12:04.760 --> 12:18.920 Here, we're using more noise segmentation to identify and count individual tree grounds in the forest image. 12:27.320 --> 12:32.600 Back in our space, researchers can track the progress of their analysis in Galaxy by inspecting 12:32.680 --> 12:42.520 the workflow status. When complete, the researcher can use the direct links to navigate to the specific 12:42.520 --> 12:57.160 invocation and its results in Galaxy. From the invocation view, researchers that have set up our space 12:57.160 --> 13:02.760 as a file source of Galaxy can export complete workflow packages directly back to a 13:02.760 --> 13:05.960 allocation of their choice in the our space gallery. 13:20.120 --> 13:25.640 Alternatively, they can select specific result files to be transferred back to our space gallery 13:25.640 --> 13:29.640 using the send data tool. 13:44.760 --> 13:50.360 Finally, results can be efficiently integrated directly into the original our space document, 13:50.360 --> 13:56.760 maintaining full prominence from experimental data through computational analysis and making it available 13:56.760 --> 14:00.280 to our space digital system of integrated tools and services. 14:07.960 --> 14:13.640 This integration is available now with our space version 1.13 and Galaxy 254. 14:13.640 --> 14:24.760 That was that. We also want to conclude. We have a couple of links to the Galaxy community 14:24.760 --> 14:31.000 and also to the our space community resources. We actually also have an open office hour coming 14:31.000 --> 14:37.560 up next week. You're all invited to join if you want. With that, any final words from you? 14:38.520 --> 14:42.520 Oh yeah, thanks for your attention. 14:47.160 --> 14:52.440 So unfortunately, we do not have time for questions but we do have a matrix space which you can go 14:52.440 --> 14:57.080 to ask questions and you should post about the office hours and also the in-of-go-hack 14:57.080 --> 15:03.000 from coming up as well. People who want to leave please do leave. Let other people in. Thank you very much.