WEBVTT 00:00.000 --> 00:14.080 All right, can everyone hear me well? Perfect. So today we're going to talk about the 00:14.080 --> 00:18.560 end of core community. It's basically community how to make your next-door workflows run 00:18.560 --> 00:24.440 correctly reproducible and easily and hopefully make it easy to write your next-door workflows. 00:25.400 --> 00:30.040 But first of all, what is next flow? Who view knows what next flow is? Can you see 00:30.040 --> 00:35.720 raise a fence? All right, perfect. For those of you that don't know next-door, next-door is 00:35.720 --> 00:40.280 basically a workflow manager. So a language runtime and a community like it says here, 00:40.280 --> 00:45.560 but mainly a workflow language manager, you can use this to change all your bioinformatics 00:45.560 --> 00:50.360 tools right off each other in a parallel way, make your pipe and reproducible and be able 00:50.360 --> 00:56.680 to run it anywhere. So reproducible is the first part I talk about here. This is the most important 00:56.680 --> 01:02.280 part. If you write the next-door pipeline, you want it to be able to run everywhere and have 01:02.280 --> 01:08.280 its creates the same output. It also needs to be portable. So next-door pipelines are heavily 01:08.280 --> 01:14.920 based on container environments and konda. So you can run them almost anywhere without any 01:14.920 --> 01:20.840 actual changes to the software environments you're working. Let me just do a technical adjustment 01:20.840 --> 01:32.440 here. It's a little bit higher. Okay, let's write that. Is this better? Is it better? 01:32.440 --> 01:37.720 Okay, and then the last part, scalable of course, you want to start writing your next-door pipelines 01:37.720 --> 01:42.120 from the local laptop, but my five samples, for example, some very small things, and then you 01:42.120 --> 01:46.200 will have to well want to immediately scale this to like five million samples in the quote, for 01:46.200 --> 01:53.640 example, and this is all possible using next-door. But of course, such a big and scary thing, 01:53.640 --> 02:00.040 is very versatile, and this highlights the need for standardization. Standardization is introduced 02:00.040 --> 02:06.440 by the end of core community. So in of course, it's basically the community of like-minded people 02:06.520 --> 02:12.360 who write next-door pipelines, I want to make pipelines usable for everyone. So this starts, of course, 02:12.360 --> 02:18.200 with a set of standards. We have a whole list of guidelines. I won't go through all of them right 02:18.200 --> 02:23.800 now, or I will need like two or three more talks about this in the future. But to main parts, 02:23.800 --> 02:30.280 are the documentation, the templates. We use a common template for all pipelines, and use of public 02:30.280 --> 02:35.480 containers to make sure that everyone can actually access all the environments during the 02:35.560 --> 02:42.120 pipeline. As the best practice, this is probably the most important one, and then of course, 02:42.120 --> 02:48.040 there are a lot of different best practices. Of course, we allow only a low one pipeline per 02:48.040 --> 02:53.320 data type and or usage. So this means that there isn't any overlap between several pipelines. 02:53.320 --> 02:57.720 There's only one pipeline that can do your actual analysis for you, so you don't have to go looking 02:57.720 --> 03:04.040 at three or four different pipelines to see which one does your analysis best. We also use stable 03:04.040 --> 03:10.920 text to release our pipelines and run it with CICD to actually test our pipelines. And of course, 03:10.920 --> 03:15.160 they are closed already. This is pretty native and next to right about now, but end of course, 03:15.160 --> 03:22.760 really push it through. Cooperation is maybe even more important. We all work together on pipelines, 03:22.760 --> 03:27.160 and there's no single pipeline that is owned by one person. All pipelines are owned by the community. 03:27.480 --> 03:35.400 It's of course open source. Otherwise, we wouldn't be here, I think. And then to make the work 03:35.400 --> 03:39.400 easier, we have a lot of components, which are parts of your next supply plan that you can actually 03:39.400 --> 03:45.240 use and reuse in other pipelines without having to write these components yourself. This really 03:45.240 --> 03:51.800 speeds up the process of writing next to pipelines and reproducible. So to talk a bit more about 03:51.880 --> 03:58.040 these components, we have about 141 pipelines right now. I think 60 to 80 of them are released. 03:58.040 --> 04:03.560 I forgot a number, but something around that. Others are still in development, but all pipelines 04:03.560 --> 04:10.280 are really being worked on. Then modules, we have a thousand and 700 modules. Modules are basically 04:10.280 --> 04:15.080 a wrapers around buying from edict tools. So for example, some tools, few, you have a module for that, 04:15.080 --> 04:19.640 and you can just pull it in your pipeline and start to use it. And around 100 separate flows, 04:19.640 --> 04:23.880 which is basically a chain of modules that you can start using, which is a specific analysis. 04:25.240 --> 04:31.240 We test our pipelines using minimal test data in CI, and also some full tests for some pipelines. 04:32.520 --> 04:37.480 We use the NF test framework for this. I won't go deeper into that, but you can still come 04:37.480 --> 04:42.520 ask me at the end to get some more information about this. We lent all our pipelines to check 04:42.520 --> 04:46.360 for consistency between all the different NF core pipelines to make sure that if you read one 04:46.360 --> 04:53.800 pipeline that you can understand all the other pipelines. We use a schema to validate or inputs 04:53.800 --> 04:58.680 of the pipeline. So the parameters, for example, your sample sheet, where you define each 04:58.680 --> 05:06.040 specific option for each sample, I use the NF schema plugin to do this. And then tooling, 05:06.040 --> 05:11.560 which Julia will talk about more at the end of the stock, can be used to tie the stall together 05:11.560 --> 05:17.160 and to make it even easier for you to write these pipelines. And to use them, of course. 05:17.160 --> 05:21.880 So a little bit of statistics to show you a bit of the scale of the community. 05:21.880 --> 05:26.920 We have around 12.000 slack users right now, as you can see in the graph here, 05:27.960 --> 05:35.240 it's still going up pretty well. And a lot of active users also, so you can say 12.000 slack 05:35.240 --> 05:40.280 users, but we also have a lot of active users of those. We have more than 500 contributors, 05:40.280 --> 05:45.080 which means a contributor, we count as someone who made at least one commit to any of the 05:45.080 --> 05:51.560 NF core repositories. There are 175 plus repositories, 141 of them are the pipelines, 05:51.560 --> 05:57.240 and all the other are more infrastructure related. Then total of all the repositories we have 05:57.240 --> 06:04.200 around 100,000 GitHub commits, which shows you the progress we made since the inception of NF 06:04.200 --> 06:11.320 core in 2019. Over all the repositories combined again, we have almost 8,000 GitHub stars 06:11.880 --> 06:21.720 and 790,000 GitHub views. So to go over a couple of our pipelines, the most important ones, 06:23.080 --> 06:28.280 we have our nasic, which is the basic expression analysis or an API plan. 06:29.240 --> 06:33.960 ATXC, which is the ATXC analysis, speaks for itself, I think. 06:34.520 --> 06:39.560 Simply see if there's some applicants sequencing analysis. I don't know what it is, but 06:40.200 --> 06:46.040 there you go. Then, Sarac, which is closer to my heart, there's the variant analysis of short 06:46.040 --> 06:53.320 read to DNA data using a series of different tools, etc. Then, fetch NES has used to fetch your 06:53.320 --> 06:58.600 metadata or fast queues, so you can create sample sheets to run other pipelines. So this is basically 06:58.600 --> 07:04.360 a start of point for a lot of people to start using NF core pipelines. And then viral recon is mainly 07:04.360 --> 07:12.120 used to analyze viral samples, and this was heavily used during the COVID crisis, so it's always 07:12.120 --> 07:17.240 a nice thing to have. And maybe a special mention for protein fault, which was able present in a 07:17.240 --> 07:24.440 couple of talks from here. So, oh, pay attention. Then, in of course, socials, we have our Slack 07:24.440 --> 07:31.240 GitHub, Macedon, Blue Sky, LinkedIn, and YouTube, where you can follow us, join us, ask us anything 07:31.240 --> 07:35.400 you want, preferably next door, and of course related, but we answer all the questions too. 07:37.560 --> 07:43.480 So to recap a little bit, and of course, it's a community of people that work together on 07:43.480 --> 07:49.480 Excel pipelines. To do this, we need a set of standards and best practices to actually apply 07:49.480 --> 07:55.800 on these pipelines, so everyone knows how to write one pipeline for a array of different subjects. 07:57.800 --> 08:03.880 Then, of course, we need the modules and separate flows to create our pipelines, and some specific 08:03.880 --> 08:11.000 ways to actually run our pipelines. And to do that, we have the NF core tools, which Julia is going 08:11.320 --> 08:15.320 to present now. So, a big round of applause for Julia, please. 08:25.960 --> 08:36.120 Okay, hello. Okay, thank you. So, I will focus on NF core tools, and for this, I 08:36.120 --> 08:43.000 prepared, I wanted to invite you to come along in a small text adventure with me. I promise that this 08:43.000 --> 08:48.200 looked like a nice terminal formatted, but this is what we have. So, please try to not be to this 08:48.200 --> 08:56.280 directed about this. Okay, so, in NF core, we offer tools to help both users and developers. 08:56.280 --> 09:00.360 So, let's start this adventure, imagining that you have a bunch of data that you want to 09:00.440 --> 09:05.000 analyze. For example, if you work in a core facility, in this case, you are a user. 09:06.200 --> 09:12.680 And we offer several comments to help you. So, let's imagine that you need to run your pipeline 09:12.680 --> 09:18.440 in the HPC of your institution, which doesn't have internet connection. This is not a problem, 09:18.440 --> 09:24.840 because you can use the download command that will download the pipeline and all the containers 09:24.920 --> 09:33.160 required to run this pipeline offline. Talking about the HPC in your institution, we also have, 09:33.160 --> 09:38.200 and here I introduce this concept of special repository that I called for the sake of this 09:38.200 --> 09:43.240 presentation, which are repositories that collect resources that can be used in any of the 09:43.240 --> 09:51.000 next flow pipelines. So, the first one, NF core configs. Here we store config files, where you can 09:51.960 --> 09:57.720 create a config file for your institution, and then everyone in the institution can use these 09:57.720 --> 10:03.000 for any of the next flow pipelines. And this will typically have things like, for example, the 10:03.000 --> 10:09.480 maximum GPU CPUs that you can use, maximum memory, or other parameters that you need to specify. 10:11.880 --> 10:17.880 And once you have your pipeline downloaded, you have your institutional config, what you want to 10:17.880 --> 10:23.640 do is tweak a little bit the run for your specific needs. This means setting the input parameters 10:23.640 --> 10:28.760 of the pipeline. For this, we also help you with the great parameters file comment, 10:29.640 --> 10:36.040 which is very useful to store this input parameters, and then you can use it later on, and it's 10:36.040 --> 10:43.480 nice for reproducibility. Okay, so then on the other hand, imagine that you have already analyzed 10:43.560 --> 10:49.080 your data, and maybe you found a small bug. Then you become a developer, and you are following 10:49.080 --> 10:58.360 the developer path. For example, you can, if maybe you found a bug, or maybe you want to use 10:58.360 --> 11:04.120 a different tool for these analysis. In this case, what you will have to do is to create a new 11:04.120 --> 11:09.640 module. And here comes another of these repositories, which is the module's repo, and this one 11:10.520 --> 11:16.120 is where we store the collection of modules and sub-workflow that you can use in any pipeline. 11:16.680 --> 11:21.720 And that's basically a module is just a wrapper around a tool, so you have your software, 11:21.720 --> 11:30.680 you will wrap it around the next language in an informed module, and then these modules, 11:30.680 --> 11:36.200 you connect to create sub-workflow, and the sub-workflow you connect to create pipelines. 11:37.080 --> 11:43.560 So, you are going to create your new module, contribute to the module's repo, and you will 11:43.560 --> 11:50.120 start by with the great comment that will provide you a template, which is basically set of files, 11:50.120 --> 11:55.400 but give you like a skeleton to help you build this module, and you will find different 11:55.400 --> 12:00.920 to-do comments that will guide you through the process. Once everything is ready, you will have 12:00.920 --> 12:05.800 to link your code to make sure that you are following the standards, and then it's easier for 12:05.800 --> 12:10.840 everyone to rebuild your PRs, and maybe also contribute back. And finally, of course, you will 12:10.840 --> 12:15.960 have to add tests to your module to make sure that everything works as expected. 12:18.280 --> 12:22.760 So, once you have your module created, or maybe the tool that you wanted to use already 12:22.760 --> 12:27.240 had a module that in the repo, you want to use this module in your pipeline, so now you go and 12:27.240 --> 12:32.840 contribute to the pipeline's repo. And here you also have a bunch of comments to help you, for 12:33.800 --> 12:39.400 example, you can install the module, you can remove the module, or if you were already using 12:39.400 --> 12:44.200 the module, but you might some changes upstream, you can update it to get these changes. 12:48.120 --> 12:54.440 Okay, then in another hypothetical situation, let's imagine that you tried to find a pipeline that 12:54.440 --> 12:59.720 was suitable for you, but you couldn't find any. So, then you want to create a new pipeline. 12:59.800 --> 13:04.680 The first thing that you will have to do is come and talk to us, and for this we use this repository 13:04.680 --> 13:11.720 called Proposals. You will open an issue there, and there is where we discuss, and we check that 13:11.720 --> 13:17.640 there's not a similar pipeline in the in NF Core, in such case we will enforce collaboration, 13:17.640 --> 13:25.080 but if everything goes well and it's approved, then we will create this repository for your pipeline. 13:25.960 --> 13:31.480 And very similar to the process that we were using for modules, you will start creating 13:32.280 --> 13:37.000 the pipeline from a template that will give you again, like this is skeleton for the pipeline, 13:37.000 --> 13:43.560 and has some, some of the features that we use in NF Core, for example, the CI testing and some 13:43.560 --> 13:49.960 examples. If you are more interested on knowing about this template, we have this recorded by 13:49.960 --> 13:57.320 CI start, when we go through all the files expanding a little bit more, then you will also have 13:57.320 --> 14:03.240 to link your code to make sure it follows the standards, and there are a couple of additional 14:03.240 --> 14:09.160 things in the pipelines. The first of all is this raw create file that will be created automatically. 14:09.720 --> 14:15.320 Raw create is an open source standard that is used basically to pack the pipeline with its 14:15.320 --> 14:21.400 metadata and its structure, and this is used for provenance tracking. Then you will also have 14:21.400 --> 14:27.640 to update the next log schema. This is a schema that follows the JSON schema format, and this is 14:27.640 --> 14:35.400 what we use to define the pipeline parameters, and then we use it to validate the input parameters 14:35.400 --> 14:40.600 on each run of the pipeline, which with the NF schema plugin, and also use it for other things, 14:40.600 --> 14:49.000 for example, to generate a nice documentation in our website. And finally, when you have your 14:50.040 --> 14:56.440 repository in NF Core, you will regularly receive template updates. This means that an automatic 14:56.440 --> 15:02.280 PR is going to be opened in your repo, giving you some updates that we use to deliver, 15:02.280 --> 15:07.800 maybe some bug fixes, or most importantly to keep up with the latest next log developments, 15:07.800 --> 15:14.920 and next log features. And here I wanted to do a quick sneak peek of how these works in case 15:14.920 --> 15:19.400 you are interested. So, basically in the pipeline, we have the main branch where we do the 15:19.400 --> 15:24.520 cyber releases of the pipeline, then we have the development branch, and all the pipelines have 15:24.520 --> 15:31.080 this special branch called template. Whenever there is a release of NF Core tools, we push the changes 15:31.080 --> 15:36.680 to the template branch of all the pipelines, and this is what we use then to deliver the updates. 15:38.120 --> 15:43.960 It is also important to mention that all of this that I have been talking about, it's not only 15:43.960 --> 15:49.560 for NF Core pipelines, but for any of the next log pipelines. So, if you want to have your 15:49.560 --> 15:54.520 pipeline in a private repo, you can still use a template, you can still install NF Core modules, 15:54.520 --> 15:58.680 or for example, receive updates in this case manually with a sync command. 15:59.080 --> 16:07.080 And last but not least, we have this small sidecast, which is that of course, you will want 16:07.080 --> 16:15.800 to add tests to your pipeline, we're using the NF test tool. And here comes the last one of these 16:15.800 --> 16:22.120 special repositories, which is the test data set repository, and here we store some data files 16:22.120 --> 16:32.040 that you can use for testing. And we also provide some commands to help you look into this 16:32.040 --> 16:40.680 repository and see if you can find a file that you can use. So, now you've completed the NF Core 16:40.680 --> 16:47.320 adventure, congratulations, and the only thing left for use to join the community, and just 16:47.320 --> 16:53.960 wanted to make a last summary take home message of all what we provide, which is a pipeline template 16:53.960 --> 17:01.080 validation of input parameters and files, CI testing, LinkedIn configuration, and finally documentation. 17:01.080 --> 17:05.480 But, most importantly, a group of people ready to help thank all over it with you. 17:06.680 --> 17:11.320 And thank you very much, if you want to catch up with us later, we are going to be having 17:11.400 --> 17:13.320 beer, so I just find this. 17:22.280 --> 17:27.960 So, that was perfectly to time. We have a little bit of time for questions, okay, if you'd like. 17:27.960 --> 17:33.320 So, maybe one or two questions anybody, or just post, yep, please go ahead. 17:33.320 --> 17:39.960 Maybe before the questions, we are hosting a hackathon at March 11th to 13th. It's online. 17:40.920 --> 17:45.320 You can everyone can join, feel free to join if you want to contribute to NF Core or anything 17:45.320 --> 17:46.920 next or related. Go ahead. 17:56.520 --> 18:02.520 Yeah, so the question was, if we have a, now it works way better. If we have a process to add 18:02.520 --> 18:09.320 this data to the data set repo, basically you will open, so we have branch for modules, 18:09.320 --> 18:16.280 and a different branch for every pipeline. So, you will open a PR there and try to be organized 18:16.280 --> 18:21.320 and update the rate me with all the required metadata and information. And then someone from 18:21.320 --> 18:22.920 the community reviews your PR. 18:39.320 --> 18:42.520 I'm a bad statistician, we're sort of free to decide your community, but if you're 18:42.520 --> 18:48.440 you open to biostatics, by the way, for example, yes, definitely, do you want to answer? 18:48.440 --> 18:57.480 Okay, yes, so basically NF Core started in the genomics field, so we have a bunch of genomics 18:57.480 --> 19:04.600 pipelines, but now we are expanding and we have one about economics, we have earth science people, 19:04.680 --> 19:15.160 we have probably some examples that I forgot. So, definitely, yes, whatever it can be added. 19:15.160 --> 19:45.000 So, the question was, the question was, if we can, that modules were only used for tools, and how can we actually connect with 19:45.160 --> 19:50.440 reference databases, for example, that was a question, right? Yeah, so next look and actually 19:51.160 --> 19:56.360 has some pretty built-in functions to access data from reference databases, for example, there's also a 19:56.360 --> 20:01.800 possibility to write your own plugins for next look and to make it possible to access this data and 20:01.800 --> 20:07.160 pull it into the tools you actually want to run it on. There are any modules to access with reference 20:07.160 --> 20:13.160 databases, even though it just happens sometimes, it's a bit of a gray area, but we prefer it if you 20:13.160 --> 20:20.440 can just make next all actually pull the data in for you. I'm going to add something here, so 20:21.720 --> 20:27.880 so there are some themes you will spot through this, the program today, and one of them is 20:27.880 --> 20:36.520 aerocreate, next flow in F core, these are becoming standard only for this part of the world, 20:36.520 --> 20:42.840 and I think probably at least half of the world's bi-informatics pipelines and other science pipelines. 20:43.160 --> 20:49.080 So this is only going to get more standardized and more effective, and all of the reference databases 20:49.080 --> 20:54.520 are actually looking at interoperations with the processing so that we can do fully provenance, 20:54.520 --> 21:00.600 reproducible, and reconstructable systems. So yeah, this is just the beginning. 21:01.480 --> 21:02.920 Thank you. Thank you. 21:04.600 --> 21:07.640 So, and I think, okay, this is going to be very quick. 21:07.720 --> 21:12.440 Just going to point one, it turns out that our ship likes to meet the pipeline, who owns it, 21:12.440 --> 21:18.600 that make this a lot of dirt. So the question was, who owns the actual pipelines that you publish 21:18.600 --> 21:25.160 to NF core, and the short answer is everyone, so it's a community building, actual pipelines, 21:25.160 --> 21:30.760 so any pipeline in the NF core, community is actually owned by the whole community. Of course, 21:30.760 --> 21:35.400 there are all of people who are considered to be the lead developers of the pipelines, we have a little 21:35.480 --> 21:40.840 bit more to say about what happens to it, but the main consensus is that everyone owns the pipeline. 21:43.880 --> 21:47.400 So we're going to disconnect your laptop and do the great transfer. 21:47.400 --> 22:07.400 Okay, so I should have a third hand should not. So whilst this is going on,