WEBVTT

00:00.000 --> 00:14.080
All right, can everyone hear me well? Perfect. So today we're going to talk about the

00:14.080 --> 00:18.560
end of core community. It's basically community how to make your next-door workflows run

00:18.560 --> 00:24.440
correctly reproducible and easily and hopefully make it easy to write your next-door workflows.

00:25.400 --> 00:30.040
But first of all, what is next flow? Who view knows what next flow is? Can you see

00:30.040 --> 00:35.720
raise a fence? All right, perfect. For those of you that don't know next-door, next-door is

00:35.720 --> 00:40.280
basically a workflow manager. So a language runtime and a community like it says here,

00:40.280 --> 00:45.560
but mainly a workflow language manager, you can use this to change all your bioinformatics

00:45.560 --> 00:50.360
tools right off each other in a parallel way, make your pipe and reproducible and be able

00:50.360 --> 00:56.680
to run it anywhere. So reproducible is the first part I talk about here. This is the most important

00:56.680 --> 01:02.280
part. If you write the next-door pipeline, you want it to be able to run everywhere and have

01:02.280 --> 01:08.280
its creates the same output. It also needs to be portable. So next-door pipelines are heavily

01:08.280 --> 01:14.920
based on container environments and konda. So you can run them almost anywhere without any

01:14.920 --> 01:20.840
actual changes to the software environments you're working. Let me just do a technical adjustment

01:20.840 --> 01:32.440
here. It's a little bit higher. Okay, let's write that. Is this better? Is it better?

01:32.440 --> 01:37.720
Okay, and then the last part, scalable of course, you want to start writing your next-door pipelines

01:37.720 --> 01:42.120
from the local laptop, but my five samples, for example, some very small things, and then you

01:42.120 --> 01:46.200
will have to well want to immediately scale this to like five million samples in the quote, for

01:46.200 --> 01:53.640
example, and this is all possible using next-door. But of course, such a big and scary thing,

01:53.640 --> 02:00.040
is very versatile, and this highlights the need for standardization. Standardization is introduced

02:00.040 --> 02:06.440
by the end of core community. So in of course, it's basically the community of like-minded people

02:06.520 --> 02:12.360
who write next-door pipelines, I want to make pipelines usable for everyone. So this starts, of course,

02:12.360 --> 02:18.200
with a set of standards. We have a whole list of guidelines. I won't go through all of them right

02:18.200 --> 02:23.800
now, or I will need like two or three more talks about this in the future. But to main parts,

02:23.800 --> 02:30.280
are the documentation, the templates. We use a common template for all pipelines, and use of public

02:30.280 --> 02:35.480
containers to make sure that everyone can actually access all the environments during the

02:35.560 --> 02:42.120
pipeline. As the best practice, this is probably the most important one, and then of course,

02:42.120 --> 02:48.040
there are a lot of different best practices. Of course, we allow only a low one pipeline per

02:48.040 --> 02:53.320
data type and or usage. So this means that there isn't any overlap between several pipelines.

02:53.320 --> 02:57.720
There's only one pipeline that can do your actual analysis for you, so you don't have to go looking

02:57.720 --> 03:04.040
at three or four different pipelines to see which one does your analysis best. We also use stable

03:04.040 --> 03:10.920
text to release our pipelines and run it with CICD to actually test our pipelines. And of course,

03:10.920 --> 03:15.160
they are closed already. This is pretty native and next to right about now, but end of course,

03:15.160 --> 03:22.760
really push it through. Cooperation is maybe even more important. We all work together on pipelines,

03:22.760 --> 03:27.160
and there's no single pipeline that is owned by one person. All pipelines are owned by the community.

03:27.480 --> 03:35.400
It's of course open source. Otherwise, we wouldn't be here, I think. And then to make the work

03:35.400 --> 03:39.400
easier, we have a lot of components, which are parts of your next supply plan that you can actually

03:39.400 --> 03:45.240
use and reuse in other pipelines without having to write these components yourself. This really

03:45.240 --> 03:51.800
speeds up the process of writing next to pipelines and reproducible. So to talk a bit more about

03:51.880 --> 03:58.040
these components, we have about 141 pipelines right now. I think 60 to 80 of them are released.

03:58.040 --> 04:03.560
I forgot a number, but something around that. Others are still in development, but all pipelines

04:03.560 --> 04:10.280
are really being worked on. Then modules, we have a thousand and 700 modules. Modules are basically

04:10.280 --> 04:15.080
a wrapers around buying from edict tools. So for example, some tools, few, you have a module for that,

04:15.080 --> 04:19.640
and you can just pull it in your pipeline and start to use it. And around 100 separate flows,

04:19.640 --> 04:23.880
which is basically a chain of modules that you can start using, which is a specific analysis.

04:25.240 --> 04:31.240
We test our pipelines using minimal test data in CI, and also some full tests for some pipelines.

04:32.520 --> 04:37.480
We use the NF test framework for this. I won't go deeper into that, but you can still come

04:37.480 --> 04:42.520
ask me at the end to get some more information about this. We lent all our pipelines to check

04:42.520 --> 04:46.360
for consistency between all the different NF core pipelines to make sure that if you read one

04:46.360 --> 04:53.800
pipeline that you can understand all the other pipelines. We use a schema to validate or inputs

04:53.800 --> 04:58.680
of the pipeline. So the parameters, for example, your sample sheet, where you define each

04:58.680 --> 05:06.040
specific option for each sample, I use the NF schema plugin to do this. And then tooling,

05:06.040 --> 05:11.560
which Julia will talk about more at the end of the stock, can be used to tie the stall together

05:11.560 --> 05:17.160
and to make it even easier for you to write these pipelines. And to use them, of course.

05:17.160 --> 05:21.880
So a little bit of statistics to show you a bit of the scale of the community.

05:21.880 --> 05:26.920
We have around 12.000 slack users right now, as you can see in the graph here,

05:27.960 --> 05:35.240
it's still going up pretty well. And a lot of active users also, so you can say 12.000 slack

05:35.240 --> 05:40.280
users, but we also have a lot of active users of those. We have more than 500 contributors,

05:40.280 --> 05:45.080
which means a contributor, we count as someone who made at least one commit to any of the

05:45.080 --> 05:51.560
NF core repositories. There are 175 plus repositories, 141 of them are the pipelines,

05:51.560 --> 05:57.240
and all the other are more infrastructure related. Then total of all the repositories we have

05:57.240 --> 06:04.200
around 100,000 GitHub commits, which shows you the progress we made since the inception of NF

06:04.200 --> 06:11.320
core in 2019. Over all the repositories combined again, we have almost 8,000 GitHub stars

06:11.880 --> 06:21.720
and 790,000 GitHub views. So to go over a couple of our pipelines, the most important ones,

06:23.080 --> 06:28.280
we have our nasic, which is the basic expression analysis or an API plan.

06:29.240 --> 06:33.960
ATXC, which is the ATXC analysis, speaks for itself, I think.

06:34.520 --> 06:39.560
Simply see if there's some applicants sequencing analysis. I don't know what it is, but

06:40.200 --> 06:46.040
there you go. Then, Sarac, which is closer to my heart, there's the variant analysis of short

06:46.040 --> 06:53.320
read to DNA data using a series of different tools, etc. Then, fetch NES has used to fetch your

06:53.320 --> 06:58.600
metadata or fast queues, so you can create sample sheets to run other pipelines. So this is basically

06:58.600 --> 07:04.360
a start of point for a lot of people to start using NF core pipelines. And then viral recon is mainly

07:04.360 --> 07:12.120
used to analyze viral samples, and this was heavily used during the COVID crisis, so it's always

07:12.120 --> 07:17.240
a nice thing to have. And maybe a special mention for protein fault, which was able present in a

07:17.240 --> 07:24.440
couple of talks from here. So, oh, pay attention. Then, in of course, socials, we have our Slack

07:24.440 --> 07:31.240
GitHub, Macedon, Blue Sky, LinkedIn, and YouTube, where you can follow us, join us, ask us anything

07:31.240 --> 07:35.400
you want, preferably next door, and of course related, but we answer all the questions too.

07:37.560 --> 07:43.480
So to recap a little bit, and of course, it's a community of people that work together on

07:43.480 --> 07:49.480
Excel pipelines. To do this, we need a set of standards and best practices to actually apply

07:49.480 --> 07:55.800
on these pipelines, so everyone knows how to write one pipeline for a array of different subjects.

07:57.800 --> 08:03.880
Then, of course, we need the modules and separate flows to create our pipelines, and some specific

08:03.880 --> 08:11.000
ways to actually run our pipelines. And to do that, we have the NF core tools, which Julia is going

08:11.320 --> 08:15.320
to present now. So, a big round of applause for Julia, please.

08:25.960 --> 08:36.120
Okay, hello. Okay, thank you. So, I will focus on NF core tools, and for this, I

08:36.120 --> 08:43.000
prepared, I wanted to invite you to come along in a small text adventure with me. I promise that this

08:43.000 --> 08:48.200
looked like a nice terminal formatted, but this is what we have. So, please try to not be to this

08:48.200 --> 08:56.280
directed about this. Okay, so, in NF core, we offer tools to help both users and developers.

08:56.280 --> 09:00.360
So, let's start this adventure, imagining that you have a bunch of data that you want to

09:00.440 --> 09:05.000
analyze. For example, if you work in a core facility, in this case, you are a user.

09:06.200 --> 09:12.680
And we offer several comments to help you. So, let's imagine that you need to run your pipeline

09:12.680 --> 09:18.440
in the HPC of your institution, which doesn't have internet connection. This is not a problem,

09:18.440 --> 09:24.840
because you can use the download command that will download the pipeline and all the containers

09:24.920 --> 09:33.160
required to run this pipeline offline. Talking about the HPC in your institution, we also have,

09:33.160 --> 09:38.200
and here I introduce this concept of special repository that I called for the sake of this

09:38.200 --> 09:43.240
presentation, which are repositories that collect resources that can be used in any of the

09:43.240 --> 09:51.000
next flow pipelines. So, the first one, NF core configs. Here we store config files, where you can

09:51.960 --> 09:57.720
create a config file for your institution, and then everyone in the institution can use these

09:57.720 --> 10:03.000
for any of the next flow pipelines. And this will typically have things like, for example, the

10:03.000 --> 10:09.480
maximum GPU CPUs that you can use, maximum memory, or other parameters that you need to specify.

10:11.880 --> 10:17.880
And once you have your pipeline downloaded, you have your institutional config, what you want to

10:17.880 --> 10:23.640
do is tweak a little bit the run for your specific needs. This means setting the input parameters

10:23.640 --> 10:28.760
of the pipeline. For this, we also help you with the great parameters file comment,

10:29.640 --> 10:36.040
which is very useful to store this input parameters, and then you can use it later on, and it's

10:36.040 --> 10:43.480
nice for reproducibility. Okay, so then on the other hand, imagine that you have already analyzed

10:43.560 --> 10:49.080
your data, and maybe you found a small bug. Then you become a developer, and you are following

10:49.080 --> 10:58.360
the developer path. For example, you can, if maybe you found a bug, or maybe you want to use

10:58.360 --> 11:04.120
a different tool for these analysis. In this case, what you will have to do is to create a new

11:04.120 --> 11:09.640
module. And here comes another of these repositories, which is the module's repo, and this one

11:10.520 --> 11:16.120
is where we store the collection of modules and sub-workflow that you can use in any pipeline.

11:16.680 --> 11:21.720
And that's basically a module is just a wrapper around a tool, so you have your software,

11:21.720 --> 11:30.680
you will wrap it around the next language in an informed module, and then these modules,

11:30.680 --> 11:36.200
you connect to create sub-workflow, and the sub-workflow you connect to create pipelines.

11:37.080 --> 11:43.560
So, you are going to create your new module, contribute to the module's repo, and you will

11:43.560 --> 11:50.120
start by with the great comment that will provide you a template, which is basically set of files,

11:50.120 --> 11:55.400
but give you like a skeleton to help you build this module, and you will find different

11:55.400 --> 12:00.920
to-do comments that will guide you through the process. Once everything is ready, you will have

12:00.920 --> 12:05.800
to link your code to make sure that you are following the standards, and then it's easier for

12:05.800 --> 12:10.840
everyone to rebuild your PRs, and maybe also contribute back. And finally, of course, you will

12:10.840 --> 12:15.960
have to add tests to your module to make sure that everything works as expected.

12:18.280 --> 12:22.760
So, once you have your module created, or maybe the tool that you wanted to use already

12:22.760 --> 12:27.240
had a module that in the repo, you want to use this module in your pipeline, so now you go and

12:27.240 --> 12:32.840
contribute to the pipeline's repo. And here you also have a bunch of comments to help you, for

12:33.800 --> 12:39.400
example, you can install the module, you can remove the module, or if you were already using

12:39.400 --> 12:44.200
the module, but you might some changes upstream, you can update it to get these changes.

12:48.120 --> 12:54.440
Okay, then in another hypothetical situation, let's imagine that you tried to find a pipeline that

12:54.440 --> 12:59.720
was suitable for you, but you couldn't find any. So, then you want to create a new pipeline.

12:59.800 --> 13:04.680
The first thing that you will have to do is come and talk to us, and for this we use this repository

13:04.680 --> 13:11.720
called Proposals. You will open an issue there, and there is where we discuss, and we check that

13:11.720 --> 13:17.640
there's not a similar pipeline in the in NF Core, in such case we will enforce collaboration,

13:17.640 --> 13:25.080
but if everything goes well and it's approved, then we will create this repository for your pipeline.

13:25.960 --> 13:31.480
And very similar to the process that we were using for modules, you will start creating

13:32.280 --> 13:37.000
the pipeline from a template that will give you again, like this is skeleton for the pipeline,

13:37.000 --> 13:43.560
and has some, some of the features that we use in NF Core, for example, the CI testing and some

13:43.560 --> 13:49.960
examples. If you are more interested on knowing about this template, we have this recorded by

13:49.960 --> 13:57.320
CI start, when we go through all the files expanding a little bit more, then you will also have

13:57.320 --> 14:03.240
to link your code to make sure it follows the standards, and there are a couple of additional

14:03.240 --> 14:09.160
things in the pipelines. The first of all is this raw create file that will be created automatically.

14:09.720 --> 14:15.320
Raw create is an open source standard that is used basically to pack the pipeline with its

14:15.320 --> 14:21.400
metadata and its structure, and this is used for provenance tracking. Then you will also have

14:21.400 --> 14:27.640
to update the next log schema. This is a schema that follows the JSON schema format, and this is

14:27.640 --> 14:35.400
what we use to define the pipeline parameters, and then we use it to validate the input parameters

14:35.400 --> 14:40.600
on each run of the pipeline, which with the NF schema plugin, and also use it for other things,

14:40.600 --> 14:49.000
for example, to generate a nice documentation in our website. And finally, when you have your

14:50.040 --> 14:56.440
repository in NF Core, you will regularly receive template updates. This means that an automatic

14:56.440 --> 15:02.280
PR is going to be opened in your repo, giving you some updates that we use to deliver,

15:02.280 --> 15:07.800
maybe some bug fixes, or most importantly to keep up with the latest next log developments,

15:07.800 --> 15:14.920
and next log features. And here I wanted to do a quick sneak peek of how these works in case

15:14.920 --> 15:19.400
you are interested. So, basically in the pipeline, we have the main branch where we do the

15:19.400 --> 15:24.520
cyber releases of the pipeline, then we have the development branch, and all the pipelines have

15:24.520 --> 15:31.080
this special branch called template. Whenever there is a release of NF Core tools, we push the changes

15:31.080 --> 15:36.680
to the template branch of all the pipelines, and this is what we use then to deliver the updates.

15:38.120 --> 15:43.960
It is also important to mention that all of this that I have been talking about, it's not only

15:43.960 --> 15:49.560
for NF Core pipelines, but for any of the next log pipelines. So, if you want to have your

15:49.560 --> 15:54.520
pipeline in a private repo, you can still use a template, you can still install NF Core modules,

15:54.520 --> 15:58.680
or for example, receive updates in this case manually with a sync command.

15:59.080 --> 16:07.080
And last but not least, we have this small sidecast, which is that of course, you will want

16:07.080 --> 16:15.800
to add tests to your pipeline, we're using the NF test tool. And here comes the last one of these

16:15.800 --> 16:22.120
special repositories, which is the test data set repository, and here we store some data files

16:22.120 --> 16:32.040
that you can use for testing. And we also provide some commands to help you look into this

16:32.040 --> 16:40.680
repository and see if you can find a file that you can use. So, now you've completed the NF Core

16:40.680 --> 16:47.320
adventure, congratulations, and the only thing left for use to join the community, and just

16:47.320 --> 16:53.960
wanted to make a last summary take home message of all what we provide, which is a pipeline template

16:53.960 --> 17:01.080
validation of input parameters and files, CI testing, LinkedIn configuration, and finally documentation.

17:01.080 --> 17:05.480
But, most importantly, a group of people ready to help thank all over it with you.

17:06.680 --> 17:11.320
And thank you very much, if you want to catch up with us later, we are going to be having

17:11.400 --> 17:13.320
beer, so I just find this.

17:22.280 --> 17:27.960
So, that was perfectly to time. We have a little bit of time for questions, okay, if you'd like.

17:27.960 --> 17:33.320
So, maybe one or two questions anybody, or just post, yep, please go ahead.

17:33.320 --> 17:39.960
Maybe before the questions, we are hosting a hackathon at March 11th to 13th. It's online.

17:40.920 --> 17:45.320
You can everyone can join, feel free to join if you want to contribute to NF Core or anything

17:45.320 --> 17:46.920
next or related. Go ahead.

17:56.520 --> 18:02.520
Yeah, so the question was, if we have a, now it works way better. If we have a process to add

18:02.520 --> 18:09.320
this data to the data set repo, basically you will open, so we have branch for modules,

18:09.320 --> 18:16.280
and a different branch for every pipeline. So, you will open a PR there and try to be organized

18:16.280 --> 18:21.320
and update the rate me with all the required metadata and information. And then someone from

18:21.320 --> 18:22.920
the community reviews your PR.

18:39.320 --> 18:42.520
I'm a bad statistician, we're sort of free to decide your community, but if you're

18:42.520 --> 18:48.440
you open to biostatics, by the way, for example, yes, definitely, do you want to answer?

18:48.440 --> 18:57.480
Okay, yes, so basically NF Core started in the genomics field, so we have a bunch of genomics

18:57.480 --> 19:04.600
pipelines, but now we are expanding and we have one about economics, we have earth science people,

19:04.680 --> 19:15.160
we have probably some examples that I forgot. So, definitely, yes, whatever it can be added.

19:15.160 --> 19:45.000
So, the question was, the question was, if we can, that modules were only used for tools, and how can we actually connect with

19:45.160 --> 19:50.440
reference databases, for example, that was a question, right? Yeah, so next look and actually

19:51.160 --> 19:56.360
has some pretty built-in functions to access data from reference databases, for example, there's also a

19:56.360 --> 20:01.800
possibility to write your own plugins for next look and to make it possible to access this data and

20:01.800 --> 20:07.160
pull it into the tools you actually want to run it on. There are any modules to access with reference

20:07.160 --> 20:13.160
databases, even though it just happens sometimes, it's a bit of a gray area, but we prefer it if you

20:13.160 --> 20:20.440
can just make next all actually pull the data in for you. I'm going to add something here, so

20:21.720 --> 20:27.880
so there are some themes you will spot through this, the program today, and one of them is

20:27.880 --> 20:36.520
aerocreate, next flow in F core, these are becoming standard only for this part of the world,

20:36.520 --> 20:42.840
and I think probably at least half of the world's bi-informatics pipelines and other science pipelines.

20:43.160 --> 20:49.080
So this is only going to get more standardized and more effective, and all of the reference databases

20:49.080 --> 20:54.520
are actually looking at interoperations with the processing so that we can do fully provenance,

20:54.520 --> 21:00.600
reproducible, and reconstructable systems. So yeah, this is just the beginning.

21:01.480 --> 21:02.920
Thank you. Thank you.

21:04.600 --> 21:07.640
So, and I think, okay, this is going to be very quick.

21:07.720 --> 21:12.440
Just going to point one, it turns out that our ship likes to meet the pipeline, who owns it,

21:12.440 --> 21:18.600
that make this a lot of dirt. So the question was, who owns the actual pipelines that you publish

21:18.600 --> 21:25.160
to NF core, and the short answer is everyone, so it's a community building, actual pipelines,

21:25.160 --> 21:30.760
so any pipeline in the NF core, community is actually owned by the whole community. Of course,

21:30.760 --> 21:35.400
there are all of people who are considered to be the lead developers of the pipelines, we have a little

21:35.480 --> 21:40.840
bit more to say about what happens to it, but the main consensus is that everyone owns the pipeline.

21:43.880 --> 21:47.400
So we're going to disconnect your laptop and do the great transfer.

21:47.400 --> 22:07.400
Okay, so I should have a third hand should not. So whilst this is going on,