WEBVTT 00:00.000 --> 00:07.000 Thank you. 00:07.000 --> 00:10.000 Yes, so my name is Felix. 00:10.000 --> 00:12.000 I'm going to talk about data wizard, 00:12.000 --> 00:15.000 which is a project of our lab from the last, 00:15.000 --> 00:16.000 I don't know, four, five years. 00:16.000 --> 00:17.000 I would say it started. 00:17.000 --> 00:20.000 And it's about reporting your tabular data. 00:20.000 --> 00:24.000 And yes, all of you are probably involved in bioinformatics 00:24.000 --> 00:26.000 and run some sort of analysis. 00:26.000 --> 00:29.000 You have some output of them. 00:29.000 --> 00:33.000 And these are more likely to be tables or are likely to be tables. 00:33.000 --> 00:34.000 Yeah. 00:34.000 --> 00:37.000 And people tend to use access for that, 00:37.000 --> 00:39.000 although it comes with some caveats, 00:39.000 --> 00:41.000 especially non-computational people use like, 00:41.000 --> 00:44.000 and yeah, it's not very reproducible. 00:44.000 --> 00:46.000 It also comes with some other, 00:46.000 --> 00:48.000 like, not so nice features. 00:48.000 --> 00:50.000 If you have a, like, had a column with genes in it, 00:50.000 --> 00:53.000 it tends to convert them into dates and this sort of stuff. 00:53.000 --> 00:55.000 And yeah. 00:55.000 --> 00:59.000 Also, you can use stuff like pandas or molytly polas 00:59.000 --> 01:01.000 or within our, the tidiverse. 01:01.000 --> 01:03.000 And yeah, this is great. 01:03.000 --> 01:04.000 We use this ourselves, 01:04.000 --> 01:06.000 but more within, like, the analysis. 01:06.000 --> 01:09.000 And not for, like, visualizing the final results 01:09.000 --> 01:12.000 and communicating the tables. 01:12.000 --> 01:15.000 Two, for example, like doctors in a molecular tumor 01:15.000 --> 01:16.000 about scenario. 01:16.000 --> 01:19.000 And additionally, also a single table 01:19.000 --> 01:21.000 might not even be enough for most of the use cases 01:21.000 --> 01:23.000 we have in bioinformatics. 01:23.000 --> 01:25.000 For example, you could have, like, an anchor print 01:25.000 --> 01:28.000 and single variant calls or differential express genes 01:28.000 --> 01:29.000 and an expression matrix. 01:29.000 --> 01:32.000 And this means this is, like, hierarchically, 01:32.000 --> 01:35.000 like, structure state, structure data. 01:35.000 --> 01:37.000 So you have, like, one big overview table 01:37.000 --> 01:39.000 and then you have, like, multiple other tables 01:39.000 --> 01:41.000 where each row of the overview table, 01:41.000 --> 01:44.000 basically corresponds to, like, one table 01:44.000 --> 01:45.000 with more details in them. 01:45.000 --> 01:47.000 Or you could have something, like, a joint 01:47.000 --> 01:50.000 and a database where, like, one row corresponds to another row. 01:51.000 --> 01:54.000 And additionally, one might also want to include, 01:54.000 --> 01:55.000 like, plots in there. 01:55.000 --> 01:58.000 And this could be, like, a singular big plot 01:58.000 --> 02:01.000 or, like, a plot within each cell, yeah, 02:01.000 --> 02:03.000 of a single table. 02:03.000 --> 02:06.000 And to quickly recap the state of the art, 02:06.000 --> 02:08.000 we can have, like, individual TSV files, 02:08.000 --> 02:10.000 actual files, and individual plots. 02:10.000 --> 02:15.000 And, like, I don't know, SVG or PDF or PNG format. 02:15.000 --> 02:18.000 Yeah, these are very easy to publish, just single files. 02:18.000 --> 02:19.000 You can open them. 02:19.000 --> 02:21.000 But they come with limited interactivity, 02:21.000 --> 02:24.000 and also, for the TSV and excess scenario 02:24.000 --> 02:27.000 with, like, very limited visualization. 02:27.000 --> 02:28.000 Yeah. 02:28.000 --> 02:30.000 And, as I mentioned with the Uncle Print, for example, 02:30.000 --> 02:32.000 the connections in between, like, different items 02:32.000 --> 02:34.000 in between different tables, really get lost 02:34.000 --> 02:36.000 in, like, plain CSV, or, I don't know, 02:36.000 --> 02:38.000 Excel files, yeah. 02:38.000 --> 02:42.000 And, on the other hand, one could do, like, a custom solution, 02:42.000 --> 02:45.000 using, like, stuff, like, shiny or lumen, 02:45.000 --> 02:49.000 like, these frameworks for, yeah, running, like, a web application, 02:49.000 --> 02:53.000 or you could even go ahead and, like, implement your own thing. 02:53.000 --> 02:56.000 And this comes with a big implementation overhead. 02:56.000 --> 02:58.000 You have to sit down, write some code, 02:58.000 --> 03:00.000 and, yeah, it takes just takes time. 03:00.000 --> 03:01.000 I don't have to tell you. 03:01.000 --> 03:04.000 And, yeah, you need to maintain that server. 03:04.000 --> 03:06.000 You need to make sure it keeps running over time. 03:06.000 --> 03:07.000 Yeah. 03:07.000 --> 03:11.000 And, therefore, the long-term maintenance for that is challenging. 03:11.000 --> 03:13.000 Out of this, we can formulate a problem 03:13.000 --> 03:16.000 where the input is basically a set of tables, 03:16.000 --> 03:18.000 and the relations between these tables, 03:18.000 --> 03:20.000 and a set of rendering definitions. 03:20.000 --> 03:23.000 And the output shall be portable, 03:23.000 --> 03:27.000 and also an interactive and visual representation of our data set. 03:27.000 --> 03:30.000 And, yeah, as I already mentioned in the beginning, 03:30.000 --> 03:32.000 for that, we developed data wizard. 03:32.000 --> 03:35.000 On the left, you see a configuration file. 03:35.000 --> 03:38.000 Data wizard is invoked by the command line 03:38.000 --> 03:40.000 with that configuration file. 03:41.000 --> 03:43.000 And, in that configuration file, 03:43.000 --> 03:45.000 you define your data sets. 03:45.000 --> 03:48.000 Yeah, the configuration file is written in young formats, 03:48.000 --> 03:51.000 so very easily readable and editable for humans. 03:51.000 --> 03:55.000 And, you define these, like, linkages in between different data sets, 03:55.000 --> 04:00.000 and also per column, define what sort of visualization you have. 04:00.000 --> 04:03.000 And then, when you call data wizard from the command line, 04:03.000 --> 04:05.000 generates an HTML report, 04:05.000 --> 04:08.000 so it's a bit similar to something like multi-QC, 04:09.000 --> 04:12.000 that is self-contained, and is then openable, 04:12.000 --> 04:14.000 and the browser on any system. 04:14.000 --> 04:16.000 So, it's very portable. 04:16.000 --> 04:20.000 To show you a bit further how you configure this stuff. 04:20.000 --> 04:23.000 There's a few examples I want to show you here. 04:23.000 --> 04:26.000 You simply, yeah, give your data set, 04:26.000 --> 04:28.000 for example here, there's some movie stuff. 04:28.000 --> 04:30.000 You give it an arbitrary name, like Oscar's, 04:30.000 --> 04:34.000 and then give the file path where the CSV is located. 04:34.000 --> 04:37.000 We also support, like Jason or Paquet, 04:37.000 --> 04:39.000 some sort of stuff. 04:39.000 --> 04:44.000 Yeah, and then you basically say what column is linked to another column, 04:44.000 --> 04:47.000 if you want to jump around in your data set. 04:47.000 --> 04:49.000 For example, data wizard for that will define, 04:49.000 --> 04:51.000 like, automatic, link outs. 04:51.000 --> 04:53.000 Yeah, so you click on one row, 04:53.000 --> 04:55.000 and then jump to your other data set immediately 04:55.000 --> 04:58.000 with the row that corresponds highlighted. 04:58.000 --> 05:01.000 Yeah, this about the data sets, 05:01.000 --> 05:05.000 and the same goes for basically visualizing any column. 05:05.000 --> 05:08.000 Yeah, for example, you type in the name of the column, 05:08.000 --> 05:10.000 then you say, oh, I want a plot. 05:10.000 --> 05:11.000 It shall be a tick plot. 05:11.000 --> 05:14.000 And what you get is what you see on the right here. 05:14.000 --> 05:16.000 And yeah, there are, like, 05:16.000 --> 05:19.000 a lot of pre-made options for this plot. 05:19.000 --> 05:23.000 For example, also bar plots, as a show to you a heat map. 05:23.000 --> 05:26.000 We have a link out, for example, 05:26.000 --> 05:28.000 if you want to link out to stuff, like, 05:28.000 --> 05:30.000 clin-va based on the value that's in the cell. 05:30.000 --> 05:32.000 Data wizard will render these links. 05:32.000 --> 05:35.000 And then what I also want to highlight is that you can do, 05:35.000 --> 05:37.000 like, custom plots using the Vigalite library, 05:37.000 --> 05:39.000 simply pass the plot specs, 05:39.000 --> 05:42.000 and then it allows you to do, like, 05:42.000 --> 05:44.000 any plot in a single cell. 05:44.000 --> 05:46.000 And also, like, a full plot view of this, 05:46.000 --> 05:47.000 what we also can do. 05:47.000 --> 05:48.000 And if that's not enough, 05:48.000 --> 05:52.000 we support, like, passing any arbitrary, like, 05:52.000 --> 05:55.000 JavaScript function that manipulates the content of a cell. 05:55.000 --> 05:58.000 So this means you can really do anything with it. 05:59.000 --> 06:00.000 Yeah. 06:00.000 --> 06:04.000 And, once again, coming to the probability of data wizard, 06:04.000 --> 06:07.000 so as I mentioned, you invoke it from the CLI, 06:07.000 --> 06:10.000 and then it outputs a single directory. 06:10.000 --> 06:13.000 And this means you can open it on any machine. 06:13.000 --> 06:15.000 No servers needed. 06:15.000 --> 06:18.000 And, yeah, this means also it's very easily, 06:18.000 --> 06:19.000 easily shareable. 06:19.000 --> 06:22.000 Yeah, you can just zip up the whole directory, 06:22.000 --> 06:24.000 send it around to a doctor, 06:24.000 --> 06:29.000 or, like, I don't know, uploaded to GitHub pages, 06:29.000 --> 06:33.000 which actually means that you also outsource the server stuff, 06:33.000 --> 06:36.000 basically, as it's only static HTML. 06:36.000 --> 06:38.000 And what's also very nice, you can attach it, 06:38.000 --> 06:40.000 like, to a publication. 06:40.000 --> 06:43.000 This means the reviewers, or, like, 06:43.000 --> 06:45.000 any other people who read your manuscript, 06:45.000 --> 06:47.000 can basically explore the data set, 06:47.000 --> 06:49.000 the same way you can do. 06:49.000 --> 06:53.000 And this also brings me to the end of my presentation. 06:53.000 --> 06:56.000 I want to thank my PI, Jonas Kester, 06:56.000 --> 06:57.000 and my whole group. 06:57.000 --> 07:00.000 If you want to take further into data wizard and use it, 07:00.000 --> 07:04.000 we have it on Condy and all other sort of stuff. 07:04.000 --> 07:07.000 And also, we have a publication that we did last year 07:07.000 --> 07:09.000 that also explains a lot more. 07:09.000 --> 07:10.000 Thank you. 07:10.000 --> 07:11.000 Thank you. 07:12.000 --> 07:17.000 You've got time for maybe one question. 07:17.000 --> 07:19.000 Yes, for two minutes. 07:19.000 --> 07:20.000 Two minutes. 07:20.000 --> 07:21.000 Yeah, go ahead. 07:21.000 --> 07:22.000 I'd like to go ahead. 07:22.000 --> 07:30.000 I'd like to go ahead. 07:30.000 --> 07:35.000 So the question was, whether you can, like, 07:35.000 --> 07:37.000 apply data transformation, 07:38.000 --> 07:41.000 with, in that yellow config, so yes and no. 07:41.000 --> 07:44.000 So you can do some manipulation, 07:44.000 --> 07:48.000 or basically add columns based on the values of other columns. 07:48.000 --> 07:51.000 This possible, but if you, if you want to do, 07:51.000 --> 07:54.000 like, stuff before and I would advise on, 07:54.000 --> 07:57.000 I'm putting this, like, before into the workflow, 07:57.000 --> 08:00.000 this way it also stays a bit more readable to the user 08:00.000 --> 08:02.000 and stays more reproducible. 08:02.000 --> 08:04.000 Another question. 08:04.000 --> 08:06.000 I can talk presenters. 08:06.000 --> 08:08.000 We are starting to assemble the S, 08:08.000 --> 08:09.000 which we're presenting. 08:09.000 --> 08:11.000 Please come up and try to hear. 08:11.000 --> 08:13.000 Is there another question? 08:13.000 --> 08:15.000 I can ask one. 08:15.000 --> 08:16.000 Yes. 08:16.000 --> 08:18.000 We have a schema for your young one. 08:18.000 --> 08:21.000 Actually, yes, we can, we can derive that. 08:21.000 --> 08:23.000 Data was at a written and rust. 08:23.000 --> 08:26.000 And we can simply derive that as we have this, 08:26.000 --> 08:28.000 as structs in rust. 08:28.000 --> 08:29.000 Yes. 08:29.000 --> 08:30.000 Yeah. 08:30.000 --> 08:31.000 Have you actually published it? 08:31.000 --> 08:33.000 Because that's useful rather than passing the rust.