WEBVTT 00:00.000 --> 00:11.840 So, now for something somewhat completely different, but to completely familiar things, put 00:11.840 --> 00:15.000 together in a new way, I think. 00:15.000 --> 00:21.320 Thank you very much. 00:21.320 --> 00:26.520 Just for a show of hands for Michael Rostick, who here has any experience of interest in 00:26.520 --> 00:31.520 cloning or theorem editing, like making new constructs and organisms? 00:31.520 --> 00:40.960 Oh, the question was, who here has any experience with cloning plasmids or genome editing 00:40.960 --> 00:43.360 or engineering organisms, spans? 00:43.360 --> 00:45.360 Okay, perfect. 00:45.360 --> 00:49.280 Then a lot of you are going to learn some new stuff as well, so that's great. 00:49.280 --> 00:55.660 What I'm here to present is, get for genomes, it's about a tool we made, specifically for 00:55.660 --> 01:03.140 doing version control on sequences, DNA sequences, protein sequences, and engineering context. 01:03.140 --> 01:08.140 And where this comes from, is actually the reason I go into Sintai Biology, is this notion 01:08.140 --> 01:15.060 that DNA is like code for computers, means like we can read it, we can also create it and 01:15.060 --> 01:22.100 design new genetic scripts to allow cells to execute, like what we want to do. 01:22.100 --> 01:25.340 But the thing is, it does not feel at all like software engineering. 01:25.340 --> 01:30.820 So it's once you add for a while, like you'll notice it's much harder and it's actually 01:30.820 --> 01:34.940 much more tricky than you would think coming into it. 01:34.940 --> 01:40.940 And to illustrate it with an example, let's say at the project where we are tasked with 01:40.940 --> 01:47.140 making beer yeast that produces the hop flavor compounds, so that we can just make beer 01:47.140 --> 01:52.020 with yeast alone and then not have to worry about hops, like if you're on Mars or something 01:52.100 --> 01:55.940 or Space Station, you don't have access to hops. 01:55.940 --> 01:59.900 We are going to explain that in the context of like the design build test cycle, the Sintai 01:59.900 --> 02:06.420 Biology people like to use, and starting with a design part, here I would start with 02:06.420 --> 02:10.660 sourcing genetic parts to express these genes, so parts that you have put in front of your 02:10.660 --> 02:14.340 genes or the genes itself, to help the cell express them. 02:14.340 --> 02:18.980 I jam, an amazing resource, it's an open source collection of the unique parts. 02:18.980 --> 02:24.220 They do this massive competition every year for high schoolers to undergrad, to 02:24.220 --> 02:28.220 overgrads, where people come together to make genetic engineering machines. 02:28.220 --> 02:32.300 It says a lot of fun, I recommend checking them out. 02:32.300 --> 02:37.060 Then we would create combinations of all these variable parts, because I think we're not 02:37.060 --> 02:38.060 that good at it yet. 02:38.060 --> 02:43.020 So we always test a bunch in parallel, because it's a slow test process, so we might 02:43.020 --> 02:45.780 as well test quite a few at once. 02:45.780 --> 02:49.660 Then second point is that we go with our built, so this is where it takes our digital 02:49.660 --> 02:50.660 sequences. 02:50.660 --> 02:54.420 We're going to make them real, so that means like ordering synthetic sequences, putting 02:54.420 --> 02:58.300 them together, putting them into your host, and creating these organisms that have been 02:58.300 --> 03:03.900 engineered to execute these genetic programs. 03:03.900 --> 03:10.980 That part tends to be a little bit expensive, so you oftentimes, especially academia, try 03:10.980 --> 03:14.300 to build the strategy as you can reuse as many parts as possible. 03:14.300 --> 03:18.900 So it's like you're using a little Lego bricks, and you're trying to use your synthetic 03:18.900 --> 03:23.140 DNA multiple times, so you get some more mileage out of that. 03:23.140 --> 03:25.180 That's where I could first issue comes in. 03:25.180 --> 03:31.580 If you're trying to do this design build test on a large scale, is that when you want 03:31.580 --> 03:36.020 to build an efficient build strategy, you kind of need to know what's going to enter the 03:36.020 --> 03:37.020 design. 03:37.020 --> 03:38.020 You need some context. 03:38.020 --> 03:41.980 You can just give these build people a synthesis company, a list of sequences, and 03:41.980 --> 03:45.660 then go like have added, and that's going to cost you a lot of money. 03:45.660 --> 03:50.460 So there's this loss of context sometimes you see in an industry between a design and 03:50.460 --> 03:53.900 a built stage. 03:53.900 --> 03:58.140 Then the third stage, then is that where we would like read back our genome, especially 03:58.140 --> 04:04.540 for everyone like possibly somebody or if you're doing an insertion into a genome, the 04:04.540 --> 04:08.420 genome editing process is not perfect, so oftentimes it doesn't work. 04:08.420 --> 04:12.860 So you end up with these assays where you say, like, our genes got into our bug, our genes 04:12.860 --> 04:18.460 is not getting into our bug, but more often than not, you end up in a third category where 04:18.460 --> 04:22.660 your genes got into your genome, but there's some limitations. 04:22.660 --> 04:26.180 You're in the stuff you put in there, or somewhere completely else. 04:26.180 --> 04:32.860 And that's a point where you really end up in a lot of trouble in the sense that we don't 04:32.860 --> 04:35.180 yet know well how to deal with that. 04:35.180 --> 04:39.940 So a lot of our software is based on these are genes that get into the host, or they 04:39.940 --> 04:41.300 do not get into host. 04:41.300 --> 04:45.820 And on the other side, like the variant colors, they're great, they tell you the variants, 04:45.820 --> 04:47.740 but you lose the engineering context. 04:47.740 --> 04:51.740 So then you don't know exactly what is that variant, and it's important to me, or what 04:51.740 --> 04:54.900 should I expect. 04:54.900 --> 05:01.380 So that in total basically leads to this issue, where closing this design, built test 05:01.380 --> 05:11.220 group is, yeah, then remaining, or then fast, that's, gets really tricky, and the whole 05:11.220 --> 05:13.860 engineering process becomes a lot harder than it should be. 05:13.860 --> 05:19.380 So like, like, synthetic biology hasn't really taken off that much yet, and that's important 05:19.380 --> 05:24.380 because we're missing some gaps that's, we can be inspired by software engineering to 05:24.380 --> 05:27.140 solve. 05:27.140 --> 05:29.580 So that's where our project comes in. 05:29.580 --> 05:35.420 So David is like my co-workers sitting right there, and he's another fellow contributor, 05:35.420 --> 05:44.580 and we built a client called Gen, that is heavily inspired by get, and it is good to 05:44.580 --> 05:47.700 design specifically for sequences. 05:47.700 --> 05:52.900 It's a risk rate, we have command line interface, we have Python bindings, we have a nice 05:52.900 --> 06:01.460 tool as well, and what it is, is it's a way to organize your sequences in repositories, 06:01.460 --> 06:06.340 there are SQLite databases, where we track every change you make here in the design stage, 06:06.340 --> 06:12.740 or in the observation stage, when you use sequence your samples, and we track those as operations 06:12.740 --> 06:18.740 basically commits, and then we supply the familiar get commands to initialize your repository, 06:18.980 --> 06:23.460 synchronize your remote repository, make branches, basically like the workflow of software 06:23.460 --> 06:30.500 engineers are used to, but on top of that, we add a whole bunch of sequence specific commands 06:30.500 --> 06:32.220 and operations. 06:32.220 --> 06:38.660 So when you look back at our design bill test, for example, here are some of the functions 06:38.660 --> 06:45.180 like I would use to reflect what I did in the lab, and what I saw in my, in my data, 06:45.180 --> 06:49.860 and load them onto my repository, so I was like building the story of what happens to 06:49.860 --> 06:56.580 this train, what went into the design, why did it do it, and sort of end up with a more 06:56.580 --> 07:04.300 reviewable PR for your collaborators, for governments, agencies, for any one really. 07:04.300 --> 07:10.180 So here, for instance, when we're planning this planning strategy, we can chunk out 07:10.180 --> 07:17.900 the sequences, we can stitch them back together and keep track of all that. 07:17.900 --> 07:23.620 Yeah, and then here's some other ones as well, where we have ways of interacting with 07:23.620 --> 07:31.180 these variant call files and loading up in a way that we can view and study like that. 07:31.180 --> 07:35.780 Now why don't we just use get, like why can't we, why do we need this whole new system, 07:35.780 --> 07:40.100 if we are just working with sequences, why can't we use our existing tools? 07:40.100 --> 07:45.220 And the reason for that is that coordinate frames on genomes are really, really fragile. 07:45.220 --> 07:53.780 So what you end up with is I can be talking about age genome, and I say like base 1,255, 07:53.780 --> 07:56.740 and then you can think we're talking about the same thing, but if you have a different reference 07:56.740 --> 07:59.220 in mind, then we're screwed. 07:59.220 --> 08:05.660 Now the references also have an issue, like there's no one here who is like the reference 08:05.660 --> 08:06.660 human, right? 08:06.660 --> 08:10.580 So but we're still working with these reference genes, and part of that is because that 08:10.580 --> 08:18.140 was just the best we had at the time, but it's also slow down, like the engineering side, 08:18.140 --> 08:23.860 because we don't also get to do these issues of these intended and observed variants, 08:23.860 --> 08:28.700 where if you are thinking that you're working with a certain sequence, you're reference 08:28.700 --> 08:33.340 sequence, and then in the lab, you soon see your real data that tells you, there's actually 08:33.340 --> 08:37.380 a thousand differences compared to your reference, like how do you find like the relevant ones 08:37.380 --> 08:42.780 from that and how do you find your own engineering context there? 08:42.780 --> 08:44.780 So how do you solve that? 08:44.780 --> 08:48.500 We solved it a little earlier, so actually I'm going to ask again, how many people here 08:48.580 --> 08:53.580 are familiar with like pan genomes and graph genome representations? 08:53.580 --> 09:01.460 Okay, okay, so as a very brief introduction to that, it's a way of representing genome 09:01.460 --> 09:09.220 sequences with DNA sequences in a nonlinear fashion, so what you do is every linear sequence 09:09.220 --> 09:15.540 is a walk and a graph from like a start to an end, and when you have a variant, you add 09:15.620 --> 09:20.260 an edge, so it's an edge in a node, and a node carries sub sequences. 09:20.260 --> 09:25.820 So in this case, what I'm showing here is when you have an AT that muted to a TG, instead 09:25.820 --> 09:31.500 of this following your initial reference, you then hop on to another node and then hop back, 09:31.500 --> 09:40.500 and that allows you to very efficiently store a ton of variants in one object, let's see, 09:40.740 --> 09:53.900 and then what's also interesting there, when I spend a lot of effort for the engineering 09:53.900 --> 09:59.460 sites, more than the pan genomes site, is that these points where you're forks, that can 09:59.460 --> 10:03.780 actually represent many things that are very useful for us. 10:03.780 --> 10:09.700 On the one site, it can represent mixed populations and polyploidy as in like these are real 10:09.700 --> 10:14.020 things that happen that you don't see in any of your sequences. On the other hand, it can 10:14.020 --> 10:18.660 give you historical variants, so our software, every variant that sees it adds, I just do this 10:18.660 --> 10:24.260 additive data model so that we keep a change log like that, and then you can also use that to 10:24.260 --> 10:29.460 represent all of the variants you're screening. So let's see, I'm testing 10 sequences, that 10:29.460 --> 10:35.860 can be 10, like, like, legs of that fork, and then we can, we can work with this entire screening 10:35.940 --> 10:40.500 library at once, and that becomes really interesting once you start combining many metvenny 10:40.500 --> 10:44.980 parts because that quickly blooms, so oftentimes we find people restricting their experiments 10:44.980 --> 10:52.100 based on what they think they can cover, but then this allows them to go much, much higher 10:52.100 --> 10:56.660 without having these huge data sets. And I like to call shooting a smaller kill, because these 10:56.660 --> 11:00.580 are smaller kills, they can be anything, and tell you sequence, and then like this proposition 11:00.580 --> 11:05.620 collapses, and it becomes this useful metaphor of working with the sequences where you think 11:05.620 --> 11:12.500 you know what it is, but you don't know yet. So what we're offering here is then, not just 11:12.500 --> 11:17.620 then a version control client, but also tools for working with these graphs, to make it easier to 11:17.620 --> 11:23.060 to visualize and conceptualize and do some printing and pasting, and hoping to get more people 11:23.060 --> 11:30.020 on the graph geomes space, especially on the engineering side. We also have done a web interface at 11:30.100 --> 11:36.180 general bio, that is also being worked on so that we have as many ways as possible to get people 11:36.180 --> 11:41.460 on this, and to get people familiar with this and working with this. And then the whole idea is that 11:41.460 --> 11:47.140 we want to promote collaborative engineering. So we support common bio from many file formats, 11:47.140 --> 11:51.780 and anyone, if you want to get your format on here, let us know, and then we'll work on that. 11:52.980 --> 11:59.060 We have ways of doing this decentralized distribution via patch files, so you can just email patches 11:59.140 --> 12:03.220 to each other and keep your repositories in sync. We have remote repositories, like you're 12:03.220 --> 12:08.980 used to with Gets, that we host ourselves at general.io, or that you can host yourself yourself, 12:08.980 --> 12:14.500 and our Gets repo is there, we're a patchy to licensed, and we're accepting PRs. 12:18.500 --> 12:24.420 So the basically our entire goal is going from a world where we're sharing sequences in 12:24.420 --> 12:29.860 word files, or horrible patterns that you're eligible, and we're getting to a world where it 12:29.860 --> 12:35.060 looks more like a Gets repo, and we can get as fast as software engineering with Android biology. 12:35.700 --> 12:40.340 Thank you. 12:43.220 --> 12:49.220 So it's time for questions. Do we have any questions? Yes, I get friends. How can you 12:49.220 --> 12:59.540 store your data? So the question was, how do you store your data? It's SQLite. We have per repository, 12:59.540 --> 13:06.260 we can make one or more SQLite databases as files. So it's a graph model on SQLite. 13:10.900 --> 13:15.940 So we import center files, so one of the things that we did, that wasn't there in the 13:15.940 --> 13:23.300 Pan Genomex space, is when you have your GFA file from that space, and you add more variants with 13:23.300 --> 13:31.540 your changes, everything changes. It's a very fragile model in the sense that if you break up and 13:31.540 --> 13:36.900 notice, then all edges are connected, there's no need to be changed. So the way we do it is we 13:38.020 --> 13:44.260 work a slightly different model of graphs, that's additive. So we set it up in a way that is 13:44.260 --> 13:49.220 completely additive, and then we put it in SQLite, so it's a good scale. But the import 13:49.220 --> 13:54.740 export center files. 13:58.420 --> 14:04.900 Yeah, so could you use this for genome alignments? We don't include an aligner, but we do interface with 14:04.900 --> 14:11.220 those aligners. So you can have your aligner like cactus is like a common one, and then pipe it into 14:11.300 --> 14:14.260 into a gen, and have then the results and the pinnure repo. 14:27.540 --> 14:34.100 Yeah, so this is, as I said, the question was, you have these files where you have 14:34.980 --> 14:43.780 instead of ACTG of ACTN, or K, or whatever, like ambiguous basis. It's not in our main 14:44.580 --> 14:50.740 branch yet, but we have some scripts where we convert those on the fly into a graph. So you 14:50.740 --> 14:56.740 underpate these barcode, that can be like millions of possible sequences, and you have like 20 nodes. 14:56.740 --> 15:00.740 And that sort of what makes that exponential problem like all these years. 15:04.180 --> 15:08.740 Yeah, one more final. We share a plan also to have like the human, you know, one a repo 15:08.740 --> 15:14.500 where we can put all our stuff. Yeah, so one of our developers, he stressed us, everything was like 15:14.500 --> 15:21.540 100,000 human genomes at once, so it's built to handle that scale, and it does, but it does as well. 15:21.540 --> 15:27.060 Yeah. Okay, so everybody, please, thank you, Bob.