WEBVTT 00:00.000 --> 00:10.000 Thank you very much. 00:10.000 --> 00:11.000 So yeah, let's start. 00:11.000 --> 00:18.000 I will talk about a bit of the mathematics or the fundamental mathematics of Eurasia coding. 00:18.000 --> 00:20.000 First of all, it's me. 00:20.000 --> 00:22.000 I'm not a software developer. 00:22.000 --> 00:24.000 I'm just a mathematics student. 00:24.000 --> 00:26.000 Currently studying in Oslo. 00:26.000 --> 00:29.000 I'm a big fan of Austin, so that's why I'm here. 00:30.000 --> 00:33.000 Yeah, first, what is Eurasia coding? 00:33.000 --> 00:38.000 So you've probably heard that Eurasia coding is a redundancy technique for storage systems, 00:38.000 --> 00:41.000 so essentially for data protection. 00:41.000 --> 00:52.000 And it enables especially storage efficiency by distributing data in a smart way over different nodes in a cluster. 00:52.000 --> 01:00.000 So what's essentially first done is that we're going to some data and then we split it into k chunks. 01:00.000 --> 01:05.000 And then they are mystically encoded into n chunks. 01:05.000 --> 01:14.000 And by this then n minus k chunks of the encoded chunks are going to be enough to recover all the data. 01:14.000 --> 01:21.000 Yeah, also what's quite important is that it considers transient and prominent failures. 01:21.000 --> 01:28.000 So it's not trying to repair errors in the disks, but just considers them as Eurasia. 01:28.000 --> 01:32.000 Yeah, it has less redundancy than replication. 01:32.000 --> 01:35.000 So that's the nice thing about Eurasia coding. 01:35.000 --> 01:39.000 And it's for example typically used in cloud storage. 01:39.000 --> 02:00.000 So yeah, first bit about the history, so the fundamental idea behind Eurasia coding is presented by was presented by Reed and Solomon in the paper from 1960 polynomial codes of a certain finite fields, which is a very mathematical title. 02:00.000 --> 02:03.000 But it wasn't really applied at that time. 02:03.000 --> 02:09.000 So first the world, the redundant area of independent or inexpensive disks. 02:09.000 --> 02:16.000 And it's usually now applied since the early 2000s. 02:16.000 --> 02:23.000 So now we start with the Reed Solomon code, which is like the fundamental basis of Eurasia coding. 02:23.000 --> 02:30.000 And we consider the simple string head of out and we divided it into five chunks. 02:30.000 --> 02:34.000 Like this, so that they all have the same size. 02:34.000 --> 02:41.000 And then they are encoded, as I said before, and then we get, for example, now eight chunks. 02:41.000 --> 02:51.000 And the important thing is that five of these eight encoded chunks will be enough to recover the whole string head of the data. 02:51.000 --> 02:56.000 But of course, what is this encoding and how do we recover the data? 02:57.000 --> 03:04.000 Therefore, we now need polynomials, which you hopefully have seen at some point in school. 03:04.000 --> 03:11.000 And yeah, we call these numbers a zero to a k minus one coefficients. 03:11.000 --> 03:14.000 And polynomial, for example, looks like this. 03:14.000 --> 03:23.000 So the wet one in the middle, for example, the simple polynomial, a one zero, and the others, is one and the others are all zero. 03:23.000 --> 03:30.000 And above there's one where eight one is two, and a zero is minus one, and then the others are zero. 03:30.000 --> 03:40.000 And the one underneath then has eight two equal to zero point five, and it's then a quadratic polynomial. 03:40.000 --> 03:47.000 Now we need to use a fundamental theorem from mathematics about polynomials. 03:47.000 --> 03:52.000 And therefore, we now try to draw polynomials through points. 03:53.000 --> 04:01.000 So first we look at this point eight, which is simply the point two two with x and y coordinates two and two. 04:01.000 --> 04:13.000 And if we look at very simple polynomials that look like this above, we see, okay, the only one that directly goes through that point is one that's just simply constantly two. 04:13.000 --> 04:22.000 But if we look at polynomials that are a bit more complex, we see that we have different options to draw them through these points. 04:22.000 --> 04:31.000 And if we add another point, we see, okay, there's again only one unique polynomial that directly goes through these two points that looks like this. 04:31.000 --> 04:42.000 But if we again continue the whole thing and consider polynomials of a higher degree, so that a bit more complex, we again get the same that if we would add another point. 04:42.000 --> 04:50.000 There would only be one unique polynomial of that form that would go through the three points. 04:50.000 --> 04:54.000 So this is essentially what we observed, summed up. 04:54.000 --> 04:57.000 And this gives then the theorem. 04:57.000 --> 05:01.000 And this looks a bit overwhelming maybe, but it's just what we observed. 05:01.000 --> 05:11.000 So given K coordinates where the x coordinates should be power assistant, the x is a unique polynomial that looks like this. 05:12.000 --> 05:17.000 With that attains just these points on the graph. 05:17.000 --> 05:32.000 And equivalently, we could also say if we consider distinct points x zero through x k minus one, each polynomial that goes up to k degree k minus one is uniquely determined by its values. 05:33.000 --> 05:39.000 Yeah, how does it help us? And what does the polynomial really do? 05:39.000 --> 05:50.000 The central idea is now to construct the polynomial, almost specifically the coefficients of the polynomial by the data that we split it up. 05:50.000 --> 05:53.000 So we considered these five chunks. 05:53.000 --> 06:01.000 And now if we look at the first chunks, which is just the letters h and e, we have that decimal number for this. 06:01.000 --> 06:06.000 These two letters by just some simple encoding, it does really matter now. 06:06.000 --> 06:11.000 In the example is 26,725. 06:11.000 --> 06:16.000 Then we take this m0 to be the first coefficient in the polynomial. 06:16.000 --> 06:21.000 And that is then done for all the other chunks as well. 06:21.000 --> 06:25.000 But then we have some x values that we multiply them with. 06:26.000 --> 06:31.000 So we get this polynomial, which looks very scary. 06:31.000 --> 06:40.000 And if we try to plot it, it looks also a bit horrible because it attains really, really high values. 06:40.000 --> 06:44.000 But now, why did we have the mathematical result? 06:44.000 --> 06:54.000 Well, we saw from the mathematical result that we can uniquely determine this polynomial by considering the total five points. 06:55.000 --> 06:59.000 So here's again the theorem. 06:59.000 --> 07:07.000 And for example, what is typically done is that we consider the x values zero and up to k minus one. 07:07.000 --> 07:15.000 And these are then, in our example, exactly these points that can be simply computed. 07:15.000 --> 07:18.000 And here they are plotted. 07:18.000 --> 07:27.000 But that now doesn't really introduce redundancy because we know we need these five points to uniquely determine this. 07:27.000 --> 07:31.000 So we consider more points, for example, eight points. 07:31.000 --> 07:35.000 And this would be then the coordinates. 07:35.000 --> 07:44.000 And then the idea is to just distribute these end coordinates that we have now so far on to our cluster. 07:44.000 --> 07:48.000 So they're all saved on different nodes. 07:48.000 --> 08:02.000 And this from these nodes, we can we now have that if we consider any five points, so any five coordinates for polynomial. 08:02.000 --> 08:06.000 We get a uniquely solvable systems of linear equations. 08:06.000 --> 08:12.000 And that's why we can uniquely redetermined our polynomial, which was essentially that data. 08:12.000 --> 08:19.000 So, for example, because we then know it's one, five, and seven fail. 08:19.000 --> 08:26.000 Then from the node zero, we get that the polynomial where you should look like this. 08:26.000 --> 08:30.000 And then from the y-well, you get this also from the node. 08:30.000 --> 08:35.000 And same can be done for all the other nodes that are still left. 08:35.000 --> 08:40.000 And well, this is a system of linear equations in five variables. 08:40.000 --> 08:50.000 And we have five equations, so we can uniquely recover these coefficients, which, again, simply the data we have. 08:50.000 --> 08:59.000 But now, of course, we saw that, for example, for 0.4, we get very, very big values. 08:59.000 --> 09:07.000 So, and if we simply consider the letters, we had two letters, make up 16 bits, but this number would, for example, take 24 bits. 09:07.000 --> 09:11.000 So, we need more space, so it doesn't really make it better. 09:11.000 --> 09:16.000 But that's why then the so-called calor fields are used. 09:16.000 --> 09:20.000 So, this is what is done there. 09:20.000 --> 09:25.000 And that's what also forms these, especially these red-solum and codes. 09:25.000 --> 09:30.000 And what is done there is that we essentially don't look at all the numbers. 09:31.000 --> 09:40.000 But we restrict to using only the integers 0 up to, for example, in our case 2 to the power of 16 minus 1. 09:40.000 --> 09:47.000 And then we don't only have the coefficients in that range, but also the values of the function. 09:47.000 --> 09:53.000 So, p4 would then also be some random sum number in that range. 09:54.000 --> 09:58.000 And of course, the normal computation doesn't work with this. 09:58.000 --> 10:08.000 So, on calor fields, so fields and mathematics also define the specific multiply, multiplication and a specific addition. 10:08.000 --> 10:12.000 So, they have a different system to use this. 10:12.000 --> 10:19.000 But it essentially also only involves more modular calculations and some more divisions. 10:19.000 --> 10:27.000 So, it creates a bit more complexity when calculating, but it's essentially works. 10:27.000 --> 10:40.000 And then, the nice thing about these calor fields is that also this theorem that we had about the uniqueness to recover the coefficients from the polynomial. 10:40.000 --> 10:47.000 Also still holds over these fields and that's why it then works for the with the calor fields. 10:47.000 --> 10:58.000 And then, of course, we have that the numbers that we want to save on the nodes are only of 16 bits and not larger. 10:58.000 --> 11:10.000 So, then with these calor fields, we finally add that we can tolerate n minus k in old failures, but also let is storage optimal, 11:10.000 --> 11:23.000 meaning that we can construct the data from any k of the n chunks with the minimum storage capacity, which is then simply n divided by k. 11:23.000 --> 11:35.000 And, for instance, very, very large clusters like for example, storage systems like Google Colossus as a redundancy value of 1.6, 11:35.000 --> 11:38.000 or Facebook as for the redundancy value of 1.4. 11:38.000 --> 11:53.000 And here, the splitting, for example, explains that Google Colossus also has for splits the data first into 6 chunks and then encodes in total of 9 chunks. 11:53.000 --> 12:04.000 And also very nice thing about these real tournament codes is that the values and k that we looked at, they can be chosen arbitrarily essentially. 12:05.000 --> 12:23.000 We just have this one condition that the number of encoded chunks has to be smaller than 2 to the power of the number of bits from a particular chunk, because otherwise we can solve the system of linear equations. 12:24.000 --> 12:52.000 Another very useful thing is that we can, so if we want to consider scalable storage, so first if we we call the collection of these n and code chunks that we have just a stripe, it's a normal terminology, and then we consider each chunk that we had before as the collection of more chunks. 12:52.000 --> 13:00.000 So for example, here we say that the first chunk is made out of four chunks and the same also holds for the other chunks. 13:00.000 --> 13:13.000 And then we consider to create this polynomial, only for example, for the first line of the data that is given by these division of the chunks. 13:13.000 --> 13:23.000 And then we also get like the first line is then the encoded first line in the uncoded chunks. 13:23.000 --> 13:30.000 And that makes it especially useful when updating or changing the data or also when we covering to make it. 13:30.000 --> 13:40.000 So the computations are smaller because if we have very big chunks that also the computations get much higher. 13:40.000 --> 13:55.000 And another useful thing is that we can configure the retone of the codes also in the systematic form, which says that the final n and coded chunks. 13:55.000 --> 14:05.000 Made off of the k original chunks, so we just have the original with the original data together with the n minus k parity chunks. 14:05.000 --> 14:16.000 So it means that it's also feasible to just read the data and not only consider it in the encoded state. 14:16.000 --> 14:29.000 And yeah, there are also still many more other erasure codes that also just go more into the details of the overhead that is created in repair is an update. 14:29.000 --> 14:45.000 So for instance, regenerating codes, local reconstruction codes is decodes also read and sort of on the density parity codes, but they also all face essentially on this same idea, but with adaptations to make it. 14:45.000 --> 15:00.000 So yeah, that's already it. Thank you very much for your attention. Just feel free to contact me wherever you may or yeah, please ask if you have any questions. 15:00.000 --> 15:15.000 I mean the performance. 15:15.000 --> 15:33.000 Yeah, so the question was about the performance and not so much about the real liability. 15:33.000 --> 15:44.000 I haven't looked at it too closely, but it's essentially depends on the selection of chunks you choose. 15:44.000 --> 15:52.000 And then if for example, if the chunk gets smaller, the computation over to each chunk also gets smaller. 15:52.000 --> 16:12.000 So what's essentially done is when you encode you just calculate values of the of the polynomials, but when you're decoding you're solving a system of linear integration, so and that's very standard mathematical algorithm for us. 16:12.000 --> 16:25.000 So I don't have too much of the details, but yeah, so it's many additions and multiplications essentially. 16:25.000 --> 16:33.000 So I just wanted to mention this set up this very similar to a super chair and some culture here to see the chair, which also leads to the school. 16:33.000 --> 16:39.000 If you have like the secret that you want to share with the parties, so that the parties can recover it. 16:39.000 --> 16:42.000 But the purpose of using essentially the same idea. 16:42.000 --> 16:47.000 I've been calling years ago where I've been calling for two times when you're in coverage. 16:47.000 --> 16:49.000 But I've been seeing that you can recover it. 16:49.000 --> 16:50.000 She's in that right. 16:50.000 --> 16:53.000 It's definitely going on because it's solving the earlier system. 16:53.000 --> 16:56.000 And I think that is a good factor. 16:56.000 --> 16:59.000 It's depending on the details I guess. 16:59.000 --> 17:01.000 That's not a way to do it. 17:01.000 --> 17:08.000 Yeah, so essentially it was noted that it's very similar also to the secret sharing. 17:08.000 --> 17:16.000 And that also for the polynomial interpolation parts, what we saw with the points on the graph. 17:16.000 --> 17:19.000 That instead also. 17:19.000 --> 17:21.000 Another basis could be used. 17:21.000 --> 17:24.000 So for example, like gosh, we have polynomials. 17:24.000 --> 17:26.000 They are known to you. 17:26.000 --> 17:29.000 So yeah, of course, that's all making the performance better. 17:29.000 --> 17:31.000 And you can encode this in very different ways. 17:31.000 --> 17:36.000 But it was just to show what essentially really happens in the usual coding part. 17:36.000 --> 17:39.000 Because it might not have been known to many people. 17:39.000 --> 17:42.000 Then it uses polynomial interpolation. 17:42.000 --> 17:43.000 Yeah. 17:43.000 --> 17:45.000 Questioner, you have to cave for five. 17:45.000 --> 17:47.000 And was what was too much. 17:47.000 --> 17:52.000 And now when I have only a word that is, what was known then I only have to cave. 17:52.000 --> 17:57.000 How do I get to, up to five. 17:57.000 --> 17:59.000 I mean. 17:59.000 --> 18:17.000 So the question was about choosing the end on the cave values for the, when you, for example, 18:17.000 --> 18:19.000 looking only at four bytes. 18:19.000 --> 18:23.000 Well, for four bytes, this eraser coding is essentially not really great. 18:23.000 --> 18:26.000 So it was just a small example. 18:26.000 --> 18:30.000 But typically the chunks are a bit larger. 18:30.000 --> 18:35.000 And the cave can always be shown to be any value. 18:35.000 --> 18:41.000 So it can be anything between one and maybe the number of bits that you have. 18:41.000 --> 18:47.000 So it's really up to choice. 18:47.000 --> 18:53.000 So in modern story, we've got what ourselves compression encryption. 18:53.000 --> 18:55.000 Check sound link. 18:55.000 --> 19:00.000 And where do we could arrange your coding between these layers? 19:00.000 --> 19:02.000 Or what are the triangles? 19:02.000 --> 19:05.000 Thank you for that. 19:05.000 --> 19:16.000 So the question was about where erasure coding is placed in comparison to encrypting. 19:16.000 --> 19:20.000 And different stories. 19:20.000 --> 19:29.000 So I mean the erasure coding part is mainly about the recovering of data and being able to recover. 19:29.000 --> 19:36.000 So I mean if you have encrypted data, I guess you could just consider the encrypted data in the beginning 19:36.000 --> 19:41.000 and then use the encoding by the erasure coding to apply it. 19:41.000 --> 19:45.000 So I think it's two different approaches. 19:45.000 --> 19:55.000 So I mean the encoding part is not really about making it decrypting it and 19:55.000 --> 20:03.000 encrypting it but it's more about the storing part and to recover it easily. 20:04.000 --> 20:26.000 So the question was about what we do if we have, if not all of the redundant nodes die essentially. 20:26.000 --> 20:29.000 But yeah. 20:29.000 --> 20:33.000 So for example in our example only one node would have died. 20:33.000 --> 20:39.000 We could have derived instead of five equations seven. 20:39.000 --> 20:44.000 But then the additional two other equations don't really give us additional information 20:44.000 --> 20:48.000 because we just need the five equations to recover the data. 20:48.000 --> 20:55.000 So but yeah when for example we look at recovering algorithms or things like that then questions like this 20:55.000 --> 20:58.000 become more relevant. 20:58.000 --> 20:59.000 Yeah. 20:59.000 --> 21:04.000 As we give us error detection if we have to do all the specified metrics. 21:04.000 --> 21:09.000 So I'm the additional ones if they don't die out. 21:09.000 --> 21:11.000 Have you error detection data? 21:11.000 --> 21:16.000 Yeah I think so but I mean essentially the erasure codes. 21:16.000 --> 21:24.000 The question was about error detection if we have like additional equations and then see okay the equations don't 21:24.000 --> 21:26.000 line up. 21:26.000 --> 21:34.000 Then so erasure coding was essentially so these read Solomon code to also construct it for more come 21:34.000 --> 21:42.000 off from the telecommunications field and there they also function as error correcting codes. 21:43.000 --> 21:53.000 So yes essentially this could be done but it doesn't have to so yeah it could be used. 21:53.000 --> 21:54.000 Yeah. 21:54.000 --> 21:56.000 Yes. 21:56.000 --> 22:03.000 I'm going to put like correct but from my understanding right now the wide values of the first K chance 22:03.000 --> 22:08.000 carry the semantics of the characters we want to go. 22:08.000 --> 22:17.000 But when it's right how do I choose the next chance to go to the end chance and when some of the first K die 22:17.000 --> 22:22.000 but I have the luck from this end I can reconstruct the polynomial. 22:22.000 --> 22:26.000 Of course I can calculate the coefficient. 22:26.000 --> 22:32.000 But then it would also be as meta data the weeks where use of the first K that each child 22:32.000 --> 22:37.000 to be able to get the information I want. 22:37.000 --> 22:54.000 So the question was about the x and y values that we save in the chunks so I don't know if I understood correctly 22:54.000 --> 23:05.000 but I mean we don't really so we just save the n and coded chunks so in total. 23:05.000 --> 23:18.000 And do the y values carry the semantics of the characters we want to do is the y values and it's only for the n. 23:18.000 --> 23:34.000 Not really so if we for example look at the first coordinate just some coordinate this is the function value and this is determined as we saw here. 23:34.000 --> 23:51.000 By summing over the encodings of the coefficients multiplied with the x values so it all plays together into each chunk. 23:51.000 --> 24:20.000 So the question was about how much space the encoding by Eurasia coding takes and this is really depending on the choice of the values and end K. 24:20.000 --> 24:31.000 So we saw for example here the minimum storage redundancy would be is and divided by K and that is attained. 24:31.000 --> 24:43.000 So if we have if we choose and to be equal to 1 equal to K then of course it doesn't give us anything new but we have the same space. 24:43.000 --> 25:04.000 And depends on these numbers so if for example n is twice the amount of K then it takes twice the space but it still gives us a better reliability on the system except if we are considering just the K value equal to 1. 25:04.000 --> 25:05.000 Yes? 25:05.000 --> 25:07.000 I know it's not your question actually. 25:07.000 --> 25:10.000 And it's not the both methods of all of these things out. 25:10.000 --> 25:12.000 Is there or when you have to wear it? 25:12.000 --> 25:14.000 It's not exactly this one. 25:14.000 --> 25:19.000 And the math for when you like to give it a G. 25:19.000 --> 25:27.000 So the question was about if this is Eurasia coding is also in some hardware. 25:27.000 --> 25:38.000 I think it is used but it's not really hardware specific because it's just about saving these values on hardware. 25:38.000 --> 25:42.000 So I don't know too much about this. 25:42.000 --> 25:46.000 But I think we sure most modern SSDs is some things that we can do. 25:46.000 --> 25:50.000 So like as you can see for the XE3 check list. 25:50.000 --> 25:57.000 Which also I think give you some sort of data correction on your region in the direction. 25:57.000 --> 26:00.000 And we have to have the instruction sets for that. 26:00.000 --> 26:01.000 Yes. 26:01.000 --> 26:04.000 Then you are going to see that this one. 26:04.000 --> 26:12.000 You see, you have to make the same. 26:12.000 --> 26:14.000 Like, you see, do you run the edge? 26:14.000 --> 26:17.000 I don't think I can run. 26:17.000 --> 26:19.000 Yeah, it's on your own code. 26:19.000 --> 26:20.000 It's like version. 26:20.000 --> 26:21.000 Yeah, sure. 26:27.000 --> 26:28.000 Yes? 26:28.000 --> 26:29.000 Okay. 26:29.000 --> 26:31.000 For example, I have a zip code. 26:31.000 --> 26:32.000 Oh, guys. 26:32.000 --> 26:34.000 It's a systemic code. 26:34.000 --> 26:37.000 Can you give any examples that does? 26:37.000 --> 26:39.000 I haven't really looked. 26:39.000 --> 26:44.000 Yeah, the question was about any examples of this. 26:44.000 --> 26:48.000 Where it's not encoded in the systematic form. 26:48.000 --> 26:54.000 I haven't really looked at many other examples but it's essentially all the same thing. 26:54.000 --> 27:01.000 Because you're just changing like the if you consider the matrix that's created by the linear system of equations. 27:01.000 --> 27:06.000 It's just, yeah, applying simple changes. 27:06.000 --> 27:08.000 And then it's equivalent. 27:08.000 --> 27:13.000 But I don't know if it's used in that way, somewhere. 27:13.000 --> 27:14.000 Yeah. 27:14.000 --> 27:15.000 I just have one more. 27:15.000 --> 27:18.000 It'll come in and drop me out. 27:18.000 --> 27:24.000 When you lose this in a large storage area, and you want to restore it. 27:24.000 --> 27:28.000 And you have a redundancy of basically three. 27:28.000 --> 27:38.000 Can you pick any of the other, or any, you can pick any of three of other disks to place the information that this has to be restored. 27:38.000 --> 27:43.000 So, the main advantage that just struck me on the other side. 27:43.000 --> 27:47.000 So, when the advantage would be that I don't have to read the copies from just one disk on it. 27:47.000 --> 27:49.000 You set your eyes again. 27:49.000 --> 27:54.000 I can just read one first of data from one disk, one third of the data from another disk. 27:54.000 --> 27:55.000 And have it. 27:55.000 --> 28:00.000 Because I have enough redundancy of three other disks. 28:00.000 --> 28:01.000 Yeah. 28:01.000 --> 28:04.000 Just want to make that come in there. 28:04.000 --> 28:05.000 Yeah. 28:05.000 --> 28:06.000 If you have to. 28:07.000 --> 28:12.000 Thank you. 28:12.000 --> 28:14.000 Okay. 28:14.000 --> 28:15.000 Thank you. 28:15.000 --> 28:16.000 Any more questions? 28:16.000 --> 28:19.000 Comments we can have later in the hallway, sir. 28:19.000 --> 28:20.000 All right. 28:20.000 --> 28:21.000 Thank you. 28:21.000 --> 28:23.000 Thank you.