WEBVTT

00:00.000 --> 00:10.000
Thank you very much.

00:10.000 --> 00:11.000
So yeah, let's start.

00:11.000 --> 00:18.000
I will talk about a bit of the mathematics or the fundamental mathematics of Eurasia coding.

00:18.000 --> 00:20.000
First of all, it's me.

00:20.000 --> 00:22.000
I'm not a software developer.

00:22.000 --> 00:24.000
I'm just a mathematics student.

00:24.000 --> 00:26.000
Currently studying in Oslo.

00:26.000 --> 00:29.000
I'm a big fan of Austin, so that's why I'm here.

00:30.000 --> 00:33.000
Yeah, first, what is Eurasia coding?

00:33.000 --> 00:38.000
So you've probably heard that Eurasia coding is a redundancy technique for storage systems,

00:38.000 --> 00:41.000
so essentially for data protection.

00:41.000 --> 00:52.000
And it enables especially storage efficiency by distributing data in a smart way over different nodes in a cluster.

00:52.000 --> 01:00.000
So what's essentially first done is that we're going to some data and then we split it into k chunks.

01:00.000 --> 01:05.000
And then they are mystically encoded into n chunks.

01:05.000 --> 01:14.000
And by this then n minus k chunks of the encoded chunks are going to be enough to recover all the data.

01:14.000 --> 01:21.000
Yeah, also what's quite important is that it considers transient and prominent failures.

01:21.000 --> 01:28.000
So it's not trying to repair errors in the disks, but just considers them as Eurasia.

01:28.000 --> 01:32.000
Yeah, it has less redundancy than replication.

01:32.000 --> 01:35.000
So that's the nice thing about Eurasia coding.

01:35.000 --> 01:39.000
And it's for example typically used in cloud storage.

01:39.000 --> 02:00.000
So yeah, first bit about the history, so the fundamental idea behind Eurasia coding is presented by was presented by Reed and Solomon in the paper from 1960 polynomial codes of a certain finite fields, which is a very mathematical title.

02:00.000 --> 02:03.000
But it wasn't really applied at that time.

02:03.000 --> 02:09.000
So first the world, the redundant area of independent or inexpensive disks.

02:09.000 --> 02:16.000
And it's usually now applied since the early 2000s.

02:16.000 --> 02:23.000
So now we start with the Reed Solomon code, which is like the fundamental basis of Eurasia coding.

02:23.000 --> 02:30.000
And we consider the simple string head of out and we divided it into five chunks.

02:30.000 --> 02:34.000
Like this, so that they all have the same size.

02:34.000 --> 02:41.000
And then they are encoded, as I said before, and then we get, for example, now eight chunks.

02:41.000 --> 02:51.000
And the important thing is that five of these eight encoded chunks will be enough to recover the whole string head of the data.

02:51.000 --> 02:56.000
But of course, what is this encoding and how do we recover the data?

02:57.000 --> 03:04.000
Therefore, we now need polynomials, which you hopefully have seen at some point in school.

03:04.000 --> 03:11.000
And yeah, we call these numbers a zero to a k minus one coefficients.

03:11.000 --> 03:14.000
And polynomial, for example, looks like this.

03:14.000 --> 03:23.000
So the wet one in the middle, for example, the simple polynomial, a one zero, and the others, is one and the others are all zero.

03:23.000 --> 03:30.000
And above there's one where eight one is two, and a zero is minus one, and then the others are zero.

03:30.000 --> 03:40.000
And the one underneath then has eight two equal to zero point five, and it's then a quadratic polynomial.

03:40.000 --> 03:47.000
Now we need to use a fundamental theorem from mathematics about polynomials.

03:47.000 --> 03:52.000
And therefore, we now try to draw polynomials through points.

03:53.000 --> 04:01.000
So first we look at this point eight, which is simply the point two two with x and y coordinates two and two.

04:01.000 --> 04:13.000
And if we look at very simple polynomials that look like this above, we see, okay, the only one that directly goes through that point is one that's just simply constantly two.

04:13.000 --> 04:22.000
But if we look at polynomials that are a bit more complex, we see that we have different options to draw them through these points.

04:22.000 --> 04:31.000
And if we add another point, we see, okay, there's again only one unique polynomial that directly goes through these two points that looks like this.

04:31.000 --> 04:42.000
But if we again continue the whole thing and consider polynomials of a higher degree, so that a bit more complex, we again get the same that if we would add another point.

04:42.000 --> 04:50.000
There would only be one unique polynomial of that form that would go through the three points.

04:50.000 --> 04:54.000
So this is essentially what we observed, summed up.

04:54.000 --> 04:57.000
And this gives then the theorem.

04:57.000 --> 05:01.000
And this looks a bit overwhelming maybe, but it's just what we observed.

05:01.000 --> 05:11.000
So given K coordinates where the x coordinates should be power assistant, the x is a unique polynomial that looks like this.

05:12.000 --> 05:17.000
With that attains just these points on the graph.

05:17.000 --> 05:32.000
And equivalently, we could also say if we consider distinct points x zero through x k minus one, each polynomial that goes up to k degree k minus one is uniquely determined by its values.

05:33.000 --> 05:39.000
Yeah, how does it help us? And what does the polynomial really do?

05:39.000 --> 05:50.000
The central idea is now to construct the polynomial, almost specifically the coefficients of the polynomial by the data that we split it up.

05:50.000 --> 05:53.000
So we considered these five chunks.

05:53.000 --> 06:01.000
And now if we look at the first chunks, which is just the letters h and e, we have that decimal number for this.

06:01.000 --> 06:06.000
These two letters by just some simple encoding, it does really matter now.

06:06.000 --> 06:11.000
In the example is 26,725.

06:11.000 --> 06:16.000
Then we take this m0 to be the first coefficient in the polynomial.

06:16.000 --> 06:21.000
And that is then done for all the other chunks as well.

06:21.000 --> 06:25.000
But then we have some x values that we multiply them with.

06:26.000 --> 06:31.000
So we get this polynomial, which looks very scary.

06:31.000 --> 06:40.000
And if we try to plot it, it looks also a bit horrible because it attains really, really high values.

06:40.000 --> 06:44.000
But now, why did we have the mathematical result?

06:44.000 --> 06:54.000
Well, we saw from the mathematical result that we can uniquely determine this polynomial by considering the total five points.

06:55.000 --> 06:59.000
So here's again the theorem.

06:59.000 --> 07:07.000
And for example, what is typically done is that we consider the x values zero and up to k minus one.

07:07.000 --> 07:15.000
And these are then, in our example, exactly these points that can be simply computed.

07:15.000 --> 07:18.000
And here they are plotted.

07:18.000 --> 07:27.000
But that now doesn't really introduce redundancy because we know we need these five points to uniquely determine this.

07:27.000 --> 07:31.000
So we consider more points, for example, eight points.

07:31.000 --> 07:35.000
And this would be then the coordinates.

07:35.000 --> 07:44.000
And then the idea is to just distribute these end coordinates that we have now so far on to our cluster.

07:44.000 --> 07:48.000
So they're all saved on different nodes.

07:48.000 --> 08:02.000
And this from these nodes, we can we now have that if we consider any five points, so any five coordinates for polynomial.

08:02.000 --> 08:06.000
We get a uniquely solvable systems of linear equations.

08:06.000 --> 08:12.000
And that's why we can uniquely redetermined our polynomial, which was essentially that data.

08:12.000 --> 08:19.000
So, for example, because we then know it's one, five, and seven fail.

08:19.000 --> 08:26.000
Then from the node zero, we get that the polynomial where you should look like this.

08:26.000 --> 08:30.000
And then from the y-well, you get this also from the node.

08:30.000 --> 08:35.000
And same can be done for all the other nodes that are still left.

08:35.000 --> 08:40.000
And well, this is a system of linear equations in five variables.

08:40.000 --> 08:50.000
And we have five equations, so we can uniquely recover these coefficients, which, again, simply the data we have.

08:50.000 --> 08:59.000
But now, of course, we saw that, for example, for 0.4, we get very, very big values.

08:59.000 --> 09:07.000
So, and if we simply consider the letters, we had two letters, make up 16 bits, but this number would, for example, take 24 bits.

09:07.000 --> 09:11.000
So, we need more space, so it doesn't really make it better.

09:11.000 --> 09:16.000
But that's why then the so-called calor fields are used.

09:16.000 --> 09:20.000
So, this is what is done there.

09:20.000 --> 09:25.000
And that's what also forms these, especially these red-solum and codes.

09:25.000 --> 09:30.000
And what is done there is that we essentially don't look at all the numbers.

09:31.000 --> 09:40.000
But we restrict to using only the integers 0 up to, for example, in our case 2 to the power of 16 minus 1.

09:40.000 --> 09:47.000
And then we don't only have the coefficients in that range, but also the values of the function.

09:47.000 --> 09:53.000
So, p4 would then also be some random sum number in that range.

09:54.000 --> 09:58.000
And of course, the normal computation doesn't work with this.

09:58.000 --> 10:08.000
So, on calor fields, so fields and mathematics also define the specific multiply, multiplication and a specific addition.

10:08.000 --> 10:12.000
So, they have a different system to use this.

10:12.000 --> 10:19.000
But it essentially also only involves more modular calculations and some more divisions.

10:19.000 --> 10:27.000
So, it creates a bit more complexity when calculating, but it's essentially works.

10:27.000 --> 10:40.000
And then, the nice thing about these calor fields is that also this theorem that we had about the uniqueness to recover the coefficients from the polynomial.

10:40.000 --> 10:47.000
Also still holds over these fields and that's why it then works for the with the calor fields.

10:47.000 --> 10:58.000
And then, of course, we have that the numbers that we want to save on the nodes are only of 16 bits and not larger.

10:58.000 --> 11:10.000
So, then with these calor fields, we finally add that we can tolerate n minus k in old failures, but also let is storage optimal,

11:10.000 --> 11:23.000
meaning that we can construct the data from any k of the n chunks with the minimum storage capacity, which is then simply n divided by k.

11:23.000 --> 11:35.000
And, for instance, very, very large clusters like for example, storage systems like Google Colossus as a redundancy value of 1.6,

11:35.000 --> 11:38.000
or Facebook as for the redundancy value of 1.4.

11:38.000 --> 11:53.000
And here, the splitting, for example, explains that Google Colossus also has for splits the data first into 6 chunks and then encodes in total of 9 chunks.

11:53.000 --> 12:04.000
And also very nice thing about these real tournament codes is that the values and k that we looked at, they can be chosen arbitrarily essentially.

12:05.000 --> 12:23.000
We just have this one condition that the number of encoded chunks has to be smaller than 2 to the power of the number of bits from a particular chunk, because otherwise we can solve the system of linear equations.

12:24.000 --> 12:52.000
Another very useful thing is that we can, so if we want to consider scalable storage, so first if we we call the collection of these n and code chunks that we have just a stripe, it's a normal terminology, and then we consider each chunk that we had before as the collection of more chunks.

12:52.000 --> 13:00.000
So for example, here we say that the first chunk is made out of four chunks and the same also holds for the other chunks.

13:00.000 --> 13:13.000
And then we consider to create this polynomial, only for example, for the first line of the data that is given by these division of the chunks.

13:13.000 --> 13:23.000
And then we also get like the first line is then the encoded first line in the uncoded chunks.

13:23.000 --> 13:30.000
And that makes it especially useful when updating or changing the data or also when we covering to make it.

13:30.000 --> 13:40.000
So the computations are smaller because if we have very big chunks that also the computations get much higher.

13:40.000 --> 13:55.000
And another useful thing is that we can configure the retone of the codes also in the systematic form, which says that the final n and coded chunks.

13:55.000 --> 14:05.000
Made off of the k original chunks, so we just have the original with the original data together with the n minus k parity chunks.

14:05.000 --> 14:16.000
So it means that it's also feasible to just read the data and not only consider it in the encoded state.

14:16.000 --> 14:29.000
And yeah, there are also still many more other erasure codes that also just go more into the details of the overhead that is created in repair is an update.

14:29.000 --> 14:45.000
So for instance, regenerating codes, local reconstruction codes is decodes also read and sort of on the density parity codes, but they also all face essentially on this same idea, but with adaptations to make it.

14:45.000 --> 15:00.000
So yeah, that's already it. Thank you very much for your attention. Just feel free to contact me wherever you may or yeah, please ask if you have any questions.

15:00.000 --> 15:15.000
I mean the performance.

15:15.000 --> 15:33.000
Yeah, so the question was about the performance and not so much about the real liability.

15:33.000 --> 15:44.000
I haven't looked at it too closely, but it's essentially depends on the selection of chunks you choose.

15:44.000 --> 15:52.000
And then if for example, if the chunk gets smaller, the computation over to each chunk also gets smaller.

15:52.000 --> 16:12.000
So what's essentially done is when you encode you just calculate values of the of the polynomials, but when you're decoding you're solving a system of linear integration, so and that's very standard mathematical algorithm for us.

16:12.000 --> 16:25.000
So I don't have too much of the details, but yeah, so it's many additions and multiplications essentially.

16:25.000 --> 16:33.000
So I just wanted to mention this set up this very similar to a super chair and some culture here to see the chair, which also leads to the school.

16:33.000 --> 16:39.000
If you have like the secret that you want to share with the parties, so that the parties can recover it.

16:39.000 --> 16:42.000
But the purpose of using essentially the same idea.

16:42.000 --> 16:47.000
I've been calling years ago where I've been calling for two times when you're in coverage.

16:47.000 --> 16:49.000
But I've been seeing that you can recover it.

16:49.000 --> 16:50.000
She's in that right.

16:50.000 --> 16:53.000
It's definitely going on because it's solving the earlier system.

16:53.000 --> 16:56.000
And I think that is a good factor.

16:56.000 --> 16:59.000
It's depending on the details I guess.

16:59.000 --> 17:01.000
That's not a way to do it.

17:01.000 --> 17:08.000
Yeah, so essentially it was noted that it's very similar also to the secret sharing.

17:08.000 --> 17:16.000
And that also for the polynomial interpolation parts, what we saw with the points on the graph.

17:16.000 --> 17:19.000
That instead also.

17:19.000 --> 17:21.000
Another basis could be used.

17:21.000 --> 17:24.000
So for example, like gosh, we have polynomials.

17:24.000 --> 17:26.000
They are known to you.

17:26.000 --> 17:29.000
So yeah, of course, that's all making the performance better.

17:29.000 --> 17:31.000
And you can encode this in very different ways.

17:31.000 --> 17:36.000
But it was just to show what essentially really happens in the usual coding part.

17:36.000 --> 17:39.000
Because it might not have been known to many people.

17:39.000 --> 17:42.000
Then it uses polynomial interpolation.

17:42.000 --> 17:43.000
Yeah.

17:43.000 --> 17:45.000
Questioner, you have to cave for five.

17:45.000 --> 17:47.000
And was what was too much.

17:47.000 --> 17:52.000
And now when I have only a word that is, what was known then I only have to cave.

17:52.000 --> 17:57.000
How do I get to, up to five.

17:57.000 --> 17:59.000
I mean.

17:59.000 --> 18:17.000
So the question was about choosing the end on the cave values for the, when you, for example,

18:17.000 --> 18:19.000
looking only at four bytes.

18:19.000 --> 18:23.000
Well, for four bytes, this eraser coding is essentially not really great.

18:23.000 --> 18:26.000
So it was just a small example.

18:26.000 --> 18:30.000
But typically the chunks are a bit larger.

18:30.000 --> 18:35.000
And the cave can always be shown to be any value.

18:35.000 --> 18:41.000
So it can be anything between one and maybe the number of bits that you have.

18:41.000 --> 18:47.000
So it's really up to choice.

18:47.000 --> 18:53.000
So in modern story, we've got what ourselves compression encryption.

18:53.000 --> 18:55.000
Check sound link.

18:55.000 --> 19:00.000
And where do we could arrange your coding between these layers?

19:00.000 --> 19:02.000
Or what are the triangles?

19:02.000 --> 19:05.000
Thank you for that.

19:05.000 --> 19:16.000
So the question was about where erasure coding is placed in comparison to encrypting.

19:16.000 --> 19:20.000
And different stories.

19:20.000 --> 19:29.000
So I mean the erasure coding part is mainly about the recovering of data and being able to recover.

19:29.000 --> 19:36.000
So I mean if you have encrypted data, I guess you could just consider the encrypted data in the beginning

19:36.000 --> 19:41.000
and then use the encoding by the erasure coding to apply it.

19:41.000 --> 19:45.000
So I think it's two different approaches.

19:45.000 --> 19:55.000
So I mean the encoding part is not really about making it decrypting it and

19:55.000 --> 20:03.000
encrypting it but it's more about the storing part and to recover it easily.

20:04.000 --> 20:26.000
So the question was about what we do if we have, if not all of the redundant nodes die essentially.

20:26.000 --> 20:29.000
But yeah.

20:29.000 --> 20:33.000
So for example in our example only one node would have died.

20:33.000 --> 20:39.000
We could have derived instead of five equations seven.

20:39.000 --> 20:44.000
But then the additional two other equations don't really give us additional information

20:44.000 --> 20:48.000
because we just need the five equations to recover the data.

20:48.000 --> 20:55.000
So but yeah when for example we look at recovering algorithms or things like that then questions like this

20:55.000 --> 20:58.000
become more relevant.

20:58.000 --> 20:59.000
Yeah.

20:59.000 --> 21:04.000
As we give us error detection if we have to do all the specified metrics.

21:04.000 --> 21:09.000
So I'm the additional ones if they don't die out.

21:09.000 --> 21:11.000
Have you error detection data?

21:11.000 --> 21:16.000
Yeah I think so but I mean essentially the erasure codes.

21:16.000 --> 21:24.000
The question was about error detection if we have like additional equations and then see okay the equations don't

21:24.000 --> 21:26.000
line up.

21:26.000 --> 21:34.000
Then so erasure coding was essentially so these read Solomon code to also construct it for more come

21:34.000 --> 21:42.000
off from the telecommunications field and there they also function as error correcting codes.

21:43.000 --> 21:53.000
So yes essentially this could be done but it doesn't have to so yeah it could be used.

21:53.000 --> 21:54.000
Yeah.

21:54.000 --> 21:56.000
Yes.

21:56.000 --> 22:03.000
I'm going to put like correct but from my understanding right now the wide values of the first K chance

22:03.000 --> 22:08.000
carry the semantics of the characters we want to go.

22:08.000 --> 22:17.000
But when it's right how do I choose the next chance to go to the end chance and when some of the first K die

22:17.000 --> 22:22.000
but I have the luck from this end I can reconstruct the polynomial.

22:22.000 --> 22:26.000
Of course I can calculate the coefficient.

22:26.000 --> 22:32.000
But then it would also be as meta data the weeks where use of the first K that each child

22:32.000 --> 22:37.000
to be able to get the information I want.

22:37.000 --> 22:54.000
So the question was about the x and y values that we save in the chunks so I don't know if I understood correctly

22:54.000 --> 23:05.000
but I mean we don't really so we just save the n and coded chunks so in total.

23:05.000 --> 23:18.000
And do the y values carry the semantics of the characters we want to do is the y values and it's only for the n.

23:18.000 --> 23:34.000
Not really so if we for example look at the first coordinate just some coordinate this is the function value and this is determined as we saw here.

23:34.000 --> 23:51.000
By summing over the encodings of the coefficients multiplied with the x values so it all plays together into each chunk.

23:51.000 --> 24:20.000
So the question was about how much space the encoding by Eurasia coding takes and this is really depending on the choice of the values and end K.

24:20.000 --> 24:31.000
So we saw for example here the minimum storage redundancy would be is and divided by K and that is attained.

24:31.000 --> 24:43.000
So if we have if we choose and to be equal to 1 equal to K then of course it doesn't give us anything new but we have the same space.

24:43.000 --> 25:04.000
And depends on these numbers so if for example n is twice the amount of K then it takes twice the space but it still gives us a better reliability on the system except if we are considering just the K value equal to 1.

25:04.000 --> 25:05.000
Yes?

25:05.000 --> 25:07.000
I know it's not your question actually.

25:07.000 --> 25:10.000
And it's not the both methods of all of these things out.

25:10.000 --> 25:12.000
Is there or when you have to wear it?

25:12.000 --> 25:14.000
It's not exactly this one.

25:14.000 --> 25:19.000
And the math for when you like to give it a G.

25:19.000 --> 25:27.000
So the question was about if this is Eurasia coding is also in some hardware.

25:27.000 --> 25:38.000
I think it is used but it's not really hardware specific because it's just about saving these values on hardware.

25:38.000 --> 25:42.000
So I don't know too much about this.

25:42.000 --> 25:46.000
But I think we sure most modern SSDs is some things that we can do.

25:46.000 --> 25:50.000
So like as you can see for the XE3 check list.

25:50.000 --> 25:57.000
Which also I think give you some sort of data correction on your region in the direction.

25:57.000 --> 26:00.000
And we have to have the instruction sets for that.

26:00.000 --> 26:01.000
Yes.

26:01.000 --> 26:04.000
Then you are going to see that this one.

26:04.000 --> 26:12.000
You see, you have to make the same.

26:12.000 --> 26:14.000
Like, you see, do you run the edge?

26:14.000 --> 26:17.000
I don't think I can run.

26:17.000 --> 26:19.000
Yeah, it's on your own code.

26:19.000 --> 26:20.000
It's like version.

26:20.000 --> 26:21.000
Yeah, sure.

26:27.000 --> 26:28.000
Yes?

26:28.000 --> 26:29.000
Okay.

26:29.000 --> 26:31.000
For example, I have a zip code.

26:31.000 --> 26:32.000
Oh, guys.

26:32.000 --> 26:34.000
It's a systemic code.

26:34.000 --> 26:37.000
Can you give any examples that does?

26:37.000 --> 26:39.000
I haven't really looked.

26:39.000 --> 26:44.000
Yeah, the question was about any examples of this.

26:44.000 --> 26:48.000
Where it's not encoded in the systematic form.

26:48.000 --> 26:54.000
I haven't really looked at many other examples but it's essentially all the same thing.

26:54.000 --> 27:01.000
Because you're just changing like the if you consider the matrix that's created by the linear system of equations.

27:01.000 --> 27:06.000
It's just, yeah, applying simple changes.

27:06.000 --> 27:08.000
And then it's equivalent.

27:08.000 --> 27:13.000
But I don't know if it's used in that way, somewhere.

27:13.000 --> 27:14.000
Yeah.

27:14.000 --> 27:15.000
I just have one more.

27:15.000 --> 27:18.000
It'll come in and drop me out.

27:18.000 --> 27:24.000
When you lose this in a large storage area, and you want to restore it.

27:24.000 --> 27:28.000
And you have a redundancy of basically three.

27:28.000 --> 27:38.000
Can you pick any of the other, or any, you can pick any of three of other disks to place the information that this has to be restored.

27:38.000 --> 27:43.000
So, the main advantage that just struck me on the other side.

27:43.000 --> 27:47.000
So, when the advantage would be that I don't have to read the copies from just one disk on it.

27:47.000 --> 27:49.000
You set your eyes again.

27:49.000 --> 27:54.000
I can just read one first of data from one disk, one third of the data from another disk.

27:54.000 --> 27:55.000
And have it.

27:55.000 --> 28:00.000
Because I have enough redundancy of three other disks.

28:00.000 --> 28:01.000
Yeah.

28:01.000 --> 28:04.000
Just want to make that come in there.

28:04.000 --> 28:05.000
Yeah.

28:05.000 --> 28:06.000
If you have to.

28:07.000 --> 28:12.000
Thank you.

28:12.000 --> 28:14.000
Okay.

28:14.000 --> 28:15.000
Thank you.

28:15.000 --> 28:16.000
Any more questions?

28:16.000 --> 28:19.000
Comments we can have later in the hallway, sir.

28:19.000 --> 28:20.000
All right.

28:20.000 --> 28:21.000
Thank you.

28:21.000 --> 28:23.000
Thank you.