WEBVTT 00:00.000 --> 00:08.000 Hi, my name is Rob, this is the link here. 00:08.000 --> 00:15.000 We'll try to convince you that if you're using arrow or per k or something in that ecosystem, 00:15.000 --> 00:23.000 you should try a patchy, I mean, you should try tensor arrays to store your tensor data or multi-dimensional array data. 00:23.000 --> 00:27.000 So our talk will compose and shortly talking about error attention times, 00:27.000 --> 00:31.000 then we'll talk about fixed-shaped tensor, variable-shaped tensor, 00:31.000 --> 00:37.000 a little bit about the integration with numpy, and then finally the alpac. 00:37.000 --> 00:41.000 So what are arrow extension types? 00:41.000 --> 00:48.000 So arrow currently provides several data times, but of course, people always want more. 00:48.000 --> 00:56.000 So at some point we decided to add user extension types to enable people to build their own. 00:56.000 --> 01:02.000 However, because some extension types are more often used than others, 01:02.000 --> 01:05.000 we also decided to make a registry of well-known extension types, 01:05.000 --> 01:09.000 or rather canonical types that live in the arrow namespace. 01:09.000 --> 01:16.000 And with the current implement, we have implement that arrow fixed-shaped tensor, 01:16.000 --> 01:23.000 arrow variable-shaped tensor, json, opaque, and a bit bullet. 01:23.000 --> 01:29.000 These are currently available specs for extension arrays. 01:29.000 --> 01:33.000 So let's first talk about the fixed-shaped tensor. 01:33.000 --> 01:42.000 So, fixed-shaped tensor arrays are multi-dimensional arrays of certain type, 01:42.000 --> 01:48.000 and we represent them in arrow with the fixed-sized list, 01:49.000 --> 01:55.000 which means that we have an array where you can see this. 01:55.000 --> 02:04.000 Each row is set of values, and there need to be as many values as there are. 02:04.000 --> 02:08.000 If you multiply all the shape numbers, right? 02:08.000 --> 02:13.000 So if you have a 2 by 2 tensor, you should have 4 values there. 02:13.000 --> 02:23.000 And besides the date itself, we also carry a metadata which is the shape of these tensors. 02:23.000 --> 02:29.000 And, actually, we also bring dimension names and permutation, 02:29.000 --> 02:37.000 so permutation kind of helps you to calculate strides of the tensors out of the shape. 02:38.000 --> 02:40.000 Yeah, basically that's it. 02:40.000 --> 02:45.000 Data itself is stored in this fixed-sized list, 02:45.000 --> 02:51.000 in order to call this second, second-figuous or row major order. 02:51.000 --> 02:56.000 So that means that you can slice the array, right? 02:56.000 --> 03:02.000 And you can each of these rows is a second-figuous data for a tensor. 03:02.000 --> 03:07.000 So every cell in this array is a tensor by cell, right? 03:07.000 --> 03:14.000 So here's an example of how many of that of such a tensor arrays is serialized. 03:14.000 --> 03:19.000 So we have shape, we have the names of dimensions, and then the permutation of dimensions. 03:19.000 --> 03:25.000 So, out of the permutation, again, you can calculate the strides of the tensors. 03:26.000 --> 03:31.000 Now, on to the second one, the variable shape tensor array. 03:31.000 --> 03:34.000 So this case is a little bit more complicated. 03:34.000 --> 03:39.000 Every row of the array is struck, 03:39.000 --> 03:44.000 because not only do we carry the data of the tensors, 03:44.000 --> 03:47.000 we also need to carry its shape. 03:47.000 --> 03:51.000 So the way that struck the arrays are done in arrow, 03:51.000 --> 03:55.000 it basically means you have two child arrays next to each other. 03:55.000 --> 03:57.000 Correct me if I'm wrong guys. 03:57.000 --> 03:59.000 Yeah? Yeah, okay. 03:59.000 --> 04:04.000 So these are, yeah, we carry them. 04:04.000 --> 04:09.000 And one contains the actual data. 04:11.000 --> 04:12.000 Was I muted? 04:12.000 --> 04:13.000 Yes. 04:13.000 --> 04:14.000 Oh. 04:14.000 --> 04:16.000 There goes our promotions. 04:16.000 --> 04:20.000 High stream, nice to meet you. 04:20.000 --> 04:24.000 So yeah, the variable. 04:24.000 --> 04:30.000 So the first we have the data, as the first child array in the second child array, 04:30.000 --> 04:34.000 carries the shape of each individual tensor. 04:34.000 --> 04:42.000 Besides that, we also carry, of course, the dimension names, 04:43.000 --> 04:50.000 permutations to again to calculate strides out of the shapes. 04:50.000 --> 04:52.000 And then we have this uniform shape. 04:52.000 --> 05:01.000 So this is in case we only one of the dimensions is being changed in the data. 05:01.000 --> 05:08.000 Let's say you don't need to read the whole shape every for every row. 05:08.000 --> 05:12.000 And data is also stored in row major or contiguous, 05:12.000 --> 05:14.000 secontiguous order. 05:14.000 --> 05:19.000 So here's an example of the uniform shape parameter. 05:19.000 --> 05:25.000 You can see that, like, the here the convention is the first, 05:25.000 --> 05:28.000 the first value of shape will always be 400. 05:28.000 --> 05:33.000 The second one can change the third one will always be three. 05:33.000 --> 05:37.000 So when the reader or writer, when the reader works with this, 05:37.000 --> 05:42.000 they don't need to read the first and third parameter. 05:42.000 --> 05:44.000 Yeah, oh yeah. 05:44.000 --> 05:49.000 And here the shape changes from row to row, 05:49.000 --> 05:52.000 but the dimension number always stays the same. 05:52.000 --> 05:55.000 That was kind of the design decision that we made. 05:55.000 --> 06:00.000 Yeah, and to you, Alenko. 06:00.000 --> 06:05.000 Okay, so this is strange because this is for video. 06:05.000 --> 06:08.000 And I have to. 06:08.000 --> 06:14.000 The this canonical extension type, the fixed shape one is implemented in C++, 06:14.000 --> 06:17.000 ROC++, and it has bindings in Python. 06:17.000 --> 06:20.000 So we can check it out play with it. 06:20.000 --> 06:24.000 I thought it would be nice to have an example of Python because it's nice to visualize. 06:24.000 --> 06:29.000 Here we import, we define which type we want. 06:29.000 --> 06:32.000 This is the extension type that's already in pyro. 06:32.000 --> 06:40.000 You can use it, you have to tell it which data type you need and you tell the shape of individual tensor. 06:40.000 --> 06:44.000 Then you give it the data from wherever you need it. 06:44.000 --> 06:49.000 And you define the storage type, which is as we saw the list. 06:49.000 --> 06:57.000 It has to have the same data type and the length of the list has to match. 06:57.000 --> 07:05.000 So then you define the extension the extension area out of this function with this method. 07:05.000 --> 07:11.000 So you go from storage, which is the list created out with the data. 07:11.000 --> 07:15.000 And then you give it the tensor type you need. 07:15.000 --> 07:16.000 Thank you. 07:16.000 --> 07:20.000 I had to trim it out a little bit, so it would be visible. 07:20.000 --> 07:24.000 This is how the object looks like in pyro if you print it out. 07:24.000 --> 07:32.000 So there's a list and each element of the list is a tensor element of the area of the tensor type. 07:32.000 --> 07:39.000 Now if you go to numpy, you can see it's an end the array with the shape 422. 07:39.000 --> 07:47.000 And if you go back, the four shape is the length of the array. 07:47.000 --> 07:51.000 Okay, is that clear? 07:51.000 --> 07:55.000 Yeah, so the first dimension is the length of the array. 07:55.000 --> 08:06.000 And then you have individual elements which are individual rows, tensors in the pyro will then if you go back to numpy to numpy. 08:06.000 --> 08:12.000 Yeah, so this are the individual tensors in the pyro array. 08:12.000 --> 08:17.000 Okay, and then you can also have a numpy and end the array and go back. 08:17.000 --> 08:20.000 So you go forward. 08:20.000 --> 08:23.000 Another one, another one. 08:23.000 --> 08:24.000 Yes. 08:24.000 --> 08:31.000 So you could also have a numpy array and go back to the pyro tensor array. 08:31.000 --> 08:34.000 Okay, so you can go fourth and back between. 08:34.000 --> 08:37.000 Okay, so this is for an example. 08:37.000 --> 08:39.000 I hope that was useful. 08:40.000 --> 08:47.000 I would like to take a minute or so full deal pack deal pack is a protocol. 08:47.000 --> 08:55.000 And then it enables interchange between python libraries that have arrays, the array libraries or tensor libraries. 08:55.000 --> 08:58.000 To have a device aware. 08:58.000 --> 09:01.000 So you can live on CPU or GPU. 09:01.000 --> 09:08.000 It's aware of that and it's it's meant to be on a zero copy interchange. 09:08.000 --> 09:19.000 We would like to have this nicely implemented in arrow and pyro also for now it's implemented for pyro for arrow arrays. 09:19.000 --> 09:24.000 Only to produce we would like to have the consumption also. 09:24.000 --> 09:34.000 And we would like to connect this extension tensor arrays with this methods to have the seamless interchange between other python libraries that use tensors. 09:34.000 --> 09:38.000 For example, Q pythons of role pythorge, etc. 09:38.000 --> 09:40.000 Okay, so now it's implemented for arrays. 09:40.000 --> 09:45.000 You can go from pyro array to any of those. 09:45.000 --> 09:48.000 So consumption you can. 09:48.000 --> 09:51.000 Sorry to use you could go into any of those. 09:51.000 --> 09:57.000 But not go back and it's not implemented for the extension arrays yet, but that's what we would like to do. 09:57.000 --> 10:02.000 If there's any wish for that and thumbs up, that would be awesome. 10:02.000 --> 10:05.000 Thank you.