WEBVTT 00:00.000 --> 00:12.480 All right, everyone. Good morning. Still the morning. I want to thank Yaron for sitting 00:12.480 --> 00:18.320 of the stage today and organizing all this open media dev room. My name is Roe Mao, 00:18.320 --> 00:23.560 Boksis. I'm a software engineer. I've been working in the field for about 10 years, 00:23.560 --> 00:28.920 a little bit more. And today I'm going to try to talk about how you bridge the gap between 00:28.920 --> 00:35.520 web browsers API and what you do in the server and the backend. So that's all planned. 00:35.520 --> 00:42.120 So yeah, nowadays the web in the past 20 years has become very, very rich in features 00:42.120 --> 00:48.080 and particularly in media. And it's also becoming the number one software that people really 00:48.080 --> 00:51.560 have installed and where they do most of their work. It's getting more and more rare to 00:51.560 --> 00:57.320 install third-party applications. So we can do a lot of things in brothers, browsers, but 00:57.320 --> 01:04.160 not everything can be done. We have a lot of API for manipulating media listed here. But 01:04.160 --> 01:08.440 there are some interesting challenges when you want to do a really state-of-the-art in 01:08.440 --> 01:14.400 brother specific experience. And so what I'm going to be up today is very specific examples 01:14.400 --> 01:19.000 that I've learned in the past six months. He's installed in my current position, which 01:19.000 --> 01:24.320 I'm going to explain in the next slides. So my background is I've installed media streaming 01:24.320 --> 01:30.320 in college. I was at the course on trial where VRC was at the time. Jean Baptiste also 01:30.320 --> 01:34.240 will talk after me and was there. Then I did some open source contribution with liquid 01:34.240 --> 01:39.200 soap and I'm trying to contribute with a fempeg. But all of that was my non-professional 01:39.200 --> 01:45.160 career and then the other part of it was web. I did a lot of web-sealed websites building 01:45.160 --> 01:50.120 rich application and other things. But recently I started a current role where I got hired 01:50.120 --> 01:55.120 to work on a web application that is actually also very media rich and was very exciting 01:55.120 --> 01:59.400 to be able to bridge those two gaps and see how these two worlds can interact. And that's 01:59.400 --> 02:04.200 where I've based all the topics of discussion for this talk. 02:04.200 --> 02:11.600 So this script, the website that I work for, is a online video editor. It's basically a web 02:11.600 --> 02:18.080 application you can go. You can upload your files, you can compose them, you can add text, 02:18.080 --> 02:23.760 put some effects and export that as a video. So it's the same space as a premier or the 02:23.760 --> 02:30.760 things but it's all web. It started originally as a text base so you would upload a file, 02:30.760 --> 02:36.760 audio file and edit the text and the file with the audio would be cut and then evolve a video 02:36.760 --> 02:43.640 and have a lot of features in particular. In recent years they incorporated AI research 02:43.640 --> 02:48.520 team and started having their own effects that does like, you know, background removal, 02:48.520 --> 02:52.800 I contact you follow the camera and those effects are done in the background typically. 02:52.800 --> 02:56.960 So you're starting to see that we have an application that's very rich that does another 02:56.960 --> 03:02.680 things and some of it has to be done in the server, some of it has to be done in the client. 03:02.680 --> 03:07.120 The other thing I want to highlight is that specifically to the web, a web application 03:07.120 --> 03:11.040 functions essentially very differently than a normal application. And one of the things you 03:11.120 --> 03:17.880 can think about is that you have a whole class of execution context. You can have different 03:17.880 --> 03:25.400 browsers, a different OS so you can be Chrome on iOS. You can be Firefox on Linux. 03:25.400 --> 03:30.480 You can be Safari with my Quest too. Also, you can have different apps. You might 03:30.480 --> 03:35.000 want to do the same projects, open into different apps or different browsers. Are you 03:35.000 --> 03:39.240 going to want to collaborate saying that I'm using Chrome on Linux but I'm sharing that 03:39.280 --> 03:44.560 projects with my colleague who has, you know, Safari on Mac and I want both of them to have 03:44.560 --> 03:51.520 a good user experience in the brother and that's where the real complexity comes in terms 03:51.520 --> 03:57.600 of doing that. So yeah, that's the full stack media editor and how do you make that work 03:57.600 --> 04:03.680 in a context of browsers. So I'm going to focus on a very subset of the features because 04:03.680 --> 04:09.040 there are a lot of features. I'm going to focus on how you ingest media, display them 04:09.040 --> 04:14.480 to the user and export them at the end. So the first thing we're going to start looking at 04:14.480 --> 04:22.640 is what are the capabilities in a browser to ingest media. So imagine that it's a really 04:22.640 --> 04:28.000 user-oriented application. Most of the users are not software developers. They don't know a lot 04:28.000 --> 04:33.680 of media. They might upload very large files of very exotic formats and they expect everything 04:33.680 --> 04:40.680 to work very rapidly. So we have all these APIs that can be used. The web collects. We have 04:40.680 --> 04:46.040 the web audio API. We have the WebGL for composition. But how do you maintain a consistent 04:46.040 --> 04:51.040 UX user experience across different browsers platform and collaborative stations? 04:51.040 --> 05:01.400 So typically here's one example. Codex support is very different across browsers. And we 05:01.400 --> 05:06.080 want to support all of that. We want to really our user to be able to upload the Matroska 05:06.080 --> 05:12.120 files or their mob files or their A, C, Opus, whatever they have. We're going to support it 05:12.120 --> 05:18.000 in the browser. Unfortunately, the range of things that the brother which support is extremely 05:18.000 --> 05:25.160 unpredictable. Chrome on WebKit, Mac OS, my support things it doesn't support on Linux. 05:25.160 --> 05:32.160 So we're going to receive all these files and have to decide what to do with them. On the 05:32.160 --> 05:37.960 contrary, you can see back-end things like FFMPEG would be able to process pretty much 05:37.960 --> 05:45.680 anything if you compare it correctly. On top of that, the set of codex that we can use inside 05:45.680 --> 05:49.840 the web codex to decode hardware is even more limited than what you might support. Some 05:49.840 --> 05:54.560 of them are supported in WebKit. You see, but in WebKit. So it's just a lot. But what 05:54.640 --> 05:59.600 we really want to try to do is try to take advantage of the native anchoring API when 05:59.600 --> 06:04.920 they exist so that we can at least leverage the part of the web API that are relevant 06:04.920 --> 06:12.080 and then use in the back-end when we can, or when we need to. So it's not just codex. 06:12.080 --> 06:17.760 It's also containers. So remember, you have the way you encode the data for what you 06:17.760 --> 06:23.200 can do, but then you have the way you store it in those big boxes. Metroscopes move a lot 06:23.200 --> 06:28.440 of them. And same thing, they're not all supported. And we're worse than that. There's 06:28.440 --> 06:36.440 actually, I mean, there's actually no support for demoxing in the browser. So let's say 06:36.440 --> 06:41.320 you receive a Metroscop, which is not supported or an MPEG TS, which is not supported. But 06:41.320 --> 06:46.200 you still have H264 in it that you can decode natively in the browser. You still need 06:46.200 --> 06:52.680 to be able to access that packet of data, send it to the browser to decode and then display 06:52.680 --> 06:58.280 the frame back. So how you do that is a question that we're going to address typically. 06:58.280 --> 07:05.600 That's one of the example where we can bridge gaps between browsers API and backend API. 07:05.600 --> 07:14.160 So first of all, you get to think about the web codex audio decoding and video decoding API. 07:14.160 --> 07:20.360 So those are native to the browser. And they use array buffer that are basically chunk of 07:20.360 --> 07:28.440 memories that you can decode natively. The same exist for video. That's the one we're 07:28.440 --> 07:32.880 here. And that's what we're going to use. But what we wanted to be able to do is how 07:32.880 --> 07:37.680 can we access that chunk of data that's stored in a mattress cafe, when we don't have 07:37.680 --> 07:43.360 the native support for it. And that's we'll see that in a minute. So yeah, that's one of the 07:43.360 --> 07:49.600 problems we have is your user starts sending a lot of files in your application, a very 07:49.600 --> 07:54.640 exotic format. Some of them we could decode natively, but they are in certain form of containers 07:54.640 --> 08:00.840 that are natively supported by the web API. How do you bridge that gap? So that's one example 08:00.840 --> 08:07.600 we're going to see right next here. And the solution, as you can see, is to leverage 08:07.600 --> 08:12.880 the wisdom, compilation of technologies that are essentially back in, but to lift them 08:12.880 --> 08:19.120 into the browser. So you can start executing, bridge the gap between what the web offers 08:19.120 --> 08:25.440 and what you have typically in the backend, which is FFMPG, demurccing capabilities, but 08:25.440 --> 08:33.320 this time you lift them in the client. So that's was an presentation of the kind of technologies 08:33.320 --> 08:39.000 that are in the browser. So the next, oh yes, sorry, I forgot. The last thing we need to do, 08:39.000 --> 08:43.880 and it's also important, we also need to do renders. So you have your video, right? And 08:43.880 --> 08:48.760 you're composing those videos, you're adding different layers on it, you want some text, 08:48.760 --> 08:53.160 what we're going to do, we're going to leverage the WebGL composition. I might do a little bit 08:53.160 --> 08:58.120 faster on that because of time limitation, but that's the last part of the application. And 08:58.120 --> 09:03.240 this also is challenging, particularly because remember, we want to add text, a perfect example 09:03.240 --> 09:09.080 of something that you would think is simple and actually it turns out pretty hard, because when 09:09.080 --> 09:17.080 you add text, you have things like that. I say we want to be able to do a collaborative session. 09:17.080 --> 09:23.880 So I am working on macOS and I'm adding images on my brothers, renders with the local system 09:23.880 --> 09:29.560 phones. But my colleague is on Android, on Windows, and has different emoji phones and it starts 09:29.560 --> 09:35.560 looking differently. How do you do that too? So I'm going to run some example of how we did that, 09:35.560 --> 09:40.200 it's going to have to be a little bit quick because of time limit, but I hope this will be 09:40.200 --> 09:45.720 insightful, at least I find that interesting when I was working. So the first challenge is how do we 09:45.720 --> 09:51.640 seek when the user upload the file? So let's say I'm uploading a file that comes from a screen 09:51.640 --> 09:57.480 recording smartphone videos. The files can have a wide range of codec parameters. And one of them 09:57.480 --> 10:02.680 is it's going to have a very unpredictable frame rate. Remember like the keyframe. So that's what 10:02.680 --> 10:08.440 I mean. So you have keyframe space at very different places and you want to be able to seek very fast. 10:08.440 --> 10:14.040 So the user expands in the application is I'm editing my video and I want to zoom to 4.8 10:14.040 --> 10:20.600 second, 5.2 second and I want the application to react very quickly. But how do you do that in the way 10:20.840 --> 10:26.440 that's the best possible? What you have to do really is that you have to know where the keyframes 10:26.440 --> 10:32.920 are exactly in your media. So if you want to go to 5.2, but your left keyframe is 4.8, you need to 10:32.920 --> 10:39.400 be able to know that this is your keyframe, seek right into it and decode immediately there to 10:39.400 --> 10:44.360 give yourself the best chance to be as fast as possible in your seeking. The browser can 10:44.360 --> 10:49.560 do that because the browser doesn't have demoxing capacities as I was saying. So how do you do 10:49.560 --> 10:57.000 that without having the native API is a good example? So what you would do is you lift up the 10:57.000 --> 11:07.080 wasm build of a back end library. Here we can say FFMP. So you open your file with FFMP. You scan 11:07.080 --> 11:12.520 all the keyframes inside the file so you know that that's where your keyframe is and then you seek 11:12.520 --> 11:18.200 with it using FFMP. You move past and you get exactly to the frame you want and that's 11:18.200 --> 11:22.280 the point where you're going to send me to the browser. So first step, you extract the timing. 11:22.280 --> 11:28.840 Then you seek to that keyframe at 4.8. Second thing, you read the packets until you get to your 11:28.840 --> 11:37.000 frame and then you take over with the native browser API, render it and eventually display it 11:37.160 --> 11:45.160 on the frame on the screen. So that's the last step you can see. You have this packet that's 11:45.160 --> 11:51.880 available. You can put back to your JavaScript memory. It's that packet the data at 914. Pass that 11:51.880 --> 11:57.880 to the video chunk decode it natively. That way you bridge the gap between what is available in 11:57.880 --> 12:04.120 the browsers and what is available typically as a back end. Another example, and I'm probably 12:04.120 --> 12:11.080 just going to have time for that is when browser processing is not enough you want to upload to 12:11.080 --> 12:17.800 a back end infrastructure. So as I was saying, we have the users throwing a lot of media at us and 12:17.800 --> 12:22.760 some of it we can decode and have that kind of experience that's just described. Some of it we cannot 12:22.760 --> 12:28.120 because it's a codec that's not supported because many reasons. And in this case what we want to do 12:28.120 --> 12:35.720 is we want to be able to send that to a server and retrieve transcoded media immediately that 12:35.720 --> 12:39.480 are usable. Typically they're going to have a high keyframe rate. We're going to also shrink the 12:39.480 --> 12:44.040 resolution because if they're editing on the screen they don't need a 4K video. And we can 12:44.040 --> 12:49.320 that's the time we can add all the service side background, removal, eye contact, and etc. 12:50.040 --> 12:56.200 The second piece of technology that we do in this case is this scenario which adds a lot of complexity 12:56.280 --> 13:03.000 to the application. Basically it's a user application that we want to be friendly. So we really 13:03.000 --> 13:08.360 want things to work as quickly as possible. So the way this scenario works is you upload your files 13:09.000 --> 13:15.000 and you put your files in the web application and I start displaying using the native capacity 13:15.000 --> 13:20.680 of your browser if possible as we just described for seeking. Meanwhile we start uploading in the 13:20.680 --> 13:28.760 background. And when the file is fully uploaded we are ready to hit it up with the back end server 13:28.760 --> 13:35.960 and we switch immediately to chunk of media that have been transcoded on demand just for you 13:35.960 --> 13:42.520 and are really like in a codec that's supported in encapsulation container that's supported 13:42.520 --> 13:47.160 and then your experience becomes better. But that has a lot of complexity because it means that you have 13:47.320 --> 13:52.600 a first kind of degraded experience in the application while you are uploading that we kind of 13:52.600 --> 13:59.080 do what we can and immediately switch to what we fully can do with the application on your file 13:59.080 --> 14:03.960 is uploading. So that's the second piece. If you can't really bridge the gap in the browser 14:03.960 --> 14:08.520 you bridge the gap with the back end service that kicks out as early as you can. That's the one 14:08.520 --> 14:14.840 we call this media transform server. Yeah so that server is implemented and works it's a tiny 14:15.480 --> 14:24.040 FFI layer on top of FFMP the libraries. It produces MP4 segment. Yeah it's dynamically generated 14:24.040 --> 14:28.520 because there's a lot of parameters that change as you edit your video but we also experimenting with 14:28.520 --> 14:35.720 pre-segmenting for caching. Yeah that's the second piece of the puzzle. And I have one minute 14:35.720 --> 14:41.960 to talk about the last piece which is more of an open question. I remember I was saying we use 14:42.520 --> 14:51.320 composition with GL with WebGPU or WebGL to do the final composition of the canvas to a 14:51.320 --> 14:55.960 rendered frame with text and everything. One thing that is still a challenge is how do you do the 14:55.960 --> 15:01.800 composition in a way that remember is described in the browser but needs to be rendered in the back end 15:01.800 --> 15:07.560 if possible because in the back end we may have higher quality. We can do the rendering while 15:07.560 --> 15:11.720 the client is doing something else but the client has been describing their video in the browser. 15:11.720 --> 15:16.600 So when you start peeking it out in the server you have different font but you also don't have the 15:16.600 --> 15:21.400 web API. So one of the challenges that we are thinking of at the moment is how do you do that 15:21.960 --> 15:27.800 translate that and one of the ideas we have is to use the WGPU stack because this stack that's 15:27.800 --> 15:34.440 natively in pre-mediting in rust that support the WebGPU and it exports to native volcano in Linux but 15:34.440 --> 15:40.840 also WebGPU on the web so the kind of challenge we're thinking in the future is can we rewrite the 15:40.840 --> 15:46.760 rendering so the composition is described in this stack in a way that can be rendered both in the 15:46.760 --> 15:51.960 browser and in the back end in a way that's consistent and looks the same for everyone. Thank you 15:51.960 --> 16:06.520 there was a lot to present so appreciate I hope you'll be happy to have a big problem. 16:06.520 --> 16:19.400 What FFI? It's there is a node module that's called FFI-Dush-RS or yeah what FFI do 16:19.400 --> 16:30.200 you use to interface with FFI-Dush-RS or the opposite and it's a rust tiny layer on top of the 16:30.200 --> 16:36.440 lib FFI that provides like a type script safe way to interact with FFI. 16:36.440 --> 16:44.040 No yes the binding itself is linear but all the rest of the stack is type script so we get a 16:44.120 --> 16:55.880 plug right into that. All right if there's no more question we'll yield the floor and thank you guys 16:55.880 --> 17:08.680 again.