WEBVTT 00:00.000 --> 00:16.400 Please welcome our next speaker, it's Alexander Valerking, CTO in Victoria Matrix, and he's 00:16.400 --> 00:22.080 going to speak about working with file systems in time series database, welcome please. 00:22.080 --> 00:29.080 Thank you for the introduction, let's start. 00:29.080 --> 00:38.280 Let's meet, I'm Alexander Valerking, it's a registered and I'm also known 00:38.280 --> 00:48.280 on the my nickname, Valerking, which you can find Google in Google and know more about me. 00:48.280 --> 00:57.080 I'm found of writing a hyperformance code, and this code is optimized for low resource 00:57.080 --> 01:04.520 usage, and I also created a few time series databases, optimized for performance and low 01:04.520 --> 01:12.080 resource usage, and this database of course opensource, this is Victorimetric, Database 01:12.080 --> 01:19.520 for Matrix, and Victorimetric for the database for logs, and as I said before, this database 01:19.520 --> 01:31.560 is a fast and cost efficient, and let's, okay, they are not written in the rest, they are 01:31.560 --> 01:42.000 written in Go, and let's talk about specifics of time series databases. 01:42.000 --> 01:50.160 This database is contained at time series data, and that I see this is a series of samples 01:50.160 --> 01:57.000 and each sample contains time stamp and value, and value can be numeric, in this case, 01:57.000 --> 02:05.880 this database contains metrics, and value can be arbitrary blob, plain text, or some 02:05.880 --> 02:16.280 structured blob, in this case, such time series name talks or events, and every time 02:16.280 --> 02:25.800 series can have arbitrary set of key value labels, and these labels are usually used for filtering 02:25.800 --> 02:32.280 and grouping, for simplifying the filtering and grouping, and the number of time series 02:32.280 --> 02:40.200 and in typical database ranges from a few thousands to billions, and the typical data 02:40.200 --> 02:49.640 ingestion rate for time series databases is up to 10 millions of samples or rows per second, 02:49.640 --> 02:55.960 so you can see that such time series databases must be optimized for high data ingestion 02:55.960 --> 03:03.800 performance, and typical queries in time series databases usually need to scan up to 10 millions 03:03.800 --> 03:14.960 or 100 millions of samples, so obviously you need to have high query performance, and the number 03:14.960 --> 03:21.840 of typical samples in time series databases is measured, not in billions, not in millions, 03:21.840 --> 03:30.800 but in trillions, and the size of the data stored in time series databases is measured in 03:30.800 --> 03:39.440 terabytes and petabytes, so obviously such size doesn't fit memory, and how to achieve 03:39.680 --> 03:50.880 high performance for such databases, data base performance is limited by disk IO, if the size 03:50.880 --> 03:56.560 of data doesn't fit memory, because you need to store this data persisted data to memory, 03:59.360 --> 04:05.280 and so if you want to achieve high performance, for the database, you need to reduce the number 04:05.360 --> 04:11.760 of disk operations, and you need to reduce the amounts of data read and read from disk, 04:14.160 --> 04:20.640 let's talk about how to achieve high data ingestion rate for databases, any ideas, 04:25.520 --> 04:34.960 just reduce a sync frequency, why, let's talk about why, because when you store data into files, 04:35.600 --> 04:43.040 this data isn't stored in files actually, it's stored in memory in the old page cache, 04:44.000 --> 04:51.280 by operating system and operating systems usually stores this data in the files in background 04:51.280 --> 04:59.360 when they want to store this data, and the order of storing this data in files in disk is undefined, 05:00.240 --> 05:10.800 it doesn't match the order used right as data to files, and when you unplug your computer from 05:10.800 --> 05:19.840 the power, page cache data can be partially written to disk, and this means that you get 05:20.560 --> 05:28.080 easy data loss or data corruption, so you need some system calls, I just have sync, 05:29.040 --> 05:35.760 which forces storing all the data from all-page cache, written data to all-page cache to disk, 05:36.560 --> 05:44.080 and that's why all the production rate that the basis must use a sync, the exceptions, 05:44.080 --> 05:51.600 let's talk about the data, why you need to reduce frequency as I said before, because a sync is slow, 05:51.680 --> 06:03.120 and it is slow not only on HDDs, but on SSDs too, why, because when you write even a single byte 06:03.120 --> 06:18.160 to SSD, disk under the hood SSD writes and a much larger size of block, such as up to four 06:18.160 --> 06:26.560 megabytes or even more, which are named to erase blocks, because SSDs, as it's hardware works, 06:26.560 --> 06:35.200 it can store you one single byte and SSDs, so if you, that's why, expect only thousands of 06:35.200 --> 06:45.040 fcps, even on SSDs, there are exceptions such as enterprise-grade SSDs, which are more expensive, 06:45.040 --> 06:54.640 and they have special capacitors, which allow you to store the battery data, which is not persisted 06:54.640 --> 07:03.120 yet when you power off the SSD, but on regular SSDs there are no such capacitors. 07:03.200 --> 07:16.880 And the basic frequently used technique, which bypass all the page cache and directly write data 07:16.880 --> 07:23.200 to disk, they are direct IO, but it doesn't help too much, because in reality, this direct IO 07:23.200 --> 07:29.920 just replaces a sync, you just simulates a sync by yourself, so it takes the same amount of time and 07:30.080 --> 07:37.680 resources, and also, additionally, direct IO doesn't work well on many popular file systems, 07:39.280 --> 07:47.120 it has bugs and various issues, so how to reduce frequency for a sync at that base, 07:47.760 --> 07:56.640 the first approach is to send stored data in the base in big batches, when you store data in big batches, 07:57.600 --> 08:02.800 the client expects the success response when this data is persisted, so we, 08:03.600 --> 08:10.240 per each batch, you need to issue only one f sync or multiple sync, so big batches, 08:10.960 --> 08:18.560 small number of fs, and you have to, you can achieve higher data injection performance. 08:19.520 --> 08:26.080 Another approach, commonly used in databases, is to merge concurrent inserts in a single 08:26.080 --> 08:34.640 data block, and store it with a single fsync to disk, so for example, you can run fsync every 10 08:34.640 --> 08:45.520 milliseconds, 100 fsps, so every insert can be delayed up to 10 milliseconds, while collecting data 08:45.600 --> 08:51.040 from concurrent inserts, and then storing this data in a single insert in a single fsync. 08:53.760 --> 09:02.240 The advantage of this method is that it is limited to 100 sequential insert per second 09:02.240 --> 09:07.760 from a single client, in this case, of course you can run thousands of concurrent clients and 09:07.760 --> 09:15.440 achieve much higher injection performance, and as a popular approach is to use right-of-head 09:15.440 --> 09:25.200 block, how it works, you set stored data in a just a data in a right-of-wall, a sync it, and after that, 09:25.840 --> 09:35.680 stored data in the ground in other data structures without a sync, and if a power of occurs, 09:35.680 --> 09:41.680 you can replace this wall of the restart, and do not lose any data, and many of the 09:41.680 --> 09:48.080 databases use this technique, but it doesn't reduce the sync too much, because you need to make 09:48.080 --> 09:55.920 this fsync after every storage of data into wall, in order to prevent from power off, 09:58.400 --> 10:03.760 and that the basis uses various data tricks for reducing the same set wall, for example, 10:03.760 --> 10:11.760 they can buffer data in RAM, and then periodically rises data to wall, with a sync, of course, 10:11.760 --> 10:17.440 or they can write data to wall without any of the sync, and periodically run a sync on this wall, 10:17.440 --> 10:25.840 and both cases are bad, because they can lead to data loss and power off, and also they 10:25.840 --> 10:32.240 even worse, they can lead to data corruption, if power off happens in the middle of a sync, 10:33.200 --> 10:39.840 and in this case, you are on a way, which parts of data are persisted, and which are not persisted, 10:39.840 --> 10:47.600 and you need to protect from this with some checksums, so there is another approach, 10:48.400 --> 10:54.880 it just admits that there isn't just a data can be lost on power off, and there are practical cases, 10:54.880 --> 11:01.040 when it is a K, for example, when you work with observability data, such as metrics, 11:01.040 --> 11:08.160 logs and events, if you admit this, then you can buffer data, listen to ingested data in memory, 11:08.160 --> 11:16.800 and periodically flush it to disk with a single sync at the end, and if the data is synced to disk, 11:16.800 --> 11:24.160 then it is safe from loss and corruption on power off, but listen to ingested data can be 11:24.160 --> 11:31.200 lost on power off, and so let's talk about a sense of also that the basis and querying pass, 11:33.040 --> 11:38.400 they provide you ability to quickly find the needed data by the given prefix, okay, 11:39.360 --> 11:45.680 and that the basis you use indexes for this, and indexes, data structures, which map keys, 11:45.680 --> 11:52.240 to locations of the data on disk in database cases, and indexes usually provide 11:52.720 --> 12:01.360 or look and see complexity, where n is a number of interesting indexes, and the most frequently 12:02.320 --> 12:11.280 index types in databases are based on B3s, B3s is a tree, that structure where every note contains 12:11.360 --> 12:20.400 thousands of entries, and notes are usually aligned in size up to from 4 kilobytes to 64 kilobytes 12:20.400 --> 12:30.880 and stored to disk at once, and when you add new entry to B3, it usually requires updating multiple 12:30.880 --> 12:38.800 notes in the tree along the path, where you add this note, and note the entries usually can 12:38.880 --> 12:45.920 be three entries contained, not the data row data itself, but pointers to the data and this data 12:45.920 --> 12:54.000 stored somewhere else in other files, and another popular index type is based on LSM3s, 12:55.360 --> 13:06.160 this is the structure contains files, every file contain data sorted data, and such files 13:06.240 --> 13:14.960 periodically merge in bigger files in the ground, and this means that these files usually always contain 13:14.960 --> 13:22.960 entries sorted entries, and they contain values itself instead of pointers to values which 13:22.960 --> 13:29.040 are stored somewhere else, and let's look at some practical cases, the first case is a high data 13:29.040 --> 13:39.200 ingestion rate, when you use B3, then you need to update B3 notes, multiple B3 notes, 13:39.200 --> 13:45.680 per each data ingestion ingestion entry, this means that in a worst case you need to make a 13:45.680 --> 13:53.280 split, since you're ingestion ingestion entry, and also since when you use B3, you are relying 13:53.360 --> 14:03.840 on wall, and you need additional disk bandwidth for wall, and when you use LSM3s, then you just buffer 14:03.840 --> 14:11.600 multiple entries, thousands of entries in memory, and store them in this at one F-sync, 14:12.480 --> 14:22.480 and you do not need any wall, and save for this guyo, the second case is reading a well in the 14:22.480 --> 14:32.000 index order, which is frequently used, in B3 case you need to read data from many different 14:32.000 --> 14:38.880 locations on disk, because in B3 doesn't contain usually rows, as I said, row data it contains pointers, 14:38.880 --> 14:46.240 those data can be scattered across the disk, and in LSM3s, as you remember, the data itself 14:47.200 --> 14:54.800 is stored in order, in order, and this means that you just read data sequentially with a 14:54.800 --> 15:01.760 much smaller number of disk operations. The next practical case ingestion big data volumes, 15:01.760 --> 15:09.440 which do not fit our memory, in B3 case, is increasing significantly the amounts of disk 15:09.440 --> 15:17.040 operations, because if some B3 nodes are missing in memory, then this means that you need to read them, 15:17.040 --> 15:23.360 in memory, and if you have no final space for all the B3 nodes, then you need to read some 15:23.360 --> 15:28.240 this B3 nodes and store them to disk, so you increase significantly the disk write operations, 15:28.240 --> 15:37.440 and read operations, and in LSM3, the frequency of this write delete operations doesn't depend 15:37.520 --> 15:50.080 on the size of data base. In other cases, in B3 case, you need some complicated HMS based on 15:50.720 --> 15:59.040 NDCC and transaction IDs, which need a lot of time and efforts, and in LSM3s, 15:59.920 --> 16:08.480 snapshots are instant, for any database sizes, why? Because LSM3 files are immutable, and this means that 16:08.480 --> 16:16.000 you just make hard links for all the LSM files and get snapshot, and it is instant for any size of data. 16:17.200 --> 16:23.520 In other cases, network attached storage, storage is such as NFS or object storage. In B3 case, 16:23.520 --> 16:29.520 it doesn't work at all, because you need to update files and all object storage, and network 16:29.520 --> 16:38.800 file systems, but for updating files, and in LSM, it works perfectly because files are immutable, 16:38.800 --> 16:45.520 you just write files, and delete already mere files, and do not update them any time. 16:46.080 --> 16:54.080 As for this space user, in B3 index, isn't compressible, usually, and row values are compressed 16:54.080 --> 17:00.400 not so well, because the order of this values are coherent, and this means bigger this space 17:00.400 --> 17:06.240 usage, and in LSM3s data is compressed much better, because it's sorted, and sorted data 17:06.240 --> 17:11.200 compresses better, and you get all of this space usage, and of course, you get 17:11.200 --> 17:19.280 a discrete bandwidth in LSM3s compared to B3, and this means higher query performance, 17:20.000 --> 17:24.960 for queries which need to read the robots of data, because you need smaller amounts of data, 17:24.960 --> 17:32.080 and package them, and then process. Let's talk about row oriented with column oriented storage, 17:33.040 --> 17:39.760 row oriented storage, you store rows on this one of the other, it works great for select star, 17:40.720 --> 17:47.040 type of queries where you select all the row fields, but such queries are not usually used in practice, 17:48.320 --> 17:54.640 and row oriented storage has very big overhead for typical production queries, which select 17:54.640 --> 18:04.560 some subset of fields, and perform some filtering on subset of fields, and this results in low 18:04.560 --> 18:13.760 compression rates, as for column storage, you store data per each column separator in separate files, 18:14.480 --> 18:21.120 and this gives you big overhead for select star queries, but they are not used frequently in production, 18:21.120 --> 18:27.520 and but it has much smaller overhead for production queries, because it reads data only for the 18:27.520 --> 18:32.000 given columns which are mentioned in select and where, and it doesn't need to read to other data, 18:32.320 --> 18:42.320 and per column data can process much better than per row data, because per column data values usually contain some constants, 18:42.320 --> 18:49.600 or small number of unique values, and this means that you get faster performance and to let this space. 18:51.360 --> 18:56.160 Let's keep this because I don't know time, and let's go to the conclusions, 18:56.160 --> 19:03.280 as the first conclusion is thin and slow, and the best way to reduce a thin crate is to admit that 19:03.280 --> 19:09.920 the recently ingest of data can be lost, this doesn't apply for typical all-lob data, all typical 19:09.920 --> 19:17.680 databases such as mySQL and Postgres, but it applies well for time series databases, typical time 19:17.680 --> 19:24.960 series databases which throw metrics, locks, and events, and just admit it, and let's increase use 19:24.960 --> 19:32.880 this, let's list this kio, let's invest this, and column rate storage uses list this space, 19:32.880 --> 19:38.720 and this kio, then row oriented storage, even if you include the ground merges overhead, 19:39.440 --> 19:44.080 and when you combine it, let's invest this column rate storage, you get the best performance 19:44.160 --> 19:53.360 for big data bases which do not fit memory, and this popular data bases which use LSM-3s 19:53.360 --> 20:00.240 and column rate storage, a kioly house, of course, DAGDB, and of course, vector metrics and vector 20:00.240 --> 20:06.960 locks, and I recommend you investigate in the source code for these data bases, because they are all open source, 20:06.960 --> 20:13.280 and let's say thank you.