WEBVTT

00:00.000 --> 00:16.400
Please welcome our next speaker, it's Alexander Valerking, CTO in Victoria Matrix, and he's

00:16.400 --> 00:22.080
going to speak about working with file systems in time series database, welcome please.

00:22.080 --> 00:29.080
Thank you for the introduction, let's start.

00:29.080 --> 00:38.280
Let's meet, I'm Alexander Valerking, it's a registered and I'm also known

00:38.280 --> 00:48.280
on the my nickname, Valerking, which you can find Google in Google and know more about me.

00:48.280 --> 00:57.080
I'm found of writing a hyperformance code, and this code is optimized for low resource

00:57.080 --> 01:04.520
usage, and I also created a few time series databases, optimized for performance and low

01:04.520 --> 01:12.080
resource usage, and this database of course opensource, this is Victorimetric, Database

01:12.080 --> 01:19.520
for Matrix, and Victorimetric for the database for logs, and as I said before, this database

01:19.520 --> 01:31.560
is a fast and cost efficient, and let's, okay, they are not written in the rest, they are

01:31.560 --> 01:42.000
written in Go, and let's talk about specifics of time series databases.

01:42.000 --> 01:50.160
This database is contained at time series data, and that I see this is a series of samples

01:50.160 --> 01:57.000
and each sample contains time stamp and value, and value can be numeric, in this case,

01:57.000 --> 02:05.880
this database contains metrics, and value can be arbitrary blob, plain text, or some

02:05.880 --> 02:16.280
structured blob, in this case, such time series name talks or events, and every time

02:16.280 --> 02:25.800
series can have arbitrary set of key value labels, and these labels are usually used for filtering

02:25.800 --> 02:32.280
and grouping, for simplifying the filtering and grouping, and the number of time series

02:32.280 --> 02:40.200
and in typical database ranges from a few thousands to billions, and the typical data

02:40.200 --> 02:49.640
ingestion rate for time series databases is up to 10 millions of samples or rows per second,

02:49.640 --> 02:55.960
so you can see that such time series databases must be optimized for high data ingestion

02:55.960 --> 03:03.800
performance, and typical queries in time series databases usually need to scan up to 10 millions

03:03.800 --> 03:14.960
or 100 millions of samples, so obviously you need to have high query performance, and the number

03:14.960 --> 03:21.840
of typical samples in time series databases is measured, not in billions, not in millions,

03:21.840 --> 03:30.800
but in trillions, and the size of the data stored in time series databases is measured in

03:30.800 --> 03:39.440
terabytes and petabytes, so obviously such size doesn't fit memory, and how to achieve

03:39.680 --> 03:50.880
high performance for such databases, data base performance is limited by disk IO, if the size

03:50.880 --> 03:56.560
of data doesn't fit memory, because you need to store this data persisted data to memory,

03:59.360 --> 04:05.280
and so if you want to achieve high performance, for the database, you need to reduce the number

04:05.360 --> 04:11.760
of disk operations, and you need to reduce the amounts of data read and read from disk,

04:14.160 --> 04:20.640
let's talk about how to achieve high data ingestion rate for databases, any ideas,

04:25.520 --> 04:34.960
just reduce a sync frequency, why, let's talk about why, because when you store data into files,

04:35.600 --> 04:43.040
this data isn't stored in files actually, it's stored in memory in the old page cache,

04:44.000 --> 04:51.280
by operating system and operating systems usually stores this data in the files in background

04:51.280 --> 04:59.360
when they want to store this data, and the order of storing this data in files in disk is undefined,

05:00.240 --> 05:10.800
it doesn't match the order used right as data to files, and when you unplug your computer from

05:10.800 --> 05:19.840
the power, page cache data can be partially written to disk, and this means that you get

05:20.560 --> 05:28.080
easy data loss or data corruption, so you need some system calls, I just have sync,

05:29.040 --> 05:35.760
which forces storing all the data from all-page cache, written data to all-page cache to disk,

05:36.560 --> 05:44.080
and that's why all the production rate that the basis must use a sync, the exceptions,

05:44.080 --> 05:51.600
let's talk about the data, why you need to reduce frequency as I said before, because a sync is slow,

05:51.680 --> 06:03.120
and it is slow not only on HDDs, but on SSDs too, why, because when you write even a single byte

06:03.120 --> 06:18.160
to SSD, disk under the hood SSD writes and a much larger size of block, such as up to four

06:18.160 --> 06:26.560
megabytes or even more, which are named to erase blocks, because SSDs, as it's hardware works,

06:26.560 --> 06:35.200
it can store you one single byte and SSDs, so if you, that's why, expect only thousands of

06:35.200 --> 06:45.040
fcps, even on SSDs, there are exceptions such as enterprise-grade SSDs, which are more expensive,

06:45.040 --> 06:54.640
and they have special capacitors, which allow you to store the battery data, which is not persisted

06:54.640 --> 07:03.120
yet when you power off the SSD, but on regular SSDs there are no such capacitors.

07:03.200 --> 07:16.880
And the basic frequently used technique, which bypass all the page cache and directly write data

07:16.880 --> 07:23.200
to disk, they are direct IO, but it doesn't help too much, because in reality, this direct IO

07:23.200 --> 07:29.920
just replaces a sync, you just simulates a sync by yourself, so it takes the same amount of time and

07:30.080 --> 07:37.680
resources, and also, additionally, direct IO doesn't work well on many popular file systems,

07:39.280 --> 07:47.120
it has bugs and various issues, so how to reduce frequency for a sync at that base,

07:47.760 --> 07:56.640
the first approach is to send stored data in the base in big batches, when you store data in big batches,

07:57.600 --> 08:02.800
the client expects the success response when this data is persisted, so we,

08:03.600 --> 08:10.240
per each batch, you need to issue only one f sync or multiple sync, so big batches,

08:10.960 --> 08:18.560
small number of fs, and you have to, you can achieve higher data injection performance.

08:19.520 --> 08:26.080
Another approach, commonly used in databases, is to merge concurrent inserts in a single

08:26.080 --> 08:34.640
data block, and store it with a single fsync to disk, so for example, you can run fsync every 10

08:34.640 --> 08:45.520
milliseconds, 100 fsps, so every insert can be delayed up to 10 milliseconds, while collecting data

08:45.600 --> 08:51.040
from concurrent inserts, and then storing this data in a single insert in a single fsync.

08:53.760 --> 09:02.240
The advantage of this method is that it is limited to 100 sequential insert per second

09:02.240 --> 09:07.760
from a single client, in this case, of course you can run thousands of concurrent clients and

09:07.760 --> 09:15.440
achieve much higher injection performance, and as a popular approach is to use right-of-head

09:15.440 --> 09:25.200
block, how it works, you set stored data in a just a data in a right-of-wall, a sync it, and after that,

09:25.840 --> 09:35.680
stored data in the ground in other data structures without a sync, and if a power of occurs,

09:35.680 --> 09:41.680
you can replace this wall of the restart, and do not lose any data, and many of the

09:41.680 --> 09:48.080
databases use this technique, but it doesn't reduce the sync too much, because you need to make

09:48.080 --> 09:55.920
this fsync after every storage of data into wall, in order to prevent from power off,

09:58.400 --> 10:03.760
and that the basis uses various data tricks for reducing the same set wall, for example,

10:03.760 --> 10:11.760
they can buffer data in RAM, and then periodically rises data to wall, with a sync, of course,

10:11.760 --> 10:17.440
or they can write data to wall without any of the sync, and periodically run a sync on this wall,

10:17.440 --> 10:25.840
and both cases are bad, because they can lead to data loss and power off, and also they

10:25.840 --> 10:32.240
even worse, they can lead to data corruption, if power off happens in the middle of a sync,

10:33.200 --> 10:39.840
and in this case, you are on a way, which parts of data are persisted, and which are not persisted,

10:39.840 --> 10:47.600
and you need to protect from this with some checksums, so there is another approach,

10:48.400 --> 10:54.880
it just admits that there isn't just a data can be lost on power off, and there are practical cases,

10:54.880 --> 11:01.040
when it is a K, for example, when you work with observability data, such as metrics,

11:01.040 --> 11:08.160
logs and events, if you admit this, then you can buffer data, listen to ingested data in memory,

11:08.160 --> 11:16.800
and periodically flush it to disk with a single sync at the end, and if the data is synced to disk,

11:16.800 --> 11:24.160
then it is safe from loss and corruption on power off, but listen to ingested data can be

11:24.160 --> 11:31.200
lost on power off, and so let's talk about a sense of also that the basis and querying pass,

11:33.040 --> 11:38.400
they provide you ability to quickly find the needed data by the given prefix, okay,

11:39.360 --> 11:45.680
and that the basis you use indexes for this, and indexes, data structures, which map keys,

11:45.680 --> 11:52.240
to locations of the data on disk in database cases, and indexes usually provide

11:52.720 --> 12:01.360
or look and see complexity, where n is a number of interesting indexes, and the most frequently

12:02.320 --> 12:11.280
index types in databases are based on B3s, B3s is a tree, that structure where every note contains

12:11.360 --> 12:20.400
thousands of entries, and notes are usually aligned in size up to from 4 kilobytes to 64 kilobytes

12:20.400 --> 12:30.880
and stored to disk at once, and when you add new entry to B3, it usually requires updating multiple

12:30.880 --> 12:38.800
notes in the tree along the path, where you add this note, and note the entries usually can

12:38.880 --> 12:45.920
be three entries contained, not the data row data itself, but pointers to the data and this data

12:45.920 --> 12:54.000
stored somewhere else in other files, and another popular index type is based on LSM3s,

12:55.360 --> 13:06.160
this is the structure contains files, every file contain data sorted data, and such files

13:06.240 --> 13:14.960
periodically merge in bigger files in the ground, and this means that these files usually always contain

13:14.960 --> 13:22.960
entries sorted entries, and they contain values itself instead of pointers to values which

13:22.960 --> 13:29.040
are stored somewhere else, and let's look at some practical cases, the first case is a high data

13:29.040 --> 13:39.200
ingestion rate, when you use B3, then you need to update B3 notes, multiple B3 notes,

13:39.200 --> 13:45.680
per each data ingestion ingestion entry, this means that in a worst case you need to make a

13:45.680 --> 13:53.280
split, since you're ingestion ingestion entry, and also since when you use B3, you are relying

13:53.360 --> 14:03.840
on wall, and you need additional disk bandwidth for wall, and when you use LSM3s, then you just buffer

14:03.840 --> 14:11.600
multiple entries, thousands of entries in memory, and store them in this at one F-sync,

14:12.480 --> 14:22.480
and you do not need any wall, and save for this guyo, the second case is reading a well in the

14:22.480 --> 14:32.000
index order, which is frequently used, in B3 case you need to read data from many different

14:32.000 --> 14:38.880
locations on disk, because in B3 doesn't contain usually rows, as I said, row data it contains pointers,

14:38.880 --> 14:46.240
those data can be scattered across the disk, and in LSM3s, as you remember, the data itself

14:47.200 --> 14:54.800
is stored in order, in order, and this means that you just read data sequentially with a

14:54.800 --> 15:01.760
much smaller number of disk operations. The next practical case ingestion big data volumes,

15:01.760 --> 15:09.440
which do not fit our memory, in B3 case, is increasing significantly the amounts of disk

15:09.440 --> 15:17.040
operations, because if some B3 nodes are missing in memory, then this means that you need to read them,

15:17.040 --> 15:23.360
in memory, and if you have no final space for all the B3 nodes, then you need to read some

15:23.360 --> 15:28.240
this B3 nodes and store them to disk, so you increase significantly the disk write operations,

15:28.240 --> 15:37.440
and read operations, and in LSM3, the frequency of this write delete operations doesn't depend

15:37.520 --> 15:50.080
on the size of data base. In other cases, in B3 case, you need some complicated HMS based on

15:50.720 --> 15:59.040
NDCC and transaction IDs, which need a lot of time and efforts, and in LSM3s,

15:59.920 --> 16:08.480
snapshots are instant, for any database sizes, why? Because LSM3 files are immutable, and this means that

16:08.480 --> 16:16.000
you just make hard links for all the LSM files and get snapshot, and it is instant for any size of data.

16:17.200 --> 16:23.520
In other cases, network attached storage, storage is such as NFS or object storage. In B3 case,

16:23.520 --> 16:29.520
it doesn't work at all, because you need to update files and all object storage, and network

16:29.520 --> 16:38.800
file systems, but for updating files, and in LSM, it works perfectly because files are immutable,

16:38.800 --> 16:45.520
you just write files, and delete already mere files, and do not update them any time.

16:46.080 --> 16:54.080
As for this space user, in B3 index, isn't compressible, usually, and row values are compressed

16:54.080 --> 17:00.400
not so well, because the order of this values are coherent, and this means bigger this space

17:00.400 --> 17:06.240
usage, and in LSM3s data is compressed much better, because it's sorted, and sorted data

17:06.240 --> 17:11.200
compresses better, and you get all of this space usage, and of course, you get

17:11.200 --> 17:19.280
a discrete bandwidth in LSM3s compared to B3, and this means higher query performance,

17:20.000 --> 17:24.960
for queries which need to read the robots of data, because you need smaller amounts of data,

17:24.960 --> 17:32.080
and package them, and then process. Let's talk about row oriented with column oriented storage,

17:33.040 --> 17:39.760
row oriented storage, you store rows on this one of the other, it works great for select star,

17:40.720 --> 17:47.040
type of queries where you select all the row fields, but such queries are not usually used in practice,

17:48.320 --> 17:54.640
and row oriented storage has very big overhead for typical production queries, which select

17:54.640 --> 18:04.560
some subset of fields, and perform some filtering on subset of fields, and this results in low

18:04.560 --> 18:13.760
compression rates, as for column storage, you store data per each column separator in separate files,

18:14.480 --> 18:21.120
and this gives you big overhead for select star queries, but they are not used frequently in production,

18:21.120 --> 18:27.520
and but it has much smaller overhead for production queries, because it reads data only for the

18:27.520 --> 18:32.000
given columns which are mentioned in select and where, and it doesn't need to read to other data,

18:32.320 --> 18:42.320
and per column data can process much better than per row data, because per column data values usually contain some constants,

18:42.320 --> 18:49.600
or small number of unique values, and this means that you get faster performance and to let this space.

18:51.360 --> 18:56.160
Let's keep this because I don't know time, and let's go to the conclusions,

18:56.160 --> 19:03.280
as the first conclusion is thin and slow, and the best way to reduce a thin crate is to admit that

19:03.280 --> 19:09.920
the recently ingest of data can be lost, this doesn't apply for typical all-lob data, all typical

19:09.920 --> 19:17.680
databases such as mySQL and Postgres, but it applies well for time series databases, typical time

19:17.680 --> 19:24.960
series databases which throw metrics, locks, and events, and just admit it, and let's increase use

19:24.960 --> 19:32.880
this, let's list this kio, let's invest this, and column rate storage uses list this space,

19:32.880 --> 19:38.720
and this kio, then row oriented storage, even if you include the ground merges overhead,

19:39.440 --> 19:44.080
and when you combine it, let's invest this column rate storage, you get the best performance

19:44.160 --> 19:53.360
for big data bases which do not fit memory, and this popular data bases which use LSM-3s

19:53.360 --> 20:00.240
and column rate storage, a kioly house, of course, DAGDB, and of course, vector metrics and vector

20:00.240 --> 20:06.960
locks, and I recommend you investigate in the source code for these data bases, because they are all open source,

20:06.960 --> 20:13.280
and let's say thank you.