?
- how we can store the data that we’re given(如何存储数据)?
- how we can find it again when we’re asked for it(如何检索数据)?
- storage engine optimized for transactional(
) 和 optimized for analytics(OLTP
) 有何不同?OLAP
- log-structured storage engines(如
) and page-oriented storage enginers(如LSM-Trees
)?B-Trees
- index 的类型和原理?
OLTP(online transaction processing) Overview
A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes— as opposed to batch processing jobs, which only run periodically (for example, once per day).
Storage:
相比随机写入顺序写入简单高效, 因此 Many databases internally use a log(an append-only sequence of records), which is an
append-only
(the simplest possible write operation) data file.
Retrieval:
TP 应用通常面向用户,使用 key 检索数据,结果集也通常较小,对响应时间有要求,因此 store engine 使用 index 来加速查询。Disk seek time is often the bottleneck here.
Index:
Index 利用额外的存储(来自 data)来加速查询. 维持 index 会面临一些成本,尤其是当写入时,需要同步更新 index。由于 index 可以加速查询, 降低写入效率,所以索引的建立需开发人员去权衡.
Hash Indexes with append-only
Storage:
数据以 (k,v) pair 的形式存储在磁盘,以 append only 的方式进行追加
Retrieval:
在内存中维护 hash index,数据追加时同步更新 index
以 append-only 方式追加会导致磁盘空间不足,为了解决这个问题:
- 将 log 切分为定长的 segment, 当 segment file 到达固定大小时,后续的 log 写入新的 segment file
- 定期对 segment file 进行 compaction(throwing away duplicate keys in the log, and keeping only the most recent update for each key),也可以将多个 segment file compaction 为一个
- segment file 是只读的, compaction 会产生新的文件.在 compaction 的同时, old segment file 仍可以提供读写服务.当 compaction 完成时,使用新的 segment file, 删除老的 segment file。
这个方案一些要关注的点:
- File format: use a binary format that first encodes the length of a string in bytes, followed by the raw string
- Deleting records: use a binary format that first encodes the length of a string in bytes, followed by the raw string. merge 的时候会删除数据.
-
: Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.Crash recovery
-
: include checksums, allowing such corrupted parts of the log to be detected and ignored.Partially written records
-
: 单线程写入,一个 file 要么是 append-only 要么是 immutable, 所以可以多线程读取Concurrency control
优势:
- append 和 segment mege 是顺序写入操作,比随机写入要高效
- Concurrency and crash recovery are much simpler if segment files are append-only or immutable
- Merging old segments avoids the problem of data files getting fragmented over time.
Limitations:
-
The hash table must fit in memory,当 hash table 很
大被迫放入磁盘时,索引效率就会下降,还会面临 hash 冲突的问题.
- Range queries are not efficient.
SSTables(Sorted String Table) and LSM-Trees
如何使用 SSTables 来解决 hash index 的问题呢
Storage:
在 hash index 中 k,v 数据按照写入顺序存储,且同一个 k 会存在多个 segment file 中。在 SSTable 中, we require that the sequence of key-value pairs is
sorted by key
.We also require that each key
only appears once
within each merged segment file
Retrieval:
因为 key 按序存储,所以 index 中的一个 key 可以表示一个范围的上/下界,找到数据所在范围后再 scan 者个范围内的数据。先检索内存中的 memtable, 再根据时间检索磁盘上的 SSTables.
[外链图片转存失败(img-8gsxfDsB-1568549252126)(https://miro.medium.com/max/4000/0*6zJnBf20REcr4Uet.png)]
Advantages:
- Merging segments is simple and efficient, even if the files are bigger than the available memory, 类似于合并有序数组
- you no longer need to keep an index of all the keys in memory. 因为有序, hash table 中的一个 key 可以代表一个范围的上/下界,通过 scan 这个小范围来检索数据
- it is possible to
before writing it to disk. 这可以节省磁盘同时也能降低 IO 带宽group those records into a block and compress it
Constructing and maintaining SSTables
Sorted structure on disk: LSM-tree(Log-Structured Merge-Tree)?
Sorted structure in memory: red-black trees or AVL trees
Work flow:
- add write to and in-memory balanced tree structure called
memtable
- When the memtable gets bigger than some threshold—typically a few megabytes—write it out to
.disk as an SSTable file
- In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
- From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
唯一的问题:
当 database crashed, 还没刷到磁盘的 memetable 就会丢失,为了避免这个问题,写一条数据时,同时以简单的 append 方式追加到一个 log 中,便于 crash 后 memtable 的恢复
LSM-tree(Log-Structured Merge-Tree)
写快读慢,因为顺序写入,读取的时候需要检索 memtable 和 sstable.
Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.
the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background is simple and effective
由于数据存储有序,所以 index 可以进行 range query,这保证了再大数据集下不错的检索性能。
由于顺序写入, LSM-tree 也可以支持大的写吞吐;
Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.
Performance optimizations
- 检索一个不存在的 key 费时费力,为了解决这个问题,常使用额外的 bloom-filter
Merge 和 compact 方式:
- size-tiered: newer and smaller SSTables are successively merged into older and larger SSTables.
- leveled compaction: the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space
B-Trees
Storage:
(k,v) 按 k 有序存储,which allows efficient key-value lookups and range queries。B-trees break the database down into
fixed-size blocks or pages
, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are look‐ing for. (A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)
Making B-trees reliable
B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page;
-
fault-tolerant when database crash?
In order to make the database resilient to crashes, it is common for B-tree implemen‐ tations to include an additional data structure on disk: a
(WAL, also known as a redo log).write-ahead log
-
This is typically done by protecting the tree’s data structures with latches (lightweight locks).并发问题:
B-tree optimizations
- copy-on-write
- …
Comparing B-Trees and LSM-Trees
B-tree: 写慢读快;每条数据之存在于一个地方,随意可以支持强大的事务能力
Lsm-tree: 写快读慢,同一个 key 可能在不同的 segment 中
Advantages of LSM-trees
相比于 B-trees, LSM-trees 可以维持 higher write throughput:
- lsm-trees have lower write amplification
- sequentially write compact SSTable
LSM-trees 磁盘利用率相对较高
: LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
write amplification(写放大):
- A B-tree index must write every piece of data at least twice: index 和 数据
- Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables
Downsides of LSM-trees:写入时空间换时间
- compaction process 有时和 ongoing reads and writes process 会互相干扰
- Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.
Other Indexing Structures
二级索引
Storing values within the index
数据存储在 heap file 中,索引处存储 heap file 的 reference,这样可以避免重复存储数据.index 中新值可以 overwrite 老值, index 无需更新; 如果空间不够需要生成新的 heapfile, 则需要更新 index。
clustered index(toring all row data within the index):
so it can be desirable to store the indexed row directly within an index. This is known as a clustered index(
聚集索引,比如 mysql 主键
).
covering index:(比如 mysql 主键外其他键上索引)
which stores some of a table’s columns within the index
Multi-column indexes
concatenated index(比如 mysql 联合唯一键):
simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated)
Multi-dimensional indexes are a more general way of querying several columns at once, 比如 R 树?
Full-text search and fuzzy indexes
To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance
but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie
Keeping everything in memory
But other in- memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.
Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk.Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [44].?
Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes.
anti-caching approach ?
Summary
Log-structured storage engine
Hash Index + append only
- log 分为 segments, 后台 merge 和 compaction, 磁盘利用率低
- index 受内存大小限制
- 难以进行 range query
Range Index + SSTable(LSM-Trees 是其进阶)
- 写入按 k 排序, 写入 mmtable(memory), 写满后刷到 SSTable(disk)
- range index + scan filter, 不受内存限制, range query 支持良好
- 后台进行 merge(高效的有序合并) 和 compaction; 磁盘利用率高
- 相对读慢写快(顺序写入,append-only,不 update,合并时 delete)
- crash recovery: append-only log
- 并发问题: 以上 2 个都是单线程写入, segment file 不可变,可以多线程读取;
Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.
Page-orientated storage engine
B-Trees(update-in-place)
- 按 k 排序,会进行 update(overwrite)
- 数据只有一份,事务支持强大
- 读快写慢(随机写入,还有可能 B 树要分裂)
- crash recovery: write-ahead log
- concurrency issue: latches (lightweight locks)
- 二级索引,聚集索引,非聚集索引…
- Multi-column indexes
else
- 全文索引
- 模糊索引
- Keeping everything in memory
OLAP(online analytic processing ) Overview
特点: scan 大量的数据, project 一些字段,在这些数据上进行聚合运算
Instead it becomes important to encode data very compactly,
to minimize the amount of data that the query needs to read from disk.
We discussed how column-oriented storage helps achieve this goal.
Data Warehousing
通过 ETL 将 TP 系统中的数据以合适的方式收集到 data warehouse 中进行 AP 分析.DW 通常是关系模型,因为使用 sql 分析非常合适。
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations
A big advantage of using a separate data warehouse, rather than querying OLTP sys‐ tems directly for analytics, is that the data warehouse can be optimized for analytic access patterns.
[外链图片转存失败(img-a5szTnbl-1568549252127)(https://panoply.io/uploads/versions/diagram4—x----750-328x—.jpg)]
Stars and Snowflakes: Schemas for Analytics
Stars schema(也称维度建模):
中间是事实表,其通常包含很多字段,有些是属性,有些事指向维度表的外键。
[外链图片转存失败(img-GLKtXL71-1568549252127)(https://i.ytimg.com/vi/KUwOcip7Zzc/maxresdefault.jpg)]
Snowflakes schema:
是 Stars 的变体,维度表维度和以被细分为更小的维度
[外链图片转存失败(img-owDxB68K-1568549252128)(https://www.tutorialspoint.com/dwh/images/snowflake.jpg)]
fact table(事实表): Each row of the fact table represents an event that occurred at a particular time
dimension table(维度表): the dimensions represent the who, what, where, when, how, and why of the event
Column-Oriented Storage
The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.
The column-oriented storage layout relies on each column file containing the rows in the same order.
Column Compression
we can further reduce the demands on disk throughput by compressing data
bitmap encoding
vectorized processing
A big bottleneck is the bandwidth for getting data from disk into memory
Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache
Sort Order in Column Storage
在列存中顺序无关紧要,存储按照 insert 顺序来就好了.但是仍然可以类似 SSTable 指定顺序用来实现索引
Rather, the data needs to be sorted an entire row at a time, even though it is stored by column.
A second column can determine the sort order of any rows that have the same value in the first column.
Another advantage of sorted order is that it can help with compression of columns
Having multiple sort orders in a column-oriented store is a bit similar to having mul‐ tiple secondary indexes in a row-oriented store.
Writing to Column-Oriented Storage
面向列,压缩和排序增加了写入成本。可以使用类似 LSM-trees 的方式,将数据写入 memtable 再刷到磁盘,刷到磁盘的时候生成 column-oriented 文件。只是这样以来,查询就需要合并 memtable 和磁盘文件,但是查询优化器会对用户屏蔽这些底层细节。
Aggregation: Data Cubes and Materialized Views
Materialized view:
is and actual copy of the query results, written to disk,When the underlying data changes, a materialized view needs to be updated
A common special case of a materialized view is known as a data cube or OLAP cube, It is a grid of aggregates grouped by different dimensions.
The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data