?

how we can store the data that we’re given(如何存储数据)?
how we can find it again when we’re asked for it(如何检索数据)?
storage engine optimized for transactional( OLTP ) 和 optimized for analytics( OLAP ) 有何不同?
log-structured storage engines(如 LSM-Trees ) and page-oriented storage enginers(如 B-Trees )?
index 的类型和原理?

OLTP(online transaction processing) Overview

A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes— as opposed to batch processing jobs, which only run periodically (for example, once per day).

Storage:

相比随机写入顺序写入简单高效, 因此 Many databases internally use a log(an append-only sequence of records), which is an

append-only

(the simplest possible write operation) data file.

Retrieval:

TP 应用通常面向用户,使用 key 检索数据,结果集也通常较小,对响应时间有要求,因此 store engine 使用 index 来加速查询。Disk seek time is often the bottleneck here.

Index: Index 利用额外的存储(来自 data)来加速查询. 维持 index 会面临一些成本,尤其是当写入时,需要同步更新 index。由于 index 可以加速查询, 降低写入效率,所以索引的建立需开发人员去权衡.

Hash Indexes with append-only

Storage:

数据以 (k,v) pair 的形式存储在磁盘,以 append only 的方式进行追加

Retrieval:

在内存中维护 hash index,数据追加时同步更新 index

以 append-only 方式追加会导致磁盘空间不足,为了解决这个问题:

将 log 切分为定长的 segment, 当 segment file 到达固定大小时,后续的 log 写入新的 segment file
定期对 segment file 进行 compaction(throwing away duplicate keys in the log, and keeping only the most recent update for each key),也可以将多个 segment file compaction 为一个
segment file 是只读的, compaction 会产生新的文件.在 compaction 的同时, old segment file 仍可以提供读写服务.当 compaction 完成时,使用新的 segment file, 删除老的 segment file。

这个方案一些要关注的点:

File format: use a binary format that first encodes the length of a string in bytes, followed by the raw string
Deleting records: use a binary format that first encodes the length of a string in bytes, followed by the raw string. merge 的时候会删除数据.
Crash recovery : Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.
Partially written records : include checksums, allowing such corrupted parts of the log to be detected and ignored.
Concurrency control : 单线程写入,一个 file 要么是 append-only 要么是 immutable, 所以可以多线程读取

优势:

append 和 segment mege 是顺序写入操作,比随机写入要高效
Concurrency and crash recovery are much simpler if segment files are append-only or immutable
Merging old segments avoids the problem of data files getting fragmented over time.

Limitations:

The hash table must fit in memory,当 hash table 很

大被迫放入磁盘时,索引效率就会下降,还会面临 hash 冲突的问题.
Range queries are not efficient.

SSTables(Sorted String Table) and LSM-Trees

如何使用 SSTables 来解决 hash index 的问题呢

Storage:

在 hash index 中 k,v 数据按照写入顺序存储,且同一个 k 会存在多个 segment file 中。在 SSTable 中, we require that the sequence of key-value pairs is

sorted by key

.We also require that each key

only appears once

within each merged segment file

Retrieval:

因为 key 按序存储,所以 index 中的一个 key 可以表示一个范围的上/下界,找到数据所在范围后再 scan 者个范围内的数据。先检索内存中的 memtable, 再根据时间检索磁盘上的 SSTables.

[外链图片转存失败(img-8gsxfDsB-1568549252126)(https://miro.medium.com/max/4000/0*6zJnBf20REcr4Uet.png)]

Advantages:

Merging segments is simple and efficient, even if the files are bigger than the available memory, 类似于合并有序数组
you no longer need to keep an index of all the keys in memory. 因为有序, hash table 中的一个 key 可以代表一个范围的上/下界,通过 scan 这个小范围来检索数据
it is possible to group those records into a block and compress it before writing it to disk. 这可以节省磁盘同时也能降低 IO 带宽

Constructing and maintaining SSTables

Sorted structure on disk: LSM-tree(Log-Structured Merge-Tree)?

Sorted structure in memory: red-black trees or AVL trees

Work flow:

add write to and in-memory balanced tree structure called memtable
When the memtable gets bigger than some threshold—typically a few megabytes—write it out to disk as an SSTable file .
In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

唯一的问题:

当 database crashed, 还没刷到磁盘的 memetable 就会丢失,为了避免这个问题,写一条数据时,同时以简单的 append 方式追加到一个 log 中,便于 crash 后 memtable 的恢复

LSM-tree(Log-Structured Merge-Tree)

写快读慢,因为顺序写入,读取的时候需要检索 memtable 和 sstable.

Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.

the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background is simple and effective

由于数据存储有序,所以 index 可以进行 range query,这保证了再大数据集下不错的检索性能。

由于顺序写入, LSM-tree 也可以支持大的写吞吐;

Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.

Performance optimizations

检索一个不存在的 key 费时费力,为了解决这个问题,常使用额外的 bloom-filter

Merge 和 compact 方式:

size-tiered: newer and smaller SSTables are successively merged into older and larger SSTables.
leveled compaction: the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space

B-Trees

Storage:

(k,v) 按 k 有序存储,which allows efficient key-value lookups and range queries。B-trees break the database down into

fixed-size blocks or pages

, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are look‐ing for. (A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)

Making B-trees reliable

B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page;

fault-tolerant when database crash?

In order to make the database resilient to crashes, it is common for B-tree implemen‐ tations to include an additional data structure on disk: a
write-ahead log (WAL, also known as a redo log).
并发问题: This is typically done by protecting the tree’s data structures with latches (lightweight locks).

B-tree optimizations

copy-on-write
…

Comparing B-Trees and LSM-Trees

B-tree: 写慢读快;每条数据之存在于一个地方,随意可以支持强大的事务能力

Lsm-tree: 写快读慢,同一个 key 可能在不同的 segment 中

Advantages of LSM-trees

相比于 B-trees, LSM-trees 可以维持 higher write throughput:

lsm-trees have lower write amplification
sequentially write compact SSTable

LSM-trees 磁盘利用率相对较高

: LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.

write amplification(写放大):

A B-tree index must write every piece of data at least twice: index 和数据

Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables

Downsides of LSM-trees:写入时空间换时间

compaction process 有时和 ongoing reads and writes process 会互相干扰
Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.

Other Indexing Structures

二级索引

Storing values within the index

数据存储在 heap file 中,索引处存储 heap file 的 reference,这样可以避免重复存储数据.index 中新值可以 overwrite 老值, index 无需更新; 如果空间不够需要生成新的 heapfile, 则需要更新 index。

clustered index(toring all row data within the index):

so it can be desirable to store the indexed row directly within an index. This is known as a clustered index(

聚集索引,比如 mysql 主键

covering index:(比如 mysql 主键外其他键上索引)

which stores some of a table’s columns within the index

Multi-column indexes

concatenated index(比如 mysql 联合唯一键):

simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated)

Multi-dimensional indexes are a more general way of querying several columns at once, 比如 R 树?

Full-text search and fuzzy indexes

To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance

but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie

Keeping everything in memory

But other in- memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.

Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk.Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [44].?

Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes.

anti-caching approach ?

Summary

Log-structured storage engine

Hash Index + append only

log 分为 segments, 后台 merge 和 compaction, 磁盘利用率低
index 受内存大小限制
难以进行 range query

Range Index + SSTable(LSM-Trees 是其进阶)

写入按 k 排序, 写入 mmtable(memory), 写满后刷到 SSTable(disk)
range index + scan filter, 不受内存限制, range query 支持良好
后台进行 merge(高效的有序合并) 和 compaction; 磁盘利用率高
相对读慢写快(顺序写入,append-only,不 update,合并时 delete)
crash recovery: append-only log
并发问题: 以上 2 个都是单线程写入, segment file 不可变,可以多线程读取;

Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.

Page-orientated storage engine

B-Trees(update-in-place)

按 k 排序,会进行 update(overwrite)
数据只有一份,事务支持强大
读快写慢(随机写入,还有可能 B 树要分裂)
crash recovery: write-ahead log
concurrency issue: latches (lightweight locks)
二级索引,聚集索引,非聚集索引…
Multi-column indexes

else

全文索引
模糊索引
Keeping everything in memory

OLAP(online analytic processing ) Overview

特点: scan 大量的数据, project 一些字段,在这些数据上进行聚合运算

Instead it becomes important to encode data very compactly,

to minimize the amount of data that the query needs to read from disk.

We discussed how column-oriented storage helps achieve this goal.

Data Warehousing

通过 ETL 将 TP 系统中的数据以合适的方式收集到 data warehouse 中进行 AP 分析.DW 通常是关系模型,因为使用 sql 分析非常合适。

A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations

A big advantage of using a separate data warehouse, rather than querying OLTP sys‐ tems directly for analytics, is that the data warehouse can be optimized for analytic access patterns.

[外链图片转存失败(img-a5szTnbl-1568549252127)(https://panoply.io/uploads/versions/diagram4—x----750-328x—.jpg)]

Stars and Snowflakes: Schemas for Analytics

Stars schema(也称维度建模):

中间是事实表,其通常包含很多字段,有些是属性,有些事指向维度表的外键。

[外链图片转存失败(img-GLKtXL71-1568549252127)(https://i.ytimg.com/vi/KUwOcip7Zzc/maxresdefault.jpg)]

Snowflakes schema:

是 Stars 的变体,维度表维度和以被细分为更小的维度

[外链图片转存失败(img-owDxB68K-1568549252128)(https://www.tutorialspoint.com/dwh/images/snowflake.jpg)]

fact table(事实表): Each row of the fact table represents an event that occurred at a particular time

dimension table(维度表): the dimensions represent the who, what, where, when, how, and why of the event

Column-Oriented Storage

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.

The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

we can further reduce the demands on disk throughput by compressing data

bitmap encoding

vectorized processing

A big bottleneck is the bandwidth for getting data from disk into memory

Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache

Sort Order in Column Storage

在列存中顺序无关紧要,存储按照 insert 顺序来就好了.但是仍然可以类似 SSTable 指定顺序用来实现索引

Rather, the data needs to be sorted an entire row at a time, even though it is stored by column.

A second column can determine the sort order of any rows that have the same value in the first column.

Another advantage of sorted order is that it can help with compression of columns

Having multiple sort orders in a column-oriented store is a bit similar to having mul‐ tiple secondary indexes in a row-oriented store.

Writing to Column-Oriented Storage

面向列,压缩和排序增加了写入成本。可以使用类似 LSM-trees 的方式,将数据写入 memtable 再刷到磁盘,刷到磁盘的时候生成 column-oriented 文件。只是这样以来,查询就需要合并 memtable 和磁盘文件,但是查询优化器会对用户屏蔽这些底层细节。

Aggregation: Data Cubes and Materialized Views

Materialized view:

is and actual copy of the query results, written to disk,When the underlying data changes, a materialized view needs to be updated

A common special case of a materialized view is known as a data cube or OLAP cube, It is a grid of aggregates grouped by different dimensions.

The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data

Note: CHAPTER 3 Storage and Retrieval?

?