NoSQL Databases技术资料整理汇总

在 stuttgart media 大学的 christof strauch 历时8个月（2010年6月-2011年2月）完成了一篇150页长的nosql相关的论文, 对nosql的各个方面做了探讨

<a href="http://www.christof-strauch.de/nosqldbs.pdf">http://www.christof-strauch.de/nosqldbs.pdf</a>

<a href="http://duanple.blog.163.com/blog/static/709717672011330101333271/">http://duanple.blog.163.com/blog/static/709717672011330101333271/</a>

<a href="http://blog.nosqlfan.com/html/1647.html">http://blog.nosqlfan.com/html/1647.html</a>

<a href="http://www.empiricalreality.com/2010/09/22/2010-nosql-summer-reading-list/">http://www.empiricalreality.com/2010/09/22/2010-nosql-summer-reading-list/</a>

<a href="http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/">http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/</a>

<a href="http://horicky.blogspot.com/2009/11/nosql-patterns.html">http://horicky.blogspot.com/2009/11/nosql-patterns.html</a>

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

google created a full mechanism that included a distributed filesystem, a column-family-oriented data store, a distributed coordination system, and a mapreduce-based parallel algorithm execution environment. graciously enough, google published and presented a series of papers explaining some of the key pieces of its infrastructure. the most important of these publications are as follows:

the creators of the open-source search engine, lucene, were the first to develop an open-source version that replicated some of the features of google’s infrastructure. subsequently, the core lucene developers joined yahoo, where with the help of a host of other contributors, they created a parallel universe that mimicked all the pieces of the google distributed computing stack.

this open-source alternative is hadoop.

a year after the google papers had catalyzed interest in parallel scalable processing and nonrelational distributed data stores, amazon decided to share some of its own success story. in 2007, amazon presented its ideas of a distributed highly available and eventually consistent data store named dynamo.

you can read more about amazon dynamo in a research paper, the details of which are as follows:

相关blog:

<a href="http://www.cnblogs.com/fxjwind/archive/2012/09/27/2705179.html">nosql data modeling techniques</a>

concerning the classification of nosql stores highscalability author todd hoff cites a presentation by stephen yen in his blog post “a yes for a nosql taxonomy” (cf. [hof09c]).

in the presentation “nosql is a horseless carriage” (cf. [yen09]) yen suggests a taxononmy that can be found in table 2.1.

key-value-cache

memcached, repcached, coherence, infinispan, extreme scale, jboss cache, velocity, terracoqa

key-value-store

keyspace, flare, schema free, ramcloud

eventually-consistent key-value-store

dynamo, voldemort, dynomite, subrecord

mo8ondb

dovetaildb

ordered-key-value-store

tokyo tyrant, lightcloud, nmdb, luxio, memcachedb, actord

data-structures server

redis

tuple store

gigaspaces, coord, apache river

object database

zopedb, db4o, shoal

document store

couchdb, mongodb, jackrabbit, xml databases, thrudb, cloudkit, perservere, riak basho, scalaris

wide columnar store

bigtable, hbase, cassandra, hypertable, kai, openneptune, qbase, kdi

<a href="http://www.cnblogs.com/fxjwind/archive/2012/02/24/2367031.html">cap – consistency, availability, partition tolerance</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/08/07/2627059.html">how to beat the cap theorem</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/13/3017892.html">全序, 分布式一致性的本质</a>

在lamport论文谈了那么多偏序和全序的问题, 全序到底有什么用? 论文里面给出互斥资源访问的例子, 如果觉得还是比较抽象

这里以分布式数据存储为例

对于并发写数据就存在一致性问题, 如何解决分布式数据库的一致性问题?

lamport在上面那篇论文里面其实也给出了答案, 这就是他这篇paper里面第二个贡献, 也是常常为人忽略的

如果将分布式系统的所有节点看作有限状态机, 只要保证每个节点的执行命令序列一致, 就能保证所有节点的状态的一致性

对于分布式数据库, 其实就是在同样的初始状况下, 保证每个数据库节点的数据更新序列一致, 就能简单的保证所有数据库的数据的一致性

所以可以看出, 一致性问题已经转变为排序问题

所以这就是为什么上面的paper来讨论偏序和全序的原因, 因为其实你解决了这个问题就已经解决了数据一致性问题

于是上面的问题转变为, 如何在分布式的环境中, 给所有的写操作全序?

1. 基于master或固定参照系, 比如下面的利用时间戳, 悲观或乐观锁

这些方法确实可以保证全序, 但都存在单点或时钟同步问题

2. 使用paxos算法来保证全序, 尤其在强一致性的场景下

但问题在于, 该算法耗费比较高, 如果对于海量并发写而言, 需要高可用性的方案

当然对于高可用性的方案, 必须要做出一些牺牲, 无法保证全序

那么vector clocks算法就是这样一种方案, 当然只能达到偏序, 因为他的原理就是基于paper中描述的偏序理论

<a href="http://www.cnblogs.com/fxjwind/archive/2012/11/24/2786066.html">nosql数据一致性技术概要</a>

此概念成名于dynamo的设计, 但是该设计不光可以用于最终一致性的方案, 而是一种保证一致性的通用思路

因为在分布式的环境中, 让w达到n是不现实的,在这种情况下怎样保证一致性...

对于m/s架构, 如果master只会同步更新部分复本w, 如果read操作需要读到最新数据, 要不通过master, 要不就至少需要读r个复本, 并保证r+w>n

paxos同样也可以基于这样的设计

n the number of replicas for the data or the piece of data to be read or written. r the number of machines contacted in read operations. w the number of machines that have to be blocked in write operations5. in the interest to provide e. g. the read-your-own-writes consistency model the following relation between the above parameters becomes necessary: r +w > n

几种特殊情况：

w = 1, r = n,对写操作要求高性能高可用

r = 1, w = n , 对读操作要求高性能高可用，比如类似cache之类业务

w = q, r = q where q = n / 2 + 1 一般应用适用，读写性能之间取得平衡。如n=3,w=2,r=2

当然最典型的代表就是amazon dynamo

高可用性的solution, 任意节点都可以写入数据, 必然导致版本的不一致和冲突

所以必须需要一种技术来记录各个版本之间的因果关系或偏序关系, 这就需要vector clocks

并且对于任意节点的更新, 如何在各个复本间同步以达到最终的一致性, 这就需要反熵协议

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/13/3017914.html">vector clocks, 时间向量</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/06/06/2537963.html">why vector clock are easy or hard?</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/02/2995679.html">anti-entropy protocols</a>

如上图右下角, m/s比较简单在上面的引用已经描述, 简单但很实用, goolge早期在gfs和bigtable都使用的这种设计

其中最重要的算法是paxos, google的megastore中使用

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/03/2998396.html">strong consistency, 强一致性技术概述</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/11/3014620.html">paxos made simple</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/13/3018514.html">consistent hashing算法及相关技术</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/06/08/2541818.html">data replication 同步技术</a>

a table of a relational model gets serialized as its lines are appended and flushed to disk.

advantages

a. whole datasets can be read and written in a single io operation

b. one has a “[g]ood locality of access (on disk and in cache) of different columns”.

disadvantages

a. operating on columns is expensive as a considerable amount data has to be read.

serializes tables by appending their columns and flushing them to disk.

therefore operations on columns are fast and cheap while operations on rows are costly and can lead to seeks in a lot or all of the columns. a typical application field for this type

of storage layout is analytics where an efficient examination of columns for statistical purposes is important.

其实没有好坏, 只是不同的场景, 如果需要整行读当然row-based好, 如果只需要少量的column, 当然选columnar

做个balance, 就是下面的方案column-families

similar to column-based storage but adds the feature of defining so called locality groups that are groups of columnsexpected to be accessed together by clients.

the columns of such a group may therefore be stored together and physically separated from other columns and column groups.

the idea of locality groups was introduced in google’s bigtable paper.

storage implementation pluggable. e.g. a local mysql db, berkeley db, filesystem or even a in memory hashtable can be used as a storage mechanism.

特有的storage implementation, hbase, couchbase

<a href="http://www.cnblogs.com/fxjwind/archive/2012/06/09/2543357.html">大数据索引技术 - b+ tree vs lsm tree</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html">详解sstable结构和lsmtree索引</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/06/28/2567648.html">nosql databases - couchdb</a>

couchdb has a mvcc model that uses a copy-on-modified approach. any update will cause a private copy being made which in turn cause the index also need to be modified and causing the a private copy of the index as well, all the way up to the root pointer.

notice that the update happens in an append-only mode where the modified data is appended to the file and the old data becomes garbage. periodic garbage collection is done to compact the data. here is how the model is implemented in memory and disks.

whereas key/value stores by design often only provide a lookup by primary key or some id field and lack capabilities to query any further fields, other datastores like the document databases couchdb and mongodb allow for complex queries—at least static ones predefined on the database nodes (as in couchdb).

this is not surprising as in the design of many nosql databases rich dynamic querying features have been omitted in favor of performance and scalability.

on the other hand, also when using nosql databases, there are use-cases requiring at least some querying features for non-primary key attributes.

nosql往往只支持基于主键query, 而无法支持复杂的查询, 比如范围查询, 非主键的查询, 当然也有象couchdb和mangodb可以支持这样的查询.

但大部分比较纯粹的nosql是不支持的, 因为基于key/value的query, 一般都是基于dht(distributed hash table)技术, 只支持exact match.

那么如果用nosql, 又想具有较复杂的querying features, 有如下思路,

companion sql-database is an approach in which searchable attributes are copied to a sql or text database. the querying capabilities of this database are used to retrieve the primary keys of matching datasets by which the nosql database will subsequently be accessed.

如图, 这个想法就是用sql当索引, 比较简单, 因为索引应该会小点, 所以扩展性问题不是那么突出, 但是还是有问题, 而且维护两个系统增加了复杂性

scatter/gather local search can be used if the nosql store allows querying and indexing within database server nodes. if this is the case a query processor can dispatch queries to the database

nodes where the query is executed locally. the results from all database servers are sent back to the query processor postprocessing them to e. g. do some aggregation and returning the results to a client that issued the query.

distributed b+trees are another alternative to implement querying features. the basic idea is to hash the searchable attribute to locate the root node of a distributed b+tree (further information on scalable, distributed b+trees can be found in a paper by microsoft, hp and the university of toronto, cf. [ags08]). the “value” of this root node then contains an id for a child node in the b+tree which can again be looked up. this process is repeated until a leaf node is reached which contains the primary-key or id of a nosql database entry matching search criteria.

prefix hash table (aka distributed trie) is a tree-datastructure where every path from the root-node to the leafs contains the prefix of the key and every node in the trie contains all the data whose key is prefixed by it (for further information cf. a berkley-paper on this datastructure [rrhs04]). besides an illustration ho provides some code-snippets in his blog post that describe how to operate on prefix hash tables / distributed tries and how to use them for querying purposes (cf.[ho09b]).

前缀ht, effciently supporting 1-dimensional range queries over a dht.

<a href="http://www.cnblogs.com/fxjwind/archive/2012/07/07/2580761.html">bigtable: a distributed storage system for structured data</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/09/25/2701292.html">hbase-tdg introduction</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/09/26/2704089.html">hbase-tdg clientapi the basics</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/10/10/2717976.html">hbase-tdg clientapi advanced features</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/10/24/2737267.html">hbase-tdg schema design</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2013/01/09/2852995.html">hbase vs. bigtable comparison</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/08/10/2631822.html">cassandra - a decentralized structured storage system</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/07/02/2573498.html">nosql databases - mongodb</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2012/07/02/2573329.html">comparing mongo db and couch db</a>

<a href="http://www.cnblogs.com/fxjwind/archive/2013/04/28/3049468.html">mongodb schema design</a>

本文章摘自博客园，原文发布日期： 2012-06-13

NoSQL Databases技术资料整理汇总

继续阅读

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艰难安装LDAP,SSL认证

《Linux命令行与Shell脚本编程大全第2版.布卢姆》pdf

MySQL的4种隔离级别？出现问题

数据结构与算法（27）——排序（二）

XX系统实施过程问题总结

无组件上传图片到数据库中，最完整解决方案

【MySQL数据库】数据库索引事务1.索引2.事务

neo4j之cypher使用文档

Dijkstra--简易版（最短路径）

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

NOSQL安全攻击

mybatis_入门程序Mybatis入门

登录plsql 报错 the account is locked --用户被锁

SequoiaDB巨杉数据库C++驱动概述

hdu7108哈希