我讀了幾個主題,但我迷路了。 我對此很陌生。 我想存儲巨大的稀疏矩陣并有幾個想法,但可以在它們之間進行選擇。 這是我的需求:

鄰接矩陣約。 5000萬個頂點。

每個頂點的最大鄰居數量 - 大約 10 000。

每個頂點的平均鄰居數量 - 約。 200-300。

快速行查詢 - 向量将乘以此矩陣。




可移植性 - 必須有一種方法将基站從一台計算機轉移到另一台計算機。


巨大的桌子對(行,col)。 非常簡單,但頂點的枚舉将至少為O(log N),其中N - 表的大小。 我覺得它很慢。 此外,它必須編入索引。 每個RDBMS都會有所幫助。

大量的清單:每個頂點一個清單。 枚舉速度非常快,但是存儲它需要多少資源? 另外,我不确定在這種情況下使用哪個DBMS:也許是一些NoSql?

巨大的桌子(行|集合)。 上面兩個組合。 我不确定是否有任何RDBMS支援任意集。 你知道任何? 也許NoSql在這裡有用嗎?

鄰接清單的集合。 任何RDBMS都适用于此,并且複雜性方面的成本很高,但是對于一個頂點,它們可以被多個DB請求殺死。

HDF5 - 我認為由于I / O會很慢。

Neo4j - 據我所知,它将資料存儲在雙連結清單中,是以它實際上與№4相同,我是對的嗎?



I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs:

Adjacency matrix of approx. 50 million vertices.

Maximum amount of neighbors per one vertex - approx. 10 000.

Average amount of neighbors per one vertex - approx. 200-300.

Fast row query - vector will be multiplied by this matrix.

O(1) complexity to add edge.

Most probably, edges will not be deleted.

Enumeration of the vertices adjacent to v - as fast as possible.

Portability - there must be a way to transfer base from one computer to another.

So, here's my ideas:

Huge table with pairs (row, col). Very simple, but enumeration of vertices will be at least O(log N), where N - size of table. It's quite slow as I think. Also, it must be indexed. Every RDBMS will be good for what.

Enormous amount of lists: one list per vertex. Very fast enumeration, but wouldn't it take much resources to storage this? Also, I'm not sure about which DBMS to use in this case: maybe some NoSql?

Huge table (row | set of cols). Combination of two above. I'm not sure is there any RDBMS to support arbitrary sets. Do you know any? Maybe NoSql will be useful here?

Collection of adjacency lists. Any RDBMS will be suitable for that, and costs in terms of complexity are good, but they can be killed by multiple request to DB for one vertex.

HDF5 - I think it will be slow due to I/O.

Neo4j - As far as I understand, it storages data in double-linked lists, so it will be practically the same as №4, am i right?

Please, help me to choose or offer a better decision.

If I'm wrong with estimates somewhere, please correct me.


混合neo4j / hbase方法可以很好地運作,其中neo4j優化了圖形處理方面,而hbase實作了繁重的可擴充性 - 例如,用于存儲大量額外屬性。

neo4j包含節點和關系。 它可能具有足夠的可擴充性。 我在網上對獨立的非neo4j站點進行的調查在單台機器上聲稱多達數十億個節點/關系,在周遊上比RDBMS具有幾個數量級更好的性能。

但是..如果需要更多的可擴充性,你可以引入hbase big iron來存儲非關系/節點辨別符的額外屬性。 然後,隻需将hbase rowkey添加到neo4j節點資訊中,以便在應用程式需要時進行查找。

In the end, I've implemented solution number one.

I used PostgreSQL with two tables: one for edges with two columns - start/end, and another for vertices with unique serial for vertex number and some columns for vertex description.

I've implemented upsert based on pg_advisory_xact_lock. It was a bit slow, but it was enough for me.

Also, it's a pain to delete vertex from this configuration.

To speed up multiplication, I've exported edges table to file. It can even be placed in RAM on x64 machine.

To be fair, the amount of data was less than I expected. Instead of 50 million vertices and average 200-300 edges for 1 vertex there were only 7 million vertices and 160 million edges total.


