計算機是如何存儲矩陣,如何存儲稀疏鄰接矩陣(How to store sparse adjacency matrix)

如何存儲稀疏鄰接矩陣(How to store sparse adjacency matrix)

我讀了幾個主題，但我迷路了。我對此很陌生。我想存儲巨大的稀疏矩陣并有幾個想法，但可以在它們之間進行選擇。這是我的需求：

鄰接矩陣約。 5000萬個頂點。

每個頂點的最大鄰居數量 - 大約 10 000。

每個頂點的平均鄰居數量 - 約。 200-300。

快速行查詢 - 向量将乘以此矩陣。

O(1)增加邊緣的複雜性。

最有可能的是，邊緣不會被删除。

盡可能快地枚舉與v相鄰的頂點。

可移植性 - 必須有一種方法将基站從一台計算機轉移到另一台計算機。

是以，這是我的想法：

巨大的桌子對(行，col)。非常簡單，但頂點的枚舉将至少為O(log N)，其中N - 表的大小。我覺得它很慢。此外，它必須編入索引。每個RDBMS都會有所幫助。

大量的清單：每個頂點一個清單。枚舉速度非常快，但是存儲它需要多少資源？另外，我不确定在這種情況下使用哪個DBMS：也許是一些NoSql？

巨大的桌子(行|集合)。上面兩個組合。我不确定是否有任何RDBMS支援任意集。你知道任何？也許NoSql在這裡有用嗎？

鄰接清單的集合。任何RDBMS都适用于此，并且複雜性方面的成本很高，但是對于一個頂點，它們可以被多個DB請求殺死。

HDF5 - 我認為由于I / O會很慢。

Neo4j - 據我所知，它将資料存儲在雙連結清單中，是以它實際上與№4相同，我是對的嗎？

請幫助我選擇或提供更好的決定。

如果我在某處估計錯了，請糾正我。

I've read several topics, but I'm lost. I'm quite new to this. I want to store huge sparse matrix and have several idea's but can choose between them. Here's my needs:

Adjacency matrix of approx. 50 million vertices.

Maximum amount of neighbors per one vertex - approx. 10 000.

Average amount of neighbors per one vertex - approx. 200-300.

Fast row query - vector will be multiplied by this matrix.

O(1) complexity to add edge.

Most probably, edges will not be deleted.

Enumeration of the vertices adjacent to v - as fast as possible.

Portability - there must be a way to transfer base from one computer to another.

So, here's my ideas:

Huge table with pairs (row, col). Very simple, but enumeration of vertices will be at least O(log N), where N - size of table. It's quite slow as I think. Also, it must be indexed. Every RDBMS will be good for what.

Enormous amount of lists: one list per vertex. Very fast enumeration, but wouldn't it take much resources to storage this? Also, I'm not sure about which DBMS to use in this case: maybe some NoSql?

Huge table (row | set of cols). Combination of two above. I'm not sure is there any RDBMS to support arbitrary sets. Do you know any? Maybe NoSql will be useful here?

Collection of adjacency lists. Any RDBMS will be suitable for that, and costs in terms of complexity are good, but they can be killed by multiple request to DB for one vertex.

HDF5 - I think it will be slow due to I/O.

Neo4j - As far as I understand, it storages data in double-linked lists, so it will be practically the same as №4, am i right?

Please, help me to choose or offer a better decision.

If I'm wrong with estimates somewhere, please correct me.

原文：https://stackoverflow.com/questions/15003397

更新時間：2020-01-15 14:19

最滿意答案

混合neo4j / hbase方法可以很好地運作，其中neo4j優化了圖形處理方面，而hbase實作了繁重的可擴充性 - 例如，用于存儲大量額外屬性。

neo4j包含節點和關系。它可能具有足夠的可擴充性。我在網上對獨立的非neo4j站點進行的調查在單台機器上聲稱多達數十億個節點/關系，在周遊上比RDBMS具有幾個數量級更好的性能。

但是..如果需要更多的可擴充性，你可以引入hbase big iron來存儲非關系/節點辨別符的額外屬性。然後，隻需将hbase rowkey添加到neo4j節點資訊中，以便在應用程式需要時進行查找。

In the end, I've implemented solution number one.

I used PostgreSQL with two tables: one for edges with two columns - start/end, and another for vertices with unique serial for vertex number and some columns for vertex description.

I've implemented upsert based on pg_advisory_xact_lock. It was a bit slow, but it was enough for me.

Also, it's a pain to delete vertex from this configuration.

To speed up multiplication, I've exported edges table to file. It can even be placed in RAM on x64 machine.

To be fair, the amount of data was less than I expected. Instead of 50 million vertices and average 200-300 edges for 1 vertex there were only 7 million vertices and 160 million edges total.

相關問答

你是什麼意思'删除'？如果您确實想要删除 - 删除，隻需删除已删除節點的行和列，然後重新編号其餘節點。 What do you mean 'delete'? If you really want to delete-delete, just delete the row and column for the node deleted, and renumber the remaining nodes.

在您的問題中的圖算法中，如果給出權重，除了權重之外，通常不會明确地編碼鄰接。相反，圖形被認為是一個完整的圖形(即evey頂點與任何其他頂點相鄰)，但對于非相鄰頂點u ， v在初始圖形中，權重被編碼為“無窮大”，編碼為最大值所使用的資料類型，在計算等中識别的一些負值。使用這種方法，不可行的邊緣将永遠不會被考慮，因為它們比最初問題的任何可行解決方案更昂貴。 In graph algorithms as in your question, if weights are given, usually

...

你仍然可以使用sparse但你将不得不做更多的工作。首先，我們需要将X和Y的标簽轉換為唯一的整數ID。嘗試在組合的X和Y輸入上使用unique ，以便您可以獲得在兩者之間共享的唯一整數ID。具體來說， unique會給你一個輸入的所有唯一條目清單(是以X和Y組合)。我們結合X和Y的原因是因為X有某些令牌可能不會出現在Y ，反之亦然。對組合輸入進行此ID配置設定将確定一緻性。 'stable'标志存在，因為預設情況下unique實際排序所有唯一條目。如果輸入是字元串的單元數組，則按照字典順序

...

你必須建立你的鄰接矩陣，因為如果2與3 ： A(2,3) == 1相鄰，那麼3也必須與2 ： A(3,2) == 1相鄰。您已經建構了鄰接矩陣，使得每個邊隻有一個單向關系。您可以通過将EdgeList的另一列附加到sparse每個輸入來更正此問題： A = sparse([EdgeList(:,1); EdgeList(:,2)], [EdgeList(:,2); EdgeList(:,1)], 1);

或者，您可以轉置A并使用邏輯或( | )将其與A的初始版本組合以強制它對稱。 A = A

...

Cludgy，但嘿... gg1

gg1

...

混合neo4j / hbase方法可以很好地運作，其中neo4j優化了圖形處理方面，而hbase實作了繁重的可擴充性 - 例如，用于存儲大量額外屬性。 neo4j包含節點和關系。它可能具有足夠的可擴充性。我在網上對獨立的非neo4j站點進行的調查在單台機器上聲稱多達數十億個節點/關系，在周遊上比RDBMS具有幾個數量級更好的性能。但是..如果需要更多的可擴充性，你可以引入hbase big iron來存儲非關系/節點辨別符的額外屬性。然後，隻需将hbase rowkey添加到neo4j節點

...

nnz ：稀疏矩陣的非零數 row_size ：矩陣行号 column_size ：矩陣列号它們的空間複雜性有很多種：壓縮稀疏行(CSR)： 2*nnz + row_size記憶體數壓縮稀疏列(CSC)： 2*nnz + column_size記憶體數坐标格式(COO)： 3*nnz記憶體數對于空間複雜性：如果row_size > column_size ，則使用CSC格式，否則，使用CSR格式。對于時間複雜性：對于CSR格式，Row将被O(1)時間索引，Column将被O(log(k)

...

scipy.sparse為你做的。 import numpy as np

from scipy.sparse import dok_matrix

your_data = [((2, 7), 1)]

XDIM, YDIM = 10, 10 # Replace with your values

dct = {}

for (row, col), weight in your_data:

dct[(row, col)] = weight

smat = dok_matrix((XDIM, Y

...

該錯誤來自于類spantree的對象上的使用函數as_adjacency_matrix ，當它需要igraph 。由于您使用的是igraph ，一個簡單的解決方案是使用igraph的函數mst從原始的“距離圖”計算最小生成樹。以下是spantree如何計算最小生成樹： require(vegan)

data(dune)

dis

結果是以下樹( plot(tr, type="t") ) ：隻有igraph函數才能獲得相

...

這是一個簡單的矩陣乘法。 t(m) %*% m

a b c d e

a 2 1 1 0 0

b 1 2 1 0 0

c 1 1 3 1 0

d 0 0 1 1 0

e 0 0 0 0 0

使用這些資料： m = read.table(text = "ID a b c d e

1 1 0 1 0 0

2 0 1 1 0 0

3 0 0 1 1 0

4 1 1 0 0 0", header = T)

m = as.matrix(m[, -1])

這依賴于原始矩陣僅為1和0。如果不是

...