天天看點

二十三、PHP核心探索:翻譯一篇HashTables文章 ☞ PHP中的任何東西都是哈希表

In case you ever heard me talking about PHP internals I certainly mentioned something along the lines of "Everything in PHP is a HashTable" that's not true, but next to a zval the HashTable is one of the most important structures in PHP. While preparing my "PHP Under The Hood" talk for the Dutch PHP Conference there was a question on IRC about extension_loaded() being faster thanfunction_exists(), which might be strange as both of them are simple hash lookups and a hash lookup is said to be O(1). I started to write some slides for it but figured out that I won't have the time to go through it during that presentation, so I'm doing this now, here:

如果你以前看過我寫的PHP核心相關的内容,那麼你一定記得我說過PHP中的任何東西都是哈希表。這不一定是真的,但是除了zval,哈希表确實是PHP中的最重要的結構了。在為Dutch PHP 大會準備"PHP Under The Hood" 演講時,IRC上有個關于extension_loaded()比function_exists()快的問題,這讓人感到疑惑,它們底層都是簡單的哈希查找,而哈希查找的時間複雜度為O(1)。我寫了一些相關的幻燈片,但是那個大會上已經沒有時間了,是以放在了這裡。

You all should know PHP arrays. They allow you to create a list of elements where every element may be identified by a key. This key may either be an integer or a string. Now we need a way to store this association in an effective way in memory. An efficient way to store a collection of data in memory is a "real" array, an array of elements indexed by a sequence of numbers, starting at 0, up to n. As memory is essentially a sequence of numbered storage blocks (this obviously is simplified, there might be segments and offsets, there might be virtual pages, etc.) we can efficiently access an element by knowing the address of the first element, the size of the elements (we assume all have the same size, which is the case here), and the index: The address of the n-th element is start_address + (n * size_of_element). That's basically what C arrays are.

大家都應該知道PHP數組,它可以儲存一系列元素,并且可以用一個key來辨別每個元素,這個key可以是整數或字元串。因而我們需要在記憶體中有一個高效的方案來儲存這種關聯關系。其中一個高效的方案是把記憶體中的這些資料存儲為”真正“的數組,一個從索引0開始到n的連續序列。記憶體從本質上講就是一個連續的存儲塊(這麼說是簡化了的結果,其實這裡有段、偏移以及虛拟頁等概念)。我們可以快速地通路一個元素:隻要我們知道第一個元素的位址,元素的大小(這裡假設每個元素大小一緻)和待查找元素的索引值。第n個元素的位址計算公式為:第一個元素位址 + (n * 元素大小),C語言中的數組就是這麼實作的。

Now we're not dealing with C and C arrays but also want to use string offsets so we need a function to calculate a numeric value, a hash, for each key. An hash function you most likely know is MD5, MD5 is creating a 128 bit numeric value which is often represented using 32 hexadecimal characters. For our purpose 128 bit is a bit much and MD5 is too slow so the PHP developers have chosen the "DJBX33A (Daniel J. Bernstein, Times 33 with Addition)" algorithm. This hash function is relatively fast and gives us an integer of the value, the trouble with this algorithm is that it is more likely to produce conflicts, that means to string values which create the same numeric value.

現在我們處理的不是C語言和它的數組,我們想用字元串偏移量,是以我們需要一個哈希函數,為每個鍵值計算出一個數值。大家最熟悉的哈希函數可能是MD5,它産生一個128比特的數值,且通常用32個十六進制的字元表示。對于我們的目的而言,128比特的數值太大了,MD5計算過程緩慢,因而PHP開發者選擇了"DJBX33A (Daniel J. Bernstein, Times 33 with Addition)"算法。這個哈希函數相對快地從一個字元串計算出一個整數,缺點是沖突的機率增大了,即不同的字元串産生了相同的數值。

Now back to our C array: For being able to safely read any element, to see whether it is used, we need to pre-allocate the entire array with space for all possible elements. Given our hash function returning a system dependent (32 bit or 64 bit) integer this is quite a lot (size of an element multiplied wit the max int value), so PHP does another trick: It throws some digits away. For this a table mask is calculated. The a table mask is a number which is a power of two and then one subtracted and ideally higher than the number elements in the hash table. If one looks at this in binary representation this is a number where all bits are set. Doing a binary AND operation of our hash value and the table this gives us a number which is smaller than our table mask. Let's look at an example: The hash value of "foobar" equals, in decimal digits, to 3127933054. We assume a table mask of 2047 (2¹¹-1).

回到C數組:為了安全地讀到任何一個元素,不管它是否用到,我們需要為所有可能元素預先配置設定空間。假設我們的哈希函數傳回一個系統相關(32位或64位)的整數,數值非常大,這時PHP需要另外一個技巧:抛棄一些比特。這是通過表掩碼實作的,表掩碼是這樣一個數,不小于哈希表的元素個數值且為2的次方減一。如果檢視它的二進制表現形式,即各個位都設定為1的數。把哈希值和表掩碼做二進制與操作,得到一個小于表掩碼的數值。如:"foobar"的十進制格式的哈希值為3127933054,假設表掩碼為2047 (2¹¹-1)。

3127933054 10111010011100000111100001111110

& 2047 00000000000000000000011111111111

= 126 00000000000000000000000001111110

Wonderful - we have an array Index, 126, for this string and can set the value in memory!

好——我們得到這個字元串的數組索引值126,就可以在記憶體中操作了。

If life were that easy. Remember: We used a hashing function which is by far not collision free and then dropped almost two thirds of the binary digits by using the table mask. This makes it rather likely that some collisions appear. These collisions are handled by storing values with the same hash in a linked list. So for accessing the an element one has to

  1. Calculate the hash
  2. Apply the table mask
  3. locate the memory offset
  4. check whether this is the element we need, traverse the linked list.

生活并不如此簡單,我們使用的哈希函數不能避免碰撞,并通過表掩碼抛棄了幾乎三分之二的位,這更增大了碰撞的幾率。解決辦法是把相同哈希值的元素儲存在連結清單中,是以通路一個元素的時候:

  1. 計算哈希值
  2. 應用表掩碼
  3. 找到記憶體位置
  4. 周遊連結清單,檢查是否是查找的元素

Now the question initially asked was why extension_loaded() might be faster than function_exists() and we can tell: For many random reasons and since you have chosen a value which probably conflicts in the first, not in the second. So now the question is how often such collisions happen for this I did a quick analysis: Given the function table of a random PHP build of mine with 1106 functions listed I have 634 unique hash values and 210 hash values calculated from different functions. Worst is the value of 471 which represents 6 functions.