用linux perf指令來分析程式的cpu cache miss現象

2023-04-12 16:46:08

先來看一段簡單的程式：

#include <stdio.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	int a[1000][1000];
	if(1 == argc)
	{
		for(int i = 0; i < 1000; ++i)
		{
				for(int j = 0; j < 1000; ++j)
				{
						a[i][j] = 0;
				}
		}
	}
	else
	{
		for(int i = 0; i < 1000; ++i)
		{
				for(int j = 0; j < 1000; ++j)
				{
						a[j][i] = 0;
				}
		}
	}

	return 0;
}

上面有兩個小程式片段，哪段效率高？顯然，第一段效率高，為什麼呢？因為在C/C++中，數組是按行存儲的，程式的按行通路可以充分利用程式的局部性原理（空間局部性），用time指令來看看結果：

taoge$ time ./a.out 

real    0m0.006s
user    0m0.004s
sys     0m0.000s
taoge$ time ./a.out 

real    0m0.006s
user    0m0.004s
sys     0m0.000s
taoge$ time ./a.out 

real    0m0.006s
user    0m0.004s
sys     0m0.000s
taoge$ time ./a.out 1

real    0m0.009s
user    0m0.004s
sys     0m0.008s
taoge$ time ./a.out 1

real    0m0.010s
user    0m0.004s
sys     0m0.004s
taoge$ time ./a.out 1

real    0m0.010s
user    0m0.004s
sys     0m0.004s

顯然，第二段程式的real time要大，用perf分析下原因：

taoge$ perf stat -e L1-dcache-load-misses ./a.out

 Performance counter stats for './a.out':

           101,870 L1-dcache-load-misses                                       

       0.005415735 seconds time elapsed

taoge$ 
taoge$ 
taoge$ perf stat -e L1-dcache-load-misses ./a.out

 Performance counter stats for './a.out':

           100,231 L1-dcache-load-misses                                       

       0.005486385 seconds time elapsed

taoge$ 
taoge$ 
taoge$ perf stat -e L1-dcache-load-misses ./a.out

 Performance counter stats for './a.out':

           103,496 L1-dcache-load-misses                                       

       0.005329914 seconds time elapsed

taoge$ 
taoge$ 
taoge$ perf stat -e L1-dcache-load-misses ./a.out 1

 Performance counter stats for './a.out 1':

         1,122,333 L1-dcache-load-misses                                       

       0.012910445 seconds time elapsed

taoge$ 
taoge$ 
taoge$ perf stat -e L1-dcache-load-misses ./a.out 1

 Performance counter stats for './a.out 1':

         1,093,971 L1-dcache-load-misses                                       

       0.009197791 seconds time elapsed

taoge$ 
taoge$ 
taoge$ perf stat -e L1-dcache-load-misses ./a.out 1

 Performance counter stats for './a.out 1':

         1,099,561 L1-dcache-load-misses                                       

       0.009234823 seconds time elapsed

taoge$

顯而易見了， cache miss太多了。

理論聯系實際地了解一下，有好處。

用linux perf指令來分析程式的cpu cache miss現象

繼續閱讀

用syc.WaitGroup來等待go協程執行完畢, 順便看看協程并發數的控制方法

rapidjson将map轉為json------人生苦短，我用rapidjson

BCB如何加載字元串資源檔案（語言資源檔案）？

BCB中如何實時顯示滑鼠的坐标？---利用定時器

rapidjson将嵌套map轉為嵌套json------人生苦短，我用rapidjson

“undefined reference to“ 問題彙總及解決方法 ------非常非常好的一篇文章

用linux shell逐行讀取文本檔案内容

從http 414(Request-URI Too Long)說起------RFC并未限制URL的長度

再談用strace來看程序的動态調用-------順便說說用strace來定位core dump

如何引用另外一個檔案中的串，順便說說void print();和(void)print();的差別

用valgrind調試pthread_create引起的記憶體洩漏------順便熟悉下線程的joinable和detached屬性

模拟linux的shell---順便複習一下fork，execlp和waitpid函數

(int)(void *)a 是在幹什麼？ ------ 可應用于函數多參數聚合

玩轉重要的select函數并分析其行為

Linux程序控制函數之exec()函數的學習

基類的虛函數沒有實作而引起的錯誤