先來看一段簡單的程式:
#include <stdio.h>
#include <unistd.h>
int main(int argc, char **argv)
{
int a[1000][1000];
if(1 == argc)
{
for(int i = 0; i < 1000; ++i)
{
for(int j = 0; j < 1000; ++j)
{
a[i][j] = 0;
}
}
}
else
{
for(int i = 0; i < 1000; ++i)
{
for(int j = 0; j < 1000; ++j)
{
a[j][i] = 0;
}
}
}
return 0;
}
上面有兩個小程式片段, 哪段效率高? 顯然, 第一段效率高, 為什麼呢? 因為在C/C++中,數組是按行存儲的,程式的按行通路可以充分利用程式的局部性原理(空間局部性), 用time指令來看看結果:
taoge$ time ./a.out
real 0m0.006s
user 0m0.004s
sys 0m0.000s
taoge$ time ./a.out
real 0m0.006s
user 0m0.004s
sys 0m0.000s
taoge$ time ./a.out
real 0m0.006s
user 0m0.004s
sys 0m0.000s
taoge$ time ./a.out 1
real 0m0.009s
user 0m0.004s
sys 0m0.008s
taoge$ time ./a.out 1
real 0m0.010s
user 0m0.004s
sys 0m0.004s
taoge$ time ./a.out 1
real 0m0.010s
user 0m0.004s
sys 0m0.004s
顯然, 第二段程式的real time要大, 用perf分析下原因:
taoge$ perf stat -e L1-dcache-load-misses ./a.out
Performance counter stats for './a.out':
101,870 L1-dcache-load-misses
0.005415735 seconds time elapsed
taoge$
taoge$
taoge$ perf stat -e L1-dcache-load-misses ./a.out
Performance counter stats for './a.out':
100,231 L1-dcache-load-misses
0.005486385 seconds time elapsed
taoge$
taoge$
taoge$ perf stat -e L1-dcache-load-misses ./a.out
Performance counter stats for './a.out':
103,496 L1-dcache-load-misses
0.005329914 seconds time elapsed
taoge$
taoge$
taoge$ perf stat -e L1-dcache-load-misses ./a.out 1
Performance counter stats for './a.out 1':
1,122,333 L1-dcache-load-misses
0.012910445 seconds time elapsed
taoge$
taoge$
taoge$ perf stat -e L1-dcache-load-misses ./a.out 1
Performance counter stats for './a.out 1':
1,093,971 L1-dcache-load-misses
0.009197791 seconds time elapsed
taoge$
taoge$
taoge$ perf stat -e L1-dcache-load-misses ./a.out 1
Performance counter stats for './a.out 1':
1,099,561 L1-dcache-load-misses
0.009234823 seconds time elapsed
taoge$
顯而易見了, cache miss太多了。
理論聯系實際地了解一下, 有好處。