cpu利用cache和内存之间交换数据的最小粒度不是字节,而是称为cacheline的一块固定大小的区域,详细信息参见wiki文档:
<a href="http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure">http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure</a>
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
关于cpu cache对性能的影响, Igor Ostrovsky有一篇精彩的文章:
cacheline.c
点击(此处)折叠或打开
#include stdio.h>
#include string.h>
#define BUF_SIZE 8388608
#define LOOPS 16
char arr[BUF_SIZE] __attribute__((__aligned__((64)),__section__(".data.cacheline_aligned"))) ;
int main(int argc, char **argv)
{
int step = atoi(argv[1]);
int i = 0;
int j = 0;
int iter = 0;
for (i = 0; i LOOPS; i++){
for (j = 0; j BUF_SIZE; j += step){
iter++;
arr[j] = 3;
}
}
printf("%d\n", iter);
return 0;
}
编译一下: gcc -O0 -o cacheline cacheline.c
下面开始看看cacheline对程序性能的影响。按照cacheline的定义,我们可以推测step从1到64,加载cacheline的次数是一致的。而继续增大step,加载cacheline的次数就会变少。
看看结果:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1
134217728
Performance counter stats for './cacheline 1':
2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits
673,338,076 L1-dcache-load
1,041,209,909 cycles # 0.000 GHz
0.433421077 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2
67108864
Performance counter stats for './cacheline 2':
2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits
337,577,957 L1-dcache-load
524,684,462 cycles # 0.000 GHz
0.254773008 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4
33554432
Performance counter stats for './cacheline 4':
2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits
169,703,215 L1-dcache-load
255,623,966 cycles # 0.000 GHz
0.154640897 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64
2097152
Performance counter stats for './cacheline 64':
2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits
12,299,250 L1-dcache-load
55,040,163 cycles # 0.000 GHz
0.034769960 seconds time elapsed
可以看出,
i)step从1调整到64,L1 cache misses非常接近
ii) 程序执行时间不光取决于cache miss,还与很多因素有关(比如cpu clocks)
继续增大step:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128
1048576
Performance counter stats for './cacheline 128':
1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits
7,048,673 L1-dcache-load
38,773,055 cycles # 0.000 GHz
0.024586981 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024
131072
Performance counter stats for './cacheline 1024':
442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits
2,427,631 L1-dcache-load
17,618,913 cycles # 0.000 GHz
0.011433279 seconds time elapsed
L1 cache miss有了非常明显的下降。