cpu cacheline对性能影响实验

cpu利用cache和内存之间交换数据的最小粒度不是字节，而是称为cacheline的一块固定大小的区域，详细信息参见wiki文档：

<a href="http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure">http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure</a>

cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size

关于cpu cache对性能的影响， Igor Ostrovsky有一篇精彩的文章：

cacheline.c

点击(此处)折叠或打开

#include stdio.h>

#include string.h>

#define BUF_SIZE 8388608

#define LOOPS 16

char arr[BUF_SIZE] __attribute__((__aligned__((64)),__section__(".data.cacheline_aligned"))) ;

int main(int argc, char **argv)

{

int step = atoi(argv[1]);

int i = 0;

int j = 0;

int iter = 0;

for (i = 0; i LOOPS; i++){

for (j = 0; j BUF_SIZE; j += step){

iter++;

arr[j] = 3;

}

printf("%d\n", iter);

return 0;

}

编译一下： gcc -O0 -o cacheline cacheline.c

下面开始看看cacheline对程序性能的影响。按照cacheline的定义，我们可以推测step从1到64，加载cacheline的次数是一致的。而继续增大step，加载cacheline的次数就会变少。

看看结果：

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1

134217728

Performance counter stats for './cacheline 1':

2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits

673,338,076 L1-dcache-load

1,041,209,909 cycles # 0.000 GHz

0.433421077 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2

67108864

Performance counter stats for './cacheline 2':

2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits

337,577,957 L1-dcache-load

524,684,462 cycles # 0.000 GHz

0.254773008 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4

33554432

Performance counter stats for './cacheline 4':

2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits

169,703,215 L1-dcache-load

255,623,966 cycles # 0.000 GHz

0.154640897 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64

2097152

Performance counter stats for './cacheline 64':

2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits

12,299,250 L1-dcache-load

55,040,163 cycles # 0.000 GHz

0.034769960 seconds time elapsed

可以看出，

i）step从1调整到64，L1 cache misses非常接近

ii）程序执行时间不光取决于cache miss，还与很多因素有关（比如cpu clocks）

继续增大step：

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128

1048576

Performance counter stats for './cacheline 128':

1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits

7,048,673 L1-dcache-load

38,773,055 cycles # 0.000 GHz

0.024586981 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024

131072

Performance counter stats for './cacheline 1024':

442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits

2,427,631 L1-dcache-load

17,618,913 cycles # 0.000 GHz

0.011433279 seconds time elapsed

L1 cache miss有了非常明显的下降。

cpu cacheline对性能影响实验

继续阅读

CPU在空闲的时候做什么CPU在空闲的时候做什么

CPU对指令长度的判断

主机cpu突然飙高，如何快速排查问题

一文告诉你CPU分支预测对性能影响有多大

ARM 推出全新 CPU 和 GPU：首次基于 Armv9 指令集，性能显著提升

不同CPU指令的指令集密度

这次影响了几乎整个科技圈的CPU漏洞，究竟是个啥？