天天看点

cpu cacheline对性能影响实验

cpu利用cache和内存之间交换数据的最小粒度不是字节,而是称为cacheline的一块固定大小的区域,详细信息参见wiki文档:

<a href="http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure">http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure</a>

cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size 

64

关于cpu cache对性能的影响,  Igor Ostrovsky有一篇精彩的文章:

cacheline.c

点击(此处)折叠或打开

#include stdio.h&gt;

#include string.h&gt;

#define BUF_SIZE 8388608

#define LOOPS 16

char arr[BUF_SIZE] __attribute__((__aligned__((64)),__section__(".data.cacheline_aligned"))) ;

int main(int argc, char **argv)

{

  int step = atoi(argv[1]);

  int i = 0;

  int j = 0;

  int iter = 0;

  for (i = 0; i LOOPS; i++){

    for (j = 0; j BUF_SIZE; j += step){

      iter++;

      arr[j] = 3;

    }

  }

  printf("%d\n", iter);

  return 0;

}

编译一下: gcc -O0 -o cacheline cacheline.c

下面开始看看cacheline对程序性能的影响。按照cacheline的定义,我们可以推测step从1到64,加载cacheline的次数是一致的。而继续增大step,加载cacheline的次数就会变少。

看看结果:

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1

134217728

 Performance counter stats for './cacheline 1':

         2,352,446 L1-dcache-loads-misses    #    0.35% of all L1-dcache hits  

       673,338,076 L1-dcache-load                                              

     1,041,209,909 cycles                    #    0.000 GHz                    

       0.433421077 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2

67108864

 Performance counter stats for './cacheline 2':

         2,326,564 L1-dcache-loads-misses    #    0.69% of all L1-dcache hits  

       337,577,957 L1-dcache-load                                              

       524,684,462 cycles                    #    0.000 GHz                    

       0.254773008 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4

33554432

 Performance counter stats for './cacheline 4':

         2,309,318 L1-dcache-loads-misses    #    1.36% of all L1-dcache hits  

       169,703,215 L1-dcache-load                                              

       255,623,966 cycles                    #    0.000 GHz                    

       0.154640897 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64

2097152

 Performance counter stats for './cacheline 64':

         2,292,510 L1-dcache-loads-misses    #   18.64% of all L1-dcache hits  

        12,299,250 L1-dcache-load                                              

        55,040,163 cycles                    #    0.000 GHz                    

       0.034769960 seconds time elapsed

可以看出,

  i)step从1调整到64,L1 cache misses非常接近

  ii) 程序执行时间不光取决于cache miss,还与很多因素有关(比如cpu clocks)

继续增大step:

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128

1048576

 Performance counter stats for './cacheline 128':

         1,308,532 L1-dcache-loads-misses    #   18.56% of all L1-dcache hits  

         7,048,673 L1-dcache-load                                              

        38,773,055 cycles                    #    0.000 GHz                    

       0.024586981 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024

131072

 Performance counter stats for './cacheline 1024':

           442,176 L1-dcache-loads-misses    #   18.21% of all L1-dcache hits  

         2,427,631 L1-dcache-load                                              

        17,618,913 cycles                    #    0.000 GHz                    

       0.011433279 seconds time elapsed

L1 cache miss有了非常明显的下降。