numatop的實作

在霸爺的推薦下浏覽了一遍numatop的代碼. numatop從numa的角度統計系統的CPU,CPI, 記憶體通路熱點,RMA, LMA,調用棧等.

很好很強大.

numatop底層也是和perf一樣,都是通過核心的PerfEvent子系統觀察各種counter.

不同的是numatop是從node角度統計和聚合資料.

perf的使用 numatop官網

PerfEvnet是通過采集PMU processor的資料(Performance Monitoring Unit).

PMU

Performance Monitor Unit，性能監視單元，是CPU提供的一個單元.提供各種事件計數器,可以在計數器overflow時候産生新信号或者POLL_HUP.

參考<Intel開發手冊 - 18.8之後的幾個章節>

運作numatop

numatop官方支援kernel3.8以上版本,硬體支援E5,E7系列。

我們的開發機是2.6.32 的，需要把用到的3.8核心中的特性關掉就好了。

vim  intel/wsm.c
static plat_event_config_t s_wsmep_profiling[COUNT_NUM] = {
        { PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, 0x53, 0, "cpu_clk_unhalted.core" },
        { PERF_TYPE_RAW, 0x01B7, 0x53, 0x2011, "off_core_response_0" },
        { PERF_TYPE_HARDWARE, 1, 0x53, 0, "cpu_clk_unhalted.ref" },
        { PERF_TYPE_HARDWARE, 2, 0x53, 0, "instr_retired.any" },
        { PERF_TYPE_RAW, INVALID_CODE_UMASK, 0, 0, "off_core_response_1" }
};
然後，
make
./numatop

主界面

numatop 和 perf eventnumatop的實作PMU運作numatopnumatop如何組織不同node上的不同cpu上的不同僚件?numatop是如何擷取numa的布局?如何擷取cpu詳細資訊?使perf_event_open擷取計數器附dot畫圖腳本

node次元

同一個程序下的線程次元

numatop如何組織不同node上的不同cpu上的不同僚件?

numatop是如何擷取numa的布局?

擷取node清單

[[email protected] e2e-qos-0.8]$ cat /sys/devices/system/node/online 
0-1

擷取cpu清單

[[email protected] e2e-qos-0.8]$ cat /sys/devices/system/node//node0/cpulist
0-3,8-11
[[email protected] e2e-qos-0.8]$ cat /sys/devices/system/node//node1/cpulist
4-7,12-15

如何擷取cpu詳細資訊?

使用彙編指令cpuid

更多cpuid參考Intel手冊，或者

http://en.wikipedia.org/wiki/CPUID

__asm volatile(
            "cpuid\n\t"
            :"=a" (*eax),
            "=b" (*ebx),
            "=c" (*ecx),
            "=d" (*edx)
            :"a" (*eax));

這是一段嵌入彙編：

輸出，依次綁定eax, ebx, ecx, edx寄存器。

輸入，綁定本地變量eax。

彙編指令cpuid的傳回值 - 擷取CPU廠商字元資訊

eax=0時，傳回CPU的廠商資訊CPU's manufacturer ID string。傳回值是12個字元依次存儲在EBX, EDX, ECX。

"AMDisbetter!" – early engineering samples of AMD K5 processor
    "AuthenticAMD" – AMD
    "CentaurHauls" – Centaur (Including some VIA CPU)
    "CyrixInstead" – Cyrix
    "GenuineIntel" – Intel
    "TransmetaCPU" – Transmeta
    "GenuineTMx86" – Transmeta
    "Geode by NSC" – National Semiconductor
    "NexGenDriven" – NexGen
    "RiseRiseRise" – Rise
    "SiS SiS SiS " – SiS
    "UMC UMC UMC " – UMC
    "VIA VIA VIA " – VIA
    "Vortex86 SoC" – Vortex
    "KVMKVMKVMKVM" – KVM
    "Microsoft Hv" – Microsoft Hyper-V or Windows Virtual PC
    "VMwareVMware" – VMware
    "XenVMMXenVMM" – Xen HVM

彙編指令cpuid的傳回值 - 擷取CPU型号家族資訊

eax1時，傳回CPU的stepping, model, and family information，傳回值隻有一個，就是eax。

eax每個bit的解釋如下。

3:0 – Stepping
    7:4 – Model
    11:8 – Family
    13:12 – Processor Type
    19:16 – Extended Model
    27:20 – Extended Family

具體的解釋應該參考相應的廠商的開發手冊

通過cpuid工具擷取

已經有個同名的cpuid工具，通過CPU提供的cpuid指令dump出每個cpu的資訊。

sudo emerge -avt  sys-apps/cpuid
cpuid的輸出：
CPU 0:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
      model           = 0xa (10)
      stepping id     = 0x9 (9)
      extended family = 0x0 (0)
      extended model  = 0x3 (3)
      (simple synth)  = Intel Core i3-3000 (Ivy Bridge L1) / i5-3000 (Ivy Bridge E1/N0/L1) / i7-3000 (Ivy Bridge E1) / Mobile Core i3-3000 (Ivy Bridge L1) / i5-3000 (Ivy Bridge L1) / Mobile Core i7-3000 (Ivy Bridge E1/L1) / Xeon E3-1200 v2 (Ivy Bridge E1/N0/L1) / Pentium G1600/G2000/G2100 (Ivy Bridge P0) / Pentium 900/1000/2000/2100 (P0), 22nm
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x0 (0)
      cpu count                      = 0x10 (16)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0x0 (0)
   brand id = 0x00 (0): unknown
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
...
...
...

非常的詳細，包括TLB,SYSCALL,cache等。

/sys/devices/system/node/online

使perf_event_open擷取計數器

perf_event_open手冊

// 對overflow時間可以選擇信号處理SIGIO, 也可以選擇epoll, select.
// 這裡選擇SIGIO.
// 信号處理函數
static void perf_event_handler(int signum, siginfo_t* info, void* ucontext) {
    printf("In Signal handler Used %lld instructions\n", (++g_cnt)*1000);
    ioctl(info->si_fd, PERF_EVENT_IOC_REFRESH, 1);
}
int main()
{
    struct sigaction sa;
    memset(&sa, 0, sizeof(struct sigaction));
    sa.sa_sigaction = perf_event_handler;
    sa.sa_flags = SA_SIGINFO;
    
    // 注冊 SIGIO的信号處理函數
    // 當計數器overflow會像程序發送SIGIO信号
    sigaction(SIGIO, &sa, NULL);
    
    struct perf_event_attr pe;
    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    // 統計 retired instructions
    pe.config = PERF_COUNT_HW_INSTRUCTIONS;
    // Event is initially disabled    
    pe.disabled = 1;
    pe.sample_type = PERF_SAMPLE_IP;
    // 采樣的間隔,設定寄存器為1000,沒當時間發生一次就減1,當到達0的時候就觸發一次信号.
    pe.sample_period = 1000;
    // 核心發生的時間排除在外
    pe.exclude_kernel = 1;
    pe.exclude_hv = 1;
    // 統計目前程序
    pid_t pid = 0;
    // 統計所有CPU節點
    int cpu = -1;
    // 目前統計項為組長
    // 可以同時開啟多個統計項, 組長的group_fd為-1, 其他的統計項的group_fd為組長的perf_event_open傳回的fd
    // 一個組的統計項,在所有統計項都能統計時,一起統計.
    int group_fd = -1;
    unsigned long flags = 0;
    
    // 打開event
    int fd = perf_event_open(&pe, pid, cpu, group_fd, flags);
    // 設定以信号的方式處理overflow
    fcntl(fd, F_SETFL, O_NONBLOCK|O_ASYNC);
    fcntl(fd, F_SETSIG, SIGIO);
    fcntl(fd, F_SETOWN, getpid());
    // 初始計數器的值0
    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_REFRESH, 1);
    // 制造一些payload
    long loopCount = 1000000;
    long c = 0;
    long i = 0;
    for(i = 0; i < loopCount; i++) {
        c += 1;
    }
    // 關閉event
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    // 讀取最終的計數值
    long long counter;
    read(fd, &counter, sizeof(long long));  // Read event counter value
    printf("Used %lld instructions\n", counter);
    close(fd);
}

附dot畫圖腳本

dot -Tpng "/tmp/numa_cpu_node.gv" > "/tmp/numa_cpu_node.png"

// /tmp/numa_cpu_node.gv
digraph G
{
    node [shape=record, style=filled];
    group[fillcolor=green,label="{node_group|<node>node_t [64]}"];
    
    node1[fillcolor=orange, label="{node_t|<cpu>perf_cpu_t cpus[64]|int nid}"];
    node2[fillcolor=orange, label="{node_t|<cpu>perf_cpu_t cpus[64]|int nid}"];
    
    cpu11[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
    cpu12[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
    cpu13[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
    cpu21[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
    cpu22[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
    cpu23[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
    
    fd11[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
    fd12[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
    fd13[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
    fd21[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
    fd22[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
    fd23[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
    group -> node1;
    group -> node2;
    node1:cpu -> cpu11;
    cpu11:fds -> fd11;
    
    node1:cpu -> cpu12;
    cpu12:fds -> fd12;
    
    node1:cpu -> cpu13;
    cpu13:fds -> fd13;
    
    node2:cpu -> cpu21;
    cpu21:fds -> fd21;
    node2:cpu -> cpu22;
    cpu22:fds -> fd22;
    node2:cpu -> cpu23;
    cpu23:fds -> fd23;    
}

numatop 和 perf eventnumatop的實作PMU運作numatopnumatop如何組織不同node上的不同cpu上的不同僚件?numatop是如何擷取numa的布局?如何擷取cpu詳細資訊?使perf_event_open擷取計數器附dot畫圖腳本

numatop的實作

PMU

運作numatop

主界面

node次元

同一個程序下的線程次元

numatop如何組織不同node上的不同cpu上的不同僚件?

numatop是如何擷取numa的布局?

擷取node清單

擷取cpu清單

如何擷取cpu詳細資訊?

使用彙編指令cpuid

彙編指令cpuid的傳回值 - 擷取CPU廠商字元資訊

彙編指令cpuid的傳回值 - 擷取CPU型号家族資訊

通過cpuid工具擷取

使perf_event_open擷取計數器

附dot畫圖腳本

繼續閱讀

需求分析-資料流圖

資料庫設計理論及應用（4）——概念結構設計1．概念模型 2．銷售子系統的分E-R圖 3．視圖的內建 4．設計基本E-R圖

資料流圖的設計

資料庫規範化設計理論摘要要

黑馬程式員——C#結構及常用基本類型

試分析如何把數組array中的所有元素循環右移p位

Flash AS3 連續加載外部若幹圖檔

DB2表壓縮功能

華為筆試軟體

項目管理那些事兒

OS --written test1

OS-written test2

壓縮編碼M-JPEG、MPEG4、H.264

轉詳解C#資料庫存取圖檔三大方式

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名