numatop的實作
在霸爺的推薦下浏覽了一遍numatop的代碼. numatop從numa的角度統計系統的CPU,CPI, 記憶體通路熱點,RMA, LMA,調用棧等.
很好很強大.
numatop底層也是和perf一樣,都是通過核心的PerfEvent子系統觀察各種counter.
不同的是numatop是從node角度統計和聚合資料.
perf的使用 numatop官網PerfEvnet是通過采集PMU processor的資料(Performance Monitoring Unit).
PMU
Performance Monitor Unit,性能監視單元,是CPU提供的一個單元.提供各種事件計數器,可以在計數器overflow時候産生新信号或者POLL_HUP.
參考<Intel開發手冊 - 18.8之後的幾個章節>
運作numatop
numatop官方支援kernel3.8以上版本,硬體支援E5,E7系列。
我們的開發機是2.6.32 的,需要把用到的3.8核心中的特性關掉就好了。
vim intel/wsm.c
static plat_event_config_t s_wsmep_profiling[COUNT_NUM] = {
{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, 0x53, 0, "cpu_clk_unhalted.core" },
{ PERF_TYPE_RAW, 0x01B7, 0x53, 0x2011, "off_core_response_0" },
{ PERF_TYPE_HARDWARE, 1, 0x53, 0, "cpu_clk_unhalted.ref" },
{ PERF_TYPE_HARDWARE, 2, 0x53, 0, "instr_retired.any" },
{ PERF_TYPE_RAW, INVALID_CODE_UMASK, 0, 0, "off_core_response_1" }
};
然後,
make
./numatop
主界面
node次元
同一個程序下的線程次元
numatop如何組織不同node上的不同cpu上的不同僚件?
numatop是如何擷取numa的布局?
擷取node清單
[[email protected] e2e-qos-0.8]$ cat /sys/devices/system/node/online
0-1
擷取cpu清單
[[email protected] e2e-qos-0.8]$ cat /sys/devices/system/node//node0/cpulist
0-3,8-11
[[email protected] e2e-qos-0.8]$ cat /sys/devices/system/node//node1/cpulist
4-7,12-15
如何擷取cpu詳細資訊?
使用彙編指令cpuid
更多cpuid參考Intel手冊,或者
http://en.wikipedia.org/wiki/CPUID__asm volatile(
"cpuid\n\t"
:"=a" (*eax),
"=b" (*ebx),
"=c" (*ecx),
"=d" (*edx)
:"a" (*eax));
這是一段嵌入彙編:
輸出,依次綁定eax, ebx, ecx, edx寄存器。
輸入,綁定本地變量eax。
彙編指令cpuid的傳回值 - 擷取CPU廠商字元資訊
eax=0時,傳回CPU的廠商資訊CPU's manufacturer ID string。傳回值是12個字元依次存儲在EBX, EDX, ECX。
"AMDisbetter!" – early engineering samples of AMD K5 processor
"AuthenticAMD" – AMD
"CentaurHauls" – Centaur (Including some VIA CPU)
"CyrixInstead" – Cyrix
"GenuineIntel" – Intel
"TransmetaCPU" – Transmeta
"GenuineTMx86" – Transmeta
"Geode by NSC" – National Semiconductor
"NexGenDriven" – NexGen
"RiseRiseRise" – Rise
"SiS SiS SiS " – SiS
"UMC UMC UMC " – UMC
"VIA VIA VIA " – VIA
"Vortex86 SoC" – Vortex
"KVMKVMKVMKVM" – KVM
"Microsoft Hv" – Microsoft Hyper-V or Windows Virtual PC
"VMwareVMware" – VMware
"XenVMMXenVMM" – Xen HVM
彙編指令cpuid的傳回值 - 擷取CPU型号家族資訊
eax1時,傳回CPU的stepping, model, and family information,傳回值隻有一個,就是eax。
eax每個bit的解釋如下。
3:0 – Stepping
7:4 – Model
11:8 – Family
13:12 – Processor Type
19:16 – Extended Model
27:20 – Extended Family
具體的解釋應該參考相應的廠商的開發手冊
通過cpuid工具擷取
已經有個同名的cpuid工具,通過CPU提供的cpuid指令dump出每個cpu的資訊。
sudo emerge -avt sys-apps/cpuid
cpuid的輸出:
CPU 0:
vendor_id = "GenuineIntel"
version information (1/eax):
processor type = primary processor (0)
family = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
model = 0xa (10)
stepping id = 0x9 (9)
extended family = 0x0 (0)
extended model = 0x3 (3)
(simple synth) = Intel Core i3-3000 (Ivy Bridge L1) / i5-3000 (Ivy Bridge E1/N0/L1) / i7-3000 (Ivy Bridge E1) / Mobile Core i3-3000 (Ivy Bridge L1) / i5-3000 (Ivy Bridge L1) / Mobile Core i7-3000 (Ivy Bridge E1/L1) / Xeon E3-1200 v2 (Ivy Bridge E1/N0/L1) / Pentium G1600/G2000/G2100 (Ivy Bridge P0) / Pentium 900/1000/2000/2100 (P0), 22nm
miscellaneous (1/ebx):
process local APIC physical ID = 0x0 (0)
cpu count = 0x10 (16)
CLFLUSH line size = 0x8 (8)
brand index = 0x0 (0)
brand id = 0x00 (0): unknown
feature information (1/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
...
...
...
非常的詳細,包括TLB,SYSCALL,cache等。
*
/sys/devices/system/node/online
使perf_event_open擷取計數器
perf_event_open手冊// 對overflow時間可以選擇信号處理SIGIO, 也可以選擇epoll, select.
// 這裡選擇SIGIO.
// 信号處理函數
static void perf_event_handler(int signum, siginfo_t* info, void* ucontext) {
printf("In Signal handler Used %lld instructions\n", (++g_cnt)*1000);
ioctl(info->si_fd, PERF_EVENT_IOC_REFRESH, 1);
}
int main()
{
struct sigaction sa;
memset(&sa, 0, sizeof(struct sigaction));
sa.sa_sigaction = perf_event_handler;
sa.sa_flags = SA_SIGINFO;
// 注冊 SIGIO的信号處理函數
// 當計數器overflow會像程序發送SIGIO信号
sigaction(SIGIO, &sa, NULL);
struct perf_event_attr pe;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
// 統計 retired instructions
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
// Event is initially disabled
pe.disabled = 1;
pe.sample_type = PERF_SAMPLE_IP;
// 采樣的間隔,設定寄存器為1000,沒當時間發生一次就減1,當到達0的時候就觸發一次信号.
pe.sample_period = 1000;
// 核心發生的時間排除在外
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
// 統計目前程序
pid_t pid = 0;
// 統計所有CPU節點
int cpu = -1;
// 目前統計項為組長
// 可以同時開啟多個統計項, 組長的group_fd為-1, 其他的統計項的group_fd為組長的perf_event_open傳回的fd
// 一個組的統計項,在所有統計項都能統計時,一起統計.
int group_fd = -1;
unsigned long flags = 0;
// 打開event
int fd = perf_event_open(&pe, pid, cpu, group_fd, flags);
// 設定以信号的方式處理overflow
fcntl(fd, F_SETFL, O_NONBLOCK|O_ASYNC);
fcntl(fd, F_SETSIG, SIGIO);
fcntl(fd, F_SETOWN, getpid());
// 初始計數器的值0
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_REFRESH, 1);
// 制造一些payload
long loopCount = 1000000;
long c = 0;
long i = 0;
for(i = 0; i < loopCount; i++) {
c += 1;
}
// 關閉event
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
// 讀取最終的計數值
long long counter;
read(fd, &counter, sizeof(long long)); // Read event counter value
printf("Used %lld instructions\n", counter);
close(fd);
}
附dot畫圖腳本
dot -Tpng "/tmp/numa_cpu_node.gv" > "/tmp/numa_cpu_node.png"
// /tmp/numa_cpu_node.gv
digraph G
{
node [shape=record, style=filled];
group[fillcolor=green,label="{node_group|<node>node_t [64]}"];
node1[fillcolor=orange, label="{node_t|<cpu>perf_cpu_t cpus[64]|int nid}"];
node2[fillcolor=orange, label="{node_t|<cpu>perf_cpu_t cpus[64]|int nid}"];
cpu11[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
cpu12[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
cpu13[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
cpu21[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
cpu22[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
cpu23[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"];
fd11[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
fd12[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
fd13[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
fd21[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
fd22[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
fd23[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"];
group -> node1;
group -> node2;
node1:cpu -> cpu11;
cpu11:fds -> fd11;
node1:cpu -> cpu12;
cpu12:fds -> fd12;
node1:cpu -> cpu13;
cpu13:fds -> fd13;
node2:cpu -> cpu21;
cpu21:fds -> fd21;
node2:cpu -> cpu22;
cpu22:fds -> fd22;
node2:cpu -> cpu23;
cpu23:fds -> fd23;
}