主页:https://perf.wiki.kernel.org/index.php/Main_Page
使用:http://www.brendangregg.com/perf.html 系统级性能分析工具perf的介绍与使用
源码:https://github.com/torvalds/linux/tree/master/tools/perf/
- statistics/count: increment an integer counter on events
- sample: collect details (eg, instruction pointer or stack) from a subset of events (once every …)
- trace: collect details from every event
对事件进行:
计数(stat);实时分析(top);
采样(record);文本分析(script);内置分析(report);汇编级分析(annotate)
自定义事件(probe)
1.性能profile
perf record进行采样,结果存入当前目录下的perf.data文件,二进制格式
perf script得到文本形式
perf report对perf.data进行分析,产生分析报告
perf diff可以对perf.data.old和perf.data的两个文件数据进行比较,找到每个函数的差异点
perf record -a --call-graph dwarf -p 29052 perf report --call-graph perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > perf.svg
火焰图生成源码:
https://github.com/brendangregg/FlameGraph
火焰图:
https://cacm.acm.org/magazines/2016/6/202665-the-flame-graph/fulltext
https://nicedoc.io/brendangregg/FlameGraph
1.1 record
记录事件采样数据到文件中。包含如下信息:
comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff, srcline, period
1.2 report
可以根据不同的维度进行聚合:–sort可以指定聚合的维度
可以根据不同的条件进行过滤:-c –pid= –tid=
report输出解读:
https://zh-blog.logan.tw/2019/10/06/intro-to-perf-events-and-call-graph/
https://www.man7.org/linux/man-pages/man1/perf-report.1.html
The overhead can be shown in two columns as Children and Self when perf collects callchains. The self overhead is simply calculated by adding all period values of the entry - usually a function (symbol). This is the value that perf shows traditionally and sum of all the self overhead values should be 100%. The children overhead is calculated by adding all period values of the child functions so that it can show the total overhead of the higher level functions even if they don’t directly execute much. Children here means functions that are called from another (parent) function.
是否包含子函数:默认是–children,结果会按照包含了子函数开销后的开销进行排序。
如果忽略子函数开销,只关注自己本身的开销,可以通过指定–no-children
sudo perf report --no-children
caller-based-graph vs callee-base-graph:
默认是callee-based,通过指定-G指定为caller-based。callee-based是以被调用者视角看,可以看到调用它的函数分别是哪些,每个占用的%。caller-based是调用者视图,可以看到它调用了哪些函数。
由于bug的原因,perf -G可能无法正确执行:Inverted call-graph broken?
但是可以通过flumegraph生成caller-based的图:
./FlameGraph/flamegraph.pl --inverted --reverse
1.3 script
除了可以输出文本格式,自己编写程序处理外。目前perf也提供了一些默认的script实现。
sudo perf script -l #查看当前支持的内置脚本
源代码目录:https://github.com/torvalds/linux/blob/master/tools/perf/scripts/
比如下面是采用内置脚本生成火焰图的方法(通过-l检查当前perf版本是否支持):
perf record -a -g -F 99 sleep 60 perf script report flamegraph
2.stat
统计各种事件的个数。不会记录具体事件,只是会生成一些counter,产生的开销要比record低。
sudo perf stat -d perf_test sudo perf list #可以查看支持的event列表 block/ext4/syscalls/net/sched/kmem/task sudo perf stat -e 'syscalls:sys_enter_*' command #查看执行的系统调用 sudo perf stat -e 'sched:*' -p PID #查看某个进程的调度事件
3.top
-e:指定性能事件 -a:显示在所有CPU上的性能统计信息 -C:显示在指定CPU上的性能统计信息 -p:指定进程PID -t:指定线程TID -K:隐藏内核统计信息 -U:隐藏用户空间的统计信息
4.pstack
对某个进程的stack进行采样
# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, # at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10 perf script
5. off-cpu
http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch \ -e sched:sched_process_exit -a -g -o perf.data.raw sleep 1 sudo perf inject -v -s -i perf.data.raw -o perf.data sudo perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk ' NF > 4 { exec = $1; period_ms = int($5 / 1000000) } NF > 1 && NF <= 4 && period_ms > 0 { print $2 } NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \ ./stackcollapse.pl | \ ./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg
6.probe
探针:This command defines dynamic tracepoint events, by symbol and registers without debuginfo, or by C expressions (C line numbers, C function names, and C local variables) with debuginfo.
原理:Linux tracing – kprobe, uprobe and tracepoint
指定ELF的addr,配置/sys/kernel/debug/tracing/events/uprobes,然后程序加载时会修改该地址确保执行probe记录代码。
使用:
Uprobe-tracer: Uprobe-based Event Tracing
6.1 内核探针
sudo perf probe --add tcp_sendmsg #--add是可选参数,默认值 sudo perf probe -l #查看当前增加的tracepoints sudo perf probe -d tcp_sendmsg #删除tracepoint sudo perf probe -V tcp_sendmsg --externs #查看tcp_sendmsg的可用参数 sudo perf probe -L tcp_sendmsg #查看可供probe代码行 sudo perf probe 'tcp_sendmsg %ax %dx %cx' #增加一个tracepoint 并记录寄存器值 sudo perf probe 'tcp_sendmsg size sk->__sk_common.skc_state' #记录当时可用参数值 sudo perf probe 'tcp_sendmsg%return $retval' #返回时的tracepoint,记录返回值 sudo perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a
6.2 用户态探针
6.2.1 c语言
# Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc # Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry
[admin@i32f09086.sqa.eu95 /home/admin] $sudo perf probe -x binary_path -V tc_new Available variables at tc_new @<tc_new+0> size_t size
[admin@i32f09086.sqa.eu95 /home/admin] $sudo perf probe -x binary_path --add tc_delete -v Open Debuginfo file:binary_path Try to find probe point from debuginfo. Probe point found: tc_delete+0 Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing/uprobe_events write=1 Added new event: Writing event: p:probe_pangu/tc_delete binary_path:0x240f910 Failed to write event: Invalid argument Error: Failed to add events. Reason: Invalid argument (Code: -22)
对应源代码:
https://github.com/torvalds/linux/blob/master/tools/perf/util/probe-file.c
将binary拷贝到当前目录下,问题解决。
$sudo perf probe -x binary_path --add "tc_new size" Added new event: probe_pangu:tc_new (on tc_new in binary_path with size) You can now use it in all perf tools, such as: perf record -e probe_pangu:tc_new -aR sleep 1 $sudo perf probe -l probe_pangu:tc_new (on tc_new@src/tcmalloc.cc in binarypath with size) $sudo perf record -e probe_pangu:tc_new -aR sleep 1
6.2.2 c++支持
http://notes.secretsauce.net/notes/2019/12/16_c-probes-with-perf.html
https://stackoverflow.com/questions/20172446/cant-add-perf-probe-for-c-methods
方法使用没有经过demangle的函数名称
sudo perf probe -x \ binary_path --funcs \ --no-demangle --filter '*' >rg_funcs #列出所有函数
以没有经过demangle的函数名称设置probe点
6.3 语法说明
sudo perf probe –help
PROBE SYNTAX Probe points are defined by following syntax. 1) Define event based on function name [EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...] 2) Define event based on source file with line number [EVENT=]SRC:ALN [ARG ...] 3) Define event based on source file with lazy pattern [EVENT=]SRC;PTN [ARG ...] EVENT specifies the name of new event, if omitted, it will be set the name of the probed function. Currently, event group name is set as probe. FUNC specifies a probed function name, and it may have one of the following options; +OFFS is the offset from function entry address in bytes, :RLN is the relative-line number from function entry line, and %return means that it probes function return. And ;PTN means lazy matching pattern (see LAZY MATCHING). Note that ;PTN must be the end of the probe point definition. In addition, @SRC specifies a source file which has that function. It is also possible to specify a probe point by the source line number or lazy matching by using SRC:ALN or SRC;PTN syntax, where SRC is the source file path, :ALN is the line number and ;PTN is the lazy matching pattern. ARG specifies the arguments of this probe point, (see PROBE ARGUMENT). PROBE ARGUMENT Each probe argument follows below syntax. [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE] NAME specifies the name of this argument (optional). You can use the name of local variable, local data structure member (e.g. var→field, var.field2), local array with fixed index (e.g. array[1], var→array[0], var→pointer[2]), or kprobe-tracer argument format (e.g. $retval, %ax, etc). Note that the name of this argument will be set as the last member name if you specify a local data structure member (e.g. field2 for var→field1.field2.) $vars special argument is also available for NAME, it is expanded to the local variables which can access at given probe point. TYPE casts the type of this argument (optional). If omitted, perf probe automatically set the type based on debuginfo. You can specify string type only for the local variable or structure member which is an array of or a pointer to char or unsigned char type. On x86 systems %REG is always the short form of the register: for example %AX. %RAX or %EAX is not valid. FILTER PATTERN The filter pattern is a glob matching pattern(s) to filter variables. In addition, you can use "!" for specifying filter-out rule. You also can give several rules combined with "&" or "|", and fold those rules as one rule by using "(" ")". e.g. With --filter "foo* | bar*", perf probe -V shows variables which start with "foo" or "bar". With --filter "!foo* & *bar", perf probe -V shows variables which don’t start with "foo" and end with "bar", like "fizzbar". But "foobar" is filtered out.
7.扩展
strace + tcpdump + blktrace + cpu profile + dynamic tracing