linux

性能优化工具:perf

2020年12月19日 阅读(1,660)

主页:https://perf.wiki.kernel.org/index.php/Main_Page

使用:http://www.brendangregg.com/perf.html 系统级性能分析工具perf的介绍与使用

源码:https://github.com/torvalds/linux/tree/master/tools/perf/

  • statistics/count: increment an integer counter on events
  • sample: collect details (eg, instruction pointer or stack) from a subset of events (once every …)
  • trace: collect details from every event

perf_events_map.png

对事件进行:

计数(stat);实时分析(top);

采样(record);文本分析(script);内置分析(report);汇编级分析(annotate)

自定义事件(probe)

1.性能profile

perf record进行采样,结果存入当前目录下的perf.data文件,二进制格式

perf script得到文本形式

perf report对perf.data进行分析,产生分析报告

perf diff可以对perf.data.old和perf.data的两个文件数据进行比较,找到每个函数的差异点

perf record -a --call-graph dwarf -p 29052
perf report --call-graph
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > perf.svg

火焰图生成源码:

https://github.com/brendangregg/FlameGraph

火焰图:

https://cacm.acm.org/magazines/2016/6/202665-the-flame-graph/fulltext

https://nicedoc.io/brendangregg/FlameGraph

1.1 record

记录事件采样数据到文件中。包含如下信息:

comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff, srcline, period

1.2 report

可以根据不同的维度进行聚合:–sort可以指定聚合的维度

可以根据不同的条件进行过滤:-c –pid= –tid=

report输出解读:

https://zh-blog.logan.tw/2019/10/06/intro-to-perf-events-and-call-graph/

https://www.man7.org/linux/man-pages/man1/perf-report.1.html

       The overhead can be shown in two columns as Children and Self when
       perf collects callchains. The self overhead is simply calculated by
       adding all period values of the entry - usually a function (symbol).
       This is the value that perf shows traditionally and sum of all the
       self overhead values should be 100%.

       The children overhead is calculated by adding all period values of
       the child functions so that it can show the total overhead of the
       higher level functions even if they don’t directly execute much.
       Children here means functions that are called from another (parent)
       function.

是否包含子函数:默认是–children,结果会按照包含了子函数开销后的开销进行排序。

如果忽略子函数开销,只关注自己本身的开销,可以通过指定–no-children

sudo perf report --no-children

caller-based-graph vs callee-base-graph

默认是callee-based,通过指定-G指定为caller-based。callee-based是以被调用者视角看,可以看到调用它的函数分别是哪些,每个占用的%。caller-based是调用者视图,可以看到它调用了哪些函数。

由于bug的原因,perf -G可能无法正确执行:Inverted call-graph broken?

但是可以通过flumegraph生成caller-based的图:

./FlameGraph/flamegraph.pl --inverted --reverse

1.3 script

除了可以输出文本格式,自己编写程序处理外。目前perf也提供了一些默认的script实现。

sudo perf script -l #查看当前支持的内置脚本

源代码目录:https://github.com/torvalds/linux/blob/master/tools/perf/scripts/

比如下面是采用内置脚本生成火焰图的方法(通过-l检查当前perf版本是否支持):

perf record -a -g -F 99 sleep 60
perf script report flamegraph

2.stat

统计各种事件的个数。不会记录具体事件,只是会生成一些counter,产生的开销要比record低。

sudo perf stat -d perf_test
sudo perf list #可以查看支持的event列表 block/ext4/syscalls/net/sched/kmem/task
sudo perf stat -e 'syscalls:sys_enter_*'  command #查看执行的系统调用
sudo perf stat -e 'sched:*' -p PID #查看某个进程的调度事件

3.top

-e:指定性能事件
-a:显示在所有CPU上的性能统计信息
-C:显示在指定CPU上的性能统计信息
-p:指定进程PID
-t:指定线程TID
-K:隐藏内核统计信息
-U:隐藏用户空间的统计信息

4.pstack

对某个进程的stack进行采样

# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, 
# at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID --call-graph dwarf sleep 10
perf script

5. off-cpu

http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html

sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch \
    -e sched:sched_process_exit -a -g -o perf.data.raw sleep 1
sudo perf inject -v -s -i perf.data.raw -o perf.data
sudo perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
    NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
    NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
    NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
    ./stackcollapse.pl | \
    ./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg

6.probe

探针:This command defines dynamic tracepoint events, by symbol and registers without debuginfo, or by C expressions (C line numbers, C function names, and C local variables) with debuginfo.

原理:Linux tracing – kprobe, uprobe and tracepoint

指定ELF的addr,配置/sys/kernel/debug/tracing/events/uprobes,然后程序加载时会修改该地址确保执行probe记录代码。

使用:

Uprobe-tracer: Uprobe-based Event Tracing

ftrace uprobe使用填坑历程

6.1 内核探针

sudo perf probe --add tcp_sendmsg #--add是可选参数,默认值
sudo perf probe -l #查看当前增加的tracepoints
sudo perf probe -d tcp_sendmsg #删除tracepoint

sudo perf probe -V tcp_sendmsg --externs #查看tcp_sendmsg的可用参数
sudo perf probe -L tcp_sendmsg #查看可供probe代码行

sudo perf probe 'tcp_sendmsg %ax %dx %cx' #增加一个tracepoint 并记录寄存器值
sudo perf probe 'tcp_sendmsg size sk->__sk_common.skc_state' #记录当时可用参数值
sudo perf probe 'tcp_sendmsg%return $retval' #返回时的tracepoint,记录返回值

sudo perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a

6.2 用户态探针

6.2.1 c语言

# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc

# Add a tracepoint for this user-level static probe (USDT, aka SDT event):
perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry
[admin@i32f09086.sqa.eu95 /home/admin]
$sudo perf probe -x binary_path -V tc_new
Available variables at tc_new
        @<tc_new+0>
                size_t  size
[admin@i32f09086.sqa.eu95 /home/admin]
$sudo perf probe -x binary_path  --add tc_delete -v
Open Debuginfo file:binary_path
Try to find probe point from debuginfo.
Probe point found: tc_delete+0
Found 1 probe_trace_events.
Opening /sys/kernel/debug/tracing/uprobe_events write=1
Added new event:
Writing event: p:probe_pangu/tc_delete binary_path:0x240f910
Failed to write event: Invalid argument
  Error: Failed to add events. Reason: Invalid argument (Code: -22)

对应源代码:

https://github.com/torvalds/linux/blob/master/tools/perf/util/probe-file.c

将binary拷贝到当前目录下,问题解决。

$sudo perf probe -x binary_path   --add "tc_new size"
Added new event:
  probe_pangu:tc_new   (on tc_new in binary_path with size)

You can now use it in all perf tools, such as:

  perf record -e probe_pangu:tc_new -aR sleep 1

$sudo perf probe -l
  probe_pangu:tc_new   (on tc_new@src/tcmalloc.cc in binarypath with size)

$sudo perf record -e probe_pangu:tc_new -aR sleep 1

6.2.2 c++支持

http://notes.secretsauce.net/notes/2019/12/16_c-probes-with-perf.html

https://stackoverflow.com/questions/20172446/cant-add-perf-probe-for-c-methods

方法使用没有经过demangle的函数名称

sudo perf probe -x \
binary_path --funcs \
--no-demangle --filter '*' >rg_funcs  #列出所有函数

以没有经过demangle的函数名称设置probe点


6.3 语法说明

sudo perf probe –help

PROBE SYNTAX
       Probe points are defined by following syntax.

           1) Define event based on function name
            [EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]

           2) Define event based on source file with line number
            [EVENT=]SRC:ALN [ARG ...]

           3) Define event based on source file with lazy pattern
            [EVENT=]SRC;PTN [ARG ...]

       EVENT specifies the name of new event, if omitted, it will be set the name of the probed function. Currently, event group name is set as probe. FUNC specifies
       a probed function name, and it may have one of the following options; +OFFS is the offset from function entry address in bytes, :RLN is the relative-line
       number from function entry line, and %return means that it probes function return. And ;PTN means lazy matching pattern (see LAZY MATCHING). Note that ;PTN
       must be the end of the probe point definition. In addition, @SRC specifies a source file which has that function. It is also possible to specify a probe point
       by the source line number or lazy matching by using SRC:ALN or SRC;PTN syntax, where SRC is the source file path, :ALN is the line number and ;PTN is the lazy
       matching pattern. ARG specifies the arguments of this probe point, (see PROBE ARGUMENT).
       
PROBE ARGUMENT
       Each probe argument follows below syntax.

           [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE]

       NAME specifies the name of this argument (optional). You can use the name of local variable, local data structure member (e.g. var→field, var.field2), local
       array with fixed index (e.g. array[1], var→array[0], var→pointer[2]), or kprobe-tracer argument format (e.g. $retval, %ax, etc). Note that the name of this
       argument will be set as the last member name if you specify a local data structure member (e.g. field2 for var→field1.field2.) $vars special argument is also
       available for NAME, it is expanded to the local variables which can access at given probe point. TYPE casts the type of this argument (optional). If omitted,
       perf probe automatically set the type based on debuginfo. You can specify string type only for the local variable or structure member which is an array of or a
       pointer to char or unsigned char type.

       On x86 systems %REG is always the short form of the register: for example %AX. %RAX or %EAX is not valid.
       
FILTER PATTERN
           The filter pattern is a glob matching pattern(s) to filter variables.
           In addition, you can use "!" for specifying filter-out rule. You also can give several rules combined with "&" or "|", and fold those rules as one rule by using "(" ")".

       e.g. With --filter "foo* | bar*", perf probe -V shows variables which start with "foo" or "bar". With --filter "!foo* & *bar", perf probe -V shows variables
       which don’t start with "foo" and end with "bar", like "fizzbar". But "foobar" is filtered out.

7.扩展

image.png

strace + tcpdump + blktrace + cpu profile + dynamic tracing

You Might Also Like