基於 Fprobe 的事件追蹤

概述

Fprobe 事件類似於 kprobe 事件,但僅限於在函式入口和出口處進行探測。這對於許多僅跟蹤某些特定函式的用例來說已經足夠了。

本文件還涵蓋了跟蹤點探針事件 (tprobe),因為它也僅適用於跟蹤點入口。使用者可以跟蹤部分跟蹤點引數,或者跟蹤沒有跟蹤事件的跟蹤點,這些跟蹤點在 tracefs 上沒有公開。

與其他動態事件一樣,fprobe 事件和跟蹤點探針事件透過 tracefs 上的 dynamic_events 介面檔案定義。

fprobe 事件概要

 f[:[GRP1/][EVENT1]] SYM [FETCHARGS]                       : Probe on function entry
 f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS]     : Probe on function exit
 t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS]                : Probe on tracepoint

GRP1           : Group name for fprobe. If omitted, use "fprobes" for it.
GRP2           : Group name for tprobe. If omitted, use "tracepoints" for it.
EVENT1         : Event name for fprobe. If omitted, the event name is
                 "SYM__entry" or "SYM__exit".
EVENT2         : Event name for tprobe. If omitted, the event name is
                 the same as "TRACEPOINT", but if the "TRACEPOINT" starts
                 with a digit character, "_TRACEPOINT" is used.
MAXACTIVE      : Maximum number of instances of the specified function that
                 can be probed simultaneously, or 0 for the default value
                 as defined in Documentation/trace/fprobe.rst

FETCHARGS      : Arguments. Each probe can have up to 128 args.
 ARG           : Fetch "ARG" function argument using BTF (only for function
                 entry or tracepoint.) (\*1)
 @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
 @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
 $stackN       : Fetch Nth entry of stack (N >= 0)
 $stack        : Fetch stack address.
 $argN         : Fetch the Nth function argument. (N >= 1) (\*2)
 $retval       : Fetch return value.(\*3)
 $comm         : Fetch current task comm.
 +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
 \IMM          : Store an immediate value to the argument.
 NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
 FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
                 (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                 (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                 and bitfield are supported.

 (\*1) This is available only when BTF is enabled.
 (\*2) only for the probe on function entry (offs == 0). Note, this argument access
       is best effort, because depending on the argument type, it may be passed on
       the stack. But this only support the arguments via registers.
 (\*3) only for return probe. Note that this is also best effort. Depending on the
       return value type, it might be passed via a pair of registers. But this only
       accesses one register.
 (\*4) this is useful for fetching a field of data structures.
 (\*5) "u" means user-space dereference.

有關 TYPE 的詳細資訊,請參閱 kprobetrace 文件

出口處的函式引數

可以在出口探針處使用 $arg<N> fetcharg 訪問函式引數。這對於一次記錄函式引數和返回值,以及跟蹤結構欄位的差異(用於除錯函式是否正確更新給定的資料結構)很有用。請參閱下面的 示例,瞭解其工作原理。

BTF 引數

BTF(BPF 型別格式)引數允許使用者透過名稱而不是 $argN 來跟蹤函式和跟蹤點引數。如果核心配置了 CONFIG_BPF_SYSCALL 和 CONFIG_DEBUG_INFO_BTF,則此功能可用。如果使用者僅指定 BTF 引數,則事件的引數名稱也會自動設定為給定名稱。

# echo 'f:myprobe vfs_read count pos' >> dynamic_events
# cat dynamic_events
f:fprobes/myprobe vfs_read count=count pos=pos

它還從 BTF 資訊中選擇提取型別。例如,在上面的示例中,count 是 unsigned long,pos 是一個指標。因此,兩者都轉換為 64 位 unsigned long,但只有 pos 具有 “%Lx” 列印格式,如下所示

# cat events/fprobes/myprobe/format
name: myprobe
ID: 1313
format:
       field:unsigned short common_type;       offset:0;       size:2; signed:0;
       field:unsigned char common_flags;       offset:2;       size:1; signed:0;
       field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
       field:int common_pid;   offset:4;       size:4; signed:1;

       field:unsigned long __probe_ip; offset:8;       size:8; signed:0;
       field:u64 count;        offset:16;      size:8; signed:0;
       field:u64 pos;  offset:24;      size:8; signed:0;

print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos

如果使用者不確定引數的名稱,$arg* 將會很有用。$arg* 會擴充套件到函式或跟蹤點的所有函式引數。

# echo 'f:myprobe vfs_read $arg*' >> dynamic_events
# cat dynamic_events
f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos

BTF 也會影響 $retval。如果使用者未設定任何型別,則 retval 型別會自動從 BTF 中選取。如果函式返回 void,則 $retval 將被拒絕。

您可以使用箭頭運算子 ->(對於指標型別)和點運算子 .(對於資料結構型別)訪問資料結構的欄位。

# echo 't sched_switch preempt prev_pid=prev->pid next_pid=next->pid' >> dynamic_events

欄位訪問運算子 ->. 可以組合使用,以訪問更深層的成員和成員指向的其他結構成員。例如 foo->bar.baz->qux 如果存在非命名聯合成員,您可以像 C 程式碼一樣直接訪問它。例如

struct {
       union {
       int a;
       int b;
       };
} *foo;

要訪問 ab,在這種情況下請使用 foo->afoo->b

此資料欄位訪問可用於透過 $retval 返回的值,例如 $retval->name

對於這些 BTF 引數和欄位,:string:ustring 會改變行為。如果這些用於 BTF 引數或欄位,它會檢查引數或資料欄位的 BTF 型別是否為 char *char []。如果不是,則拒絕應用字串型別。此外,透過 BTF 支援,您不需要記憶體解引用運算子 (+0(PTR)) 來訪問 PTR 指向的字串。它會根據 BTF 型別自動新增記憶體解引用運算子。例如:

# echo 't sched_switch prev->comm:string' >> dynamic_events
# echo 'f getname_flags%return $retval->name:string' >> dynamic_events

prev->comm 是資料結構中的嵌入式字元陣列,$retval->name 是資料結構中的字元指標。但在兩種情況下,您都可以使用 :string 型別來獲取字串。

使用示例

這是一個在 vfs_read() 函式入口和出口處新增 fprobe 事件的示例,帶有 BTF 引數。

 # echo 'f vfs_read $arg*' >> dynamic_events
 # echo 'f vfs_read%return $retval' >> dynamic_events
 # cat dynamic_events
f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
 # echo 1 > events/fprobes/enable
 # head -n 20 trace | tail
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-70      [000] ...1.   335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
              sh-70      [000] ...1.   335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
              sh-70      [000] ...1.   335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
              sh-70      [000] ...1.   336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
              sh-70      [000] .....   336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1

您可以看到所有函式引數和返回值都被記錄為 signed int。

此外,這裡還有一個關於 sched_switch 跟蹤點的跟蹤點事件的示例。為了比較結果,這也啟用了 sched_switch traceevent。

 # echo 't sched_switch $arg*' >> dynamic_events
 # echo 1 > events/sched/sched_switch/enable
 # echo 1 > events/tracepoints/sched_switch/enable
 # echo > trace
 # head -n 20 trace | tail
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-70      [000] d..2.  3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
              sh-70      [000] d..3.  3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
          <idle>-0       [000] d..2.  3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
          <idle>-0       [000] d..3.  3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
     rcu_preempt-16      [000] d..2.  3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
     rcu_preempt-16      [000] d..3.  3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
          <idle>-0       [000] d..2.  3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
          <idle>-0       [000] d..3.  3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0

正如您所看到的,sched_switch 跟蹤事件顯示了熟的引數,另一方面,sched_switch 跟蹤點探針事件顯示了原始的引數。這意味著您可以訪問 prevnext 引數指向的 task 結構中的任何欄位值。

例如,通常不跟蹤 task_struct::start_time,但透過此 traceprobe 事件,您可以跟蹤該欄位,如下所示。

 # echo 't sched_switch comm=next->comm:string next->start_time' > dynamic_events
 # head -n 20 trace | tail
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
              sh-70      [000] d..3.  5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
     rcu_preempt-16      [000] d..3.  5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
              sh-70      [000] d..3.  5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
          <idle>-0       [000] d..3.  5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
     rcu_preempt-16      [000] d..3.  5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
          <idle>-0       [000] d..3.  5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
     kworker/0:1-14      [000] d..3.  5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
          <idle>-0       [000] d..3.  5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000

返回探針允許我們訪問某些函式的結果,這些函式返回錯誤程式碼,其結果透過函式引數傳遞,例如結構初始化函式。

例如,vfs_open() 會將檔案結構連結到 inode 並更新模式。您可以使用返回探針跟蹤這些更改。

# echo 'f vfs_open mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
# echo 'f vfs_open%%return mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
# echo 1 > events/fprobes/enable
# cat trace
             sh-131     [006] ...1.  1945.714346: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x2 inode=0x0
             sh-131     [006] ...1.  1945.714358: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4d801e inode=0xffff888008470168
            cat-143     [007] ...1.  1945.717949: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
            cat-143     [007] ...1.  1945.717956: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4a801d inode=0xffff888005f78d28
            cat-143     [007] ...1.  1945.720616: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
            cat-143     [007] ...1.  1945.728263: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0xa800d inode=0xffff888004ada8d8

您可以看到 file::f_modefile::f_inodevfs_open() 中被更新。