| Age | Commit message (Collapse) | Author | Files | Lines |
|
The '-o' option exists for the SVG creation but not for `perf
timechart record`. Add to better allow testing.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Save samples with deferred callchains in a separate list and deliver
them after merging the user callchains. If users don't want to merge
they can set tool->merge_deferred_callchains to false to prevent the
behavior.
With previous result, now perf script will show the merged callchains.
$ perf script
...
pwd 2312 121.163435: 249113 cpu/cycles/P:
ffffffff845b78d8 __build_id_parse.isra.0+0x218 ([kernel.kallsyms])
ffffffff83bb5bf6 perf_event_mmap+0x2e6 ([kernel.kallsyms])
ffffffff83c31959 mprotect_fixup+0x1e9 ([kernel.kallsyms])
ffffffff83c31dc5 do_mprotect_pkey+0x2b5 ([kernel.kallsyms])
ffffffff83c3206f __x64_sys_mprotect+0x1f ([kernel.kallsyms])
ffffffff845e6692 do_syscall_64+0x62 ([kernel.kallsyms])
ffffffff8360012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
7f18fe337fa7 mprotect+0x7 (/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
7f18fe330e0f _dl_sysdep_start+0x7f (/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
7f18fe331448 _dl_start_user+0x0 (/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
...
The old output can be get using --no-merge-callchain option.
Also perf report can get the user callchain entry at the end.
$ perf report --no-children --stdio -q -S __build_id_parse.isra.0
# symbol: __build_id_parse.isra.0
8.40% pwd [kernel.kallsyms]
|
---__build_id_parse.isra.0
perf_event_mmap
mprotect_fixup
do_mprotect_pkey
__x64_sys_mprotect
do_syscall_64
entry_SYSCALL_64_after_hwframe
mprotect
_dl_sysdep_start
_dl_start_user
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Add a new callchain record mode option for deferred callchains. For now
it only works with FP (frame-pointer) mode.
And add the missing feature detection logic to clear the flag on old
kernels.
$ perf record --call-graph fp,defer -vv true
...
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CALLCHAIN|PERIOD
read_format ID|LOST
disabled 1
inherit 1
mmap 1
comm 1
freq 1
enable_on_exec 1
task 1
sample_id_all 1
mmap2 1
comm_exec 1
ksymbol 1
bpf_event 1
defer_callchain 1
defer_output 1
------------------------------------------------------------
sys_perf_event_open: pid 162755 cpu 0 group_fd -1 flags 0x8
sys_perf_event_open failed, error -22
switching off deferred callchain support
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
FEAT_SPE_EFT and FEAT_SPE_FDS etc have new user facing format attributes
so document them. Also document existing 'event_filter' bits that were
missing from the doc and the fact that latency values are stored in the
weight field.
Reviewed-by: Leo Yan <leo.yan@arm.com>
Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
The NO_AUXTRACE build option was used when the __get_cpuid feature
test failed or if it was provided on the command line. The option no
longer avoids a dependency on a library and so having the option is
just adding complexity to the code base. Remove the option
CONFIG_AUXTRACE from Build files and HAVE_AUXTRACE_SUPPORT by assuming
it is always defined.
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Perf c2c report currently specified the code address and source:line
information in the cacheline browser, while it is lack of annotation
support like perf report to directly show the disassembly code for
the particular symbol shared that same cacheline. This patches add
a key 'a' binding to the cacheline browser which reuse the annotation
browser to show the disassembly view for easier analysis of cacheline
contentions.
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Reviewed-by: Jiebin Sun <jiebin.sun@intel.com>
Reviewed-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Zhiguo Zhou <zhiguo.zhou@intel.com>
Reviewed-by: Wangyang Guo <wangyang.guo@intel.com>
Tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Add example commands for building perf with Clang.
Since recent Android NDK releases use Clang as the default compiler, a
separate Android specific document is no longer needed; point to the
general build documentation instead.
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20251006-perf_build_android_ndk-v3-9-4305590795b2@arm.com
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Cc: linux-riscv@lists.infradead.org
Cc: llvm@lists.linux.dev
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Advertise when perf is built with the HAVE_LIBLLVM_SUPPORT option.
Committer testing:
$ perf -vv | grep LLVM
libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
$
And the form to use in scripts, notably the tools/perf/tests/shell/
'perf test' ones:
$ perf check feature libllvm
libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
$ perf check -q feature libllvm && echo LLVM is present
LLVM is present
$ perf check -q feature liballvm && echo ALLVM is present
$
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Dr. David Alan Gilbert <linux@treblig.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Provide ratio-to-prev term which allows the user to
set the event sample period of two events corresponding
to a desired ratio.
If using on an Intel x86 platform with Auto Counter Reload support, also
set corresponding event's config2 attribute with a bitmask which
counters to reset and which counters to sample if the desired ratio is
met or exceeded.
On other platforms, only the sample period is affected by the
ratio-to-prev term.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
- "the the"
- "in in"
- "a a"
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Markus Heidelberg <m.heidelberg@cab.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Add two mmap() workloads: one that eagerly populates a region and
another that demand faults it in.
The intent is to probe the memory subsytem performance incurred
by mmap().
$ perf bench mem mmap -s 4gb -p 4kb -l 10 -f populate
# Running 'mem/mmap' benchmark:
# function 'populate' (Eagerly populated map())
# Copying 4gb bytes ...
1.811691 GB/sec
$ perf bench mem mmap -s 4gb -p 2mb -l 10 -f populate
# Running 'mem/mmap' benchmark:
# function 'populate' (Eagerly populated mmap())
# Copying 4gb bytes ...
12.272017 GB/sec
$ perf bench mem mmap -s 4gb -p 1gb -l 10 -f populate
# Running 'mem/mmap' benchmark:
# function 'populate' (Eagerly populated mmap())
# Copying 4gb bytes ...
17.085927 GB/sec
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
There can be a significant gap in memset/memcpy performance depending
on the size of the region being operated on.
With chunk-size=4kb:
$ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
$ perf bench mem memset -p 4kb -k 4kb -s 4gb -l 10 -f x86-64-stosq
# Running 'mem/memset' benchmark:
# function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 4gb bytes ...
13.011655 GB/sec
With chunk-size=1gb:
$ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
$ perf bench mem memset -p 4kb -k 1gb -s 4gb -l 10 -f x86-64-stosq
# Running 'mem/memset' benchmark:
# function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 4gb bytes ...
21.936355 GB/sec
So, allow the user to specify the chunk-size.
The default value is identical to the total size of the region, which
preserves current behaviour.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Page sizes that can be selected: 4KB, 2MB, 1GB.
Both the reservation and node from which hugepages are allocated
from are expected to be addressed by the user.
An example of page-size selection:
$ perf bench mem memset -s 4gb -p 2mb
# Running 'mem/memset' benchmark:
# function 'default' (Default memset() provided by glibc)
# Copying 4gb bytes ...
14.919194 GB/sec
# function 'x86-64-unrolled' (unrolled memset() in arch/x86/lib/memset_64.S)
# Copying 4gb bytes ...
11.514503 GB/sec
# function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 4gb bytes ...
12.600568 GB/sec
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Update the perf.data file format description on header section
HEADER_BPF_PROG_INFO.
The information is taken from process_bpf_prog_info() and
write_bpf_prog_info() from file util/header.c.
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The --max-summary option is to limit the number of output lines for
syscall summary stats. The max applies to each entries like thread and
cgroups. For total summary, it will just print up to the given number.
For example,
$ sudo perf trace -as --max-summary 3 sleep 0.1
ThreadPoolServi (1011651), 114 events, 14.8%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 38 0 95.589 0.000 2.515 11.153 28.98%
futex 9 0 0.040 0.002 0.004 0.014 28.63%
read 10 0 0.037 0.003 0.004 0.005 4.67%
sleep (1050529), 250 events, 32.4%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
clock_nanosleep 1 0 100.156 100.156 100.156 100.156 0.00%
execve 4 3 1.020 0.005 0.255 0.989 95.93%
openat 36 17 0.416 0.003 0.012 0.029 10.58%
...
And this is for per-cgroup summary using BPF.
$ sudo perf trace -as --max-summary 3 --summary-mode=cgroup --bpf-summary sleep 0.1
cgroup /user.slice/user-657345.slice/user@657345.service/session.slice/org.gnome.Shell@x11.service, 12 events
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
recvmsg 8 7 0.016 0.001 0.002 0.006 39.73%
ppoll 1 0 0.014 0.014 0.014 0.014 0.00%
write 2 0 0.010 0.002 0.005 0.008 61.02%
cgroup /user.slice/user-657345.slice/session-4.scope, 73 events
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 8 0 13.461 0.010 1.683 12.235 89.66%
ioctl 20 0 0.204 0.001 0.010 0.113 54.01%
writev 11 0 0.164 0.004 0.015 0.042 20.34%
Reviewed-by: Howard Chu <howardchu95@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The function parse_events__sort_events_and_fix_groups is needed to fix
uncore events like:
```
$ perf stat -e '{data_read,data_write}' ...
```
so that the multiple uncore PMUs have a group each of data_read and
data_write events.
The same function will perform architecture sorting and group fixing,
in particular for Intel topdown/perf-metric events. Grouping multiple
perf metric events together causes perf_event_open to fail as the
group can only support one. This means command lines like:
```
$ perf stat -e 'slots,slots' ...
```
fail as the slots events are forced into a group together to try to
satisfy the perf-metric event constraints.
As the user may know better than
parse_events__sort_events_and_fix_groups add a 'X' modifier to skip
its regrouping behavior. This allows the following to succeed rather
than fail on the second slots event being opened:
```
$ perf stat -e 'slots,slots:X' -a sleep 1
Performance counter stats for 'system wide':
6,834,154,071 cpu_core/slots/ (50.13%)
5,548,629,453 cpu_core/slots/X (49.87%)
1.002634606 seconds time elapsed
```
Closes: https://lore.kernel.org/lkml/20250822082233.1850417-1-dapeng1.mi@linux.intel.com/
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reported-by: Xudong Hao <xudong.hao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Falcon <thomas.falcon@intel.com>
Cc: Yoshihiro Furudera <fj5100bi@fujitsu.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The instructions group is now generated by default so update the doc to
reflect this. Also explain the period/downsampling mechanism in more
detail.
Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <Ben.Gainey@arm.com>
Cc: George Wort <George.Wort@arm.com>
Cc: Graham Woodward <Graham.Woodward@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Williams <Michael.Williams@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Until now, the --code-with-type option is available only on stdio.
But it was an artifical limitation because of an implemention issue.
Implement the same logic in annotation_line__write() for stdio2/TUI
and remove the limitation and update the man page.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250816031635.25318-8-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Support for build IDs in mmap2 perf events has been present since
Linux v5.12:
https://lore.kernel.org/lkml/20210219194619.1780437-1-acme@kernel.org/
Build ID mmap events don't avoid the need to inject build IDs for DSO
touched by samples as the build ID cache is populated by perf
record. They can avoid some cases of symbol mis-resolution caused by
the file system changing from when a sample occurred and when the DSO
is sought.
Unlike the --buildid-mmap option, this chnage doesn't disable the
build ID cache but it does disable the processing of samples looking
for DSOs to inject build IDs for. To disable the build ID cache the -B
(--no-buildid) option should be used.
Making this option the default was raised on the list in:
https://lore.kernel.org/linux-perf-users/CAP-5=fXP7jN_QrGUcd55_QH5J-Y-FCaJ6=NaHVtyx0oyNh8_-Q@mail.gmail.com/
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250724163302.596743-9-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
On hybrid systems, events like msr/tsc/ will aggregate counts across
all CPUs. Often metrics only want a value like msr/tsc/ for the cores
on which the metric is being computed. Listing each CPU with terms
cpu=0,cpu=1.. is laborious and would need to be encoded for all
variations of a CPU model.
Allow the cpumask from a PMU to be an argument to the cpu term. For
example in the following the cpumask of the cstate_pkg PMU selects the
CPUs to count msr/tsc/ counter upon:
```
$ cat /sys/bus/event_source/devices/cstate_pkg/cpumask
0
$ perf stat -A -e 'msr/tsc,cpu=cstate_pkg/' -a sleep 0.1
Performance counter stats for 'system wide':
CPU0 252,621,253 msr/tsc,cpu=cstate_pkg/
0.101184092 seconds time elapsed
```
As the cpu term is now also allowed to be a string, allow it to encode
a range of CPUs (a list can't be supported as ',' is already a special
token).
The "event qualifiers" section of the `perf list` man page is updated
to detail the additional behavior. The man page formatting is tidied
up in this section, as it was incorrectly appearing within the
"parameterized events" section.
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250719030517.1990983-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
This change adds support for new funcgraph tracer options funcgraph-args,
funcgraph-retval, funcgraph-retval-hex and funcgraph-retaddr.
The new added options are:
- args : Show function arguments.
- retval : Show function return value.
- retval-hex : Show function return value in hexadecimal format.
- retaddr : Show function return address.
# ./perf ftrace -G vfs_write --graph-opts retval,retaddr
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
5) | mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */
5) 0.188 us | local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */
5) | rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */
5) | _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */
5) 0.123 us | preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */
5) 0.128 us | local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */
5) 0.086 us | do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */
5) 0.845 us | } /* _raw_spin_lock_irqsave ret=0x292 */
5) | _raw_spin_unlock_irqrestore() { /* <-rt_mutex_slowunlock+0x191/0x200 */
5) 0.097 us | local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cff1f */
5) 0.086 us | do_raw_spin_unlock(); /* <-_raw_spin_unlock_irqrestore+0x23/0x60 ret=0x1 */
5) 0.104 us | preempt_count_sub(); /* <-_raw_spin_unlock_irqrestore+0x35/0x60 ret=0x0 */
5) 0.726 us | } /* _raw_spin_unlock_irqrestore ret=0x80000000 */
5) 1.881 us | } /* rt_mutex_slowunlock ret=0x0 */
5) 2.931 us | } /* mutex_unlock ret=0x0 */
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250613114048.132336-1-changbin.du@huawei.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
In addition to the function latency, it can measure events latencies.
Some kernel tracepoints are paired and it's menningful to measure how
long it takes between the two events. The latency is tracked for the
same thread.
Currently it only uses BPF to do the work but it can be lifted later.
Instead of having separate a BPF program for each tracepoint, it only
uses generic 'event_begin' and 'event_end' programs to attach to any
(raw) tracepoints.
$ sudo perf ftrace latency -a -b --hide-empty \
-e i915_request_wait_begin,i915_request_wait_end -- sleep 1
# DURATION | COUNT | GRAPH |
256 - 512 us | 4 | ###### |
2 - 4 ms | 2 | ### |
4 - 8 ms | 12 | ################### |
8 - 16 ms | 10 | ################ |
# statistics (in usec)
total time: 194915
avg time: 6961
max time: 12855
min time: 373
count: 28
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250714052143.342851-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Follow up:
lore.kernel.org/CAP-5=fVDF4-qYL1Lm7efgiHk7X=_nw_nEFMBZFMcsnOOJgX4Kg@mail.gmail.com/
The patch adds unit aggregation during evsel merge the aggregated uncore
counters. Change the name of the column to `ctrs` and `counters` for
json mode.
Tested on a 2-socket machine with SNC3, uncore_imc_[0-11] and
cpumask="0,120"
Before:
perf stat -e clockticks -I 1000 --per-socket
# time socket cpus counts unit events
1.001085024 S0 1 9615386315 clockticks
1.001085024 S1 1 9614287448 clockticks
perf stat -e clockticks -I 1000 --per-node
# time node cpus counts unit events
1.001029867 N0 1 3205726984 clockticks
1.001029867 N1 1 3205444421 clockticks
1.001029867 N2 1 3205234018 clockticks
1.001029867 N3 1 3205224660 clockticks
1.001029867 N4 1 3205207213 clockticks
1.001029867 N5 1 3205528246 clockticks
After:
perf stat -e clockticks -I 1000 --per-socket
# time socket ctrs counts unit events
1.001026071 S0 12 9619677996 clockticks
1.001026071 S1 12 9618612614 clockticks
perf stat -e clockticks -I 1000 --per-node
# time node ctrs counts unit events
1.001027449 N0 4 3207251859 clockticks
1.001027449 N1 4 3207315930 clockticks
1.001027449 N2 4 3206981828 clockticks
1.001027449 N3 4 3206566126 clockticks
1.001027449 N4 4 3206032609 clockticks
1.001027449 N5 4 3205651355 clockticks
Tested with JSON output linter:
perf test "perf stat JSON output linter"
94: perf stat JSON output linter : Ok
Suggested-by: Ian Rogers <irogers@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Link: https://lore.kernel.org/r/20250627201818.479421-1-ctshao@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Remove all occurrence of libcrypto in the build system.
Signed-off-by: Yuzhuo Jing <yuzhuo@google.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250625202311.23244-5-ebiggers@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
To get the fixes in libbpf and perf tools.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
This creates a config option that detects libbpf's ability to display
character arrays as strings, which was just added to the BPF tree
(https://git.kernel.org/bpf/bpf-next/c/87c9c79a02b4).
To test this change, I built perf (from later in this patch set) with:
- static libbpf (default, using source from kernel tree)
- dynamic libbpf (LIBBPF_DYNAMIC=1 LIBBPF_INCLUDE=/usr/local/include)
For both the static and dynamic versions, I used headers with and without
the ".emit_strings" option.
I verified that of the four resulting binaries, the two with
".emit_strings" would successfully record BPF_METADATA events, and the two
without wouldn't. All four binaries would successfully display
BPF_METADATA events, because the relevant bit of libbpf code is only used
during "perf record".
Signed-off-by: Blake Jones <blakejones@google.com>
Link: https://lore.kernel.org/r/20250612194939.162730-2-blakejones@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Update the documentation of the new fields with examples and caveats.
Also update the related documentation for AMD IBS.
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250610005742.2173050-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The --map-dump option was removed in 5e6da6be3082 ("perf trace: Migrate
BPF augmentation to use a skeleton"), this patch removes its remaining
documentation.
Fixes: 5e6da6be3082 ("perf trace: Migrate BPF augmentation to use a skeleton")
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Link: https://lore.kernel.org/r/20250601173252.717780-1-howardchu95@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Delaying kernel operations can be dangerous and the kernel may kill
(non-sleepable) BPF programs running for long in the future.
Limit the max delay to 10ms and update the document about it.
$ sudo ./perf lock con -abl -J 100000us@cgroup_mutex true
lock delay is too long: 100000us (> 10ms)
Usage: perf lock contention [<options>]
-J, --inject-delay <TIME@FUNC>
Inject delays to specific locks
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250515181042.555189-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Unlike perf-report which uses sample period for overhead calculation,
perf-mem overhead is calculated using sample weight. Describe perf-mem
overhead calculation method in it's man page.
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20250523222157.1259998-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The original PERF_RECORD_COMPRESS is not 8-byte aligned, which can cause
asan runtime error:
# Build with asan
$ make -C tools/perf O=/tmp/perf DEBUG=1 EXTRA_CFLAGS="-O0 -g -fno-omit-frame-pointer -fsanitize=undefined"
# Test success with many asan runtime errors:
$ /tmp/perf/perf test "Zstd perf.data compression/decompression" -vv
83: Zstd perf.data compression/decompression:
...
util/session.c:1959:13: runtime error: member access within misaligned address 0x7f69e3f99653 for type 'union perf_event', which requires 13 byte alignment
0x7f69e3f99653: note: pointer points here
d0 3a 50 69 44 00 00 00 00 00 08 00 bb 07 00 00 00 00 00 00 44 00 00 00 00 00 00 00 ff 07 00 00
^
util/session.c:2163:22: runtime error: member access within misaligned address 0x7f69e3f99653 for type 'union perf_event', which requires 8 byte alignment
0x7f69e3f99653: note: pointer points here
d0 3a 50 69 44 00 00 00 00 00 08 00 bb 07 00 00 00 00 00 00 44 00 00 00 00 00 00 00 ff 07 00 00
^
...
Since there is no way to align compressed data in zstd compression, this
patch add a new event type `PERF_RECORD_COMPRESSED2`, which adds a field
`data_size` to specify the actual compressed data size.
The `header.size` contains the total record size, including the padding
at the end to make it 8-byte aligned.
Tested with `Zstd perf.data compression/decompression`
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ben Gainey <ben.gainey@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303183646.327510-1-ctshao@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Add a new summary mode to collect stats for each cgroup.
$ sudo ./perf trace -as --bpf-summary --summary-mode=cgroup -- sleep 1
Summary of events:
cgroup /user.slice/user-657345.slice/user@657345.service/session.slice/org.gnome.Shell@x11.service, 535 events
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
ppoll 15 0 373.600 0.004 24.907 197.491 55.26%
poll 15 0 1.325 0.001 0.088 0.369 38.76%
close 66 0 0.567 0.007 0.009 0.026 3.55%
write 150 0 0.471 0.001 0.003 0.010 3.29%
recvmsg 94 83 0.290 0.000 0.003 0.037 16.39%
ioctl 26 0 0.237 0.001 0.009 0.096 50.13%
timerfd_create 66 0 0.236 0.003 0.004 0.024 8.92%
timerfd_settime 70 0 0.160 0.001 0.002 0.012 7.66%
writev 10 0 0.118 0.001 0.012 0.019 18.17%
read 9 0 0.021 0.001 0.002 0.004 14.07%
getpid 14 0 0.019 0.000 0.001 0.004 20.28%
cgroup /system.slice/polkit.service, 94 events
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
ppoll 22 0 19.811 0.000 0.900 9.273 63.88%
write 30 0 0.040 0.001 0.001 0.003 12.09%
recvmsg 12 0 0.018 0.001 0.002 0.006 28.15%
read 18 0 0.013 0.000 0.001 0.003 21.99%
poll 12 0 0.006 0.000 0.001 0.001 4.48%
cgroup /user.slice/user-657345.slice/user@657345.service/app.slice/app-org.gnome.Terminal.slice/gnome-terminal-server.service, 21 events
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
ppoll 4 0 17.476 0.003 4.369 13.298 69.65%
recvmsg 15 12 0.068 0.002 0.005 0.014 26.53%
writev 1 0 0.033 0.033 0.033 0.033 0.00%
poll 1 0 0.005 0.005 0.005 0.005 0.00%
...
It works only for --bpf-summary for now.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250501225337.928470-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Sometimes we need to analyze the data in process level but current sort
keys only work on thread level. Let's add 'tgid' sort key for that as
'pid' is already taken for thread.
This will look mostly the same, but it only uses tgid instead of tid.
Here's an example of a process with two threads (thloop).
$ perf record -- perf test -w thloop
$ perf report --stdio -s tgid,pid -H
...
#
# Overhead Tgid:Command / Pid:Command
# ........... ..........................
#
100.00% 2018407:perf
50.34% 2018407:perf
49.66% 2018409:perf
Suggested-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250509210421.197245-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The -C option allows the CPUs for a list of events to be specified but
its not possible to set the CPU for a single event. Add a term to
allow this. The term isn't a general CPU list due to ',' already being
a special character in event parsing instead multiple cpu= terms may
be provided and they will be merged/unioned together.
An example of mixing different types of events counted on different CPUs:
```
$ perf stat -A -C 0,4-5,8 -e "instructions/cpu=0/,l1d-misses/cpu=4,cpu=5/,inst_retired.any/cpu=8/,cycles" -a sleep 0.1
Performance counter stats for 'system wide':
CPU0 6,979,225 instructions/cpu=0/ # 0.89 insn per cycle
CPU4 75,138 cpu/l1d-misses/
CPU5 1,418,939 cpu/l1d-misses/
CPU8 797,553 cpu/inst_retired.any,cpu=8/
CPU0 7,845,302 cycles
CPU4 6,546,859 cycles
CPU5 185,915,438 cycles
CPU8 2,065,668 cycles
0.112449242 seconds time elapsed
```
Committer testing:
root@number:~# grep -m1 "model name" /proc/cpuinfo
model name : AMD Ryzen 9 9950X3D 16-Core Processor
root@number:~# perf stat -A -e "instructions/cpu=0/,instructions,l1d-misses/cpu=4,cpu=5/,cycles" -a sleep 0.1
Performance counter stats for 'system wide':
CPU0 2,398,351 instructions/cpu=0/ # 0.44 insn per cycle
CPU0 2,398,152 instructions # 0.44 insn per cycle
CPU1 1,265,634 instructions # 0.49 insn per cycle
CPU2 606,087 instructions # 0.50 insn per cycle
CPU3 4,025,752 instructions # 0.52 insn per cycle
CPU4 4,236,810 instructions # 0.53 insn per cycle
CPU5 3,984,832 instructions # 0.66 insn per cycle
CPU6 434,132 instructions # 0.44 insn per cycle
CPU7 65,752 instructions # 0.41 insn per cycle
CPU8 459,083 instructions # 0.48 insn per cycle
CPU9 6,464,161 instructions # 1.31 insn per cycle
<SNIP>
root@number:~# perf stat -e "instructions/cpu=0/,instructions,l1d-misses/cpu=4,cpu=5/,cycles" -a sleep 0.
Performance counter stats for 'system wide':
144,822 instructions/cpu=0/ # 0.03 insn per cycle
4,666,114 instructions # 0.93 insn per cycle
2,583 l1d-misses
4,993,633 cycles
0.000868512 seconds time elapsed
root@number:~#
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.co |