Eric Lee / smarc-fsl-linux-kernel

28 Nov, 2020

6 commits

a9ffd0484 perf probe: Change function definition check due to broken DWARF ... Browse Code »

Since some gcc generates a broken DWARF which lacks DW_AT_declaration
attribute from the subprogram DIE of function prototype.
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97060)

So, in addition to the DW_AT_declaration check, we also check the
subprogram DIE has DW_AT_inline or actual entry pc.

Committer testing:

# cat /etc/fedora-release
Fedora release 33 (Thirty Three)
#

Before:

# perf test vfs_getname
78: Use vfs_getname probe to get syscall args filenames : FAILED!
79: Check open filename arg using perf trace + vfs_getname : FAILED!
81: Add vfs_getname probe to get syscall args filenames : FAILED!
#

After:

# perf test vfs_getname
78: Use vfs_getname probe to get syscall args filenames : Ok
79: Check open filename arg using perf trace + vfs_getname : Ok
81: Add vfs_getname probe to get syscall args filenames : Ok
#

Reported-by: Thomas Richter
Signed-off-by: Masami Hiramatsu
Tested-by: Arnaldo Carvalho de Melo
Cc: Sumanth Korikkar
Link: http://lore.kernel.org/lkml/160645613571.2824037.7441351537890235895.stgit@devnote2
Signed-off-by: Arnaldo Carvalho de Melo

Masami Hiramatsu
2020-11-28 01:36:15 +0800
ab4200c17 perf probe: Fix to die_entrypc() returns error correctly ... Browse Code »

Fix die_entrypc() to return error correctly if the DIE has no
DW_AT_ranges attribute. Since dwarf_ranges() will treat the case as an
empty ranges and return 0, we have to check it by ourselves.

Fixes: 91e2f539eeda ("perf probe: Fix to show function entry line as probe-able")
Signed-off-by: Masami Hiramatsu
Cc: Sumanth Korikkar
Cc: Thomas Richter
Link: http://lore.kernel.org/lkml/160645612634.2824037.5284932731175079426.stgit@devnote2
Signed-off-by: Arnaldo Carvalho de Melo

Masami Hiramatsu
2020-11-28 01:33:17 +0800
c0ee1d5ae perf stat: Use proper cpu for shadow stats ... Browse Code »

Currently perf stat shows some metrics (like IPC) for defined events.
But when no aggregation mode is used (-A option), it shows incorrect
values since it used a value from a different cpu.

Before:

$ perf stat -aA -e cycles,instructions sleep 1

Performance counter stats for 'system wide':

CPU0 116,057,380 cycles
CPU1 86,084,722 cycles
CPU2 99,423,125 cycles
CPU3 98,272,994 cycles
CPU0 53,369,217 instructions # 0.46 insn per cycle
CPU1 33,378,058 instructions # 0.29 insn per cycle
CPU2 58,150,086 instructions # 0.50 insn per cycle
CPU3 40,029,703 instructions # 0.34 insn per cycle

1.001816971 seconds time elapsed

So the IPC for CPU1 should be 0.38 (= 33,378,058 / 86,084,722)
but it was 0.29 (= 33,378,058 / 116,057,380) and so on.

After:

$ perf stat -aA -e cycles,instructions sleep 1

Performance counter stats for 'system wide':

CPU0 109,621,384 cycles
CPU1 159,026,454 cycles
CPU2 99,460,366 cycles
CPU3 124,144,142 cycles
CPU0 44,396,706 instructions # 0.41 insn per cycle
CPU1 120,195,425 instructions # 0.76 insn per cycle
CPU2 44,763,978 instructions # 0.45 insn per cycle
CPU3 69,049,079 instructions # 0.56 insn per cycle

1.001910444 seconds time elapsed

Fixes: 44d49a600259 ("perf stat: Support metrics in --per-core/socket mode")
Reported-by: Sam Xi
Signed-off-by: Namhyung Kim
Reviewed-by: Andi Kleen
Acked-by: Jiri Olsa
Cc: Alexander Shishkin
Cc: Ian Rogers
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Stephane Eranian
Link: http://lore.kernel.org/lkml/20201127041404.390276-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Namhyung Kim
2020-11-28 01:31:37 +0800
aa50d953c perf record: Synthesize cgroup events only if needed ... Browse Code »

It didn't check the tool->cgroup_events bit which is set when the
--all-cgroups option is given. Without it, samples will not have cgroup
info so no reason to synthesize.

We can check the PERF_RECORD_CGROUP records after running perf record
*WITHOUT* the --all-cgroups option:

Before:

$ perf report -D | grep CGROUP
0 0 0x8430 [0x38]: PERF_RECORD_CGROUP cgroup: 1 /
CGROUP events: 1
CGROUP events: 0
CGROUP events: 0

After:

$ perf report -D | grep CGROUP
CGROUP events: 0
CGROUP events: 0
CGROUP events: 0

Committer testing:

Before:

# perf record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.208 MB perf.data (10003 samples) ]
# perf report -D | grep "CGROUP events"
CGROUP events: 146
CGROUP events: 0
CGROUP events: 0
#

After:

# perf record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.208 MB perf.data (10448 samples) ]
# perf report -D | grep "CGROUP events"
CGROUP events: 0
CGROUP events: 0
CGROUP events: 0
#

With all-cgroups:

# perf record --all-cgroups -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.374 MB perf.data (11526 samples) ]
# perf report -D | grep "CGROUP events"
CGROUP events: 146
CGROUP events: 0
CGROUP events: 0
#

Fixes: 8fb4b67939e16 ("perf record: Add --all-cgroups option")
Signed-off-by: Namhyung Kim
Tested-by: Arnaldo Carvalho de Melo
Acked-by: Jiri Olsa
Cc: Alexander Shishkin
Cc: Ian Rogers
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Stephane Eranian
Link: http://lore.kernel.org/lkml/20201127054356.405481-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Namhyung Kim
2020-11-28 01:26:33 +0800
971307002 perf diff: Fix error return value in __cmd_diff() ... Browse Code »

An appropriate return value should be set on the failed path.

Fixes: 2a09a84c720b436a ("perf diff: Support hot streams comparison")
Reported-by: Hulk Robot
Signed-off-by: Zhen Lei
Acked-by: Jiri Olsa
Acked-by: Namhyung Kim
Cc: Alexander Shishkin
Cc: Jin Yao
Cc: Mark Rutland
Cc: Peter Zijlstra
Link: http://lore.kernel.org/lkml/20201124103652.438-1-thunder.leizhen@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo

Zhen Lei
2020-11-28 01:21:23 +0800
3b13eaf0b perf tools: Update copy of libbpf's hashmap.c ... Browse Code »

To pick the changes in:

7a078d2d18801bba ("libbpf, hashmap: Fix undefined behavior in hash_bits")

That don't entail any changes in tools/perf.

This addresses this perf build warning:

Warning: Kernel ABI header at 'tools/perf/util/hashmap.h' differs from latest version at 'tools/lib/bpf/hashmap.h'
diff -u tools/perf/util/hashmap.h tools/lib/bpf/hashmap.h

Not a kernel ABI, its just that this uses the mechanism in place for
checking kernel ABI files drift.

Cc: Adrian Hunter
Cc: Daniel Borkmann
Cc: Ian Rogers
Cc: Jiri Olsa
Cc: Namhyung Kim
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2020-11-28 01:19:33 +0800

17 Nov, 2020

2 commits

568beb279 perf test: Avoid an msan warning in a copied stack. ... Browse Code »

This fix is for a failure that occurred in the DWARF unwind perf test.

Stack unwinders may probe memory when looking for frames.

Memory sanitizer will poison and track uninitialized memory on the
stack, and on the heap if the value is copied to the heap.

This can lead to false memory sanitizer failures for the use of an
uninitialized value.

Avoid this problem by removing the poison on the copied stack.

The full msan failure with track origins looks like:

==2168==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x559ceb10755b in handle_cfi elfutils/libdwfl/frame_unwind.c:648:8
#1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
#2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
#3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
#4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
#5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
#6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
#7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
#8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
#9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
#10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
#11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
#12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
#13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
#14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
#15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
#16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
#17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
#18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
#19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
#20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
#21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
#22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
#23 0x559cea95fbce in main tools/perf/perf.c:539:3

Uninitialized value was stored to memory at
#0 0x559ceb106acf in __libdwfl_frame_reg_set elfutils/libdwfl/frame_unwind.c:77:22
#1 0x559ceb106acf in handle_cfi elfutils/libdwfl/frame_unwind.c:627:13
#2 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
#3 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
#4 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
#5 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
#6 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
#7 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
#8 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
#9 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
#10 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
#11 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
#12 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
#13 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
#14 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
#15 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
#16 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
#17 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
#18 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
#19 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
#20 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
#21 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
#22 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
#23 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
#24 0x559cea95fbce in main tools/perf/perf.c:539:3

Uninitialized value was stored to memory at
#0 0x559ceb106a54 in handle_cfi elfutils/libdwfl/frame_unwind.c:613:9
#1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
#2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
#3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
#4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
#5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
#6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
#7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
#8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
#9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
#10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
#11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
#12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
#13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
#14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
#15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
#16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
#17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
#18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
#19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
#20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
#21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
#22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
#23 0x559cea95fbce in main tools/perf/perf.c:539:3

Uninitialized value was stored to memory at
#0 0x559ceaff8800 in memory_read tools/perf/util/unwind-libdw.c:156:10
#1 0x559ceb10f053 in expr_eval elfutils/libdwfl/frame_unwind.c:501:13
#2 0x559ceb1060cc in handle_cfi elfutils/libdwfl/frame_unwind.c:603:18
#3 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
#4 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
#5 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
#6 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
#7 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
#8 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
#9 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
#10 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
#11 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
#12 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
#13 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
#14 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
#15 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
#16 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
#17 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
#18 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
#19 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
#20 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
#21 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
#22 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
#23 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
#24 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
#25 0x559cea95fbce in main tools/perf/perf.c:539:3

Uninitialized value was stored to memory at
#0 0x559cea9027d9 in __msan_memcpy llvm/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1558:3
#1 0x559cea9d2185 in sample_ustack tools/perf/arch/x86/tests/dwarf-unwind.c:41:2
#2 0x559cea9d202c in test__arch_unwind_sample tools/perf/arch/x86/tests/dwarf-unwind.c:72:9
#3 0x559ceabc9cbd in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:106:6
#4 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
#5 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
#6 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
#7 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
#8 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
#9 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
#10 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
#11 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
#12 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
#13 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
#14 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
#15 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
#16 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
#17 0x559cea95fbce in main tools/perf/perf.c:539:3

Uninitialized value was created by an allocation of 'bf' in the stack frame of function 'perf_event__synthesize_mmap_events'
#0 0x559ceafc5f60 in perf_event__synthesize_mmap_events tools/perf/util/synthetic-events.c:445

SUMMARY: MemorySanitizer: use-of-uninitialized-value elfutils/libdwfl/frame_unwind.c:648:8 in handle_cfi
Signed-off-by: Ian Rogers
Cc: Alexander Shishkin
Cc: clang-built-linux@googlegroups.com
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Sandeep Dasgupta
Cc: Stephane Eranian
Link: http://lore.kernel.org/lkml/20201113182053.754625-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo

Ian Rogers
2020-11-17 01:10:58 +0800
1c756cd42 perf inject: Fix file corruption due to event deletion ... Browse Code »

"perf inject" can create corrupt files when synthesizing sample events from AUX
data. This happens when in the input file, the first event (for the AUX data)
has a different sample_type from the second event (generally dummy).

Specifically, they differ in the bits that indicate the standard fields
appended to perf records in the mmap buffer. "perf inject" deletes the first
event and moves up the second event to first position.

The problem is with the synthetic PERF_RECORD_MMAP (etc.) events created
by "perf record".

Since these are synthetic versions of events which are normally produced
by the kernel, they have to have the standard fields appended as
described by sample_type.

"perf record" fills these in with zeroes, including the IDENTIFIER
field; perf readers interpret records with zero IDENTIFIER using the
descriptor for the first event in the file.

Since "perf inject" changes the first event, these synthetic records are
then processed with the wrong value of sample_type, and the perf reader
reads bad data, reports on incorrect length records etc.

Mismatching sample_types are seen with "perf record -e cs_etm//", where the AUX
event has TID|TIME|CPU|IDENTIFIER and the dummy event has TID|TIME|IDENTIFIER.

Perhaps they could be the same, but it isn't normally a problem if they aren't
- perf has no problems reading the file.

The sample_types have to agree on the position of IDENTIFIER, because
that's how perf finds the right event descriptor in the first place, but
they don't normally have to agree on other fields, and perf doesn't
check that they do.

The problem is specific to the way "perf inject" reorganizes the events
and the way synthetic MMAP events are recorded with a zero identifier. A
simple solution is to stop "perf inject" deleting the tracing event.

Committer testing

Removed the now unused 'evsel' variable, update the comment about the
evsel removal not being performed anymore, and apply the patch manually
as it failed with this warning:

warning: Patch sent with format=flowed; space at the end of lines might be lost.

Testing it with:

$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 8.543 msec (+- 0.130 msec)
Average time per event: 0.838 usec (+- 0.013 usec)
Average memory usage: 12717 KB (+- 9 KB)
Average build-id-all injection took: 5.710 msec (+- 0.058 msec)
Average time per event: 0.560 usec (+- 0.006 usec)
Average memory usage: 12079 KB (+- 7 KB)
$

Signed-off-by: Al Grant
Acked-by: Adrian Hunter
Acked-by: Namhyung Kim
Cc: Alexander Shishkin
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Peter Zijlstra
LPU-Reference: b9cf5611-daae-2390-3439-6617f8f0a34b@foss.arm.com
Signed-off-by: Arnaldo Carvalho de Melo

Al Grant
2020-11-17 00:59:17 +0800

13 Nov, 2020

5 commits

dd94ac807 perf test: Update branch sample pattern for cs-etm ... Browse Code »

Since the commit 943b69ac1884 ("perf parse-events: Set exclude_guest=1
for user-space counting"), 'exclude_guest=1' is set for user-space
counting; and the branch sample's modifier has been altered, the sample
event name has been changed from "branches:u:" to "branches:uH:", which
gives out info for "user-space and host counting".

But the cs-etm testing's regular expression cannot match the updated
branch sample event and leads to test failure.

This patch updates the branch sample pattern by using a more flexible
expression '.*' to match branch sample's modifiers, so that allows the
testing to work as expected.

Fixes: 943b69ac1884 ("perf parse-events: Set exclude_guest=1 for user-space counting")
Signed-off-by: Leo Yan
Reviewed-by: Mathieu Poirier
Cc: Alexander Shishkin
Cc: Jin Yao
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Mike Leach
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Suzuki Poulouse
Cc: coresight ml
Cc: stable@kernel.org
Link: http://lore.kernel.org/lkml/20201110063417.14467-2-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-11-13 04:55:42 +0800
db2ac2e49 perf test: Fix a typo in cs-etm testing ... Browse Code »

Fix a typo: s/devce_name/device_name.

Fixes: fe0aed19b266 ("perf test: Introduce script for Arm CoreSight testing")
Signed-off-by: Leo Yan
Reviewed-by: Mathieu Poirier
Cc: Alexander Shishkin
Cc: Jin Yao
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Mike Leach
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Suzuki Poulouse
Cc: coresight ml
Cc: stable@kernel.org
Link: http://lore.kernel.org/lkml/20201110063417.14467-1-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-11-13 04:55:42 +0800
db1a8b97a tools arch: Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench mem memcpy' ... Browse Code »

To bring in the change made in this cset:

4d6ffa27b8e5116c ("x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S")
6dcc5627f6aec4cb ("x86/asm: Change all ENTRY+ENDPROC to SYM_FUNC_*")

I needed to define SYM_FUNC_START_LOCAL() as SYM_L_GLOBAL as
mem{cpy,set}_{orig,erms} are used by 'perf bench'.

This silences these perf tools build warnings:

Warning: Kernel ABI header at 'tools/arch/x86/lib/memcpy_64.S' differs from latest version at 'arch/x86/lib/memcpy_64.S'
diff -u tools/arch/x86/lib/memcpy_64.S arch/x86/lib/memcpy_64.S
Warning: Kernel ABI header at 'tools/arch/x86/lib/memset_64.S' differs from latest version at 'arch/x86/lib/memset_64.S'
diff -u tools/arch/x86/lib/memset_64.S arch/x86/lib/memset_64.S

Cc: Adrian Hunter
Cc: Borislav Petkov
Cc: Fangrui Song
Cc: Ian Rogers
Cc: Jiri Olsa
Cc: Jiri Slaby
Cc: Namhyung Kim
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2020-11-13 04:55:41 +0800
b0e5a05cc perf lock: Don't free "lock_seq_stat" if read_count isn't zero ... Browse Code »

When execute command "perf lock report", it hits failure and outputs log
as follows:

perf: builtin-lock.c:623: report_lock_release_event: Assertion `!(seq->read_count < 0)' failed.
Aborted

This is an imbalance issue. The locking sequence structure
"lock_seq_stat" contains the reader counter and it is used to check if
the locking sequence is balance or not between acquiring and releasing.

If the tool wrongly frees "lock_seq_stat" when "read_count" isn't zero,
the "read_count" will be reset to zero when allocate a new structure at
the next time; thus it causes the wrong counting for reader and finally
results in imbalance issue.

To fix this issue, if detects "read_count" is not zero (means still have
read user in the locking sequence), goto the "end" tag to skip freeing
structure "lock_seq_stat".

Fixes: e4cef1f65061 ("perf lock: Fix state machine to recognize lock sequence")
Signed-off-by: Leo Yan
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201104094229.17509-2-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-11-13 04:55:41 +0800
e24a87b54 perf lock: Correct field name "flags" ... Browse Code »

The tracepoint "lock:lock_acquire" contains field "flags" but not
"flag". Current code wrongly retrieves value from field "flag" and it
always gets zero for the value, thus "perf lock" doesn't report the
correct result.

This patch replaces the field name "flag" with "flags", so can read out
the correct flags for locking.

Fixes: e4cef1f65061 ("perf lock: Fix state machine to recognize lock sequence")
Signed-off-by: Leo Yan
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201104094229.17509-1-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-11-13 04:55:41 +0800

03 Nov, 2020

12 commits

2c589d933 perf tools: Add missing swap for cgroup events ... Browse Code »

It was missed to add a swap function for PERF_RECORD_CGROUP.

Fixes: ba78c1c5461c ("perf tools: Basic support for CGROUP event")
Signed-off-by: Namhyung Kim
Acked-by: Jiri Olsa
Cc: Alexander Shishkin
Cc: Ian Rogers
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Stephane Eranian
Link: http://lore.kernel.org/lkml/20201102140228.303657-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Namhyung Kim
2020-11-03 20:16:41 +0800
fe01adb72 perf tools: Add missing swap for ino_generation ... Browse Code »

We are missing swap for ino_generation field.

Fixes: 5c5e854bc760 ("perf tools: Add attr->mmap2 support")
Signed-off-by: Jiri Olsa
Acked-by: Namhyung Kim
Link: https://lore.kernel.org/r/20201101233103.3537427-2-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2020-11-03 20:15:02 +0800
6311951d4 perf tools: Initialize output buffer in build_id__sprintf ... Browse Code »

We display garbage for undefined build_id objects, because we don't
initialize the output buffer.

Signed-off-by: Jiri Olsa
Acked-by: Namhyung Kim
Link: https://lore.kernel.org/r/20201101233103.3537427-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2020-11-03 20:14:45 +0800
86449b12f perf hists browser: Increase size of 'buf' in perf_evsel__hists_browse() ... Browse Code »

Making perf with gcc-9.1.1 generates the following warning:

CC ui/browsers/hists.o
ui/browsers/hists.c: In function 'perf_evsel__hists_browse':
ui/browsers/hists.c:3078:61: error: '%d' directive output may be \
truncated writing between 1 and 11 bytes into a region of size \
between 2 and 12 [-Werror=format-truncation=]

3078 | "Max event group index to sort is %d (index from 0 to %d)",
| ^~
ui/browsers/hists.c:3078:7: note: directive argument in the range [-2147483648, 8]
3078 | "Max event group index to sort is %d (index from 0 to %d)",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/stdio.h:937,
from ui/browsers/hists.c:5:

IOW, the string in line 3078 might be too long for buf[] of 64 bytes.

Fix this by increasing the size of buf[] to 128.

Fixes: dbddf1747441 ("perf report/top TUI: Support hotkeys to let user select any event for sorting")
Signed-off-by: Song Liu
Acked-by: Jiri Olsa
Cc: Jin Yao
Cc: stable@vger.kernel.org # v5.7+
Link: http://lore.kernel.org/lkml/20201030235431.534417-1-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo

Song Liu
2020-11-03 20:11:45 +0800
d0e7b0c71 perf scripting python: Avoid declaring function pointers with a visibility attribute ... Browse Code »

To avoid this:

util/scripting-engines/trace-event-python.c: In function 'python_start_script':
util/scripting-engines/trace-event-python.c:1595:2: error: 'visibility' attribute ignored [-Werror=attributes]
1595 | PyMODINIT_FUNC (*initfunc)(void);
| ^~~~~~~~~~~~~~

That started breaking when building with PYTHON=python3 and these gcc
versions (I haven't checked with the clang ones, maybe it breaks there
as well):

# export PERF_TARBALL=http://192.168.86.5/perf/perf-5.9.0.tar.xz
# dm fedora:33 fedora:rawhide
1 107.80 fedora:33 : Ok gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5), clang version 11.0.0 (Fedora 11.0.0-1.fc33)
2 92.47 fedora:rawhide : Ok gcc (GCC) 10.2.1 20201016 (Red Hat 10.2.1-6), clang version 11.0.0 (Fedora 11.0.0-1.fc34)
#

Avoid that by ditching that 'initfunc' function pointer with its:

#define Py_EXPORTED_SYMBOL _attribute_ ((visibility ("default")))
#define PyMODINIT_FUNC Py_EXPORTED_SYMBOL PyObject*

And just call PyImport_AppendInittab() at the end of the ifdef python3
block with the functions that were being attributed to that initfunc.

Cc: Adrian Hunter
Cc: Ian Rogers
Cc: Jiri Olsa
Cc: Namhyung Kim
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2020-11-03 19:32:43 +0800
9ae1e990f perf tools: Remove broken __no_tail_call attribute ... Browse Code »

The GCC specific __attribute__((optimize)) attribute does not what is
commonly expected and is explicitly recommended against using in
production code by the GCC people.

Unlike what is often expected, it doesn't add to the optimization flags,
but it fully replaces them, loosing any and all optimization flags
provided by the compiler commandline.

The only guaranteed upon means of inhibiting tail-calls is by placing a
volatile asm with side-effects after the call such that the tail-call simply
cannot be done.

Given the original commit wasn't specific on which calls were the problem, this
removal might re-introduce the problem, which can then be re-analyzed and cured
properly.

Signed-off-by: Peter Zijlstra
Acked-by: Ard Biesheuvel
Acked-by: Miguel Ojeda
Cc: Alexei Starovoitov
Cc: Arnd Bergmann
Cc: Arvind Sankar
Cc: Daniel Borkmann
Cc: Geert Uytterhoeven
Cc: Ian Rogers
Cc: Josh Poimboeuf
Cc: Kees Kook
Cc: Martin Liška
Cc: Nick Desaulniers
Cc: Randy Dunlap
Cc: Thomas Gleixner
Link: http://lore.kernel.org/lkml/20201028081123.GT2628@hirez.programming.kicks-ass.net
Signed-off-by: Arnaldo Carvalho de Melo

Peter Zijlstra
2020-11-03 19:32:15 +0800
0dfbe4c64 perf vendor events: Fix DRAM_BW_Use 0 issue for CLX/SKX ... Browse Code »

Ian reports an issue that the metric DRAM_BW_Use often remains 0.

The metric expression for DRAM_BW_Use on CLX/SKX:

"( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 1000000000 ) / duration_time"

The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/
are scaled up by 64, that is to turn a count of cache lines into bytes,
the count is then divided by 1000000000 to give GB.

However, the counts of uncore_imc/cas_count_read/ and
uncore_imc/cas_count_write/ have been scaled yet.

The scale values are from sysfs, such as
/sys/devices/uncore_imc_0/events/cas_count_read.scale.
It's 6.103515625e-5 (64 / 1024.0 / 1024.0).

So if we use original metric expression, the result is not correct.

But the difficulty is, for SKL client, the counts are not scaled.

The metric expression for DRAM_BW_Use on SKL:

"64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / 1000000 / duration_time / 1000"

root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1

Performance counter stats for 'system wide':

190 arb/event=0x84,umask=0x1/ # 1.86 DRAM_BW_Use
29,093,178 arb/event=0x81,umask=0x1/
1,000,703,287 ns duration_time

1.000703287 seconds time elapsed

The result is expected.

So the easy way is just change the metric expression for CLX/SKX.
This patch changes the metric expression to:

"( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 1000000000 ) / duration_time"

1048576 = 1024 * 1024.

Before (tested on CLX):

root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1

Performance counter stats for 'system wide':

765.35 MiB uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use
5.42 MiB uncore_imc/cas_count_write/
1001515088 ns duration_time

1.001515088 seconds time elapsed

After:

root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1

Performance counter stats for 'system wide':

767.95 MiB uncore_imc/cas_count_read/ # 0.80 DRAM_BW_Use
5.02 MiB uncore_imc/cas_count_write/
1001900010 ns duration_time

1.001900010 seconds time elapsed

Fixes: 038d3b53c284 ("perf vendor events intel: Update CascadelakeX events to v1.08")
Fixes: b5ff7f2799a4 ("perf vendor events: Update SkylakeX events to v1.21")
Signed-off-by: Jin Yao
Acked-by: Ian Rogers
Cc: Alexander Shishkin
Cc: Andi Kleen
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Peter Zijlstra
Link: http://lore.kernel.org/lkml/20201023005334.7869-1-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo

Jin Yao
2020-11-03 19:31:31 +0800
a6293f36a perf trace: Fix segfault when trying to trace events by cgroup ... Browse Code »

# ./perf trace -e sched:sched_switch -G test -a sleep 1
perf: Segmentation fault
Obtained 11 stack frames.
./perf(sighandler_dump_stack+0x43) [0x55cfdc636db3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3efcf) [0x7fd23eecafcf]
./perf(parse_cgroups+0x36) [0x55cfdc673f36]
./perf(+0x3186ed) [0x55cfdc70d6ed]
./perf(parse_options_subcommand+0x629) [0x55cfdc70e999]
./perf(cmd_trace+0x9c2) [0x55cfdc5ad6d2]
./perf(+0x1e8ae0) [0x55cfdc5ddae0]
./perf(+0x1e8ded) [0x55cfdc5ddded]
./perf(main+0x370) [0x55cfdc556f00]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x7fd23eeadb96]
./perf(_start+0x29) [0x55cfdc557389]
Segmentation fault
#

It happens because "struct trace" in option->value is passed to the
parse_cgroups function instead of "struct evlist".

Fixes: 9ea42ba4411ac ("perf trace: Support setting cgroups as targets")
Signed-off-by: Stanislav Ivanichkin
Tested-by: Arnaldo Carvalho de Melo
Acked-by: Namhyung Kim
Cc: Dmitry Monakhov
Link: http://lore.kernel.org/lkml/20201027094357.94881-1-sivanichkin@yandex-team.ru
Signed-off-by: Arnaldo Carvalho de Melo

Stanislav Ivanichkin
2020-11-03 19:31:03 +0800
ab8bf5f2e perf tools: Fix crash with non-jited bpf progs ... Browse Code »

The addr in PERF_RECORD_KSYMBOL events for non-jited bpf progs points to
the bpf interpreter, ie. within kernel text section. When processing the
unregister event, this causes unexpected removal of vmlinux_map,
crashing perf later in cleanup:

# perf record -- timeout --signal=INT 2s /usr/share/bcc/tools/execsnoop
PCOMM PID PPID RET ARGS
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.208 MB perf.data (5155 samples) ]
perf: tools/include/linux/refcount.h:131: refcount_sub_and_test: Assertion `!(new > val)' failed.
Aborted (core dumped)

# perf script -D|grep KSYM
0 0xa40 [0x48]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b530 len 0 type 1 flags 0x0 name bpf_prog_f958f6eb72ef5af6
0 0xab0 [0x48]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b530 len 0 type 1 flags 0x0 name bpf_prog_8c42dee26e8cd4c2
0 0xb20 [0x48]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b530 len 0 type 1 flags 0x0 name bpf_prog_f958f6eb72ef5af6
108563691893 0x33d98 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3b0 len 0 type 1 flags 0x0 name bpf_prog_bc5697a410556fc2_syscall__execve
108568518458 0x34098 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3f0 len 0 type 1 flags 0x0 name bpf_prog_45e2203c2928704d_do_ret_sys_execve
109301967895 0x34830 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3b0 len 0 type 1 flags 0x1 name bpf_prog_bc5697a410556fc2_syscall__execve
109302007356 0x348b0 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffa9b6b3f0 len 0 type 1 flags 0x1 name bpf_prog_45e2203c2928704d_do_ret_sys_execve
perf: tools/include/linux/refcount.h:131: refcount_sub_and_test: Assertion `!(new > val)' failed.

Here the addresses match the bpf interpreter:

# grep -e ffffffffa9b6b530 -e ffffffffa9b6b3b0 -e ffffffffa9b6b3f0 /proc/kallsyms
ffffffffa9b6b3b0 t __bpf_prog_run224
ffffffffa9b6b3f0 t __bpf_prog_run192
ffffffffa9b6b530 t __bpf_prog_run32

Fix by not allowing vmlinux_map to be removed by PERF_RECORD_KSYMBOL
unregister event.

Signed-off-by: Tommi Rantala
Acked-by: Jiri Olsa
Tested-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201016114718.54332-1-tommi.t.rantala@nokia.com
Signed-off-by: Arnaldo Carvalho de Melo

Tommi Rantala
2020-11-03 19:30:34 +0800
263e452ef tools headers UAPI: Update process_madvise affected files ... Browse Code »

To pick the changes from:

ecb8ac8b1f146915 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")

That addresses these perf build warning:

Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl

Cc: Adrian Hunter
Cc: Ian Rogers
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Minchan Kim
Cc: Namhyung Kim
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2020-11-03 19:29:30 +0800
e555b4b8d perf tools: Update copy of libbpf's hashmap.c ... Browse Code »

To pick the changes in:

85367030a6c7ef33 ("libbpf: Centralize poisoning and poison reallocarray()")
7d9c71e10baa3496 ("libbpf: Extract generic string hashing function for reuse")

That don't entail any changes in tools/perf.

This addresses this perf build warning:

Warning: Kernel ABI header at 'tools/perf/util/hashmap.h' differs from latest version at 'tools/lib/bpf/hashmap.h'
diff -u tools/perf/util/hashmap.h tools/lib/bpf/hashmap.h

Not a kernel ABI, its just that this uses the mechanism in place for
checking kernel ABI files drift.

Cc: Adrian Hunter
Cc: Alexei Starovoitov
Cc: Andrii Nakryiko
Cc: Ian Rogers
Cc: Jiri Olsa
Cc: Namhyung Kim
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2020-11-03 19:26:55 +0800
b773ea650 perf tools: Remove LTO compiler options when building perl support ... Browse Code »

To avoid breaking the build by mixing files compiled with things coming
from distro specific compiler options for perl with the rest of perf,
i.e. to avoid this:

`.gnu.debuglto_.debug_macro' referenced in section `.gnu.debuglto_.debug_macro' of /tmp/build/perf/util/scripting-engines/perf-in.o: defined in discarded section `.gnu.debuglto_.debug_macro[wm4.stdcpredef.h.19.8dc41bed5d9037ff9622e015fb5f0ce3]' of /tmp/build/perf/util/scripting-engines/perf-in.o

Noticed on Fedora 33.

Signed-off-by: Justin M. Forbes
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1593431
Cc: Jiri Olsa
Link: https://src.fedoraproject.org/rpms/kernel-tools/c/589a32b62f0c12516ab7b34e3dd30d450145bfa4?branch=master
Signed-off-by: Arnaldo Carvalho de Melo

Justin M. Forbes
2020-11-03 19:24:54 +0800

18 Oct, 2020

1 commit

9d9af1007 Merge tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linu… ... Browse Code »

…x/kernel/git/acme/linux

Pull perf tools updates from Arnaldo Carvalho de Melo:

- cgroup improvements for 'perf stat', allowing for compact
specification of events and cgroups in the command line.

- Support per thread topdown metrics in 'perf stat'.

- Support sample-read topdown metric group in 'perf record'

- Show start of latency in addition to its start in 'perf sched
latency'.

- Add min, max to 'perf script' futex-contention output, in addition to
avg.

- Allow usage of 'perf_event_attr->exclusive' attribute via the new
':e' event modifier.

- Add 'snapshot' command to 'perf record --control', using it with
Intel PT.

- Support FIFO file names as alternative options to 'perf record
--control'.

- Introduce branch history "streams", to compare 'perf record' runs
with 'perf diff' based on branch records and report hot streams.

- Support PE executable symbol tables using libbfd, to profile, for
instance, wine binaries.

- Add filter support for option 'perf ftrace -F/--funcs'.

- Allow configuring the 'disassembler_style' 'perf annotate' knob via
'perf config'

- Update CascadelakeX and SkylakeX JSON vendor events files.

- Add support for parsing perchip/percore JSON vendor events.

- Add power9 hv_24x7 core level metric events.

- Add L2 prefetch, ITLB instruction fetch hits JSON events for AMD
zen1.

- Enable Family 19h users by matching Zen2 AMD vendor events.

- Use debuginfod in 'perf probe' when required debug files not found
locally.

- Display negative tid in non-sample events in 'perf script'.

- Make GTK2 support opt-in

- Add build test with GTK+

- Add missing -lzstd to the fast path feature detection

- Add scripts to auto generate 'mmap', 'mremap' string<->id tables for
use in 'perf trace'.

- Show python test script in verbose mode.

- Fix uncore metric expressions

- Msan uninitialized use fixes.

- Use condition variables in 'perf bench numa'

- Autodetect python3 binary in systems without python2.

- Support md5 build ids in addition to sha1.

- Add build id 'perf test' regression test.

- Fix printable strings in python3 scripts.

- Fix off by ones in 'perf trace' in arches using libaudit.

- Fix JSON event code for events referencing std arch events.

- Introduce 'perf test' shell script for Arm CoreSight testing.

- Add rdtsc() for Arm64 for used in the PERF_RECORD_TIME_CONV metadata
event and in 'perf test tsc'.

- 'perf c2c' improvements: Add "RMT Load Hit" metric, "Total Stores",
fixes and documentation update.

- Fix usage of reloc_sym in 'perf probe' when using both kallsyms and
debuginfo files.

- Do not print 'Metric Groups:' unnecessarily in 'perf list'

- Refcounting fixes in the event parsing code.

- Add expand cgroup event 'perf test' entry.

- Fix out of bounds CPU map access when handling armv8_pmu events in
'perf stat'.

- Add build-id injection 'perf bench' benchmark.

- Enter namespace when reading build-id in 'perf inject'.

- Do not load map/dso when injecting build-id speeding up the 'perf
inject' process.

- Add --buildid-all option to avoid processing all samples, just the
mmap metadata events.

- Add feature test to check if libbfd has buildid support

- Add 'perf test' entry for PE binary format support.

- Fix typos in power8 PMU vendor events JSON files.

- Hide libtraceevent non API functions.

* tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (113 commits)
perf c2c: Update documentation for metrics reorganization
perf c2c: Add metrics "RMT Load Hit"
perf c2c: Correct LLC load hit metrics
perf c2c: Change header for LLC local hit
perf c2c: Use more explicit headers for HITM
perf c2c: Change header from "LLC Load Hitm" to "Load Hitm"
perf c2c: Organize metrics based on memory hierarchy
perf c2c: Display "Total Stores" as a standalone metrics
perf c2c: Display the total numbers continuously
perf bench: Use condition variables in numa.
perf jevents: Fix event code for events referencing std arch events
perf diff: Support hot streams comparison
perf streams: Report hot streams
perf streams: Calculate the sum of total streams hits
perf streams: Link stream pair
perf streams: Compare two streams
perf streams: Get the evsel_streams by evsel_idx
perf streams: Introduce branch history "streams"
perf intel-pt: Improve PT documentation slightly
perf tools: Add support for exclusive groups/events
...

Linus Torvalds
2020-10-18 02:47:46 +0800

16 Oct, 2020

1 commit

9ff9b0d39 Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from Jakub Kicinski:

- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.

Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).

- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.

- Allow more than 255 IPv4 multicast interfaces.

- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.

- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.

- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.

- Allow more calls to same peer in RxRPC.

- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.

- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.

- Add TC actions for implementing MPLS L2 VPNs.

- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.

- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.

- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.

- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.

- Support sleepable BPF programs, initially targeting LSM and tracing.

- Add bpf_d_path() helper for returning full path for given 'struct
path'.

- Make bpf_tail_call compatible with bpf-to-bpf calls.

- Allow BPF programs to call map_update_elem on sockmaps.

- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).

- Support listing and getting information about bpf_links via the bpf
syscall.

- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).

- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.

- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).

- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.

- Add XDP support for Intel's igb driver.

- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.

- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.

- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.

- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.

- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.

- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.

- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.

- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.

- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.

- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...

Linus Torvalds
2020-10-16 09:42:13 +0800

15 Oct, 2020

13 commits

744aec4df perf c2c: Update documentation for metrics reorganization ... Browse Code »

The output format for metrics has been reorganized, update documentation
to reflect the changes for it.

Signed-off-by: Leo Yan
Cc: Al Grant
Cc: Alexander Shishkin
Cc: Andi Kleen
Cc: David Ahern
Cc: Don Zickus
Cc: Ian Rogers
Cc: James Clark
Cc: Jiri Olsa
Cc: Joe Mario
Cc: Kan Liang
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Link: http://lore.kernel.org/lkml/20201015144548.18482-10-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 23:02:12 +0800
91d933c22 perf c2c: Add metrics "RMT Load Hit" ... Browse Code »

The metrics "LLC Ld Miss" and "Load Dram" overlap with each other for
accouting items:

"LLC Ld Miss" = "lcl_dram" + "rmt_dram" + "rmt_hit" + "rmt_hitm"
"Load Dram" = "lcl_dram" + "rmt_dram"

Furthermore, the metrics "LLC Ld Miss" is not directive to show
statistics due to it contains summary value and cannot give out
breakdown details.

For this reason, add a new metrics "RMT Load Hit" which is used to
present the remote cache hit; it contains two items:

"RMT Load Hit" = remote hit ("rmt_hit") + remote hitm ("rmt_hitm")

As result, the metrics "LLC Ld Miss" is perfectly divided into two
metrics "RMT Load Hit" and "Load Dram". It's not necessary to keep
metrics "LLC Ld Miss", so remove it.

Before:

# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- LLC --- Load Dram ----
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm Ld Miss Lcl Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 548 2615 66 169 481 0 0 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 187 361 27 11 78 0 0 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 131 0 10 263 1 0 0 0

After:

# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ----
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 548 2615 66 169 481 0 0 0 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 187 361 27 11 78 0 0 0 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 131 0 10 263 1 0 0 0 0

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Signed-off-by: Arnaldo Carvalho de Melo
Link: https://lore.kernel.org/r/20201014050921.5591-9-leo.yan@linaro.org

Leo Yan
2020-10-15 20:34:51 +0800
77c158698 perf c2c: Correct LLC load hit metrics ... Browse Code »

"rmt_hit" is accounted into two metrics: one is accounted into the
metrics "LLC Ld Miss" (see the function llc_miss() for calculation
"llcmiss"); and it's accounted into metrics "LLC Load Hit". Thus,
for the literal meaning, it is contradictory that "rmt_hit" is
accounted for both "LLC Ld Miss" (LLC miss) and "LLC Load Hit"
(LLC hit).

Thus this is easily to introduce confusion: "LLC Load Hit" gives
impression that all items belong to it are LLC hit; in fact "rmt_hit"
is LLC miss and remote cache hit.

To give out clear semantics for metric "LLC Load Hit", "rmt_hit" is
moved out from it and changes "LLC Load Hit" to contain two items:

LLC Load Hit = LLC's hit ("ld_llchit") + LLC's hitm ("lcl_hitm")

For output alignment, adjusts the header for "LLC Load Hit".

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-8-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:51 +0800
ed626a3e5 perf c2c: Change header for LLC local hit ... Browse Code »

Replace the header string "Lcl" with "LclHit", which is more explicit
to express the event type is LLC local hit.

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-7-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:49 +0800
0fbe2fe96 perf c2c: Use more explicit headers for HITM ... Browse Code »

Local and remote HITM use the headers 'Lcl' and 'Rmt' respectively,
suppose if we want to extend the tool to display these two dimensions
under any one metrics, users cannot understand the semantics if only
based on the header string 'Lcl' or 'Rmt'.

To explicit express the meaning for HITM items, this patch changes the
headers string as "LclHitm" and "RmtHitm", the strings are more readable
and this allows to extend metrics for using HITM items.

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-6-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:47 +0800
fdd32d7e8 perf c2c: Change header from "LLC Load Hitm" to "Load Hitm" ... Browse Code »

The metrics "LLC Load Hitm" contains two items: one is "local Hitm" and
another is "remote Hitm".

"local Hitm" means: L3 HIT and was serviced by another processor core
with a cross core snoop where modified copies were found; it's no doubt
that "local Hitm" belongs to LLC access.

But for "remote Hitm", based on the code in util/mem-events, it's the
event for remote cache HIT and was serviced by another processor core
with modified copies. Thus the remote Hitm is a remote cache's hit and
actually it's LLC load miss.

Now the display format gives users the impression that "local Hitm" and
"remote Hitm" both belong to the LLC load, but this is not the fact as
described.

This patch changes the header from "LLC Load Hitm" to "Load Hitm", this
can avoid the give the wrong impression that all Hitm belong to LLC.

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-5-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:45 +0800
6d662d730 perf c2c: Organize metrics based on memory hierarchy ... Browse Code »

The metrics are not organized based on memory hierarchy, e.g. the tool
doesn't organize the metrics order based on memory nodes from the close
node (e.g. L1/L2 cache) to far node (e.g. L3 cache and DRAM).

To output metrics with more friendly form, this patch refines the
metrics order based on memory hierarchy:

"Core Load Hit" => "LLC Load Hit" => "LLC Ld Miss" => "Load Dram"

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-4-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:41 +0800
4f28641bd perf c2c: Display "Total Stores" as a standalone metrics ... Browse Code »

The total stores is displayed under the metrics "Store Reference", to
output the same format with total records and all loads, extract the
total stores number as a standalone metrics "Total Stores".

After this patch, the tool shows the summary numbers ("Total records",
"Total loads", "Total Stores") in the unified form.

Before:

# ----------- Cacheline ---------- Tot ----- LLC Load Hitm ----- Total Total ---- Store Reference ---- --- Load Dram ---- LLC ----- Core Load Hit ----- -- LLC Load Hit --
# Index Address Node PA cnt Hitm Total Lcl Rmt records Loads Total L1Hit L1Miss Lcl Rmt Ld Miss FB L1 L2 Llc Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 0 0 0 548 2615 66 169 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 0 0 0 187 361 27 11 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 0 0 0 131 0 10 263 0

After:

# ----------- Cacheline ---------- Tot ----- LLC Load Hitm ----- Total Total Total ---- Stores ---- --- Load Dram ---- LLC ----- Core Load Hit ----- -- LLC Load Hit --
# Index Address Node PA cnt Hitm Total Lcl Rmt records Loads Stores L1Hit L1Miss Lcl Rmt Ld Miss FB L1 L2 Llc Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 0 0 0 548 2615 66 169 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 0 0 0 187 361 27 11 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 0 0 0 131 0 10 263 0

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-3-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:36 +0800
b596e979c perf c2c: Display the total numbers continuously ... Browse Code »

To view the statistics with "breakdown" mode, it's good to show the
summary numbers for the total records, all stores and all loads, then
the sequential conlumns can be used to break into more detailed items.

To achieve this purpose, this patch displays the summary numbers for
records/stores/loads continuously and places them before breakdown
items, this can allow uses to easily read the summarized statistics.

Signed-off-by: Leo Yan
Tested-by: Joe Mario
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201014050921.5591-2-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

Leo Yan
2020-10-15 20:34:33 +0800
f92993851 perf bench: Use condition variables in numa. ... Browse Code »

The existing approach to synchronization between threads in the numa
benchmark is unbalanced mutexes.

This synchronization causes thread sanitizer to warn of locks being
taken twice on a thread without an unlock, as well as unlocks with no
corresponding locks.

This change replaces the synchronization with more regular condition
variables.

While this fixes one class of thread sanitizer warnings, there still
remain warnings of data races due to threads reading and writing shared
memory without any atomics.

Committer testing:

Basic run on a non-NUMA machine.

# perf bench numa

# List of available benchmarks for collection 'numa':

mem: Benchmark for NUMA workloads
all: Run all NUMA benchmarks

# perf bench numa all
# Running numa/mem benchmark...

# Running main, "perf bench numa numa-mem"
#
# Running test on: Linux five 5.8.12-200.fc32.x86_64 #1 SMP Mon Sep 28 12:17:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
#

# Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk"
20.076 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.073 secs average thread-runtime
0.190 % difference between max/avg runtime
241.828 GB data processed, per thread
241.828 GB data processed, total
0.083 nsecs/byte/thread runtime
12.045 GB/sec/thread speed
12.045 GB/sec total speed

# Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1"
20.045 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.014 secs average thread-runtime
0.111 % difference between max/avg runtime
234.304 GB data processed, per thread
234.304 GB data processed, total
0.086 nsecs/byte/thread runtime
11.689 GB/sec/thread speed
11.689 GB/sec total speed

# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"

Test not applicable, system has only 1 nodes.

# Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
20.138 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.121 secs average thread-runtime
0.342 % difference between max/avg runtime
135.961 GB data processed, per thread
271.922 GB data processed, total
0.148 nsecs/byte/thread runtime
6.752 GB/sec/thread speed
13.503 GB/sec total speed

# Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk"

Test not applicable, system has only 1 nodes.

# Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"

Test not applicable, system has only 1 nodes.

# Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1"
0.747 secs latency to NUMA-converge
0.747 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.714 secs average thread-runtime
50.000 % difference between max/avg runtime
3.228 GB data processed, per thread
9.683 GB data processed, total
0.231 nsecs/byte/thread runtime
4.321 GB/sec/thread speed
12.964 GB/sec total speed

# Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1.127 secs latency to NUMA-converge
1.127 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.089 secs average thread-runtime
5.624 % difference between max/avg runtime
3.765 GB data processed, per thread
15.062 GB data processed, total
0.299 nsecs/byte/thread runtime
3.342 GB/sec/thread speed
13.368 GB/sec total speed

# Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
1.003 secs latency to NUMA-converge
1.003 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.889 secs average thread-runtime
50.000 % difference between max/avg runtime
2.141 GB data processed, per thread
12.847 GB data processed, total
0.469 nsecs/byte/thread runtime
2.134 GB/sec/thread speed
12.805 GB/sec total speed

# Running 2x3-convergence, "perf bench numa mem -p 2 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
1.814 secs latency to NUMA-converge
1.814 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.716 secs average thread-runtime
22.440 % difference between max/avg runtime
3.747 GB data processed, per thread
22.483 GB data processed, total
0.484 nsecs/byte/thread runtime
2.065 GB/sec/thread speed
12.393 GB/sec total speed

# Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
2.065 secs latency to NUMA-converge
2.065 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.947 secs average thread-runtime
25.788 % difference between max/avg runtime
2.855 GB data processed, per thread
25.694 GB data processed, total
0.723 nsecs/byte/thread runtime
1.382 GB/sec/thread speed
12.442 GB/sec total speed

# Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1.912 secs latency to NUMA-converge
1.912 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.775 secs average thread-runtime
23.852 % difference between max/avg runtime
1.479 GB data processed, per thread
23.668 GB data processed, total
1.293 nsecs/byte/thread runtime
0.774 GB/sec/thread speed
12.378 GB/sec total speed

# Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
1.783 secs latency to NUMA-converge
1.783 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.633 secs average thread-runtime
21.960 % difference between max/avg runtime
1.345 GB data processed, per thread
21.517 GB data processed, total
1.326 nsecs/byte/thread runtime
0.754 GB/sec/thread speed
12.067 GB/sec total speed

# Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
5.396 secs latency to NUMA-converge
5.396 secs slowest (max) thread-runtime
4.000 secs fastest (min) thread-runtime
4.928 secs average thread-runtime
12.937 % difference between max/avg runtime
2.721 GB data processed, per thread
65.306 GB data processed, total
1.983 nsecs/byte/thread runtime
0.504 GB/sec/thread speed
12.102 GB/sec total speed

# Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1"
3.121 secs latency to NUMA-converge
3.121 secs slowest (max) thread-runtime
2.000 secs fastest (min) thread-runtime
2.836 secs average thread-runtime
17.962 % difference between max/avg runtime
1.194 GB data processed, per thread
38.192 GB data processed, total
2.615 nsecs/byte/thread runtime
0.382 GB/sec/thread speed
12.236 GB/sec total speed

# Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
4.302 secs latency to NUMA-converge
4.302 secs slowest (max) thread-runtime
3.000 secs fastest (min) thread-runtime
4.045 secs average thread-runtime
15.133 % difference between max/avg runtime
1.631 GB data processed, per thread
52.178 GB data processed, total
2.638 nsecs/byte/thread runtime
0.379 GB/sec/thread speed
12.128 GB/sec total speed

# Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
4.418 secs latency to NUMA-converge
4.418 secs slowest (max) thread-runtime
3.000 secs fastest (min) thread-runtime
4.104 secs average thread-runtime
16.045 % difference between max/avg runtime
1.664 GB data processed, per thread
53.254 GB data processed, total
2.655 nsecs/byte/thread runtime
0.377 GB/sec/thread speed
12.055 GB/sec total speed

# Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
0.973 secs latency to NUMA-converge
0.973 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.955 secs average thread-runtime
50.000 % difference between max/avg runtime
4.124 GB data processed, per thread
12.372 GB data processed, total
0.236 nsecs/byte/thread runtime
4.238 GB/sec/thread speed
12.715 GB/sec total speed

# Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
0.820 secs latency to NUMA-converge
0.820 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.808 secs average thread-runtime
50.000 % difference between max/avg runtime
2.555 GB data processed, per thread
10.220 GB data processed, total
0.321 nsecs/byte/thread runtime
3.117 GB/sec/thread speed
12.468 GB/sec total speed

# Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
0.667 secs latency to NUMA-converge
0.667 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.607 secs average thread-runtime
50.000 % difference between max/avg runtime
1.009 GB data processed, per thread
8.069 GB data processed, total
0.661 nsecs/byte/thread runtime
1.512 GB/sec/thread speed
12.095 GB/sec total speed

# Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1"
1.546 secs latency to NUMA-converge
1.546 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.485 secs average thread-runtime
17.664 % difference between max/avg runtime
1.162 GB data processed, per thread
18.594 GB data processed, total
1.331 nsecs/byte/thread runtime
0.752 GB/sec/thread speed
12.025 GB/sec total speed

# Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1"
0.812 secs latency to NUMA-converge
0.812 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.739 secs average thread-runtime
50.000 % difference between max/avg runtime
0.309 GB data processed, per thread
9.874 GB data processed, total
2.630 nsecs/byte/thread runtime
0.380 GB/sec/thread speed
12.166 GB/sec total speed

# Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
20.044 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.020 secs average thread-runtime
0.109 % difference between max/avg runtime
125.750 GB data processed, per thread
251.501 GB data processed, total
0.159 nsecs/byte/thread runtime
6.274 GB/sec/thread speed
12.548 GB/sec total speed

# Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
20.148 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.090 secs average thread-runtime
0.367 % difference between max/avg runtime
85.267 GB data processed, per thread
255.800 GB data processed, total
0.236 nsecs/byte/thread runtime
4.232 GB/sec/thread speed
12.696 GB/sec total speed

# Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
20.169 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.100 secs average thread-runtime
0.419 % difference between max/avg runtime
63.144 GB data processed, per thread
252.576 GB data processed, total
0.319 nsecs/byte/thread runtime
3.131 GB/sec/thread speed
12.523 GB/sec total speed

# Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1"
20.175 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.107 secs average thread-runtime
0.433 % difference between max/avg runtime
31.267 GB data processed, per thread
250.133 GB data processed, total
0.645 nsecs/byte/thread runtime
1.550 GB/sec/thread speed
12.398 GB/sec total speed

# Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
20.216 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.113 secs average thread-runtime
0.535 % difference between max/avg runtime
30.998 GB data processed, per thread
247.981 GB data processed, total
0.652 nsecs/byte/thread runtime
1.533 GB/sec/thread speed
12.266 GB/sec total speed

# Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1"
20.234 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.174 secs average thread-runtime
0.577 % difference between max/avg runtime
15.377 GB data processed, per thread
246.039 GB data processed, total
1.316 nsecs/byte/thread runtime
0.760 GB/sec/thread speed
12.160 GB/sec total speed

# Running 1x4-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1"
20.040 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.028 secs average thread-runtime
0.099 % difference between max/avg runtime
66.832 GB data processed, per thread
267.328 GB data processed, total
0.300 nsecs/byte/thread runtime
3.335 GB/sec/thread speed
13.340 GB/sec total speed

# Running 1x8-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1"
20.064 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.034 secs average thread-runtime
0.160 % difference between max/avg runtime
32.911 GB data processed, per thread
263.286 GB data processed, total
0.610 nsecs/byte/thread runtime
1.640 GB/sec/thread speed
13.122 GB/sec total speed

# Running 1x16-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1"
20.092 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.052 secs average thread-runtime
0.230 % difference between max/avg runtime
16.131 GB data processed, per thread
258.088 GB data processed, total
1.246 nsecs/byte/thread runtime
0.803 GB/sec/thread speed
12.845 GB/sec total speed

# Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1"
20.099 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.063 secs average thread-runtime
0.247 % difference between max/avg runtime
7.962 GB data processed, per thread
254.773 GB data processed, total
2.525 nsecs/byte/thread runtime
0.396 GB/sec/thread speed
12.676 GB/sec total speed

# Running 2x3-bw-process, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1"
20.150 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.120 secs average thread-runtime
0.372 % difference between max/avg runtime
44.827 GB data processed, per thread
268.960 GB data processed, total
0.450 nsecs/byte/thread runtime
2.225 GB/sec/thread speed
13.348 GB/sec total speed

# Running 4x4-bw-process, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1"
20.258 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.168 secs average thread-runtime
0.636 % difference between max/avg runtime
17.079 GB data processed, per thread
273.263 GB data processed, total
1.186 nsecs/byte/thread runtime
0.843 GB/sec/thread speed
13.489 GB/sec total speed

# Running 4x6-bw-process, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1"
20.559 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.382 secs average thread-runtime
1.359 % difference between max/avg runtime
10.758 GB data processed, per thread
258.201 GB data processed, total
1.911 nsecs/byte/thread runtime
0.523 GB/sec/thread speed
12.559 GB/sec total speed

# Running 4x8-bw-process, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1"
20.744 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.516 secs average thread-runtime
1.792 % difference between max/avg runtime
8.069 GB data processed, per thread
258.201 GB data processed, total
2.571 nsecs/byte/thread runtime
0.389 GB/sec/thread speed
12.447 GB/sec total speed

# Running 4x8-bw-process-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
20.855 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.561 secs average thread-runtime
2.050 % difference between max/avg runtime
8.069 GB data processed, per thread
258.201 GB data processed, total
2.585 nsecs/byte/thread runtime
0.387 GB/sec/thread speed
12.381 GB/sec total speed

# Running 3x3-bw-process, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1"
20.134 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.077 secs average thread-runtime
0.333 % difference between max/avg runtime
28.091 GB data processed, per thread
252.822 GB data processed, total
0.717 nsecs/byte/thread runtime
1.395 GB/sec/thread speed
12.557 GB/sec total speed

# Running 5x5-bw-process, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
20.588 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.375 secs average thread-runtime
1.427 % difference between max/avg runtime
10.177 GB data processed, per thread
254.436 GB data processed, total
2.023 nsecs/byte/thread runtime
0.494 GB/sec/thread speed
12.359 GB/sec total speed

# Running 2x16-bw-process, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1"
20.657 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.429 secs average thread-runtime
1.589 % difference between max/avg runtime
8.170 GB data processed, per thread
261.429 GB data processed, total
2.528 nsecs/byte/thread runtime
0.395 GB/sec/thread speed
12.656 GB/sec total speed

# Running 1x32-bw-process, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1"
22.981 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
21.996 secs average thread-runtime
6.486 % difference between max/avg runtime
8.863 GB data processed, per thread
283.606 GB data processed, total
2.593 nsecs/byte/thread runtime
0.386 GB/sec/thread speed
12.341 GB/sec total speed

# Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1"
20.047 secs slowest (max) thread-runtime
19.000 secs fastest (min) thread-runtime
20.026 secs average thread-runtime
2.611 % difference between max/avg runtime
8.441 GB data processed, per thread
270.111 GB data processed, total
2.375 nsecs/byte/thread runtime
0.421 GB/sec/thread speed
13.474 GB/sec total speed

# Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1"
20.088 secs slowest (max) thread-runtime
19.000 secs fastest (min) thread-runtime
20.025 secs average thread-runtime
2.709 % difference between max/avg runtime
8.411 GB data processed, per thread
269.142 GB data processed, total
2.388 nsecs/byte/thread runtime
0.419 GB/sec/thread speed
13.398 GB/sec total speed

# Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1"
20.293 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.175 secs average thread-runtime
0.721 % difference between max/avg runtime
7.918 GB data processed, per thread
253.374 GB data processed, total
2.563 nsecs/byte/thread runtime
0.390 GB/sec/thread speed
12.486 GB/sec total speed

# Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1"
20.411 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.226 secs average thread-runtime
1.006 % difference between max/avg runtime
7.931 GB data processed, per thread
253.778 GB data processed, total
2.574 nsecs/byte/thread runtime
0.389 GB/sec/thread speed
12.434 GB/sec total speed

#

Signed-off-by: Ian Rogers
Acked-by: Jiri Olsa
Tested-by: Arnaldo Carvalho de Melo
Link: https://lore.kernel.org/r/20201012161611.366482-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo

Ian Rogers
2020-10-15 01:24:53 +0800
6873139ed Merge tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull objtool updates from Ingo Molnar:
"Most of the changes are cleanups and reorganization to make the
objtool code more arch-agnostic. This is in preparation for non-x86
support.

Other changes:

- KASAN fixes

- Handle unreachable trap after call to noreturn functions better

- Ignore unreachable fake jumps

- Misc smaller fixes & cleanups"

* tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf build: Allow nested externs to enable BUILD_BUG() usage
objtool: Allow nested externs to enable BUILD_BUG()
objtool: Permit __kasan_check_{read,write} under UACCESS
objtool: Ignore unreachable trap after call to noreturn functions
objtool: Handle calling non-function symbols in other sections
objtool: Ignore unreachable fake jumps
objtool: Remove useless tests before save_reg()
objtool: Decode unwind hint register depending on architecture
objtool: Make unwind hint definitions available to other architectures
objtool: Only include valid definitions depending on source file type
objtool: Rename frame.h -> objtool.h
objtool: Refactor jump table code to support other architectures
objtool: Make relocation in alternative handling arch dependent
objtool: Abstract alternative special case handling
objtool: Move macros describing structures to arch-dependent code
objtool: Make sync-check consider the target architecture
objtool: Group headers to check in a single list
objtool: Define 'struct orc_entry' only when needed
objtool: Skip ORC entry creation for non-text sections
objtool: Move ORC logic out of check()
...

Linus Torvalds
2020-10-15 01:13:37 +0800
caf7f9685 perf jevents: Fix event code for events referencing std arch events ... Browse Code »

The event code for events referencing std arch events is incorrectly
evaluated in json_events().

The issue is that je.event is evaluated properly from try_fixup(), but
later NULLified from the real_event() call, as "event" may be NULL.

Fix by setting "event" same je.event in try_fixup().

Also remove support for overwriting event code for events using std arch
events, as it is not used.

Signed-off-by: John Garry
Reviewed-By: Kajol Jain
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/1602170368-11892-1-git-send-email-john.garry@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo

John Garry
2020-10-15 00:43:31 +0800
2a09a84c7 perf diff: Support hot streams comparison ... Browse Code »

This patch enables perf-diff with "--stream" option.

"--stream": Enable hot streams comparison

Now let's see example.

perf record -b ... Generate perf.data.old with branch data
perf record -b ... Generate perf.data with branch data
perf diff --stream

[ Matched hot streams ]

hot chain pair 1:
cycles: 1, hits: 27.77% cycles: 1, hits: 9.24%
--------------------------- --------------------------
main div.c:39 main div.c:39
main div.c:44 main div.c:44

hot chain pair 2:
cycles: 34, hits: 20.06%
---------------------------
__random_r random_r.c:360
__random_r random_r.c:388
__random_r random_r.c:388
__random_r random_r.c:380
__random_r random_r.c:357
__random random.c:293
__random random.c:293
__random random.c:291
__random random.c:291
__random random.c:291
__random random.c:288
rand rand.c:27
rand rand.c:26
rand@plt
rand@plt
compute_flag div.c:25
compute_flag div.c:22
main div.c:40
main div.c:40
main div.c:39 cycles: 27, hits: 16.98% -------------------------- __random_r random_r.c:360 __random_r random_r.c:388 __random_r random_r.c:388 __random_r random_r.c:380 __random_r random_r.c:357 __random random.c:293 __random random.c:293 __random random.c:291 __random random.c:291 __random random.c:291 __random random.c:288 rand rand.c:27 rand rand.c:26 rand@plt rand@plt compute_flag div.c:25 compute_flag div.c:22 main div.c:40 main div.c:40 main div.c:39

hot chain pair 3:
cycles: 9, hits: 4.48% cycles: 6, hits: 4.51%
--------------------------- --------------------------
__random_r random_r.c:360 __random_r random_r.c:360
__random_r random_r.c:388 __random_r random_r.c:388
__random_r random_r.c:388 __random_r random_r.c:388
__random_r random_r.c:380 __random_r random_r.c:380

[ Hot streams in old perf data only ]

hot chain 1:
cycles: 18, hits: 6.75%
--------------------------
__random_r random_r.c:360
__random_r random_r.c:388
__random_r random_r.c:388
__random_r random_r.c:380
__random_r random_r.c:357
__random random.c:293
__random random.c:293
__random random.c:291
__random random.c:291
__random random.c:291
__random random.c:288
rand rand.c:27
rand rand.c:26
rand@plt
rand@plt
compute_flag div.c:25
compute_flag div.c:22
main div.c:40

hot chain 2:
cycles: 29, hits: 2.78%
--------------------------
compute_flag div.c:22
main div.c:40
main div.c:40
main div.c:39

[ Hot streams in new perf data only ]

hot chain 1:
cycles: 4, hits: 4.54%
--------------------------
main div.c:42
compute_flag div.c:28

hot chain 2:
cycles: 5, hits: 3.51%
--------------------------
main div.c:39
main div.c:44
main div.c:42
compute_flag div.c:28

Signed-off-by: Jin Yao
Acked-by: Jiri Olsa
Link: https://lore.kernel.org/r/20201009022845.13141-8-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo

Jin Yao
2020-10-15 00:34:48 +0800