Eric Lee / smarc-fsl-linux-kernel

13 Oct, 2020

1 commit

0bf02a0d8 perf bench: Add build-id injection benchmark ... Browse Code »

Sometimes I can see that 'perf record' piped with 'perf inject' take a
long time processing build-ids.

So introduce a inject-build-id benchmark to the internals benchmark
suite to measure its overhead regularly.

It runs the 'perf inject' command internally and feeds the given number
of synthesized events (MMAP2 + SAMPLE basically).

Usage: perf bench internals inject-build-id

-i, --iterations Number of iterations used to compute average (default: 100)
-m, --nr-mmaps Number of mmap events for each iteration (default: 100)
-n, --nr-samples Number of sample events per mmap event (default: 100)
-v, --verbose be more verbose (show iteration count, DSO name, etc)

By default, it measures average processing time of 100 MMAP2 events
and 10000 SAMPLE events. Below is a result on my laptop.

$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 25.789 msec (+- 0.202 msec)
Average time per event: 2.528 usec (+- 0.020 usec)
Average memory usage: 8411 KB (+- 7 KB)

Committer testing:

$ perf bench
Usage:
perf bench [] []

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks

$ perf bench internals

# List of available benchmarks for collection 'internals':

synthesize: Benchmark perf event synthesis
kallsyms-parse: Benchmark kallsyms parsing
inject-build-id: Benchmark build-id injection

$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.202 msec (+- 0.059 msec)
Average time per event: 1.392 usec (+- 0.006 usec)
Average memory usage: 12650 KB (+- 10 KB)
Average build-id-all injection took: 12.831 msec (+- 0.071 msec)
Average time per event: 1.258 usec (+- 0.007 usec)
Average memory usage: 11895 KB (+- 10 KB)
$

$ perf stat -r5 perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.380 msec (+- 0.056 msec)
Average time per event: 1.410 usec (+- 0.006 usec)
Average memory usage: 12608 KB (+- 11 KB)
Average build-id-all injection took: 11.889 msec (+- 0.064 msec)
Average time per event: 1.166 usec (+- 0.006 usec)
Average memory usage: 11838 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.246 msec (+- 0.065 msec)
Average time per event: 1.397 usec (+- 0.006 usec)
Average memory usage: 12744 KB (+- 10 KB)
Average build-id-all injection took: 12.019 msec (+- 0.066 msec)
Average time per event: 1.178 usec (+- 0.006 usec)
Average memory usage: 11963 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.321 msec (+- 0.067 msec)
Average time per event: 1.404 usec (+- 0.007 usec)
Average memory usage: 12690 KB (+- 10 KB)
Average build-id-all injection took: 11.909 msec (+- 0.041 msec)
Average time per event: 1.168 usec (+- 0.004 usec)
Average memory usage: 11938 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.287 msec (+- 0.059 msec)
Average time per event: 1.401 usec (+- 0.006 usec)
Average memory usage: 12864 KB (+- 10 KB)
Average build-id-all injection took: 11.862 msec (+- 0.058 msec)
Average time per event: 1.163 usec (+- 0.006 usec)
Average memory usage: 12103 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.402 msec (+- 0.053 msec)
Average time per event: 1.412 usec (+- 0.005 usec)
Average memory usage: 12876 KB (+- 10 KB)
Average build-id-all injection took: 11.826 msec (+- 0.061 msec)
Average time per event: 1.159 usec (+- 0.006 usec)
Average memory usage: 12111 KB (+- 10 KB)

Performance counter stats for 'perf bench internals inject-build-id' (5 runs):

4,267.48 msec task-clock:u # 1.502 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
102,092 page-faults:u # 0.024 M/sec ( +- 0.08% )
3,894,589,578 cycles:u # 0.913 GHz ( +- 0.19% ) (83.49%)
140,078,421 stalled-cycles-frontend:u # 3.60% frontend cycles idle ( +- 0.77% ) (83.34%)
948,581,189 stalled-cycles-backend:u # 24.36% backend cycles idle ( +- 0.46% ) (83.25%)
5,835,587,719 instructions:u # 1.50 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.21% ) (83.24%)
1,267,423,636 branches:u # 296.996 M/sec ( +- 0.22% ) (83.12%)
17,484,290 branch-misses:u # 1.38% of all branches ( +- 0.12% ) (83.55%)

2.84176 +- 0.00222 seconds time elapsed ( +- 0.08% )

$

Acked-by: Jiri Olsa
Tested-by: Arnaldo Carvalho de Melo
Signed-off-by: Namhyung Kim
Link: https://lore.kernel.org/r/20201012070214.2074921-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Namhyung Kim
2020-10-13 21:59:42 +0800

31 Jul, 2020

1 commit

7c43b0c1d perf bench: Add benchmark of find_next_bit ... Browse Code »

for_each_set_bit, or similar functions like for_each_cpu, may be hot
within the kernel. If many bits were set then one could imagine on Intel
a "bt" instruction with every bit may be faster than the function call
and word length find_next_bit logic. Add a benchmark to measure this.

This benchmark on AMD rome and Intel skylakex shows "bt" is not a good
option except for very small bitmaps.

Committer testing:

# perf bench
Usage:
perf bench [] []

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks

# perf bench mem

# List of available benchmarks for collection 'mem':

memcpy: Benchmark for memcpy() functions
memset: Benchmark for memset() functions
find_bit: Benchmark for find_bit() functions
all: Run all memory access benchmarks

# perf bench mem find_bit
# Running 'mem/find_bit' benchmark:
100000 operations 1 bits set of 1 bits
Average for_each_set_bit took: 730.200 usec (+- 6.468 usec)
Average test_bit loop took: 366.200 usec (+- 4.652 usec)
100000 operations 1 bits set of 2 bits
Average for_each_set_bit took: 781.000 usec (+- 24.247 usec)
Average test_bit loop took: 550.200 usec (+- 4.152 usec)
100000 operations 2 bits set of 2 bits
Average for_each_set_bit took: 1113.400 usec (+- 112.340 usec)
Average test_bit loop took: 1098.500 usec (+- 182.834 usec)
100000 operations 1 bits set of 4 bits
Average for_each_set_bit took: 843.800 usec (+- 8.772 usec)
Average test_bit loop took: 948.800 usec (+- 10.278 usec)
100000 operations 2 bits set of 4 bits
Average for_each_set_bit took: 1185.800 usec (+- 114.345 usec)
Average test_bit loop took: 1473.200 usec (+- 175.498 usec)
100000 operations 4 bits set of 4 bits
Average for_each_set_bit took: 1769.667 usec (+- 233.177 usec)
Average test_bit loop took: 1864.933 usec (+- 187.470 usec)
100000 operations 1 bits set of 8 bits
Average for_each_set_bit took: 898.000 usec (+- 21.755 usec)
Average test_bit loop took: 1768.400 usec (+- 23.672 usec)
100000 operations 2 bits set of 8 bits
Average for_each_set_bit took: 1244.900 usec (+- 116.396 usec)
Average test_bit loop took: 2201.800 usec (+- 145.398 usec)
100000 operations 4 bits set of 8 bits
Average for_each_set_bit took: 1822.533 usec (+- 231.554 usec)
Average test_bit loop took: 2569.467 usec (+- 168.453 usec)
100000 operations 8 bits set of 8 bits
Average for_each_set_bit took: 2845.100 usec (+- 441.365 usec)
Average test_bit loop took: 3023.300 usec (+- 219.575 usec)
100000 operations 1 bits set of 16 bits
Average for_each_set_bit took: 923.400 usec (+- 17.560 usec)
Average test_bit loop took: 3240.000 usec (+- 16.492 usec)
100000 operations 2 bits set of 16 bits
Average for_each_set_bit took: 1264.300 usec (+- 114.034 usec)
Average test_bit loop took: 3714.400 usec (+- 158.898 usec)
100000 operations 4 bits set of 16 bits
Average for_each_set_bit took: 1817.867 usec (+- 222.199 usec)
Average test_bit loop took: 4015.333 usec (+- 154.162 usec)
100000 operations 8 bits set of 16 bits
Average for_each_set_bit took: 2826.350 usec (+- 433.457 usec)
Average test_bit loop took: 4460.350 usec (+- 210.762 usec)
100000 operations 16 bits set of 16 bits
Average for_each_set_bit took: 4615.600 usec (+- 809.350 usec)
Average test_bit loop took: 5129.960 usec (+- 320.821 usec)
100000 operations 1 bits set of 32 bits
Average for_each_set_bit took: 904.400 usec (+- 14.250 usec)
Average test_bit loop took: 6194.000 usec (+- 29.254 usec)
100000 operations 2 bits set of 32 bits
Average for_each_set_bit took: 1252.700 usec (+- 116.432 usec)
Average test_bit loop took: 6652.400 usec (+- 154.352 usec)
100000 operations 4 bits set of 32 bits
Average for_each_set_bit took: 1824.200 usec (+- 229.133 usec)
Average test_bit loop took: 6961.733 usec (+- 154.682 usec)
100000 operations 8 bits set of 32 bits
Average for_each_set_bit took: 2823.950 usec (+- 432.296 usec)
Average test_bit loop took: 7351.900 usec (+- 193.626 usec)
100000 operations 16 bits set of 32 bits
Average for_each_set_bit took: 4552.560 usec (+- 785.141 usec)
Average test_bit loop took: 7998.360 usec (+- 305.629 usec)
100000 operations 32 bits set of 32 bits
Average for_each_set_bit took: 7557.067 usec (+- 1407.702 usec)
Average test_bit loop took: 9072.400 usec (+- 513.209 usec)
100000 operations 1 bits set of 64 bits
Average for_each_set_bit took: 896.800 usec (+- 14.389 usec)
Average test_bit loop took: 11927.200 usec (+- 68.862 usec)
100000 operations 2 bits set of 64 bits
Average for_each_set_bit took: 1230.400 usec (+- 111.731 usec)
Average test_bit loop took: 12478.600 usec (+- 189.382 usec)
100000 operations 4 bits set of 64 bits
Average for_each_set_bit took: 1844.733 usec (+- 244.826 usec)
Average test_bit loop took: 12911.467 usec (+- 206.246 usec)
100000 operations 8 bits set of 64 bits
Average for_each_set_bit took: 2779.300 usec (+- 413.612 usec)
Average test_bit loop took: 13372.650 usec (+- 239.623 usec)
100000 operations 16 bits set of 64 bits
Average for_each_set_bit took: 4423.920 usec (+- 748.240 usec)
Average test_bit loop took: 13995.800 usec (+- 318.427 usec)
100000 operations 32 bits set of 64 bits
Average for_each_set_bit took: 7580.600 usec (+- 1462.407 usec)
Average test_bit loop took: 15063.067 usec (+- 516.477 usec)
100000 operations 64 bits set of 64 bits
Average for_each_set_bit took: 13391.514 usec (+- 2765.371 usec)
Average test_bit loop took: 16974.914 usec (+- 916.936 usec)
100000 operations 1 bits set of 128 bits
Average for_each_set_bit took: 1153.800 usec (+- 124.245 usec)
Average test_bit loop took: 26959.000 usec (+- 714.047 usec)
100000 operations 2 bits set of 128 bits
Average for_each_set_bit took: 1445.200 usec (+- 113.587 usec)
Average test_bit loop took: 25798.800 usec (+- 512.908 usec)
100000 operations 4 bits set of 128 bits
Average for_each_set_bit took: 1990.933 usec (+- 219.362 usec)
Average test_bit loop took: 25589.400 usec (+- 348.288 usec)
100000 operations 8 bits set of 128 bits
Average for_each_set_bit took: 2963.000 usec (+- 419.487 usec)
Average test_bit loop took: 25690.050 usec (+- 262.025 usec)
100000 operations 16 bits set of 128 bits
Average for_each_set_bit took: 4585.200 usec (+- 741.734 usec)
Average test_bit loop took: 26125.040 usec (+- 274.127 usec)
100000 operations 32 bits set of 128 bits
Average for_each_set_bit took: 7626.200 usec (+- 1404.950 usec)
Average test_bit loop took: 27038.867 usec (+- 442.554 usec)
100000 operations 64 bits set of 128 bits
Average for_each_set_bit took: 13343.371 usec (+- 2686.460 usec)
Average test_bit loop took: 28936.543 usec (+- 883.257 usec)
100000 operations 128 bits set of 128 bits
Average for_each_set_bit took: 23442.950 usec (+- 4880.541 usec)
Average test_bit loop took: 32484.125 usec (+- 1691.931 usec)
100000 operations 1 bits set of 256 bits
Average for_each_set_bit took: 1183.000 usec (+- 32.073 usec)
Average test_bit loop took: 50114.600 usec (+- 198.880 usec)
100000 operations 2 bits set of 256 bits
Average for_each_set_bit took: 1550.000 usec (+- 124.550 usec)
Average test_bit loop took: 50334.200 usec (+- 128.425 usec)
100000 operations 4 bits set of 256 bits
Average for_each_set_bit took: 2164.333 usec (+- 246.359 usec)
Average test_bit loop took: 49959.867 usec (+- 188.035 usec)
100000 operations 8 bits set of 256 bits
Average for_each_set_bit took: 3211.200 usec (+- 454.829 usec)
Average test_bit loop took: 50140.850 usec (+- 176.046 usec)
100000 operations 16 bits set of 256 bits
Average for_each_set_bit took: 5181.640 usec (+- 882.726 usec)
Average test_bit loop took: 51003.160 usec (+- 419.601 usec)
100000 operations 32 bits set of 256 bits
Average for_each_set_bit took: 8369.333 usec (+- 1513.150 usec)
Average test_bit loop took: 52096.700 usec (+- 573.022 usec)
100000 operations 64 bits set of 256 bits
Average for_each_set_bit took: 13866.857 usec (+- 2649.393 usec)
Average test_bit loop took: 53989.600 usec (+- 938.808 usec)
100000 operations 128 bits set of 256 bits
Average for_each_set_bit took: 23588.350 usec (+- 4724.222 usec)
Average test_bit loop took: 57300.625 usec (+- 1625.962 usec)
100000 operations 256 bits set of 256 bits
Average for_each_set_bit took: 42752.200 usec (+- 9202.084 usec)
Average test_bit loop took: 64426.933 usec (+- 3402.326 usec)
100000 operations 1 bits set of 512 bits
Average for_each_set_bit took: 1632.000 usec (+- 229.954 usec)
Average test_bit loop took: 98090.000 usec (+- 1120.435 usec)
100000 operations 2 bits set of 512 bits
Average for_each_set_bit took: 1937.700 usec (+- 148.902 usec)
Average test_bit loop took: 100364.100 usec (+- 1433.219 usec)
100000 operations 4 bits set of 512 bits
Average for_each_set_bit took: 2528.000 usec (+- 243.654 usec)
Average test_bit loop took: 99932.067 usec (+- 955.868 usec)
100000 operations 8 bits set of 512 bits
Average for_each_set_bit took: 3734.100 usec (+- 512.359 usec)
Average test_bit loop took: 98944.750 usec (+- 812.070 usec)
100000 operations 16 bits set of 512 bits
Average for_each_set_bit took: 5551.400 usec (+- 846.605 usec)
Average test_bit loop took: 98691.600 usec (+- 654.753 usec)
100000 operations 32 bits set of 512 bits
Average for_each_set_bit took: 8594.500 usec (+- 1446.072 usec)
Average test_bit loop took: 99176.867 usec (+- 579.990 usec)
100000 operations 64 bits set of 512 bits
Average for_each_set_bit took: 13840.743 usec (+- 2527.055 usec)
Average test_bit loop took: 100758.743 usec (+- 833.865 usec)
100000 operations 128 bits set of 512 bits
Average for_each_set_bit took: 23185.925 usec (+- 4532.910 usec)
Average test_bit loop took: 103786.700 usec (+- 1475.276 usec)
100000 operations 256 bits set of 512 bits
Average for_each_set_bit took: 40322.400 usec (+- 8341.802 usec)
Average test_bit loop took: 109433.378 usec (+- 2742.615 usec)
100000 operations 512 bits set of 512 bits
Average for_each_set_bit took: 71804.540 usec (+- 15436.546 usec)
Average test_bit loop took: 120255.440 usec (+- 5252.777 usec)
100000 operations 1 bits set of 1024 bits
Average for_each_set_bit took: 1859.600 usec (+- 27.969 usec)
Average test_bit loop took: 187676.000 usec (+- 1337.770 usec)
100000 operations 2 bits set of 1024 bits
Average for_each_set_bit took: 2273.600 usec (+- 139.420 usec)
Average test_bit loop took: 188176.000 usec (+- 684.357 usec)
100000 operations 4 bits set of 1024 bits
Average for_each_set_bit took: 2940.400 usec (+- 268.213 usec)
Average test_bit loop took: 189172.600 usec (+- 593.295 usec)
100000 operations 8 bits set of 1024 bits
Average for_each_set_bit took: 4224.200 usec (+- 547.933 usec)
Average test_bit loop took: 190257.250 usec (+- 621.021 usec)
100000 operations 16 bits set of 1024 bits
Average for_each_set_bit took: 6090.560 usec (+- 877.975 usec)
Average test_bit loop took: 190143.880 usec (+- 503.753 usec)
100000 operations 32 bits set of 1024 bits
Average for_each_set_bit took: 9178.800 usec (+- 1475.136 usec)
Average test_bit loop took: 190757.100 usec (+- 494.757 usec)
100000 operations 64 bits set of 1024 bits
Average for_each_set_bit took: 14441.457 usec (+- 2545.497 usec)
Average test_bit loop took: 192299.486 usec (+- 795.251 usec)
100000 operations 128 bits set of 1024 bits
Average for_each_set_bit took: 23623.825 usec (+- 4481.182 usec)
Average test_bit loop took: 194885.550 usec (+- 1300.817 usec)
100000 operations 256 bits set of 1024 bits
Average for_each_set_bit took: 40194.956 usec (+- 8109.056 usec)
Average test_bit loop took: 200259.311 usec (+- 2566.085 usec)
100000 operations 512 bits set of 1024 bits
Average for_each_set_bit took: 70983.560 usec (+- 15074.982 usec)
Average test_bit loop took: 210527.460 usec (+- 4968.980 usec)
100000 operations 1024 bits set of 1024 bits
Average for_each_set_bit took: 136530.345 usec (+- 31584.400 usec)
Average test_bit loop took: 233329.691 usec (+- 10814.036 usec)
100000 operations 1 bits set of 2048 bits
Average for_each_set_bit took: 3077.600 usec (+- 76.376 usec)
Average test_bit loop took: 402154.400 usec (+- 518.571 usec)
100000 operations 2 bits set of 2048 bits
Average for_each_set_bit took: 3508.600 usec (+- 148.350 usec)
Average test_bit loop took: 403814.500 usec (+- 1133.027 usec)
100000 operations 4 bits set of 2048 bits
Average for_each_set_bit took: 4219.333 usec (+- 285.844 usec)
Average test_bit loop took: 404312.533 usec (+- 985.751 usec)
100000 operations 8 bits set of 2048 bits
Average for_each_set_bit took: 5670.550 usec (+- 615.238 usec)
Average test_bit loop took: 405321.800 usec (+- 1038.487 usec)
100000 operations 16 bits set of 2048 bits
Average for_each_set_bit took: 7785.080 usec (+- 992.522 usec)
Average test_bit loop took: 406746.160 usec (+- 1015.478 usec)
100000 operations 32 bits set of 2048 bits
Average for_each_set_bit took: 11163.800 usec (+- 1627.320 usec)
Average test_bit loop took: 406124.267 usec (+- 898.785 usec)
100000 operations 64 bits set of 2048 bits
Average for_each_set_bit took: 16964.629 usec (+- 2806.130 usec)
Average test_bit loop took: 406618.514 usec (+- 798.356 usec)
100000 operations 128 bits set of 2048 bits
Average for_each_set_bit took: 27219.625 usec (+- 4988.458 usec)
Average test_bit loop took: 410149.325 usec (+- 1705.641 usec)
100000 operations 256 bits set of 2048 bits
Average for_each_set_bit took: 45138.578 usec (+- 8831.021 usec)
Average test_bit loop took: 415462.467 usec (+- 2725.418 usec)
100000 operations 512 bits set of 2048 bits
Average for_each_set_bit took: 77450.540 usec (+- 15962.238 usec)
Average test_bit loop took: 426089.180 usec (+- 5171.788 usec)
100000 operations 1024 bits set of 2048 bits
Average for_each_set_bit took: 138023.636 usec (+- 29826.959 usec)
Average test_bit loop took: 446346.636 usec (+- 9904.417 usec)
100000 operations 2048 bits set of 2048 bits
Average for_each_set_bit took: 251072.600 usec (+- 55947.692 usec)
Average test_bit loop took: 484855.983 usec (+- 18970.431 usec)
#

Signed-off-by: Ian Rogers
Tested-by: Arnaldo Carvalho de Melo
Cc: Alexander Shishkin
Cc: Andi Kleen
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Link: http://lore.kernel.org/lkml/20200729220034.1337168-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo

Ian Rogers
2020-07-31 20:32:11 +0800

28 Jul, 2020

1 commit

c2a082030 perf bench: Add basic syscall benchmark ... Browse Code »

The usefulness of having a standard way of testing syscall performance
has come up from time to time[0]. Furthermore, some of our testing
machinery (such as 'mmtests') already makes use of a simplified version
of the microbenchmark. This patch mainly takes the same idea to measure
syscall throughput compatible with 'perf-bench' via getppid(2), yet
without any of the additional template stuff from Ingo's version (based
on numa.c). The code is identical to what mmtests uses.

[0] https://lore.kernel.org/lkml/20160201074156.GA27156@gmail.com/

Committer notes:

Add mising stdlib.h and unistd.h to get the prototypes for exit() and
getppid().

Committer testing:

$ perf bench
Usage:
perf bench [] []

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks

$
$ perf bench syscall

# List of available benchmarks for collection 'syscall':

basic: Benchmark for basic getppid(2) calls
all: Run all syscall benchmarks

$ perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 10000000 getppid() calls
Total time: 3.679 [sec]

0.367957 usecs/op
2717708 ops/sec
$ perf bench syscall all
# Running syscall/basic benchmark...
# Executed 10000000 getppid() calls
Total time: 3.644 [sec]

0.364456 usecs/op
2743815 ops/sec

$

Signed-off-by: Davidlohr Bueso
Acked-by: Josh Poimboeuf
Acked-by: Mel Gorman
Tested-by: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Namhyung Kim
Link: http://lore.kernel.org/lkml/20190308181747.l36zqz2avtivrr3c@linux-r8p5
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2020-07-28 19:50:48 +0800

28 May, 2020

1 commit

ba35fe935 tools feature: Rename HAVE_EVENTFD to HAVE_EVENTFD_SUPPORT ... Browse Code »

To be consistent with other such auto-detected features.

Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Anand K Mistry
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2020-05-28 21:03:26 +0800

06 May, 2020

1 commit

51876bd45 perf bench: Add kallsyms parsing ... Browse Code »

Add a benchmark for kallsyms parsing. Example output:

Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 103.971 ms (+- 0.121 ms)

Committer testing:

Test Machine: AMD Ryzen 5 3600X 6-Core Processor

[root@five ~]# perf bench internals kallsyms-parse
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 79.692 ms (+- 0.101 ms)
[root@five ~]# perf stat -r5 perf bench internals kallsyms-parse
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 80.563 ms (+- 0.079 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 81.046 ms (+- 0.155 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 80.874 ms (+- 0.104 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 81.173 ms (+- 0.133 ms)
# Running 'internals/kallsyms-parse' benchmark:
Average kallsyms__parse took: 81.169 ms (+- 0.074 ms)

Performance counter stats for 'perf bench internals kallsyms-parse' (5 runs):

8,093.54 msec task-clock # 0.999 CPUs utilized ( +- 0.14% )
3,165 context-switches # 0.391 K/sec ( +- 0.18% )
10 cpu-migrations # 0.001 K/sec ( +- 23.13% )
744 page-faults # 0.092 K/sec ( +- 0.21% )
34,551,564,954 cycles # 4.269 GHz ( +- 0.05% ) (83.33%)
1,160,584,308 stalled-cycles-frontend # 3.36% frontend cycles idle ( +- 1.60% ) (83.33%)
14,974,323,985 stalled-cycles-backend # 43.34% backend cycles idle ( +- 0.24% ) (83.33%)
58,712,905,705 instructions # 1.70 insn per cycle
# 0.26 stalled cycles per insn ( +- 0.01% ) (83.34%)
14,136,433,778 branches # 1746.632 M/sec ( +- 0.01% ) (83.33%)
141,943,217 branch-misses # 1.00% of all branches ( +- 0.04% ) (83.33%)

8.1040 +- 0.0115 seconds time elapsed ( +- 0.14% )

[root@five ~]#

Signed-off-by: Ian Rogers
Tested-by: Arnaldo Carvalho de Melo
Cc: Alexander Shishkin
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Link: http://lore.kernel.org/lkml/20200501221315.54715-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo

Ian Rogers
2020-05-06 03:35:32 +0800

16 Apr, 2020

1 commit

2a4b51666 perf bench: Add event synthesis benchmark ... Browse Code »

Event synthesis may occur at the start or end (tail) of a perf command.
In system-wide mode it can scan every process in /proc, which may add
seconds of latency before event recording. Add a new benchmark that
times how long event synthesis takes with and without data synthesis.

An example execution looks like:

$ perf bench internals synthesize
# Running 'internals/synthesize' benchmark:
Average synthesis took: 168.253800 usec
Average data synthesis took: 208.104700 usec

Signed-off-by: Ian Rogers
Acked-by: Jiri Olsa
Tested-by: Arnaldo Carvalho de Melo
Cc: Alexander Shishkin
Cc: Andrey Zhizhikin
Cc: Kan Liang
Cc: Kefeng Wang
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Petr Mladek
Cc: Stephane Eranian
Cc: Thomas Gleixner
Link: http://lore.kernel.org/lkml/20200402154357.107873-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo

Ian Rogers
2020-04-16 23:19:12 +0800

30 Aug, 2019

1 commit

0ac25fd0a perf tools: Remove perf.h from source files not needing it ... Browse Code »

With the movement of lots of stuff out of perf.h to other headers we
ended up not needing it in lots of places, remove it from those places.

Cc: Adrian Hunter
Cc: Jiri Olsa
Cc: Namhyung Kim
Link: https://lkml.kernel.org/n/tip-c718m0sxxwp73lp9d8vpihb4@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2019-08-30 04:38:32 +0800

09 Jul, 2019

1 commit

7f7c536f2 tools lib: Adopt zalloc()/zfree() from tools/perf ... Browse Code »

Eroding a bit more the tools/perf/util/util.h hodpodge header.

Cc: Adrian Hunter
Cc: Jiri Olsa
Cc: Namhyung Kim
Link: https://lkml.kernel.org/n/tip-natazosyn9rwjka25tvcnyi0@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2019-07-09 21:13:26 +0800

22 Nov, 2018

2 commits

231457ec7 perf bench: Add epoll_ctl(2) benchmark ... Browse Code »

Benchmark the various operations allowed for epoll_ctl(2). The idea is
to concurrently stress a single epoll instance doing add/mod/del
operations.

Committer testing:

# perf bench epoll ctl
# Running 'epoll/ctl' benchmark:
Run summary [PID 20344]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs.

[thread 0] fdmap: 0x21a46b0 ... 0x21a47ac [ add: 1680960 ops; mod: 1680960 ops; del: 1680960 ops ]
[thread 1] fdmap: 0x21a4960 ... 0x21a4a5c [ add: 1685440 ops; mod: 1685440 ops; del: 1685440 ops ]
[thread 2] fdmap: 0x21a4c10 ... 0x21a4d0c [ add: 1674368 ops; mod: 1674368 ops; del: 1674368 ops ]
[thread 3] fdmap: 0x21a4ec0 ... 0x21a4fbc [ add: 1677568 ops; mod: 1677568 ops; del: 1677568 ops ]

Averaged 1679584 ADD operations (+- 0.14%)
Averaged 1679584 MOD operations (+- 0.14%)
Averaged 1679584 DEL operations (+- 0.14%)
#

Lets measure those calls with 'perf trace' to get a glympse at what this
benchmark is doing in terms of syscalls:

# perf trace -m32768 -s perf bench epoll ctl
# Running 'epoll/ctl' benchmark:
Run summary [PID 20405]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs.

[thread 0] fdmap: 0x21764e0 ... 0x21765dc [ add: 1100480 ops; mod: 1100480 ops; del: 1100480 ops ]
[thread 1] fdmap: 0x2176790 ... 0x217688c [ add: 1250176 ops; mod: 1250176 ops; del: 1250176 ops ]
[thread 2] fdmap: 0x2176a40 ... 0x2176b3c [ add: 1022464 ops; mod: 1022464 ops; del: 1022464 ops ]
[thread 3] fdmap: 0x2176cf0 ... 0x2176dec [ add: 705472 ops; mod: 705472 ops; del: 705472 ops ]

Averaged 1019648 ADD operations (+- 11.27%)
Averaged 1019648 MOD operations (+- 11.27%)
Averaged 1019648 DEL operations (+- 11.27%)

Summary of events:

epoll-ctl (20405), 1264 events, 0.0%

syscall calls total

---------------
eventfd2
clone
mprotect
openat
mmap
futex
sched_setaffinity
read
fstat
close
stat
access
open
getdents
write
munmap
brk
rt_sigprocmask
rt_sigaction
prlimit64
prctl
epoll_create
lseek
sched_getaffinity
arch_prctl
set_tid_address
getpid
set_robust_list
execve min avg max stddev (msec) (msec) (msec) (msec) (%) -------- --------- --------- --------- --------- ------ 256 9.514 0.001 0.037 5.243 68.00% 4 1.245 0.204 0.311 0.531 24.13% 66 0.345 0.002 0.005 0.021 7.43% 45 0.313 0.004 0.007 0.073 21.93% 88 0.302 0.002 0.003 0.013 5.02% 4 0.160 0.002 0.040 0.140 83.43% 4 0.124 0.005 0.031 0.070 49.39% 44 0.103 0.001 0.002 0.013 15.54% 40 0.052 0.001 0.001 0.003 5.43% 39 0.039 0.001 0.001 0.001 1.48% 9 0.034 0.003 0.004 0.006 7.30% 3 0.023 0.007 0.008 0.008 4.25% 2 0.021 0.008 0.011 0.013 22.60% 4 0.019 0.001 0.005 0.009 37.15% 2 0.013 0.004 0.007 0.009 38.48% 1 0.010 0.010 0.010 0.010 0.00% 3 0.006 0.001 0.002 0.003 26.34% 2 0.004 0.001 0.002 0.003 43.95% 3 0.004 0.001 0.001 0.002 16.07% 3 0.004 0.001 0.001 0.001 5.39% 1 0.003 0.003 0.003 0.003 0.00% 1 0.003 0.003 0.003 0.003 0.00% 2 0.002 0.001 0.001 0.001 11.42% 1 0.002 0.002 0.002 0.002 0.00% 1 0.002 0.002 0.002 0.002 0.00% 1 0.001 0.001 0.001 0.001 0.00% 1 0.001 0.001 0.001 0.001 0.00% 1 0.001 0.001 0.001 0.001 0.00% 1 0.000 0.000 0.000 0.000 0.00%

epoll-ctl (20406), 1245480 events, 14.6%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 619511 1034.927 0.001 0.002 6.691 0.67%
nanosleep 3226 616.114 0.006 0.191 10.376 7.57%
futex 2 11.336 0.002 5.668 11.334 99.97%
set_robust_list 1 0.001 0.001 0.001 0.001 0.00%
clone 1 0.000 0.000 0.000 0.000 0.00%

epoll-ctl (20407), 1243151 events, 14.5%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 618350 1042.181 0.001 0.002 2.512 0.40%
nanosleep 3220 366.261 0.012 0.114 18.162 9.59%
futex 4 5.463 0.001 1.366 5.427 99.12%
set_robust_list 1 0.002 0.002 0.002 0.002 0.00%

epoll-ctl (20408), 1801690 events, 21.1%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 896174 1540.581 0.001 0.002 6.987 0.74%
nanosleep 4667 783.393 0.006 0.168 10.419 7.10%
futex 2 4.682 0.002 2.341 4.681 99.93%
set_robust_list 1 0.002 0.002 0.002 0.002 0.00%
clone 1 0.000 0.000 0.000 0.000 0.00%

epoll-ctl (20409), 4254890 events, 49.8%

syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
epoll_ctl 2116416 3768.097 0.001 0.002 9.956 0.41%
nanosleep 11023 1141.778 0.006 0.104 9.447 4.95%
futex 3 0.037 0.002 0.012 0.029 70.50%
set_robust_list 1 0.008 0.008 0.008 0.008 0.00%
madvise 1 0.005 0.005 0.005 0.005 0.00%
clone 1 0.000 0.000 0.000 0.000 0.00%
#

Committer notes:

Fix build on fedora:24-x-ARC-uClibc, debian:experimental-x-mips,
debian:experimental-x-mipsel, ubuntu:16.04-x-arm and ubuntu:16.04-x-powerpc

CC /tmp/build/perf/bench/epoll-ctl.o
bench/epoll-ctl.c: In function 'init_fdmaps':
bench/epoll-ctl.c:214:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (i = 0; i < nfds; i+=inc) {
^
bench/epoll-ctl.c: In function 'bench_epoll_ctl':
bench/epoll-ctl.c:377:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (i = 0; i < nthreads; i++) {
^
bench/epoll-ctl.c:388:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
for (i = 0; i < nthreads; i++) {
^
cc1: all warnings being treated as errors

Signed-off-by: Davidlohr Bueso
Tested-by: Arnaldo Carvalho de Melo
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Jason Baron
Link: http://lkml.kernel.org/r/20181106152226.20883-3-dave@stgolabs.net
[ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ]
[ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ]
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2018-11-22 09:39:55 +0800
121dd9ea0 perf bench: Add epoll parallel epoll_wait benchmark ... Browse Code »

This program benchmarks concurrent epoll_wait(2) for file descriptors
that are monitored with with EPOLLIN along various semantics, by a
single epoll instance. Such conditions can be found when using
single/combined or multiple queuing when load balancing.

Each thread has a number of private, nonblocking file descriptors,
referred to as fdmap. A writer thread will constantly be writing to the
fdmaps of all threads, minimizing each threads's chances of epoll_wait
not finding any ready read events and blocking as this is not what we
want to stress. Full details in the start of the C file.

Committer testing:

# perf bench
Usage:
perf bench [] []

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
all: All benchmarks

# perf bench epoll

# List of available benchmarks for collection 'epoll':

wait: Benchmark epoll concurrent epoll_waits
all: Run all futex benchmarks

# perf bench epoll wait
# Running 'epoll/wait' benchmark:
Run summary [PID 19295]: 3 threads monitoring on 64 file-descriptors for 8 secs.

[thread 0] fdmap: 0xdaa650 ... 0xdaa74c [ 328241 ops/sec ]
[thread 1] fdmap: 0xdaa900 ... 0xdaa9fc [ 351695 ops/sec ]
[thread 2] fdmap: 0xdaabb0 ... 0xdaacac [ 381423 ops/sec ]

Averaged 353786 operations/sec (+- 4.35%), total secs = 8
#

Committer notes:

Fix the build on debian:experimental-x-mips, debian:experimental-x-mipsel
and others:

CC /tmp/build/perf/bench/epoll-wait.o
bench/epoll-wait.c: In function 'writerfn':
bench/epoll-wait.c:399:12: error: format '%ld' expects argument of type 'long int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
printinfo("exiting writer-thread (total full-loops: %ld)\n", iter);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~
bench/epoll-wait.c:86:31: note: in definition of macro 'printinfo'
do { if (__verbose) { printf(fmt, ## arg); fflush(stdout); } } while (0)
^~~
cc1: all warnings being treated as errors

Signed-off-by: Davidlohr Bueso
Tested-by: Arnaldo Carvalho de Melo
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Jason Baron
Link: http://lkml.kernel.org/r/20181106152226.20883-2-dave@stgolabs.net
Link: http://lkml.kernel.org/r/20181106182349.thdkpvshkna5vd7o@linux-r8p5>
[ Applied above fixup as per Davidlohr's request ]
[ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ]
[ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ]
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2018-11-22 09:38:47 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

27 Mar, 2017

1 commit

b0ad8ea66 perf tools: Remove unused 'prefix' from builtin functions ... Browse Code »

We got it from the git sources but never used it for anything, with the
place where this would be somehow used remaining:

static int run_builtin(struct cmd_struct *p, int argc, const char **argv)
{
prefix = NULL;
if (p->option & RUN_SETUP)
prefix = NULL; /* setup_perf_directory(); */

Ditch it.

Cc: Adrian Hunter
Cc: David Ahern
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Wang Nan
Link: http://lkml.kernel.org/n/tip-uw5swz05vol0qpr32c5lpvus@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2017-03-27 22:58:09 +0800

18 Dec, 2015

1 commit

4b6ab94ea perf subcmd: Create subcmd library ... Browse Code »

Move the subcommand-related files from perf to a new library named
libsubcmd.a.

Since we're moving files anyway, go ahead and rename 'exec_cmd.*' to
'exec-cmd.*' to be consistent with the naming of all the other files.

Signed-off-by: Josh Poimboeuf
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/c0a838d4c878ab17fee50998811612b2281355c1.1450193761.git.jpoimboe@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo

Josh Poimboeuf
2015-12-18 01:27:14 +0800

20 Oct, 2015

3 commits

aa254af25 perf bench: Run benchmarks, don't test them ... Browse Code »

So right now we output this text:

memcpy: Benchmark for memcpy() functions
memset: Benchmark for memset() functions
all: Test all memory access benchmarks

But the right verb to use with benchmarks is to 'run' them, not 'test'
them.

So change this (and all similar texts) to:

memcpy: Benchmark for memcpy() functions
memset: Benchmark for memset() functions
all: Run all memory access benchmarks

Signed-off-by: Ingo Molnar
Cc: David Ahern
Cc: Hitoshi Mitake
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1445241870-24854-15-git-send-email-mingo@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Ingo Molnar
2015-10-20 03:10:25 +0800
13b1fdce8 perf bench mem: Improve user visible strings ... Browse Code »

- fix various typos in user visible output strings
- make the output consistent (wrt. capitalization and spelling)
- offer the list of routines to benchmark on '-r help'.

Signed-off-by: Ingo Molnar
Cc: David Ahern
Cc: Hitoshi Mitake
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1445241870-24854-11-git-send-email-mingo@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Ingo Molnar
2015-10-20 03:07:18 +0800
7a46a8fd1 perf bench: List output formatting options on 'perf bench -h' ... Browse Code »

So 'perf bench -h' is not very helpful when printing the help line
about the output formatting options:

-f, --format
Specify format style

There are two output format styles, 'default' and 'simple', so improve
the help text to:

-f, --format
Specify the output formatting style

Signed-off-by: Ingo Molnar
Cc: David Ahern
Cc: Hitoshi Mitake
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1445241870-24854-7-git-send-email-mingo@kernel.org
[ Removed leftovers from the mem-functions.c rename ]
Signed-off-by: Arnaldo Carvalho de Melo

Ingo Molnar
2015-10-20 03:03:53 +0800

21 Jul, 2015

1 commit

d2f3f5d2e perf bench futex: Add lock_pi stresser ... Browse Code »

Allows a way of measuring low level kernel implementation of FUTEX_LOCK_PI and
FUTEX_UNLOCK_PI.

The program comes in two flavors:

(i) single futex (default), all threads contend on the same uaddr. For the
sake of the benchmark, we call into kernel space even when the lock is
uncontended. The kernel will set it to TID, any waters that come in and
contend for the pi futex will be handled respectively by the kernel.

(ii) -M option for multiple futexes, each thread deals with its own futex. This
is a trivial scenario and only measures kernel handling of 0->TID transition.

Signed-off-by: Davidlohr Bueso
Cc: Mel Gorman
Link: http://lkml.kernel.org/r/1436259353.12255.78.camel@stgolabs.net
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2015-07-21 04:49:51 +0800

09 May, 2015

1 commit

d65817b4e perf bench futex: Support parallel waker threads ... Browse Code »

The futex-wake benchmark only measures wakeups done within a single
process. While this has value in its own, it does not really generate
any hb->lock contention.

A new benchmark 'wake-parallel' is added, by extending the futex-wake
code such that we can measure parallel waker threads. The program output
shows the avg per-thread latency in order to complete its share of
wakeups:

Run summary [PID 13474]: blocking on 512 threads (at [private] futex 0xa88668), 8 threads waking up 64 at a time.

[Run 1]: Avg per-thread latency (waking 64/512 threads) in 0.6230 ms (+-15.31%)
[Run 2]: Avg per-thread latency (waking 64/512 threads) in 0.5175 ms (+-29.95%)
[Run 3]: Avg per-thread latency (waking 64/512 threads) in 0.7578 ms (+-18.03%)
[Run 4]: Avg per-thread latency (waking 64/512 threads) in 0.8944 ms (+-12.54%)
[Run 5]: Avg per-thread latency (waking 64/512 threads) in 1.1204 ms (+-23.85%)
Avg per-thread latency (waking 64/512 threads) in 0.7826 ms (+-9.91%)

Naturally, different combinations of numbers of blocking and waker
threads will exhibit different information.

Signed-off-by: Davidlohr Bueso
Tested-by: Arnaldo Carvalho de Melo
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/1431110280-20231-1-git-send-email-dave@stgolabs.net
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2015-05-09 03:23:50 +0800

20 Jun, 2014

1 commit

b6f0629a9 perf bench: Add --repeat option ... Browse Code »

There are a number of benchmarks that do single runs and as a result
does not really help users gain a general idea of how the workload
performs. So the user must either manually do multiple runs or just use
single bogus results.

This option will enable users to specify the amount of runs (arbitrarily
defaulted to 10, to use the existing benchmarks default) through the
'--repeat' option. Add it to perf-bench instead of implementing it
always in each specific benchmark.

Signed-off-by: Davidlohr Bueso
Cc: Aswin Chandramouleeswaran
Cc: Hitoshi Mitake
Cc: Jiri Olsa
Link: http://lkml.kernel.org/r/1402942467-10671-2-git-send-email-davidlohr@hp.com
[ Kept the existing default of 10, changing it to something else should
be done on separate patch ]
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2014-06-20 03:13:15 +0800

01 Apr, 2014

1 commit

8c292f117 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf changes from Ingo Molnar:
"Main changes:

Kernel side changes:

- Add SNB/IVB/HSW client uncore memory controller support (Stephane
Eranian)

- Fix various x86/P4 PMU driver bugs (Don Zickus)

Tooling, user visible changes:

- Add several futex 'perf bench' microbenchmarks (Davidlohr Bueso)

- Speed up thread map generation (Don Zickus)

- Introduce 'perf kvm --list-cmds' command line option for use by
scripts (Ramkumar Ramachandra)

- Print the evsel name in the annotate stdio output, prep to fix
support outputting annotation for multiple events, not just for the
first one (Arnaldo Carvalho de Melo)

- Allow setting preferred callchain method in .perfconfig (Jiri Olsa)

- Show in what binaries/modules 'perf probe's are set (Masami
Hiramatsu)

- Support distro-style debuginfo for uprobe in 'perf probe' (Masami
Hiramatsu)

Tooling, internal changes and fixes:

- Use tid in mmap/mmap2 events to find maps (Don Zickus)

- Record the reason for filtering an address_location (Namhyung Kim)

- Apply all filters to an addr_location (Namhyung Kim)

- Merge al->filtered with hist_entry->filtered in report/hists
(Namhyung Kim)

- Fix memory leak when synthesizing thread records (Namhyung Kim)

- Use ui__has_annotation() in 'report' (Namhyung Kim)

- hists browser refactorings to reuse code accross UIs (Namhyung Kim)

- Add support for the new DWARF unwinder library in elfutils (Jiri
Olsa)

- Fix build race in the generation of bison files (Jiri Olsa)

- Further streamline the feature detection display, trimming it a bit
to show just the libraries detected, using VF=1 gets a more verbose
output, showing the less interesting feature checks as well (Jiri
Olsa).

- Check compatible symtab type before loading dso (Namhyung Kim)

- Check return value of filename__read_debuglink() (Stephane Eranian)

- Move some hashing and fs related code from tools/perf/util/ to
tools/lib/ so that it can be used by more tools/ living utilities
(Borislav Petkov)

- Prepare DWARF unwinding code for using an elfutils alternative
unwinding library (Jiri Olsa)

- Fix DWARF unwind max_stack processing (Jiri Olsa)

- Add dwarf unwind 'perf test' entry (Jiri Olsa)

- 'perf probe' improvements including memory leak fixes, sharing the
intlist class with other tools, uprobes/kprobes code sharing and
use of ref_reloc_sym (Masami Hiramatsu)

- Shorten sample symbol resolving by adding cpumode to struct
addr_location (Arnaldo Carvalho de Melo)

- Fix synthesizing mmaps for threads (Don Zickus)

- Fix invalid output on event group stdio report (Namhyung Kim)

- Fixup header alignment in 'perf sched latency' output (Ramkumar
Ramachandra)

- Fix off-by-one error in 'perf timechart record' argv handling
(Ramkumar Ramachandra)

Tooling, cleanups:

- Remove unused thread__find_map function (Jiri Olsa)

- Remove unused simple_strtoul() function (Ramkumar Ramachandra)

Tooling, documentation updates:

- Update function names in debug messages (Ramkumar Ramachandra)

- Update some code references in design.txt (Ramkumar Ramachandra)

- Clarify load-latency information in the 'perf mem' docs (Andi
Kleen)

- Clarify x86 register naming in 'perf probe' docs (Andi Kleen)"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (96 commits)
perf tools: Remove unused simple_strtoul() function
perf tools: Update some code references in design.txt
perf evsel: Update function names in debug messages
perf tools: Remove thread__find_map function
perf annotate: Print the evsel name in the stdio output
perf report: Use ui__has_annotation()
perf tools: Fix memory leak when synthesizing thread records
perf tools: Use tid in mmap/mmap2 events to find maps
perf report: Merge al->filtered with hist_entry->filtered
perf symbols: Apply all filters to an addr_location
perf symbols: Record the reason for filtering an address_location
perf sched: Fixup header alignment in 'latency' output
perf timechart: Fix off-by-one error in 'record' argv handling
perf machine: Factor machine__find_thread to take tid argument
perf tools: Speed up thread map generation
perf kvm: introduce --list-cmds for use by scripts
perf ui hists: Pass evsel to hpp->header/width functions explicitly
perf symbols: Introduce thread__find_cpumode_addr_location
perf session: Change header.misc dump from decimal to hex
perf ui/tui: Reuse generic __hpp__fmt() code
...

Linus Torvalds
2014-04-01 02:13:25 +0800

15 Mar, 2014

1 commit

6eeefccdc perf bench: Fix NULL pointer dereference in "perf bench all" ... Browse Code »

The for_each_bench() macro must check that the "benchmarks" field of a
collection is not NULL before dereferencing it because the "all"
collection in particular has a NULL "benchmarks" field (signifying that
it has no benchmarks to iterate over).

This fixes this NULL pointer dereference when running "perf bench all":

[root@ssdandy ~]# perf bench all

# Running mem/memset benchmark...
# Copying 1MB Bytes ...

2.453675 GB/Sec
12.056327 GB/Sec (with prefault)

Segmentation fault (core dumped)
[root@ssdandy ~]#

Signed-off-by: Patrick Palka
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1394664051-6037-1-git-send-email-patrick@parcs.ath.cx
Signed-off-by: Arnaldo Carvalho de Melo

Patrick Palka
2014-03-15 00:45:54 +0800

14 Mar, 2014

3 commits

0fb298cf9 perf bench: Add futex-requeue microbenchmark ... Browse Code »

Block a bunch of threads on a futex and requeue them on another, N at a
time.

This program is particularly useful to measure the latency of nthread
requeues without waking up any tasks -- thus mimicking a regular
futex_wait.

An example run:

$ perf bench futex requeue -r 100 -t 64
Run summary [PID 151011]: Requeuing 64 threads (from 0x7d15c4 to 0x7d15c8), 1 at a time.

[Run 1]: Requeued 64 of 64 threads in 0.0400 ms
[Run 2]: Requeued 64 of 64 threads in 0.0390 ms
[Run 3]: Requeued 64 of 64 threads in 0.0400 ms
...
[Run 100]: Requeued 64 of 64 threads in 0.0390 ms
Requeued 64 of 64 threads in 0.0399 ms (+-0.37%)

Signed-off-by: Davidlohr Bueso
Acked-by: Darren Hart
Cc: Aswin Chandramouleeswaran
Cc: Darren Hart
Cc: Ingo Molnar
Cc: Jason Low
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Cc: Waiman Long
Link: http://lkml.kernel.org/r/1387081917-9102-4-git-send-email-davidlohr@hp.com
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2014-03-14 22:20:44 +0800
27db78307 perf bench: Add futex-wake microbenchmark ... Browse Code »

Block a bunch of threads on a futex and wake them up, N at a time.

This program is particularly useful to measure the latency of nthread
wakeups in non-error situations: all waiters are queued and all wake
calls wakeup one or more tasks.

An example run:

$ perf bench futex wake -t 512 -r 100
Run summary [PID 27823]: blocking on 512 threads (at futex 0x7e10d4), waking up 1 at a time.

[Run 1]: Wokeup 512 of 512 threads in 6.0080 ms
[Run 2]: Wokeup 512 of 512 threads in 5.2280 ms
[Run 3]: Wokeup 512 of 512 threads in 4.8300 ms
...
[Run 100]: Wokeup 512 of 512 threads in 5.0100 ms
Wokeup 512 of 512 threads in 5.0109 ms (+-2.25%)

Signed-off-by: Davidlohr Bueso
Acked-by: Darren Hart
Cc: Aswin Chandramouleeswaran
Cc: Darren Hart
Cc: Ingo Molnar
Cc: Jason Low
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Cc: Waiman Long
Link: http://lkml.kernel.org/r/1387081917-9102-3-git-send-email-davidlohr@hp.com
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2014-03-14 22:20:43 +0800
a04397114 perf bench: Add futex-hash microbenchmark ... Browse Code »

Introduce futexes to perf-bench and add a program that stresses and
measures the kernel's implementation of the hash table.

This is a multi-threaded program that simply measures the amount of
failed futex wait calls - we only want to deal with the hashing
overhead, so a negative return of futex_wait_setup() is enough to do the
trick.

An example run:

$ perf bench futex hash -t 32
Run summary [PID 10989]: 32 threads, each operating on 1024 [private] futexes for 10 secs.

[thread 0] futexes: 0x19d9b10 ... 0x19dab0c [ 418713 ops/sec ]
[thread 1] futexes: 0x19daca0 ... 0x19dbc9c [ 469913 ops/sec ]
[thread 2] futexes: 0x19dbe30 ... 0x19dce2c [ 479744 ops/sec ]
...
[thread 31] futexes: 0x19fbb80 ... 0x19fcb7c [ 464179 ops/sec ]

Averaged 454310 operations/sec (+- 0.84%), total secs = 10

Signed-off-by: Davidlohr Bueso
Acked-by: Darren Hart
Cc: Aswin Chandramouleeswaran
Cc: Darren Hart
Cc: Ingo Molnar
Cc: Jason Low
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Cc: Waiman Long
Link: http://lkml.kernel.org/r/1387081917-9102-2-git-send-email-davidlohr@hp.com
Signed-off-by: Arnaldo Carvalho de Melo

Davidlohr Bueso
2014-03-14 22:20:43 +0800

23 Oct, 2013

1 commit

4157922a9 perf bench: Change the procps visible command-name of invididual benchmark tests plus cleanups ... Browse Code »

Before this patch, looking at 'perf bench sched pipe' behavior over
'top' only told us that something related to perf is running:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19934 mingo 20 0 54836 1296 952 R 18.6 0.0 0:00.56 perf
19935 mingo 20 0 54836 384 36 S 18.6 0.0 0:00.56 perf

After the patch it's clearly visible what's going on:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19744 mingo 20 0 125m 3536 2644 R 68.2 0.0 0:01.12 sched-pipe
19745 mingo 20 0 125m 1172 276 R 68.2 0.0 0:01.12 sched-pipe

The benchmark-subsystem name is concatenated with the individual
testcase name.

Unfortunately 'perf top' does not show the reconfigured name, possibly
because it caches ->comm[] values and does not recognize changes to
them?

Also clean up a few bits in builtin-bench.c while at it and reorganize
the code and the output strings to be consistent.

Use iterators to access the various arrays. Rename 'suites' concept to
'benchmark collection' and the 'bench_suite' to 'benchmark/bench'. The
many repetitions of 'suite' made the code harder to read and understand.

The new output is:

comet:~/tip/tools/perf> ./perf bench
Usage:
perf bench [] []

# List of all available benchmark collections:

sched: Scheduler and IPC benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
all: All benchmarks

comet:~/tip/tools/perf> ./perf bench sched

# List of available benchmarks for collection 'sched':

messaging: Benchmark for scheduling and IPC
pipe: Benchmark for pipe() between two processes
all: Test all scheduler benchmarks

comet:~/tip/tools/perf> ./perf bench mem

# List of available benchmarks for collection 'mem':

memcpy: Benchmark for memcpy()
memset: Benchmark for memset() tests
all: Test all memory benchmarks

comet:~/tip/tools/perf> ./perf bench numa

# List of available benchmarks for collection 'numa':

mem: Benchmark for NUMA workloads
all: Test all NUMA benchmarks

Individual benchmark modules were not touched.

Signed-off-by: Ingo Molnar
Cc: David Ahern
Cc: Hitoshi Mitake
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20131023123756.GA17871@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo

Ingo Molnar
2013-10-23 20:57:34 +0800

09 Oct, 2013

1 commit

89fe808ae tools/perf: Standardize feature support define names to: HAVE_{FEATURE}_SUPPORT ... Browse Code »

Standardize all the feature flags based on the HAVE_{FEATURE}_SUPPORT naming convention:

HAVE_ARCH_X86_64_SUPPORT
HAVE_BACKTRACE_SUPPORT
HAVE_CPLUS_DEMANGLE_SUPPORT
HAVE_DWARF_SUPPORT
HAVE_ELF_GETPHDRNUM_SUPPORT
HAVE_GTK2_SUPPORT
HAVE_GTK_INFO_BAR_SUPPORT
HAVE_LIBAUDIT_SUPPORT
HAVE_LIBELF_MMAP_SUPPORT
HAVE_LIBELF_SUPPORT
HAVE_LIBNUMA_SUPPORT
HAVE_LIBUNWIND_SUPPORT
HAVE_ON_EXIT_SUPPORT
HAVE_PERF_REGS_SUPPORT
HAVE_SLANG_SUPPORT
HAVE_STRLCPY_SUPPORT

Cc: Arnaldo Carvalho de Melo
Cc: Peter Zijlstra
Cc: Namhyung Kim
Cc: David Ahern
Cc: Jiri Olsa
Link: http://lkml.kernel.org/n/tip-u3zvqejddfZhtrbYbfhi3spa@git.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2013-10-09 14:48:28 +0800

30 Jan, 2013

2 commits

79d824e31 perf tools: Make numa benchmark optional ... Browse Code »

Commit "perf: Add 'perf bench numa mem'..." added a NUMA performance
benchmark to perf. Make this optional and test for required
dependencies.

Signed-off-by: Peter Hurley
Acked-by: Ingo Molnar
Cc: Ingo Molnar
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1359337882-21821-1-git-send-email-peter@hurleysoftware.com
Signed-off-by: Arnaldo Carvalho de Melo

Peter Hurley
2013-01-30 21:36:21 +0800
1c13f3c90 perf: Add 'perf bench numa mem' NUMA performance measurement suite ... Browse Code »

Add a suite of NUMA performance benchmarks.

The goal was simulate the behavior and access patterns of real NUMA
workloads, via a wide range of parameters, so this tool goes well
beyond simple bzero() measurements that most NUMA micro-benchmarks use:

- It processes the data and creates a chain of data dependencies,
like a real workload would. Neither the compiler, nor the
kernel (via KSM and other optimizations) nor the CPU can
eliminate parts of the workload.

- It randomizes the initial state and also randomizes the target
addresses of the processing - it's not a simple forward scan
of addresses.

- It provides flexible options to set process, thread and memory
relationship information: -G sets "global" memory shared between
all test processes, -P sets "process" memory shared by all
threads of a process and -T sets "thread" private memory.

- There's a NUMA convergence monitoring and convergence latency
measurement option via -c and -m.

- Micro-sleeps and synchronization can be injected to provoke lock
contention and scheduling, via the -u and -S options. This simulates
IO and contention.

- The -x option instructs the workload to 'perturb' itself artificially
every N seconds, by moving to the first and last CPU of the system
periodically. This way the stability of convergence equilibrium and
the number of steps taken for the scheduler to reach equilibrium again
can be measured.

- The amount of work can be specified via the -l loop count, and/or
via a -s seconds-timeout value.

- CPU and node memory binding options, to test hard binding scenarios.
THP can be turned on and off via madvise() calls.

- Live reporting of convergence progress in an 'at glance' output format.
Printing of convergence and deconvergence events.

The 'perf bench numa mem -a' option will start an array of about 30
individual tests that will each output such measurements:

# Running 5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
5x5-bw-thread, 20.276, secs, runtime-max/thread
5x5-bw-thread, 20.004, secs, runtime-min/thread
5x5-bw-thread, 20.155, secs, runtime-avg/thread
5x5-bw-thread, 0.671, %, spread-runtime/thread
5x5-bw-thread, 21.153, GB, data/thread
5x5-bw-thread, 528.818, GB, data-total
5x5-bw-thread, 0.959, nsecs, runtime/byte/thread
5x5-bw-thread, 1.043, GB/sec, thread-speed
5x5-bw-thread, 26.081, GB/sec, total-speed

See the help text and the code for more details.

Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Mike Galbraith
Cc: Steven Rostedt
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Ingo Molnar

Ingo Molnar
2013-01-30 21:35:36 +0800

25 Jan, 2013

1 commit

9b494ea2f perf bench: Flush stdout before starting bench suite ... Browse Code »

perf bench prints header message for bench suite before starting the
benchmark. However if the stdout is redirected to a file and bench
suite forks child processes this (and possibly other debugging
messages too) will be repeated multiple times.

$ perf bench sched messaging
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 0.100 [sec]

$ perf bench sched messaging > result.txt
$ wc -l result.txt
391

In this file, there were so many "Running sched/messaging benchmark..."
lines. This was because stdout is converted to fully-buffered due to
the redirection and inherited child processes. Other lines are printed
after reaping all those tasks.

So fix it by flushing stdout before starting bench suites.

Signed-off-by: Namhyung Kim
Acked-by: Hitoshi Mitake
Cc: Hitoshi Mitake
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1357637966-8216-1-git-send-email-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Namhyung Kim
2013-01-25 03:40:19 +0800

11 Sep, 2012

1 commit

1d037ca16 perf tools: Use __maybe_used for unused variables ... Browse Code »

perf defines both __used and __unused variables to use for marking
unused variables. The variable __used is defined to
__attribute__((__unused__)), which contradicts the kernel definition to
__attribute__((__used__)) for new gcc versions. On Android, __used is
also defined in system headers and this leads to warnings like: warning:
'__used__' attribute ignored

__unused is not defined in the kernel and is not a standard definition.
If __unused is included everywhere instead of __used, this leads to
conflicts with glibc headers, since glibc has a variables with this name
in its headers.

The best approach is to use __maybe_unused, the definition used in the
kernel for __attribute__((unused)). In this way there is only one
definition in perf sources (instead of 2 definitions that point to the
same thing: __used and __unused) and it works on both Linux and Android.
This patch simply replaces all instances of __used and __unused with
__maybe_unused.

Signed-off-by: Irina Tirdea
Acked-by: Pekka Enberg
Cc: David Ahern
Cc: Ingo Molnar
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1347315303-29906-7-git-send-email-irina.tirdea@intel.com
[ committer note: fixed up conflict with a116e05 in builtin-sched.c ]
Signed-off-by: Arnaldo Carvalho de Melo

Irina Tirdea
2012-09-11 23:19:15 +0800

28 Jun, 2012

1 commit

08942f6d5 perf bench: Documentation update ... Browse Code »

The current perf-bench documentation has a couple of typos and even
lacks entire description of mem subsystem. Fix it.

Reported-by: Ingo Molnar
Signed-off-by: Namhyung Kim
Acked-by: Hitoshi Mitake
Cc: Hitoshi Mitake
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1340172486-17805-1-git-send-email-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Namhyung Kim
2012-06-28 00:17:48 +0800

25 Jan, 2012

1 commit

be3de80dc perf bench: Also allow measuring memset() ... Browse Code »

This simply clones the respective memcpy() implementation.

Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Stephane Eranian
Link: http://lkml.kernel.org/r/4F16D743020000780006D735@nat28.tlf.novell.com
Signed-off-by: Jan Beulich
Signed-off-by: Arnaldo Carvalho de Melo

Jan Beulich
2012-01-25 06:25:32 +0800

18 May, 2010

1 commit

edb7c60e2 perf options: Type check all the remaining OPT_ variants ... Browse Code »

OPT_SET_INT was renamed to OPT_SET_UINT since the only use in these
tools is to set something that has an enum type, that is builtin
compatible with unsigned int.

Several string constifications were done to make OPT_STRING require a
const char * type.

Cc: Frédéric Weisbecker
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Tom Zanussi
LKML-Reference:
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2010-05-18 03:22:41 +0800

14 Dec, 2009

1 commit

2044279d1 perf bench: Add "all" pseudo subsystem and "all" pseudo suite ... Browse Code »

This patch adds a new "all" pseudo subsystem and an "all" pseudo
suite. These are for testing all subsystem and its all suite, or
all suite of one subsystem.

(This patch also contains a few trivial comment fixes for
bench/* and output style fixes. I judged that there are no
necessity to make them into individual patch.)

Example of use:

| % ./perf bench sched all # Test all suites of sched subsystem
| # Running sched/messaging benchmark...
| # 20 sender and receiver processes per group
| # 10 groups == 400 processes run
|
| Total time: 0.414 [sec]
|
| # Running sched/pipe benchmark...
| # Extecuted 1000000 pipe operations between two tasks
|
| Total time: 10.999 [sec]
|
| 10.999317 usecs/op
| 90914 ops/sec
|
| % ./perf bench all # Test all suites of all subsystems
| # Running sched/messaging benchmark...
| # 20 sender and receiver processes per group
| # 10 groups == 400 processes run
|
| Total time: 0.420 [sec]
|
| # Running sched/pipe benchmark...
| # Extecuted 1000000 pipe operations between two tasks
|
| Total time: 11.741 [sec]
|
| 11.741346 usecs/op
| 85169 ops/sec
|
| # Running mem/memcpy benchmark...
| # Copying 1MB Bytes from 0x7ff33e920010 to 0x7ff3401ae010 ...
|
| 808.407437 MB/Sec

Signed-off-by: Hitoshi Mitake
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Frederic Weisbecker
LKML-Reference:
Signed-off-by: Ingo Molnar

Hitoshi Mitake
2009-12-14 15:51:19 +0800

19 Nov, 2009

1 commit

827f3b497 perf bench: Add memcpy() benchmark ... Browse Code »

'perf bench mem memcpy' is a benchmark suite for measuring memcpy()
performance.

Example on a Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz:

| % perf bench mem memcpy -l 1GB
| # Running mem/memcpy benchmark...
| # Copying 1MB Bytes from 0xb7d98008 to 0xb7e99008 ...
|
| 726.216412 MB/Sec

Signed-off-by: Hitoshi Mitake
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Frederic Weisbecker
LKML-Reference:
[ v2: updated changelog, clarified history of builtin-bench.c ]
Signed-off-by: Ingo Molnar

Hitoshi Mitake
2009-11-19 13:21:48 +0800

11 Nov, 2009

1 commit

79e295d4b perf bench: Improve builtin-bench.c for more friendly output ... Browse Code »

This patch makes output of perf bench more friendly.
Current style of putput, keeping user wait
and printing everything suddenly when we finish,
may confuse users.

So I improved it:

| % perf bench sched messaging
| # Running sched/messaging benchmark...
Cc: Peter Zijlstra
Cc: Paul Mackerras
LKML-Reference:
Signed-off-by: Ingo Molnar

Hitoshi Mitake
2009-11-11 02:56:44 +0800

10 Nov, 2009

1 commit

386d7e9e5 perf bench: Modify builtin-bench.c for processing common options ... Browse Code »

This patch modifies builtin-bench.c for processing common
options. The first option added is "--format".
Users of perf bench will be able to specify output style by
--format.

Usage example:

% ./perf bench sched messaging # with no style specify
(20 sender and receiver processes per group)
(10 groups == 400 processes run)

Total time:1.431 sec

% ./perf bench --format=simple sched messaging # specified
simple 1.431

Signed-off-by: Hitoshi Mitake
Cc: Peter Zijlstra
Cc: Paul Mackerras
LKML-Reference:
Signed-off-by: Ingo Molnar

Hitoshi Mitake
2009-11-10 11:53:48 +0800

08 Nov, 2009

1 commit

629cc3566 perf bench: Add builtin-bench.c: General framework for benchmark suites ... Browse Code »

This patch adds builtin-bench.c
builtin-bench.c is a general framework for benchmark suites.

Signed-off-by: Hitoshi Mitake
Cc: Rusty Russell
Cc: Peter Zijlstra
Cc: Mike Galbraith
Cc: Arnaldo Carvalho de Melo
Cc: fweisbec@gmail.com
Cc: Jiri Kosina
LKML-Reference:
Signed-off-by: Ingo Molnar

Hitoshi Mitake
2009-11-08 17:19:18 +0800