19 May, 2010

1 commit

  • It is hard to read very large numbers so provide an option to perf stat
    to separate thousands using a separator. The patch leverages the locale
    support of stdio. You need to set your LC_NUMERIC appropriately, for
    instance LC_NUMERIC=en_US.UTF8. You need to pass -B to activate this
    feature. This way existing scripts parsing the output do not need to be
    changed. Here is an example.

    $ perf stat noploop 2
    noploop for 2 seconds

    Performance counter stats for 'noploop 2':

    1998.347031 task-clock-msecs # 0.998 CPUs
    61 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    118 page-faults # 0.000 M/sec
    4,138,410,900 cycles # 2070.917 M/sec (scaled from 70.01%)
    2,062,650,268 instructions # 0.498 IPC (scaled from 70.01%)
    2,057,653,466 branches # 1029.678 M/sec (scaled from 70.01%)
    40,267 branch-misses # 0.002 % (scaled from 30.04%)
    2,055,961,348 cache-references # 1028.831 M/sec (scaled from 30.03%)
    53,725 cache-misses # 0.027 M/sec (scaled from 30.02%)

    2.001393933 seconds time elapsed

    $ perf stat -B noploop 2
    noploop for 2 seconds

    Performance counter stats for 'noploop 2':

    1998.297883 task-clock-msecs # 0.998 CPUs
    59 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    119 page-faults # 0.000 M/sec
    4,131,380,160 cycles # 2067.450 M/sec (scaled from 70.01%)
    2,059,096,507 instructions # 0.498 IPC (scaled from 70.01%)
    2,054,681,303 branches # 1028.216 M/sec (scaled from 70.01%)
    25,650 branch-misses # 0.001 % (scaled from 30.05%)
    2,056,283,014 cache-references # 1029.017 M/sec (scaled from 30.03%)
    47,097 cache-misses # 0.024 M/sec (scaled from 30.02%)

    2.001391016 seconds time elapsed

    Cc: David S. Miller
    Cc: Frédéric Weisbecker
    Cc: Ingo Molnar
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Tom Zanussi
    LKML-Reference:
    Signed-off-by: Stephane Eranian
    Signed-off-by: Arnaldo Carvalho de Melo

    Stephane Eranian
     

14 May, 2010

1 commit

  • By default, event inheritance across fork and pthread_create was on but the -i
    option of stat and record, which enabled inheritance, led to believe it was off
    by default.

    This patch fixes this logic by inverting the meaning of the -i option. By
    default inheritance is on whether you attach to a process (-p), a thread (-t)
    or start a process. If you pass -i, then you turn off inheritance. Turning off
    inheritance if you don't need it, helps limit perf resource usage as well.

    The patch also fixes perf stat -t xxxx and perf record -t xxxx which did not
    start the counters.

    Acked-by: Frederic Weisbecker
    Cc: David S. Miller
    Cc: Frédéric Weisbecker
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Stephane Eranian
    Signed-off-by: Arnaldo Carvalho de Melo

    Stephane Eranian
     

14 Apr, 2010

1 commit

  • Parsing an option from the command line with OPT_BOOLEAN on a
    bool data type would not work on a big-endian machine due to the
    manner in which the boolean was being cast into an int and
    incremented. For example, running 'perf probe --list' on a
    PowerPC machine would fail to properly set the list_events bool
    and would therefore print out the usage information and
    terminate.

    This patch makes OPT_BOOLEAN work as expected with a bool
    datatype. For cases where the original OPT_BOOLEAN was
    intentionally being used to increment an int each time it was
    passed in on the command line, this patch introduces OPT_INCR
    with the old behaviour of OPT_BOOLEAN (the verbose variable is
    currently the only such example of this).

    I have reviewed every use of OPT_BOOLEAN to verify that a true
    C99 bool was passed. Where integers were used, I verified that
    they were only being used for boolean logic and changed them to
    bools to ensure that they would not be mistakenly used as ints.
    The major exception was the verbose variable which now uses
    OPT_INCR instead of OPT_BOOLEAN.

    Signed-off-by: Ian Munsie
    Acked-by: David S. Miller
    Cc: # NOTE: wont apply to .3[34].x cleanly, please backport
    Cc: Git development list
    Cc: Ian Munsie
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: KOSAKI Motohiro
    Cc: Hitoshi Mitake
    Cc: Rusty Russell
    Cc: Frederic Weisbecker
    Cc: Eric B Munson
    Cc: Valdis.Kletnieks@vt.edu
    Cc: WANG Cong
    Cc: Thiago Farina
    Cc: Masami Hiramatsu
    Cc: Xiao Guangrong
    Cc: Jaswinder Singh Rajput
    Cc: Arjan van de Ven
    Cc: OGAWA Hirofumi
    Cc: Mike Galbraith
    Cc: Tom Zanussi
    Cc: Anton Blanchard
    Cc: John Kacur
    Cc: Li Zefan
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ian Munsie
     

23 Mar, 2010

1 commit

  • Before:

    [acme@doppio linux-2.6-tip]$ perf stat -a sleep 1s

    Performance counter stats for 'sleep 1s':

    task-clock-msecs
    context-switches
    CPU-migrations
    page-faults
    cycles
    instructions
    branches
    branch-misses
    cache-references
    cache-misses

    1.016998463 seconds time elapsed

    [acme@doppio linux-2.6-tip]$

    Now:

    [acme@doppio linux-2.6-tip]$ perf stat -a sleep 1s
    No permission to collect system-wide stats.
    Consider tweaking /proc/sys/kernel/perf_event_paranoid.
    [acme@doppio linux-2.6-tip]$

    Reported-by: Ingo Molnar
    Signed-off-by: Arnaldo Carvalho de Melo
    Cc: Frédéric Weisbecker
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arnaldo Carvalho de Melo
     

18 Mar, 2010

2 commits

  • Parameter --pid (or -p) of perf currently means a thread-wide
    collection. For exmaple, if a process whose id is 8888 has 10
    threads, 'perf top -p 8888' just collects the main thread
    statistics. That's misleading. Users are used to attach a whole
    process when debugging a process by gdb. To follow normal usage
    style, the patch change --pid to process-wide collection and add
    --tid (-t) to mean a thread-wide collection.

    Usage example is:

    # perf top -p 8888
    # perf record -p 8888 -f sleep 10
    # perf stat -p 8888 -f sleep 10

    Above commands collect the statistics of all threads of process
    8888.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Arnaldo Carvalho de Melo
    Cc: Avi Kivity
    Cc: Peter Zijlstra
    Cc: Sheng Yang
    Cc: Joerg Roedel
    Cc: Jes Sorensen
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Cc: zhiteng.huang@intel.com
    Cc: Zachary Amsden
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Zhang, Yanmin
     
  • Command 'perf stat' doesn't enable counters when collecting an
    existing (by -p) process or system-wide statistics. Fix the
    issue.

    Change the condition of fork/exec subcommand. If there is a
    subcommand parameter, perf always forks/execs it. The usage
    example is:

    # perf stat -a sleep 10

    So this command could collect statistics for 10 seconds
    precisely. User still could stop it by CTRL+C. Without the new
    capability, user could only use CTRL+C to stop it without
    precise time clock.

    Another issue is 'perf stat -a' consumes 100% time of a full
    single logical cpu. It has a bad impact on running workload.

    Fix it by adding a sleep(1) in the while(!done) loop in function
    run_perf_stat.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Arnaldo Carvalho de Melo
    Cc: Avi Kivity
    Cc: Peter Zijlstra
    Cc: Sheng Yang
    Cc: Marcelo Tosatti
    Cc: Joerg Roedel
    Cc: Jes Sorensen
    Cc: Gleb Natapov
    Cc: Zachary Amsden
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Zhang, Yanmin
     

11 Mar, 2010

1 commit

  • At present, the perf subcommands that do system-wide monitoring
    (perf stat, perf record and perf top) don't work properly unless
    the online cpus are numbered 0, 1, ..., N-1. These tools ask
    for the number of online cpus with sysconf(_SC_NPROCESSORS_ONLN)
    and then try to create events for cpus 0, 1, ..., N-1.

    This creates problems for systems where the online cpus are
    numbered sparsely. For example, a POWER6 system in
    single-threaded mode (i.e. only running 1 hardware thread per
    core) will have only even-numbered cpus online.

    This fixes the problem by reading the /sys/devices/system/cpu/online
    file to find out which cpus are online. The code that does that is in
    tools/perf/util/cpumap.[ch], and consists of a read_cpu_map()
    function that sets up a cpumap[] array and returns the number of
    online cpus. If /sys/devices/system/cpu/online can't be read or
    can't be parsed successfully, it falls back to using sysconf to
    ask how many cpus are online and sets up an identity map in cpumap[].

    The perf record, perf stat and perf top code then calls
    read_cpu_map() in the system-wide monitoring case (instead of
    sysconf) and uses cpumap[] to get the cpu numbers to pass to
    perf_event_open.

    Signed-off-by: Paul Mackerras
    Cc: Anton Blanchard
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     

13 Jan, 2010

1 commit


15 Nov, 2009

1 commit

  • The ratio between the number of events and the time elapsed makes
    sense only if task-clock event is counted. Otherwise it will be
    simply a (confusing)

    # 0.000 M/sec

    This patch outputs the ratio only if task-clock event is counted.
    Some test examples of before and after:

    Before:

    [lucas@skywalker linux.trees.git]$ sudo perf stat -e branch-misses -a -- sleep 1

    Performance counter stats for 'sleep 1':

    1367818 branch-misses # 0.000 M/sec

    1.001494325 seconds time elapsed

    After (without task-clock):

    [lucas@skywalker perf]$ sudo ./perf stat -e branch-misses -a -- sleep 1

    Performance counter stats for 'sleep 1':

    1135044 branch-misses

    1.001370775 seconds time elapsed

    After (with task-clock):

    [lucas@skywalker perf]$ sudo ./perf stat -e branch-misses -e task-clock -a -- sleep 1

    Performance counter stats for 'sleep 1':

    1070111 branch-misses # 0.534 M/sec
    2002.730893 task-clock-msecs # 1.999 CPUs

    1.001640292 seconds time elapsed

    Signed-off-by: Lucas De Marchi
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Lucas De Marchi
     

19 Oct, 2009

4 commits

  • Count branches first, cache-misses second. The reason is that
    on x86 branches are not counted by all counters on all CPUs.

    Before:

    Performance counter stats for 'ls':

    0.756653 task-clock-msecs # 0.802 CPUs
    0 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    250 page-faults # 0.330 M/sec
    2375725 cycles # 3139.781 M/sec
    1628129 instructions # 0.685 IPC
    19643 cache-references # 25.960 M/sec
    4608 cache-misses # 6.090 M/sec
    342532 branches # 452.694 M/sec
    branch-misses

    0.000943356 seconds time elapsed

    After:

    Performance counter stats for 'ls':

    1.056734 task-clock-msecs # 0.859 CPUs
    0 context-switches # 0.000 M/sec
    0 CPU-migrations # 0.000 M/sec
    259 page-faults # 0.245 M/sec
    3345932 cycles # 3166.295 M/sec
    3074090 instructions # 0.919 IPC
    616928 branches # 583.806 M/sec
    39279 branch-misses # 6.367 %
    21312 cache-references # 20.168 M/sec
    3661 cache-misses # 3.464 M/sec

    0.001230551 seconds time elapsed

    (also prettify the printout of branch misses, in case it's
    getting scaled.)

    Cc: Tim Blechmann
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar
    ---
    tools/perf/builtin-stat.c | 2 ++
    1 files changed, 2 insertions(+), 0 deletions(-)

    diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
    index c373683..95a55ea 100644
    --- a/tools/perf/builtin-stat.c
    +++ b/tools/perf/builtin-stat.c
    @@ -59,6 +59,8 @@ static struct perf_event_attr default_attrs[] = {
    { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_INSTRUCTIONS },
    { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_REFERENCES},
    { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_MISSES },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS},
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES },

    };
    ---
    tools/perf/builtin-stat.c | 20 ++++++++++----------
    1 files changed, 10 insertions(+), 10 deletions(-)

    diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
    index 95a55ea..90e0a26 100644
    --- a/tools/perf/builtin-stat.c
    +++ b/tools/perf/builtin-stat.c
    @@ -50,17 +50,17 @@

    static struct perf_event_attr default_attrs[] = {

    - { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK },
    - { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CONTEXT_SWITCHES},
    - { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CPU_MIGRATIONS },
    - { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_PAGE_FAULTS },
    -
    - { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES },
    - { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_INSTRUCTIONS },
    - { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_REFERENCES},
    - { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_MISSES },
    - { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS},
    - { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES },
    + { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK },
    + { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CONTEXT_SWITCHES },
    + { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CPU_MIGRATIONS },
    + { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_PAGE_FAULTS },
    +
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_INSTRUCTIONS },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_REFERENCES },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_MISSES },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES },

    };

    Ingo Molnar
     
  • Clean up the array definition to be vertically aligned.

    No functional effects.

    Cc: Tim Blechmann
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar
    ---
    tools/perf/builtin-stat.c | 2 ++
    1 files changed, 2 insertions(+), 0 deletions(-)

    diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
    index c373683..95a55ea 100644
    --- a/tools/perf/builtin-stat.c
    +++ b/tools/perf/builtin-stat.c
    @@ -59,6 +59,8 @@ static struct perf_event_attr default_attrs[] = {
    { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_INSTRUCTIONS },
    { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_REFERENCES},
    { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CACHE_MISSES },
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS},
    + { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES },

    };

    Ingo Molnar
     
  • Adds performance event information about branches
    and branch misses to the default output of perf stat.

    Signed-off-by: Tim Blechmann
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Tim Blechmann
     
  • When we count both branches and branch-misses it is useful to
    print out the percentage of branch-misses:

    # perf stat -e branches -e branch-misses /bin/true

    Performance counter stats for '/bin/true':

    401684 branches # 0.000 M/sec
    23301 branch-misses # 5.801 %

    Signed-off-by: Anton Blanchard
    Cc: paulus@samba.org
    Cc: a.p.zijlstra@chello.nl
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Anton Blanchard
     

05 Oct, 2009

1 commit


22 Sep, 2009

1 commit

  • Before:

    0 sched:sched_switch # nan M/sec

    After:

    0 sched:sched_switch # 0.000 M/sec

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Sep, 2009

1 commit


04 Sep, 2009

4 commits

  • Use the more advanced single pass variance algorithm outlined
    on the wikipedia page. This is numerically more stable for
    larger sample sets.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When we're computing the mean by sampling the distribution,
    then the std dev of the mean is related to the std dev of the
    sample set by:

    stddev_mean = std_dev / sqrt(N)

    Which is exactly what we want.

    This results in the error on the mean decreasing with
    increasing number of samples.

    Also fix the scaled == -1, aka not counted case.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since we don't need all the individual samples to calculate the
    error remove both the limit and the storage overhead associated
    with that.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The current noise computation does:

    \Sum abs(n_i - avg(n)) * N^-1.5

    Which is (afaik) not a regular noise function, and needs the
    complete sample set available to post-process.

    Change this to use a regular stddev computation which can be
    done by keeping a two sums:

    stddev = sqrt( 1/N (\Sum n_i^2) - avg(n)^2 )

    For which we only need to keep \Sum n_i and \Sum n_i^2.

    Signed-off-by: Peter Zijlstra
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Aug, 2009

1 commit

  • Librarize trace_event() helper so that perf trace can use it
    too. Also clean up the debug.h includes a bit.

    It's not good to have it included in perf.h because it doesn't
    make it flexible against other headers it may need (headers
    that can also depend on perf.h and then create a recursive
    header dependency).

    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

12 Aug, 2009

1 commit

  • Factorize multiple definitions of high level dso helpers into the
    symbol source file.

    The side effect is a general export of the verbose and eprintf
    debugging helpers into a new file dedicated to debugging purposes.

    Signed-off-by: Frederic Weisbecker
    Cc: Arnaldo Carvalho de Melo
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Brice Goglin

    Frederic Weisbecker
     

09 Aug, 2009

1 commit


23 Jul, 2009

1 commit

  • perf stat and perf record currently look for all options on the command
    line. This can lead to some confusion:

    # perf stat ls -l
    Error: unknown switch `l'

    While we can work around this by adding '--' before the command, the git
    option parsing code can stop at the first non option:

    # perf stat ls -l
    Performance counter stats for 'ls -l':
    ....

    Signed-off-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    LKML-Reference:

    Anton Blanchard
     

02 Jul, 2009

1 commit

  • Building builtin-stat.c reports the following errors:

    cc1: warnings being treated as errors
    builtin-stat.c: In function ‘run_perf_stat’:
    builtin-stat.c:242: erreur: ignoring return value of ‘read’, declared with attribute warn_unused_result
    builtin-stat.c:255: erreur: ignoring return value of ‘read’, declared with attribute warn_unused_result
    make: *** [builtin-stat.o] Erreur 1

    This patch handles the possible pipe read failures.

    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Anton Blanchard
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

01 Jul, 2009

2 commits


30 Jun, 2009

3 commits

  • This provides a way to mark a counter to be enabled on the next
    exec. This is useful for measuring the total activity of a
    program without including overhead from the process that
    launches it.

    This also changes the perf stat command to use this new
    facility.

    Signed-off-by: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     
  • Vince Weaver reported a 'perf stat' measurement overhead in the
    count of retired instructions, which can amount to a +6000
    instructions inflated count in the reported count.

    At present, perf stat creates its counters on the perf process. Thus
    the counters count the fork and various other activity in both the
    parent and child, such as the resolver overhead for resolving PLT
    entries for any libc functions that haven't been called before, such
    as execvp.

    This reduces the overhead by creating the counters on the child process
    after the fork, using a couple of pipes to synchronize so that the
    child process waits until the parent has created the counters before
    doing the exec. To eliminate the PLT resolution overhead on calling
    execvp, this does a dummy execvp first which will always fail.

    With this, the overhead of executing a program goes down from over
    4800 instructions to about 90 instructions on powerpc (32-bit).
    This was measured with a statically-linked program written in
    assembler which only does the 3 instructions needed to call _exit(0).

    Before:

    $ perf stat -e 0:1:u ./three

    Performance counter stats for './three':

    4858 instructions

    0.001274523 seconds time elapsed

    After:

    $ perf stat -e 0:1:u ./three

    Performance counter stats for './three':

    92 instructions

    0.000468153 seconds time elapsed

    Reported-by: Vince Weaver
    Signed-off-by: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     
  • Peter expressed a strong preference for percentage based
    display of scaled values - so revert to that from the
    recently introduced multiplication-factor unit.

    Reported-by: Peter Zijlstra
    Cc: Jaswinder Singh Rajput
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Jun, 2009

2 commits

  • Set attrs and nr_counters if no event is selected and !null_run.

    Setting of attrs should depend on number of counters,
    so we need to memcpy only for sizeof(default_attrs)

    Also set nr_counters as ARRAY_SIZE(default_attrs) in place of
    hardcoded value.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jaswinder Singh Rajput
     
  • Increase size for event name to handle bigger names like
    'L1-d$-prefetch-misses'

    Changed scaled counters from percentage to a multiplicative
    factor because the latter is more expressive.

    Also aligned the scaling factor, otherwise sometimes it looks
    like:

    384 iTLB-load-misses (4.74x scaled)
    452029 branch-loads (8.00x scaled)
    5892 branch-load-misses (20.39x scaled)
    972315 iTLB-loads (3.24x scaled)

    Before:
    150708 L1-d$-stores (scaled from 23.57%)
    428804 L1-d$-prefetches (scaled from 23.47%)
    314446 L1-d$-prefetch-misses (scaled from 23.42%)
    252626137 L1-i$-loads (scaled from 23.24%)
    5297550 dTLB-load-misses (scaled from 23.96%)
    106992392 branch-loads (scaled from 23.67%)
    5239561 branch-load-misses (scaled from 23.43%)

    After:
    1731713 L1-d$-loads ( 14.25x scaled)
    44241 L1-d$-prefetches ( 3.88x scaled)
    21076 L1-d$-prefetch-misses ( 3.40x scaled)
    5789421 L1-i$-loads ( 3.78x scaled)
    29645 dTLB-load-misses ( 2.95x scaled)
    461474 branch-loads ( 6.52x scaled)
    7493 branch-load-misses ( 26.57x scaled)

    Reported-by: Ingo Molnar
    Signed-off-by: Jaswinder Singh Rajput
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jaswinder Singh Rajput
     

27 Jun, 2009

2 commits

  • In multi-run (-r/--repeat) printouts, print out the noise of
    the wall-clock average as well.

    Also, fix a bug in printing out scaled counters: if it was not
    scaled then we should not update the average with -1.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Allow a no-counters run. This can be useful to measure just
    elapsed wall-clock time - or to assess the raw overhead of perf
    stat itself, without running any counters.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

24 Jun, 2009

2 commits

  • Remove dead code and do some code alignment.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jaswinder Singh Rajput
     
  • Error message should use stderr for verbose (-v), otherwise
    message will be lost for:

    $ ./perf stat -v > /dev/null

    For example on AMD bus-cycles event is not available so now
    it looks like:

    $ ./perf stat -v -e bus-cycles ls > /dev/null
    Error: counter 0, sys_perf_counter_open() syscall returned with -1 (Invalid argument)

    Performance counter stats for 'ls':

    bus-cycles

    0.006765877 seconds time elapsed.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jaswinder Singh Rajput
     

20 Jun, 2009

1 commit

  • On 64-bit powerpc, __u64 is defined to be unsigned long rather than
    unsigned long long. This causes compiler warnings every time we
    print a __u64 value with %Lx.

    Rather than changing __u64, we define our own u64 to be unsigned long
    long on all architectures, and similarly s64 as signed long long.
    For consistency we also define u32, s32, u16, s16, u8 and s8. These
    definitions are put in a new header, types.h, because these definitions
    are needed in util/string.h and util/symbol.h.

    The main change here is the mechanical change of __[us]{64,32,16,8}
    to remove the "__". The other changes are:

    * Create types.h
    * Include types.h in perf.h, util/string.h and util/symbol.h
    * Add types.h to the LIB_H definition in Makefile
    * Added (u64) casts in process_overflow_event() and print_sym_table()
    to kill two remaining warnings.

    Signed-off-by: Paul Mackerras
    Acked-by: Peter Zijlstra
    Cc: benh@kernel.crashing.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     

13 Jun, 2009

2 commits

  • If -vv (very verbose) is specified, print out raw data
    in the following format:

    $ perf stat -vv -r 3 ./loop_1b_instructions

    [ perf stat: executing run #1 ... ]
    [ perf stat: executing run #2 ... ]
    [ perf stat: executing run #3 ... ]

    debug: runtime[0]: 235871872
    debug: walltime[0]: 236646752
    debug: runtime_cycles[0]: 755150182
    debug: counter/0[0]: 235871872
    debug: counter/1[0]: 235871872
    debug: counter/2[0]: 235871872
    debug: scaled[0]: 0
    debug: counter/0[1]: 2
    debug: counter/1[1]: 235870662
    debug: counter/2[1]: 235870662
    debug: scaled[1]: 0
    debug: counter/0[2]: 1
    debug: counter/1[2]: 235870437
    debug: counter/2[2]: 235870437
    debug: scaled[2]: 0
    debug: counter/0[3]: 140
    debug: counter/1[3]: 235870298
    debug: counter/2[3]: 235870298
    debug: scaled[3]: 0
    debug: counter/0[4]: 755150182
    debug: counter/1[4]: 235870145
    debug: counter/2[4]: 235870145
    debug: scaled[4]: 0
    debug: counter/0[5]: 1001411258
    debug: counter/1[5]: 235868838
    debug: counter/2[5]: 235868838
    debug: scaled[5]: 0
    debug: counter/0[6]: 27897
    debug: counter/1[6]: 235868560
    debug: counter/2[6]: 235868560
    debug: scaled[6]: 0
    debug: counter/0[7]: 2910
    debug: counter/1[7]: 235868151
    debug: counter/2[7]: 235868151
    debug: scaled[7]: 0
    debug: runtime[0]: 235980257
    debug: walltime[0]: 236770942
    debug: runtime_cycles[0]: 755114546
    debug: counter/0[0]: 235980257
    debug: counter/1[0]: 235980257
    debug: counter/2[0]: 235980257
    debug: scaled[0]: 0
    debug: counter/0[1]: 3
    debug: counter/1[1]: 235980049
    debug: counter/2[1]: 235980049
    debug: scaled[1]: 0
    debug: counter/0[2]: 1
    debug: counter/1[2]: 235979907
    debug: counter/2[2]: 235979907
    debug: scaled[2]: 0
    debug: counter/0[3]: 135
    debug: counter/1[3]: 235979780
    debug: counter/2[3]: 235979780
    debug: scaled[3]: 0
    debug: counter/0[4]: 755114546
    debug: counter/1[4]: 235979652
    debug: counter/2[4]: 235979652
    debug: scaled[4]: 0
    debug: counter/0[5]: 1001439771
    debug: counter/1[5]: 235979304
    debug: counter/2[5]: 235979304
    debug: scaled[5]: 0
    debug: counter/0[6]: 23723
    debug: counter/1[6]: 235979050
    debug: counter/2[6]: 235979050
    debug: scaled[6]: 0
    debug: counter/0[7]: 2213
    debug: counter/1[7]: 235978820
    debug: counter/2[7]: 235978820
    debug: scaled[7]: 0
    debug: runtime[0]: 235888002
    debug: walltime[0]: 236700533
    debug: runtime_cycles[0]: 754881504
    debug: counter/0[0]: 235888002
    debug: counter/1[0]: 235888002
    debug: counter/2[0]: 235888002
    debug: scaled[0]: 0
    debug: counter/0[1]: 2
    debug: counter/1[1]: 235887793
    debug: counter/2[1]: 235887793
    debug: scaled[1]: 0
    debug: counter/0[2]: 1
    debug: counter/1[2]: 235887645
    debug: counter/2[2]: 235887645
    debug: scaled[2]: 0
    debug: counter/0[3]: 135
    debug: counter/1[3]: 235887499
    debug: counter/2[3]: 235887499
    debug: scaled[3]: 0
    debug: counter/0[4]: 754881504
    debug: counter/1[4]: 235887368
    debug: counter/2[4]: 235887368
    debug: scaled[4]: 0
    debug: counter/0[5]: 1001401731
    debug: counter/1[5]: 235887024
    debug: counter/2[5]: 235887024
    debug: scaled[5]: 0
    debug: counter/0[6]: 24212
    debug: counter/1[6]: 235886786
    debug: counter/2[6]: 235886786
    debug: scaled[6]: 0
    debug: counter/0[7]: 1824
    debug: counter/1[7]: 235886560
    debug: counter/2[7]: 235886560
    debug: scaled[7]: 0

    Performance counter stats for '/home/mingo/loop_1b_instructions' (3 runs):

    235.913377 task-clock-msecs # 0.997 CPUs ( +- 0.011% )
    2 context-switches # 0.000 M/sec ( +- 0.000% )
    1 CPU-migrations # 0.000 M/sec ( +- 0.000% )
    136 page-faults # 0.001 M/sec ( +- 0.730% )
    755048744 cycles # 3200.534 M/sec ( +- 0.009% )
    1001417586 instructions # 1.326 IPC ( +- 0.001% )
    25277 cache-references # 0.107 M/sec ( +- 3.988% )
    2315 cache-misses # 0.010 M/sec ( +- 9.845% )

    0.236706075 seconds time elapsed.

    This allows the summary stats to be validated.

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Add the --repeat feature to perf stat, which repeats a given
    command up to a 100 times, collects the stats and calculates an
    average and a stddev.

    For example, the following oneliner 'perf stat' command runs hackbench
    5 times and prints a tabulated result of all metrics, with averages
    and noise levels (in percentage) printed:

    aldebaran:~/linux/linux/tools/perf> ./perf stat --repeat 5 ~/hackbench 10
    Time: 0.117
    Time: 0.108
    Time: 0.089
    Time: 0.088
    Time: 0.100

    Performance counter stats for '/home/mingo/hackbench 10' (5 runs):

    1243.989586 task-clock-msecs # 10.460 CPUs ( +- 4.720% )
    47706 context-switches # 0.038 M/sec ( +- 19.706% )
    387 CPU-migrations # 0.000 M/sec ( +- 3.608% )
    17793 page-faults # 0.014 M/sec ( +- 0.354% )
    3770941606 cycles # 3031.329 M/sec ( +- 4.621% )
    1566372416 instructions # 0.415 IPC ( +- 2.703% )
    16783421 cache-references # 13.492 M/sec ( +- 5.202% )
    7128590 cache-misses # 5.730 M/sec ( +- 7.420% )

    0.118924455 seconds time elapsed.

    The goal of this feature is to allow the reliance on these accurate
    statistics and to know how many times a command has to be repeated
    for the noise to go down to an acceptable level.

    (The -v option can be used to see a line printed out as each run progresses.)

    Cc: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar