12 Nov, 2013

1 commit

  • Pull perf updates from Ingo Molnar:
    "As a first remark I'd like to note that the way to build perf tooling
    has been simplified and sped up, in the future it should be enough for
    you to build perf via:

    cd tools/perf/
    make install

    (ie without the -j option.) The build system will figure out the
    number of CPUs and will do a parallel build+install.

    The various build system inefficiencies and breakages Linus reported
    against the v3.12 pull request should now be resolved - please
    (re-)report any remaining annoyances or bugs.

    Main changes on the perf kernel side:

    * Performance optimizations:
    . perf ring-buffer code optimizations, by Peter Zijlstra
    . perf ring-buffer code optimizations, by Oleg Nesterov
    . x86 NMI call-stack processing optimizations, by Peter Zijlstra
    . perf context-switch optimizations, by Peter Zijlstra
    . perf sampling speedups, by Peter Zijlstra
    . x86 Intel PEBS processing speedups, by Peter Zijlstra

    * Enhanced hardware support:
    . for Intel Ivy Bridge-EP uncore PMUs, by Zheng Yan
    . for Haswell transactions, by Andi Kleen, Peter Zijlstra

    * Core perf events code enhancements and fixes by Oleg Nesterov:
    . for uprobes, if fork() is called with pending ret-probes
    . for uprobes platform support code

    * New ABI details by Andi Kleen:
    . Report x86 Haswell TSX transaction abort cost as weight

    Main changes on the perf tooling side (some of these tooling changes
    utilize the above kernel side changes):

    * 'perf report/top' enhancements:

    . Convert callchain children list to rbtree, greatly reducing the
    time taken for callchain processing, from Namhyung Kim.

    . Add new COMM infrastructure, further improving histogram
    processing, from Frédéric Weisbecker, one fix from Namhyung Kim.

    . Add /proc/kcore based live-annotation improvements, including
    build-id cache support, multi map 'call' instruction navigation
    fixes, kcore address validation, objdump workarounds. From
    Adrian Hunter.

    . Show progress on histogram collapsing, that can take a long
    time, from Namhyung Kim.

    . Add --max-stack option to limit callchain stack scan in 'top'
    and 'report', improving callchain processing when reducing the
    stack depth is an option, from Waiman Long.

    . Add new option --ignore-vmlinux for perf top, from Willy
    Tarreau.

    * 'perf trace' enhancements:

    . 'perf trace' now can can use a 'perf probe' dynamic tracepoints
    to hook into the userspace -> kernel pathname copy so that it
    can map fds to pathnames without reading /proc/pid/fd/ symlinks.
    From Arnaldo Carvalho de Melo.

    . Show VFS path associated with fd in live sessions, using a
    'vfs_getname' 'perf probe' created dynamic tracepoint or by
    looking at /proc/pid/fd, from Arnaldo Carvalho de Melo.

    . Add 'trace' beautifiers for lots of syscall arguments, from
    Arnaldo Carvalho de Melo.

    . Implement more compact 'trace' output by suppressing zeroed
    args, from Arnaldo Carvalho de Melo.

    . Show thread COMM by default in 'trace', from Arnaldo Carvalho de
    Melo.

    . Add option to show full timestamp in 'trace', from David Ahern.

    . Add 'record' command in 'trace', to record raw_syscalls:*, from
    David Ahern.

    . Add summary option to dump syscall statistics in 'trace', from
    David Ahern.

    . Improve error messages in 'trace', providing hints about system
    configuration steps needed for using it, from Ramkumar
    Ramachandra.

    . 'perf trace' now emits hints as to why tracing is not possible,
    helping the user to setup the system to allow tracing in the
    desired permission granularity, telling if the problem is due to
    debugfs not being mounted or with not enough permission for
    !root, /proc/sys/kernel/perf_event_paranoit value, etc. From
    Arnaldo Carvalho de Melo.

    * 'perf record' enhancements:

    . Check maximum frequency rate for record/top, emitting better
    error messages, from Jiri Olsa.

    . 'perf record' code cleanups, from David Ahern.

    . Improve write_output error message in 'perf record', from Adrian
    Hunter.

    . Allow specifying B/K/M/G unit to the --mmap-pages arguments,
    from Jiri Olsa.

    . Fix command line callchain attribute tests to handle the new
    -g/--call-chain semantics, from Arnaldo Carvalho de Melo.

    * 'perf kvm' enhancements:

    . Disable live kvm command if timerfd is not supported, from David
    Ahern.

    . Fix detection of non-core features, from David Ahern.

    * 'perf list' enhancements:

    . Add usage to 'perf list', from David Ahern.

    . Show error in 'perf list' if tracepoints not available, from
    Pekka Enberg.

    * 'perf probe' enhancements:

    . Support "$vars" meta argument syntax for local variables,
    allowing asking for all possible variables at a given probe
    point to be collected when it hits, from Masami Hiramatsu.

    * 'perf sched' enhancements:

    . Address the root cause of that 'perf sched' stack initialization
    build slowdown, by programmatically setting a big array after
    moving the global variable back to the stack. Fix from Adrian
    Hunter.

    * 'perf script' enhancements:

    . Set up output options for in-stream attributes, from Adrian
    Hunter.

    . Print addr by default for BTS in 'perf script', from Adrian
    Juntmer

    * 'perf stat' enhancements:

    . Improved messages when doing profiling in all or a subset of
    CPUs using a workload as the session delimitator, as in:

    'perf stat --cpu 0,2 sleep 10s'

    from Arnaldo Carvalho de Melo.

    . Add units to nanosec-based counters in 'perf stat', from David
    Ahern.

    . Remove bogus info when using 'perf stat' -e cycles/instructions,
    from Ramkumar Ramachandra.

    * 'perf lock' enhancements:

    . 'perf lock' fixes and cleanups, from Davidlohr Bueso.

    * 'perf test' enhancements:

    . Fixup PERF_SAMPLE_TRANSACTION handling in sample synthesizing
    and 'perf test', from Adrian Hunter.

    . Clarify the "sample parsing" test entry, from Arnaldo Carvalho
    de Melo.

    . Consider PERF_SAMPLE_TRANSACTION in the "sample parsing" test,
    from Arnaldo Carvalho de Melo.

    . Memory leak fixes in 'perf test', from Felipe Pena.

    * 'perf bench' enhancements:

    . Change the procps visible command-name of invididual benchmark
    tests plus cleanups, from Ingo Molnar.

    * Generic perf tooling infrastructure/plumbing changes:

    . Separating data file properties from session, code
    reorganization from Jiri Olsa.

    . Fix version when building out of tree, as when using one of
    these:

    $ make help | grep perf
    perf-tar-src-pkg - Build perf-3.12.0.tar source tarball
    perf-targz-src-pkg - Build perf-3.12.0.tar.gz source tarball
    perf-tarbz2-src-pkg - Build perf-3.12.0.tar.bz2 source tarball
    perf-tarxz-src-pkg - Build perf-3.12.0.tar.xz source tarball
    $

    from David Ahern.

    . Enhance option parse error message, showing just the help lines
    of the options affected, from Namhyung Kim.

    . libtraceevent updates from upstream trace-cmd repo, from Steven
    Rostedt.

    . Always use perf_evsel__set_sample_bit to set sample_type, from
    Adrian Hunter.

    . Memory and mmap leak fixes from Chenggang Qin.

    . Assorted build fixes for from David Ahern and Jiri Olsa.

    . Speed up and prettify the build system, from Ingo Molnar.

    . Implement addr2line directly using libbfd, from Roberto Vitillo.

    . Separate the GTK support in a separate libperf-gtk.so DSO, that
    is only loaded when --gtk is specified, from Namhyung Kim.

    . perf bash completion fixes and improvements from Ramkumar
    Ramachandra.

    . Support for Openembedded/Yocto -dbg packages, from Ricardo
    Ribalda Delgado.

    And lots and lots of other fixes and code reorganizations that did not
    make it into the list, see the shortlog, diffstat and the Git log for
    details!"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (300 commits)
    uprobes: Fix the memory out of bound overwrite in copy_insn()
    uprobes: Fix the wrong usage of current->utask in uprobe_copy_process()
    perf tools: Remove unneeded include
    perf record: Remove post_processing_offset variable
    perf record: Remove advance_output function
    perf record: Refactor feature handling into a separate function
    perf trace: Don't relookup fields by name in each sample
    perf tools: Fix version when building out of tree
    perf evsel: Ditch evsel->handler.data field
    uprobes: Export write_opcode() as uprobe_write_opcode()
    uprobes: Introduce arch_uprobe->ixol
    uprobes: Kill module_init() and module_exit()
    uprobes: Move function declarations out of arch
    perf/x86/intel: Add Ivy Bridge-EP uncore IRP box support
    perf/x86/intel/uncore: Add filter support for IvyBridge-EP QPI boxes
    perf: Factor out strncpy() in perf_event_mmap_event()
    tools/perf: Add required memory barriers
    perf: Fix arch_perf_out_copy_user default
    perf: Update a stale comment
    perf: Optimize perf_output_begin() -- address calculation
    ...

    Linus Torvalds
     

10 Nov, 2013

2 commits

  • 1. copy_insn() doesn't look very nice, all calculations are
    confusing and it is not immediately clear why do we read
    the 2nd page first.

    2. The usage of inode->i_size is wrong on 32-bit machines.

    3. "Instruction at end of binary" logic is simply wrong, it
    doesn't handle the case when uprobe->offset > inode->i_size.

    In this case "bytes" overflows, and __copy_insn() writes to
    the memory outside of uprobe->arch.insn.

    Yes, uprobe_register() checks i_size_read(), but this file
    can be truncated after that. All i_size checks are racy, we
    do this only to catch the obvious mistakes.

    Change copy_insn() to call __copy_insn() in a loop, simplify
    and fix the bytes/nbytes calculations.

    Note: we do not care if we read extra bytes after inode->i_size
    if we got the valid page. This is fine because the task gets the
    same page after page-fault, and arch_uprobe_analyze_insn() can't
    know how many bytes were actually read anyway.

    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • Commit aa59c53fd459 "uprobes: Change uprobe_copy_process() to dup
    xol_area" has a stupid typo, we need to setup t->utask->vaddr but
    the code wrongly uses current->utask.

    Even with this bug dup_xol_work() works "in practice", but only
    because get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE) likely
    returns the same address every time.

    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     

07 Nov, 2013

4 commits

  • Pull driver core / sysfs patches from Greg KH:
    "Here's the big driver core / sysfs update for 3.13-rc1.

    There's lots of dev_groups updates for different subsystems, as they
    all get slowly migrated over to the safe versions of the attribute
    groups (removing userspace races with the creation of the sysfs
    files.) Also in here are some kobject updates, devres expansions, and
    the first round of Tejun's sysfs reworking to enable it to be used by
    other subsystems as a backend for an in-kernel filesystem.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
    sysfs: rename sysfs_assoc_lock and explain what it's about
    sysfs: use generic_file_llseek() for sysfs_file_operations
    sysfs: return correct error code on unimplemented mmap()
    mdio_bus: convert bus code to use dev_groups
    device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
    sysfs: separate out dup filename warning into a separate function
    sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
    sysfs: remove unused sysfs_get_dentry() prototype
    sysfs: honor bin_attr.attr.ignore_lockdep
    sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
    devres: restore zeroing behavior of devres_alloc()
    sysfs: fix sysfs_write_file for bin file
    input: gameport: convert bus code to use dev_groups
    input: serio: remove bus usage of dev_attrs
    input: serio: use DEVICE_ATTR_RO()
    i2o: convert bus code to use dev_groups
    memstick: convert bus code to use dev_groups
    tifm: convert bus code to use dev_groups
    virtio: convert bus code to use dev_groups
    ipack: convert bus code to use dev_groups
    ...

    Linus Torvalds
     
  • set_swbp() and set_orig_insn() are __weak, but this is pointless
    because write_opcode() is static.

    Export write_opcode() as uprobe_write_opcode() for the upcoming
    arm port, this way it can actually override set_swbp() and use
    __opcode_to_mem_arm(bpinsn) instead if UPROBE_SWBP_INSN.

    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • Currently xol_get_insn_slot() assumes that we should simply copy
    arch_uprobe->insn[] which is (ignoring arch_uprobe_analyze_insn)
    just the copy of the original insn.

    This is not true for arm which needs to create another insn to
    execute it out-of-line.

    So this patch simply adds the new member, ->ixol into the union.
    This doesn't make any difference for x86 and powerpc, but arm
    can divorce insn/ixol and initialize the correct xol insn in
    arch_uprobe_analyze_insn().

    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • Turn module_init() into __initcall() and kill module_exit().

    This code can't be compiled as a module so these module_*()
    calls only add the confusion, especially if arch-dependant
    code needs its own initialization hooks.

    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     

06 Nov, 2013

8 commits

  • While this is really minor, but strncpy() does the unnecessary
    zero-padding till the end of tmp[16] and it is called every time
    we are going to use the string literal.

    Turn these strncpy()'s into the single strlcpy() under the new
    label, saves 72 bytes.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131017182417.GA17753@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The arch_perf_output_copy_user() default of
    __copy_from_user_inatomic() returns bytes not copied, while all other
    argument functions given DEFINE_OUTPUT_COPY() return bytes copied.

    Since copy_from_user_nmi() is the odd duck out by returning bytes
    copied where all other *copy_{to,from}* functions return bytes not
    copied, change it over and ammend DEFINE_OUTPUT_COPY() to expect bytes
    not copied.

    Oddly enough DEFINE_OUTPUT_COPY() already returned bytes not copied
    while expecting its worker functions to return bytes copied.

    Signed-off-by: Peter Zijlstra
    Acked-by: will.deacon@arm.com
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20131030201622.GR16117@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Michael Ellerman
    Cc: Michael Neuling
    Cc: "Paul E. McKenney"
    Cc: james.hogan@imgtec.com
    Cc: Vince Weaver
    Cc: Victor Kaplansky
    Cc: Oleg Nesterov
    Cc: Anton Blanchard
    Link: http://lkml.kernel.org/n/tip-9s5mze78gmlz19agt39i8rii@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Rewrite the handle address calculation code to be clearer.

    Saves 8 bytes on x86_64-defconfig.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Michael Ellerman
    Cc: Michael Neuling
    Cc: "Paul E. McKenney"
    Cc: james.hogan@imgtec.com
    Cc: Vince Weaver
    Cc: Victor Kaplansky
    Cc: Oleg Nesterov
    Cc: Anton Blanchard
    Link: http://lkml.kernel.org/n/tip-3trb2n2henb9m27tncef3ag7@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Avoid touching the lost_event and sample_data cachelines twince. Its
    not like we end up doing less work, but it might help to keep all
    accesses to these cachelines in one place.

    Due to code shuffle, this looses 4 bytes on x86_64-defconfig.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Michael Ellerman
    Cc: Michael Neuling
    Cc: "Paul E. McKenney"
    Cc: james.hogan@imgtec.com
    Cc: Vince Weaver
    Cc: Victor Kaplansky
    Cc: Oleg Nesterov
    Cc: Anton Blanchard
    Link: http://lkml.kernel.org/n/tip-zfxnc58qxj0eawdoj31hhupv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There's no point in re-doing the memory-barrier when we fail the
    cmpxchg(). Also placing it after the space reservation loop makes it
    clearer it only separates the userpage->tail read from the data
    stores.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Michael Ellerman
    Cc: Michael Neuling
    Cc: "Paul E. McKenney"
    Cc: james.hogan@imgtec.com
    Cc: Vince Weaver
    Cc: Victor Kaplansky
    Cc: Oleg Nesterov
    Cc: Anton Blanchard
    Link: http://lkml.kernel.org/n/tip-c19u6egfldyx86tpyc3zgkw9@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add unlikely() annotations to 'slow' paths:

    When having a sampling event but no output buffer; you have bigger
    issues -- also the bail is still faster than actually doing the work.

    When having a sampling event but a control page only buffer, you have
    bigger issues -- again the bail is still faster than actually doing
    work.

    Optimize for the case where you're not loosing events -- again, not
    doing the work is still faster but make sure that when you have to
    actually do work its as fast as possible.

    The typical watermark is 1/2 the buffer size, so most events will not
    take this path.

    Shrinks perf_output_begin() by 16 bytes on x86_64-defconfig.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Michael Ellerman
    Cc: Michael Neuling
    Cc: "Paul E. McKenney"
    Cc: james.hogan@imgtec.com
    Cc: Vince Weaver
    Cc: Victor Kaplansky
    Cc: Oleg Nesterov
    Cc: Anton Blanchard
    Link: http://lkml.kernel.org/n/tip-wlg3jew3qnutm8opd0hyeuwn@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • By using CIRC_SPACE() we can obviate the need for perf_output_space().

    Shrinks the size of perf_output_begin() by 17 bytes on
    x86_64-defconfig.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Michael Ellerman
    Cc: Michael Neuling
    Cc: "Paul E. McKenney"
    Cc: james.hogan@imgtec.com
    Cc: Vince Weaver
    Cc: Victor Kaplansky
    Cc: Oleg Nesterov
    Cc: Anton Blanchard
    Link: http://lkml.kernel.org/n/tip-vtb0xb0llebmsdlfn1v5vtfj@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Nov, 2013

1 commit


30 Oct, 2013

6 commits

  • uprobe_copy_process() does nothing if the child shares ->mm with
    the forking process, but there is a special case: CLONE_VFORK.
    In this case it would be more correct to do dup_utask() but avoid
    dup_xol(). This is not that important, the child should not unwind
    its stack too much, this can corrupt the parent's stack, but at
    least we need this to allow to ret-probe __vfork() itself.

    Note: in theory, it would be better to check task_pt_regs(p)->sp
    instead of CLONE_VFORK, we need to dup_utask() if and only if the
    child can return from the function called by the parent. But this
    needs the arch-dependant helper, and I think that nobody actually
    does clone(same_stack, CLONE_VM).

    Reported-by: Martin Cermak
    Reported-by: David Smith
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • This finally fixes the serious bug in uretprobes: a forked child
    crashes if the parent called fork() with the pending ret probe.

    Trivial test-case:

    # perf probe -x /lib/libc.so.6 __fork%return
    # perf record -e probe_libc:__fork perl -le 'fork || print "OK"'

    (the child doesn't print "OK", it is killed by SIGSEGV)

    If the child returns from the probed function it actually returns
    to trampoline_vaddr, because it got the copy of parent's stack
    mangled by prepare_uretprobe() when the parent entered this func.

    It crashes because a) this address is not mapped and b) until the
    previous change it doesn't have the proper->return_instances info.

    This means that uprobe_copy_process() has to create xol_area which
    has the trampoline slot, and its vaddr should be equal to parent's
    xol_area->vaddr.

    Unfortunately, uprobe_copy_process() can not simply do
    __create_xol_area(child, xol_area->vaddr). This could actually work
    but perf_event_mmap() doesn't expect the usage of foreign ->mm. So
    we offload this to task_work_run(), and pass the argument via not
    yet used utask->vaddr.

    We know that this vaddr is fine for install_special_mapping(), the
    necessary hole was recently "created" by dup_mmap() which skips the
    parent's VM_DONTCOPY area, and nobody else could use the new mm.

    Unfortunately, this also means that we can not handle the errors
    properly, we obviously can not abort the already completed fork().
    So we simply print the warning if GFP_KERNEL allocation (the only
    possible reason) fails.

    Reported-by: Martin Cermak
    Reported-by: David Smith
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • uprobe_copy_process() assumes that the new child doesn't need
    ->utask, it should be allocated by demand.

    But this is not true if the forking task has the pending ret-
    probes, the child should report them as well and thus it needs
    the copy of parent's ->return_instances chain. Otherwise the
    child crashes when it returns from the probed function.

    Alternatively we could cleanup the child's stack, but this needs
    per-arch changes and this is not what we want. At least systemtap
    expects a .return in the child too.

    Note: this change alone doesn't fix the problem, see the next
    change.

    Reported-by: Martin Cermak
    Reported-by: David Smith
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Currently xol_add_vma() uses get_unmapped_area() for area->vaddr,
    but the next patches need to use the fixed address. So this patch
    adds the new "vaddr" argument to __create_xol_area() which should
    be used as area->vaddr if it is nonzero.

    xol_add_vma() doesn't bother to verify that the predefined addr is
    not used, insert_vm_struct() should fail if find_vma_links() detects
    the overlap with the existing vma.

    Also, __create_xol_area() doesn't need __GFP_ZERO to allocate area.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • No functional changes, preparation.

    Extract the code which actually allocates/installs the new area
    into the new helper, __create_xol_area().

    While at it remove the unnecessary "ret = ENOMEM" and "ret = 0"
    in xol_add_vma(), they both have no effect.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Preparation for the next patches.

    Move the callsite of uprobe_copy_process() in copy_process() down
    to the succesfull return. We do not care if copy_process() fails,
    uprobe_free_utask() won't be called in this case so the wrong
    ->utask != NULL doesn't matter.

    OTOH, with this change we know that copy_process() can't fail when
    uprobe_copy_process() is called, the new task should either return
    to user-mode or call do_exit(). This way uprobe_copy_process() can:

    1. setup p->utask != NULL if necessary

    2. setup uprobes_state.xol_area

    3. use task_work_add(p)

    Also, move the definition of uprobe_copy_process() down so that it
    can see get_utask().

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     

29 Oct, 2013

7 commits

  • Currently we only optimize the context switch between two
    contexts that have the same parent; this forgoes the
    optimization between parent and child context, even though these
    contexts could be equivalent too.

    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Adrian Hunter
    Cc: Shishkin, Alexander
    Link: http://lkml.kernel.org/r/20131007164257.GH3081@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg complained about the excessive 0-ing in perf_event_mmap_event(),
    so try and be smarter about it while keeping it fairly fool proof and
    avoid leaking random bits out to userspace.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-8jirlm99m6if2z13wd6rbyu6@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • perf_event_mmap_event() does kzalloc(PATH_MAX + sizeof(u64)) to
    ensure we can align the size later. However this means that we
    actually allocate PAGE_SIZE * 2 buffer, seems too much.

    Change this code to allocate PATH_MAX==PAGE_SIZE bytes, but tell
    d_path() to not use the last sizeof(u64) bytes.

    Note: it is not clear why do we need __GFP_ZERO, see the next patch.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131016201004.GC23214@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • 1. perf_event_mmap(vma) is never called with a gate_vma-like arg,
    remove the "if (!vma->vm_mm)" code.

    2. arch_vma_name() can use the chached value of mmap_event->vma.

    3. Change the code to not call arch_vma_name() twice.

    4. Purely cosmetic, but since we use "goto got_name" all the time
    remove "else" from "[stack]" branch just for symmetry.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131016200945.GB23214@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • There's nothing atomic about atomic_set vs atomic_read; so remove the
    atomic_t usage.

    Also, make running_sample_length static as it really is (and should
    be) local to this translation unit.

    Signed-off-by: Peter Zijlstra
    Cc: eranian@google.com
    Cc: Don Zickus
    Cc: jmario@redhat.com
    Cc: acme@infradead.org
    Cc: dave.hansen@linux.intel.com
    Link: http://lkml.kernel.org/n/tip-vw9lg588x1ic248whybjon0c@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The PPC64 people noticed a missing memory barrier and crufty old
    comments in the perf ring buffer code. So update all the comments and
    add the missing barrier.

    When the architecture implements local_t using atomic_long_t there
    will be double barriers issued; but short of introducing more
    conditional barrier primitives this is the best we can do.

    Reported-by: Victor Kaplansky
    Tested-by: Victor Kaplansky
    Signed-off-by: Peter Zijlstra
    Cc: Mathieu Desnoyers
    Cc: michael@ellerman.id.au
    Cc: Paul McKenney
    Cc: Michael Neuling
    Cc: Frederic Weisbecker
    Cc: anton@samba.org
    Cc: benh@kernel.crashing.org
    Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Conflicts:
    tools/perf/builtin-record.c
    tools/perf/builtin-top.c
    tools/perf/util/hist.h

    Ingo Molnar
     

20 Oct, 2013

1 commit


18 Oct, 2013

1 commit

  • For now, we disable the extended MMAP record support (MMAP2).

    We have identified cases where it would not report the correct mapping
    information, clone(VM_CLONE) but with separate pids. We will revisit
    the support once we find a solution for this case.

    The patch changes the kernel to return EINVAL if attr->mmap2 is set. The
    patch also modifies the perf tool to use regular PERF_RECORD_MMAP for
    synthetic events and it also prevents the tool from requesting
    attr->mmap2 mode because the kernel would reject it.

    The support will be revisited once the kenrel interface is updated.

    In V2, we reduce the patch to the strict minimum.

    In V3, we avoid calling perf_event_open() with mmap2 set because we know
    it will fail and require fallback retry.

    Signed-off-by: Stephane Eranian
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20131017173215.GA8820@quad
    Signed-off-by: Arnaldo Carvalho de Melo

    Stephane Eranian
     

04 Oct, 2013

3 commits

  • Add a generic qualifier for transaction events, as a new sample
    type that returns a flag word. This is particularly useful
    for qualifying aborts: to distinguish aborts which happen
    due to asynchronous events (like conflicts caused by another
    CPU) versus instructions that lead to an abort.

    The tuning strategies are very different for those cases,
    so it's important to distinguish them easily and early.

    Since it's inconvenient and inflexible to filter for this
    in the kernel we report all the events out and allow
    some post processing in user space.

    The flags are based on the Intel TSX events, but should be fairly
    generic and mostly applicable to other HTM architectures too. In addition
    to various flag words there's also reserved space to report an
    program supplied abort code. For TSX this is used to distinguish specific
    classes of aborts, like a lock busy abort when doing lock elision.

    Flags:

    Elision and generic transactions (ELISION vs TRANSACTION)
    (HLE vs RTM on TSX; IBM etc. would likely only use TRANSACTION)
    Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
    Retryable transaction (RETRY)
    Conflicts with other threads (CONFLICT)
    Transaction write capacity overflow (CAPACITY WRITE)
    Transaction read capacity overflow (CAPACITY READ)

    Transactions implicitely aborted can also return an abort code.
    This can be used to signal specific events to the profiler. A common
    case is abort on lock busy in a RTM eliding library (code 0xff)
    To handle this case we include the TSX abort code

    Common example aborts in TSX would be:

    - Data conflict with another thread on memory read.
    Flags: TRANSACTION|ASYNC|CONFLICT
    - executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC
    - HLE transaction in user space is too large
    Flags: ELISION|SYNC|CAPACITY-WRITE

    The only flag that is somewhat TSX specific is ELISION.

    This adds the perf core glue needed for reporting the new flag word out.

    v2: Add MEM/MISC
    v3: Move transaction to the end
    v4: Separate capacity-read/write and remove misc
    v5: Remove _SAMPLE. Move abort flags to 32bit. Rename
    transaction to txn
    Signed-off-by: Andi Kleen
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1379688044-14173-2-git-send-email-andi@firstfloor.org
    Signed-off-by: Ingo Molnar

    Andi Kleen
     
  • /proc/sys/kernel/perf_event_max_sample_rate will accept
    negative values as well as 0.

    Negative values are unreasonable, and 0 causes a
    divide by zero exception in perf_proc_update_handler.

    This patch enforces a lower limit of 1.

    Signed-off-by: Knut Petersen
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
    Signed-off-by: Ingo Molnar

    Knut Petersen
     
  • While auditing the list_entry usage due to a trinity bug I found that
    perf_pmu_migrate_context violates the rules for
    perf_event::event_entry.

    The problem is that perf_event::event_entry is a RCU list element, and
    hence we must wait for a full RCU grace period before re-using the
    element after deletion.

    Therefore the usage in perf_pmu_migrate_context() which re-uses the
    entry immediately is broken. For now introduce another list_head into
    perf_event for this specific usage.

    This doesn't actually fix the trinity report because that never goes
    through this code.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Sep, 2013

1 commit


27 Sep, 2013

1 commit


20 Sep, 2013

1 commit

  • Solve the problems around the broken definition of perf_event_mmap_page::
    cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
    fixed by:

    860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

    The problem with the fix (merged in v3.12-rc1 and not yet released
    officially), noticed by Vince Weaver is that the new behavior is
    not detectable by new user-space, and that due to the reuse of the
    field names it's easy to mis-compile a binary if old headers are used
    on a new kernel or new headers are used on an old kernel.

    To solve all that make this change explicit, detectable and self-contained,
    by iterating the ABI the following way:

    - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
    confuse old user-space binaries. RDPMC will be marked as unavailable
    to old binaries but that's within the ABI, this is a capability bit.

    - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
    libraries can reliably detect that bit 0 is deprecated and perma-zero
    without having to check the kernel version.

    - Use bits 2, 3, 4 for the newly defined, correct functionality:

    cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
    cap_user_time : 1, /* The time_* fields are used */
    cap_user_time_zero : 1, /* The time_zero field is used */

    - Rename all the bitfield names in perf_event.h to be different from the
    old names, to make sure it's not possible to mis-compile it
    accidentally with old assumptions.

    The 'size' field can then be used in the future to add new fields and it
    will act as a natural ABI version indicator as well.

    Also adjust tools/perf/ userspace for the new definitions, noticed by
    Adrian Hunter.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Also-Fixed-by: Adrian Hunter
    Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Sep, 2013

1 commit

  • Currently utask->depth is simply the number of allocated/pending
    return_instance's in uprobe_task->return_instances list.

    handle_trampoline() should decrement this counter every time we
    handle/free an instance, but due to typo it does this only if
    ->chained == T. This means that in the likely case this counter
    is never decremented and the probed task can't report more than
    MAX_URETPROBE_DEPTH events.

    Reported-by: Mikhail Kulemin
    Reported-by: Hemant Kumar Shaw
    Signed-off-by: Oleg Nesterov
    Acked-by: Anton Arapov
    Cc: masami.hiramatsu.pt@hitachi.com
    Cc: srikar@linux.vnet.ibm.com
    Cc: systemtap@sourceware.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

11 Sep, 2013

1 commit

  • The ino_generation field was added in the PERF_RECORD_MMAP2 record in
    the 13d7a24 cset but no space for it was allocated, corrupting the
    PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.

    Detected with one of the regression tests done by 'perf test':

    [root@sandy ~]# perf test -v 7
    7: Validate PERF_RECORD_* events & perf_sample fields :
    --- start ---
    61315294449606 0 PERF_RECORD_SAMPLE
    61315294453161 0 PERF_RECORD_SAMPLE
    61315294454441 0 PERF_RECORD_SAMPLE
    61315294455709 0 PERF_RECORD_SAMPLE
    61315295600899 0 PERF_RECORD_COMM: sleep:6500
    27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
    MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
    MMAP2 with unexpected cpu, expected 0, got 342521613
    MMAP2 with unexpected pid, expected 6500, got 1701606191
    MMAP2 with unexpected tid, expected 6500, got 28773
    27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
    MMAP2 with unexpected cpu, expected 0, got 342561333
    MMAP2 with unexpected pid, expected 6500, got 1932408369
    MMAP2 with unexpected tid, expected 6500, got 111
    27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
    MMAP2 with unexpected cpu, expected 0, got 342600095
    MMAP2 with unexpected pid, expected 6500, got 1935963739
    MMAP2 with unexpected tid, expected 6500, got 23919
    27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
    MMAP2 with unexpected cpu, expected 0, got 342882834
    MMAP2 with unexpected pid, expected 6500, got 909192754
    MMAP2 with unexpected tid, expected 6500, got 7303982
    61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
    ---- end ----
    Validate PERF_RECORD_* events & perf_sample fields: FAILED!
    [root@sandy ~]#

    After this patch:

    [root@sandy ~]# perf test 7
    7: Validate PERF_RECORD_* events & perf_sample fields : Ok
    [root@sandy ~]#

    Acked-by: Peter Zijlstra
    Acked-by: Stephane Eranian
    Cc: Adrian Hunter
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

04 Sep, 2013

1 commit

  • …rnel.org/pub/scm/linux/kernel/git/tip/tip

    Pull perf changes from Ingo Molnar:
    "As a first remark I'd like to point out that the obsolete '-f'
    (--force) option, which has not done anything for several releases,
    has been removed from 'perf record' and related utilities. Everyone
    please update muscle memory accordingly! :-)

    Main changes on the perf kernel side:

    - Performance optimizations:
    . for trace events, by Steve Rostedt.
    . for time values, by Peter Zijlstra

    - New hardware support:
    . for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
    . for Intel SNB-EP uncore PMUs, by Zheng Yan

    - Enhanced hardware support:
    . for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan

    - Core perf events code enhancements and fixes:
    . for full-nohz feature handling, by Frederic Weisbecker
    . for group events, by Jiri Olsa
    . for call chains, by Frederic Weisbecker
    . for event stream parsing, by Adrian Hunter

    - New ABI details:
    . Add attr->mmap2 attribute, by Stephane Eranian
    . Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
    . Export u64 time_zero on the mmap header page to allow TSC
    calculation, by Adrian Hunter
    . Add dummy software event, by Adrian Hunter.
    . Add a new PERF_SAMPLE_IDENTIFIER to make samples always
    parseable, by Adrian Hunter.
    . Make Power7 events available via sysfs, by Runzhen Wang.

    - Code cleanups and refactorings:
    . for nohz-full, by Frederic Weisbecker
    . for group events, by Jiri Olsa

    - Documentation updates:
    . for perf_event_type, by Peter Zijlstra

    Main changes on the perf tooling side (some of these tooling changes
    utilize the above kernel side changes):

    - Lots of 'perf trace' enhancements:

    . Make 'perf trace' command line arguments consistent with
    'perf record', by David Ahern.

    . Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.

    . Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.

    . Support ! in -e expressions, to filter a list of syscalls,
    by Arnaldo Carvalho de Melo.

    . Arg formatting improvements to allow masking arguments in
    syscalls such as futex and open, where the some arguments are
    ignored and thus should not be printed depending on other args,
    by Arnaldo Carvalho de Melo.

    . Beautify futex open, openat, open_by_handle_at, lseek and futex
    syscalls, by Arnaldo Carvalho de Melo.

    . Add option to analyze events in a file versus live, so that
    one can do:

    [root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
    [ perf record: Woken up 0 times to write data ]
    [ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
    [root@zoo ~]# perf trace -i perf.data -e futex --duration 1
    17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
    113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
    133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
    [root@zoo ~]#

    By David Ahern.

    . Honor target pid / tid options when analyzing a file, by David Ahern.

    . Introduce better formatting of syscall arguments, including so
    far beautifiers for mmap, madvise, syscall return values,
    by Arnaldo Carvalho de Melo.

    . Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.

    - 'perf report/top' enhancements:

    . Do annotation using /proc/kcore and /proc/kallsyms when
    available, removing the forced need for a vmlinux file kernel
    assembly annotation. This also improves this use case because
    vmlinux has just the initial kernel image, not what is actually
    in use after various code patchings by things like alternatives.
    By Adrian Hunter.

    . Add --ignore-callees=<regex> option to collapse undesired parts
    of call graphs, by Greg Price.

    . Simplify symbol filtering by doing it at machine class level,
    by Adrian Hunter.

    . Add support for callchains in the gtk UI, by Namhyung Kim.

    . Add --objdump option to 'perf top', by Sukadev Bhattiprolu.

    - 'perf kvm' enhancements:

    . Add option to print only events that exceed a specified time
    duration, by David Ahern.

    . Improve stack trace printing, by David Ahern.

    . Update documentation of the live command, by David Ahern

    . Add perf kvm stat live mode that combines aspects of 'perf kvm
    stat' record and report, by David Ahern.

    . Add option to analyze specific VM in perf kvm stat report, by
    David Ahern.

    . Do not require /lib/modules/* on a guest, by Jason Wessel.

    - 'perf script' enhancements:

    . Fix symbol offset computation for some dsos, by David Ahern.

    . Fix named threads support, by David Ahern.

    . Don't install scripting files files when perl/python support
    is disabled, by Arnaldo Carvalho de Melo.

    - 'perf test' enhancements:

    . Add various improvements and fixes to the "vmlinux matches
    kallsyms" 'perf test' entry, related to the /proc/kcore
    annotation feature. By Adrian Hunter.

    . Add sample parsing test, by Adrian Hunter.

    . Add test for reading object code, by Adrian Hunter.

    . Add attr record group sampling test, by Jiri Olsa.

    . Misc testing infrastructure improvements and other details,
    by Jiri Olsa.

    - 'perf list' enhancements:

    . Skip unsupported hardware events, by Namhyung Kim.

    . List pmu events, by Andi Kleen.

    - 'perf diff' enhancements:

    . Add support for more than two files comparison, by Jiri Olsa.

    - 'perf sched' enhancements:

    . Various improvements, including removing reliance on some
    scheduler tracepoints that provide the same information as the
    PERF_RECORD_{FORK,EXIT} events. By David Ahern.

    . Remove odd build stall by moving a large struct initialization
    from a local variable to a global one, by Namhyung Kim.

    - 'perf stat' enhancements:

    . Add --initial-delay option to skip measuring for a defined
    startup phase, by Andi Kleen.

    - Generic perf tooling infrastructure/plumbing changes:

    . Tidy up sample parsing validation, by Adrian Hunter.

    . Fix up jobserver setup in libtraceevent Makefile.
    by Arnaldo Carvalho de Melo.

    . Debug improvements, by Adrian Hunter.

    . Fix correlation of samples coming after PERF_RECORD_EXIT event,
    by David Ahern.

    . Improve robustness of the topology parsing code,
    by Stephane Eranian.

    . Add group leader sampling, that allows just one event in a group
    to sample while the other events have just its values read,
    by Jiri Olsa.

    . Add support for a new modifier "D", which requests that the
    event, or group of events, be pinned to the PMU.
    By Michael Ellerman.

    . Support callchain sorting based on addresses, by Andi Kleen

    . Prep work for multi perf data file storage, by Jiri Olsa.

    . libtraceevent cleanups, by Namhyung Kim.

    And lots and lots of other fixes and code reorganizations that did not
    make it into the list, see the shortlog, diffstat and the Git log for
    details!"

    [ Also merge a leftover from the 3.11 cycle ]

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Prevent race in unthrottling code

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
    perf trace: Tell arg formatters the arg index
    perf trace: Add beautifier for open's flags arg
    perf trace: Add beautifier for lseek's whence arg
    perf tools: Fix symbol offset computation for some dsos
    perf list: Skip unsupported events
    perf tests: Add 'keep tracking' test
    perf tools: Add support for PERF_COUNT_SW_DUMMY
    perf: Add a dummy software event to keep tracking
    perf trace: Add beautifier for futex 'operation' parm
    perf trace: Allow syscall arg formatters to mask args
    perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
    perf: Export struct perf_branch_entry to userspace
    perf: Add attr->mmap2 attribute to an event
    perf/x86: Add Silvermont (22nm Atom) support
    perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
    perf trace: Handle missing HUGEPAGE defines
    perf trace: Honor target pid / tid options when analyzing a file
    perf trace: Add option to analyze events in a file versus live
    perf evlist: Add tracepoint lookup by name
    perf tests: Add a sample parsing test
    ...

    Linus Torvalds