09 Aug, 2014

1 commit

  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 Jul, 2014

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Pull perf fixes from Thomas Gleixner:
    "A bunch of fixes for perf and kprobes:
    - revert a commit that caused a perf group regression
    - silence dmesg spam
    - fix kprobe probing errors on ia64 and ppc64
    - filter kprobe faults from userspace
    - lockdep fix for perf exit path
    - prevent perf #GP in KVM guest
    - correct perf event and filters"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
    kprobes/x86: Don't try to resolve kprobe faults from userspace
    perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
    perf/x86/intel: Protect LBR and extra_regs against KVM lying
    perf: Fix lockdep warning on process exit
    perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
    perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
    perf: Revert ("perf: Always destroy groups on exit")

    Linus Torvalds
     

17 Jul, 2014

1 commit


16 Jul, 2014

3 commits

  • The following patch added another way to get mmap name: 78d683e838a6
    ("mm, fs: Add vm_ops->name as an alternative to arch_vma_name")

    The vdso vma mapping already switch to this and we no longer get vdso
    name via arch_vma_name function. Adding this way to the perf mmap
    event name retrieval code.

    Caught this via perf test:

    $ sudo ./perf test -v 7
    7: Validate PERF_RECORD_* events & perf_sample fields :
    --- start ---

    SNIP

    PERF_RECORD_MMAP for [vdso] missing!
    test child finished with 255
    ---- end ----
    Validate PERF_RECORD_* events & perf_sample fields: FAILED!

    Signed-off-by: Jiri Olsa
    Acked-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1405353439-14211-1-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Sasha Levin reported:

    > While fuzzing with trinity inside a KVM tools guest running the latest -next
    > kernel I've stumbled on the following spew:
    >
    > ======================================================
    > [ INFO: possible circular locking dependency detected ]
    > 3.15.0-next-20140613-sasha-00026-g6dd125d-dirty #654 Not tainted
    > -------------------------------------------------------
    > trinity-c578/9725 is trying to acquire lock:
    > (&(&pool->lock)->rlock){-.-...}, at: __queue_work (kernel/workqueue.c:1346)
    >
    > but task is already holding lock:
    > (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533)
    >
    > which lock already depends on the new lock.

    > 1 lock held by trinity-c578/9725:
    > #0: (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533)
    >
    > Call Trace:
    > dump_stack (lib/dump_stack.c:52)
    > print_circular_bug (kernel/locking/lockdep.c:1216)
    > __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
    > lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    > _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
    > __queue_work (kernel/workqueue.c:1346)
    > queue_work_on (kernel/workqueue.c:1424)
    > free_object (lib/debugobjects.c:209)
    > __debug_check_no_obj_freed (lib/debugobjects.c:715)
    > debug_check_no_obj_freed (lib/debugobjects.c:727)
    > kmem_cache_free (mm/slub.c:2683 mm/slub.c:2711)
    > free_task (kernel/fork.c:221)
    > __put_task_struct (kernel/fork.c:250)
    > put_ctx (include/linux/sched.h:1855 kernel/events/core.c:898)
    > perf_event_exit_task (kernel/events/core.c:907 kernel/events/core.c:7478 kernel/events/core.c:7533)
    > do_exit (kernel/exit.c:766)
    > do_group_exit (kernel/exit.c:884)
    > get_signal_to_deliver (kernel/signal.c:2347)
    > do_signal (arch/x86/kernel/signal.c:698)
    > do_notify_resume (arch/x86/kernel/signal.c:751)
    > int_signal (arch/x86/kernel/entry_64.S:600)

    Urgh.. so the only way I can make that happen is through:

    perf_event_exit_task_context()
    raw_spin_lock(&child_ctx->lock);
    unclone_ctx(child_ctx)
    put_ctx(ctx->parent_ctx);
    raw_spin_unlock_irqrestore(&child_ctx->lock);

    And we can avoid this by doing the change below.

    I can't immediately see how this changed recently, but given that you
    say it's easy to reproduce, lets fix this.

    Reported-by: Sasha Levin
    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Dave Jones
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140623141242.GB19860@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Vince reported that commit 15a2d4de0eab5 ("perf: Always destroy groups
    on exit") causes a regression with grouped events. In particular his
    read_group_attached.c test fails.

    https://github.com/deater/perf_event_tests/blob/master/tests/bugs/read_group_attached.c

    Because of the context switch optimization in
    perf_event_context_sched_out() the 'original' event may end up in the
    child process and when that exits the change in the patch in question
    destroys the actual grouping.

    Therefore revert that change and only destroy inherited groups.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-zedy3uktcp753q8fw8dagx7a@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Jul, 2014

1 commit


04 Jul, 2014

1 commit

  • …it/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
    where running:

    # perf probe -x /lib/libc.so.6 syscall
    # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
    # perf record -e probe_libc:syscall whatever

    kills the uprobe. Along the way he found some other minor bugs and
    clean ups that he fixed up making it a total of 4 patches.

    Doing unrelated work, I found that the reading of the ftrace trace
    file disables all function tracer callbacks. This was fine when
    ftrace was the only user, but now that it's used by perf and kprobes,
    this is a bug where reading trace can disable kprobes and perf. A
    very unexpected side effect and should be fixed"

    * tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Remove ftrace_stop/start() from reading the trace file
    tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
    tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
    uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
    tracing/uprobes: Revert "Support mix of ftrace and perf"

    Linus Torvalds
     

02 Jul, 2014

1 commit

  • The context check in perf_event_context_sched_out allows
    non-cloned context to be part of the optimized schedule
    out switch.

    This could move non-cloned context into another workload
    child. Once this child exits, the context is closed and
    leaves all original (parent) events in closed state.

    Any other new cloned event will have closed state and not
    measure anything. And probably causing other odd bugs.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Frederic Weisbecker
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

01 Jul, 2014

1 commit

  • Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure
    that nobody tries to play with the dead uprobe/consumer. This helps
    to catch the bugs like the one fixed by the previous patch.

    In the longer term we should fix this poorly designed interface.
    uprobe_register() should return "struct uprobe *" which should be
    passed to apply/unregister. Plus other semantic changes, see the
    changelog in commit 41ccba029e94.

    Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com

    Acked-by: Namhyung Kim
    Acked-by: Srikar Dronamraju
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     

14 Jun, 2014

1 commit


13 Jun, 2014

1 commit

  • Pull more perf updates from Ingo Molnar:
    "A second round of perf updates:

    - wide reaching kprobes sanitization and robustization, with the hope
    of fixing all 'probe this function crashes the kernel' bugs, by
    Masami Hiramatsu.

    - uprobes updates from Oleg Nesterov: tmpfs support, corner case
    fixes and robustization work.

    - perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
    et al:
    * Add support to accumulate hist periods (Namhyung Kim)
    * various fixes, refactorings and enhancements"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
    perf: Differentiate exec() and non-exec() comm events
    perf: Fix perf_event_comm() vs. exec() assumption
    uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
    perf/documentation: Add description for conditional branch filter
    perf/x86: Add conditional branch filtering support
    perf/tool: Add conditional branch filter 'cond' to perf record
    perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
    uprobes: Teach copy_insn() to support tmpfs
    uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
    perf/x86: Use common PMU interrupt disabled code
    perf/ARM: Use common PMU interrupt disabled code
    perf: Disable sampled events if no PMU interrupt
    perf: Fix use after free in perf_remove_from_context()
    perf tools: Fix 'make help' message error
    perf record: Fix poll return value propagation
    perf tools: Move elide bool into perf_hpp_fmt struct
    perf tools: Remove elide setup for SORT_MODE__MEMORY mode
    perf tools: Fix "==" into "=" in ui_browser__warning assignment
    perf tools: Allow overriding sysfs and proc finding with env var
    perf tools: Consider header files outside perf directory in tags target
    ...

    Linus Torvalds
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

09 Jun, 2014

2 commits

  • This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488.

    Re-enable the mmap2 interface as we will have a user soon.

    Since things have changed since perf disabled mmap2, small tweaks
    to the revert had to be done:

    o commit 9d4ecc88 forced (n!=8) to become (n
    Link: http://lkml.kernel.org/r/1401461382-209586-1-git-send-email-dzickus@redhat.com
    Signed-off-by: Jiri Olsa

    Don Zickus
     
  • The mmap2 interface was missing the protection and flags bits needed to
    accurately determine if a mmap memory area was shared or private and
    if it was readable or not.

    Signed-off-by: Peter Zijlstra
    [tweaked patch to compile and wrote changelog]
    Signed-off-by: Don Zickus
    Link: http://lkml.kernel.org/r/1400526833-141779-2-git-send-email-dzickus@redhat.com
    Signed-off-by: Jiri Olsa

    Peter Zijlstra
     

06 Jun, 2014

4 commits

  • perf tools like 'perf report' can aggregate samples by comm strings,
    which generally works. However, there are other potential use-cases.
    For example, to pair up 'calls' with 'returns' accurately (from branch
    events like Intel BTS) it is necessary to identify whether the process
    has exec'd. Although a comm event is generated when an 'exec' happens
    it is also generated whenever the comm string is changed on a whim
    (e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event
    to differentiate one case from the other.

    In order to determine whether the kernel supports the new flag, a
    selection bit named 'exec' is added to struct perf_event_attr. The
    bit does nothing but will cause perf_event_open() to fail if the bit
    is set on kernels that do not have it defined.

    Signed-off-by: Adrian Hunter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com
    Cc: Paul Mackerras
    Cc: Dave Jones
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Alexander Viro
    Cc: Linus Torvalds
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Adrian Hunter
     
  • Conflicts:
    arch/x86/kernel/traps.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • perf_event_comm() assumes that set_task_comm() is only called on
    exec(), and in particular that its only called on current.

    Neither are true, as Dave reported a WARN triggered by set_task_comm()
    being called on !current.

    Separate the exec() hook from the comm hook.

    Reported-by: Dave Jones
    Signed-off-by: Peter Zijlstra
    Cc: Alexander Viro
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net
    [ Build fix. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull ARM updates from Russell King:

    - Major clean-up of the L2 cache support code. The existing mess was
    becoming rather unmaintainable through all the additions that others
    have done over time. This turns it into a much nicer structure, and
    implements a few performance improvements as well.

    - Clean up some of the CP15 control register tweaks for alignment
    support, moving some code and data into alignment.c

    - DMA properties for ARM, from Santosh and reviewed by DT people. This
    adds DT properties to specify bus translations we can't discover
    automatically, and to indicate whether devices are coherent.

    - Hibernation support for ARM

    - Make ftrace work with read-only text in modules

    - add suspend support for PJ4B CPUs

    - rework interrupt masking for undefined instruction handling, which
    allows us to enable interrupts earlier in the handling of these
    exceptions.

    - support for big endian page tables

    - fix stacktrace support to exclude stacktrace functions from the
    trace, and add save_stack_trace_regs() implementation so that kprobes
    can record stack traces.

    - Add support for the Cortex-A17 CPU.

    - Remove last vestiges of ARM710 support.

    - Removal of ARM "meminfo" structure, finally converting us solely to
    memblock to handle the early memory initialisation.

    * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (142 commits)
    ARM: ensure C page table setup code follows assembly code (part II)
    ARM: ensure C page table setup code follows assembly code
    ARM: consolidate last remaining open-coded alignment trap enable
    ARM: remove global cr_no_alignment
    ARM: remove CPU_CP15 conditional from alignment.c
    ARM: remove unused adjust_cr() function
    ARM: move "noalign" command line option to alignment.c
    ARM: provide common method to clear bits in CPU control register
    ARM: 8025/1: Get rid of meminfo
    ARM: 8060/1: mm: allow sub-architectures to override PCI I/O memory type
    ARM: 8066/1: correction for ARM patch 8031/2
    ARM: 8049/1: ftrace/add save_stack_trace_regs() implementation
    ARM: 8065/1: remove last use of CONFIG_CPU_ARM710
    ARM: 8062/1: Modify ldrt fixup handler to re-execute the userspace instruction
    ARM: 8047/1: rwsem: use asm-generic rwsem implementation
    ARM: l2c: trial at enabling some Cortex-A9 optimisations
    ARM: l2c: add warnings for stuff modifying aux_ctrl register values
    ARM: l2c: print a warning with L2C-310 caches if the cache size is modified
    ARM: l2c: remove old .set_debug method
    ARM: l2c: kill L2X0_AUX_CTRL_MASK before anyone else makes use of this
    ...

    Linus Torvalds
     

05 Jun, 2014

5 commits

  • tmpfs is widely used but as Denys reports shmem_aops doesn't have
    ->readpage() and thus you can't probe a binary on this filesystem.

    As Hugh suggested we can use shmem_read_mapping_page() in this case,
    just we need to check shmem_mapping() if ->readpage == NULL.

    Reported-by: Denys Vlasenko
    Suggested-by: Hugh Dickins
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140519184136.GB6750@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • copy_insn() fails with -EIO if ->readpage == NULL, but this error
    is not propagated unless uprobe_register() path finds ->mm which
    already mmaps this file. In this case (say) "perf record" does not
    actually install the probe, but the user can't know about this.

    Move this check into uprobe_register() so that this problem can be
    detected earlier and reported to user.

    Note: this is still not perfect,

    - copy_insn() and arch_uprobe_analyze_insn() should be called
    by uprobe_register() but this is not simple, we need vm_file
    for read_mapping_page() (although perhaps we can pass NULL),
    and we need ->mm for is_64bit_mm() (although this logic is
    broken anyway).

    - uprobe_register() should be called by create_trace_uprobe(),
    not by probe_event_enable(), so that an error can be detected
    at "perf probe -x" time. This also needs more changes in the
    core uprobe code, uprobe register/unregister interface was
    poorly designed from the very beginning.

    Reported-by: Denys Vlasenko
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140519184054.GA6750@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add common code to generate -ENOTSUPP at event creation time if an
    architecture attempts to create a sampled event and
    PERF_PMU_NO_INTERRUPT is set.

    This adds a new pmu->capabilities flag. Initially we only support
    PERF_PMU_NO_INTERRUPT (to indicate a PMU has no support for generating
    hardware interrupts) but there are other capabilities that can be
    added later.

    Signed-off-by: Vince Weaver
    Acked-by: Will Deacon
    [peterz: rename to PERF_PMU_CAP_* and moved the pmu::capabilities word into a hole]
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Vince Weaver
     
  • While that mutex should guard the elements, it doesn't guard against the
    use-after-free that's from list_for_each_entry_rcu().
    __perf_event_exit_task() can actually free the event.

    And because list addition/deletion is guarded by both ctx->mutex and
    ctx->lock, holding ctx->mutex is sufficient for reading the list, so we
    don't actually need the rcu list iteration.

    Fixes: 3a497f48637e ("perf: Simplify perf_event_exit_task_context()")
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Signed-off-by: Peter Zijlstra
    Cc: Dave Jones
    Cc: acme@ghostprotocols.net
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140529170024.GA2315@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • These bits from Oleg are fully cooked, ship them to Linus.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

04 Jun, 2014

1 commit

  • …git/tip/tip into next

    Pull perf updates from Ingo Molnar:
    "The tooling changes maintained by Jiri Olsa until Arnaldo is on
    vacation:

    User visible changes:
    - Add -F option for specifying output fields (Namhyung Kim)
    - Propagate exit status of a command line workload for record command
    (Namhyung Kim)
    - Use tid for finding thread (Namhyung Kim)
    - Clarify the output of perf sched map plus small sched command
    fixes (Dongsheng Yang)
    - Wire up perf_regs and unwind support for ARM64 (Jean Pihet)
    - Factor hists statistics counts processing which in turn also fixes
    several bugs in TUI report command (Namhyung Kim)
    - Add --percentage option to control absolute/relative percentage
    output (Namhyung Kim)
    - Add --list-cmds to 'kmem', 'mem', 'lock' and 'sched', for use by
    completion scripts (Ramkumar Ramachandra)

    Development/infrastructure changes and fixes:
    - Android related fixes for pager and map dso resolving (Michael
    Lentine)
    - Add libdw DWARF post unwind support for ARM (Jean Pihet)
    - Consolidate types.h for ARM and ARM64 (Jean Pihet)
    - Fix possible null pointer dereference in session.c (Masanari Iida)
    - Cleanup, remove unused variables in map_switch_event() (Dongsheng
    Yang)
    - Remove nr_state_machine_bugs in perf latency (Dongsheng Yang)
    - Remove usage of trace_sched_wakeup(.success) (Peter Zijlstra)
    - Cleanups for perf.h header (Jiri Olsa)
    - Consolidate types.h and export.h within tools (Borislav Petkov)
    - Move u64_swap union to its single user's header, evsel.h (Borislav
    Petkov)
    - Fix for s390 to properly parse tracepoints plus test code
    (Alexander Yarygin)
    - Handle EINTR error for readn/writen (Namhyung Kim)
    - Add a test case for hists filtering (Namhyung Kim)
    - Share map_groups among threads of the same group (Arnaldo Carvalho
    de Melo, Jiri Olsa)
    - Making some code (cpu node map and report parse callchain callback)
    global to be usable by upcomming changes (Don Zickus)
    - Fix pmu object compilation error (Jiri Olsa)

    Kernel side changes:
    - intrusive uprobes fixes from Oleg Nesterov. Since the interface is
    admin-only, and the bug only affects user-space ("any probed
    jmp/call can kill the application"), we queued these fixes via the
    development tree, as a special exception.
    - more fuzzer motivated race fixes and related refactoring and
    robustization.
    - allow PMU drivers to be built as modules. (No actual module yet,
    because the x86 Intel uncore module wasn't ready in time for this)"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    perf tools: Add automatic remapping of Android libraries
    perf tools: Add cat as fallback pager
    perf tests: Add a testcase for histogram output sorting
    perf tests: Factor out print_hists_*()
    perf tools: Introduce reset_output_field()
    perf tools: Get rid of obsolete hist_entry__sort_list
    perf hists: Reset width of output fields with header length
    perf tools: Skip elided sort entries
    perf top: Add --fields option to specify output fields
    perf report/tui: Fix a bug when --fields/sort is given
    perf tools: Add ->sort() member to struct sort_entry
    perf report: Add -F option to specify output fields
    perf tools: Call perf_hpp__init() before setting up GUI browsers
    perf tools: Consolidate management of default sort orders
    perf tools: Allow hpp fields to be sort keys
    perf ui: Get rid of callback from __hpp__fmt()
    perf tools: Consolidate output field handling to hpp format routines
    perf tools: Use hpp formats to sort final output
    perf tools: Support event grouping in hpp ->sort()
    perf tools: Use hpp formats to sort hist entries
    ...

    Linus Torvalds
     

26 May, 2014

1 commit

  • After instruction write into xol area, on ARM V7
    architecture code need to flush dcache and icache to sync
    them up for given set of addresses. Having just
    'flush_dcache_page(page)' call is not enough - it is
    possible to have stale instruction sitting in icache
    for given xol area slot address.

    Introduce arch_uprobe_ixol_copy weak function
    that by default calls uprobes copy_to_page function and
    than flush_dcache_page function and on ARM define new one
    that handles xol slot copy in ARM specific way

    flush_uprobe_xol_access function shares/reuses implementation
    with/of flush_ptrace_access function and takes care of writing
    instruction to user land address space on given variety of
    different cache types on ARM CPUs. Because
    flush_uprobe_xol_access does not have vma around
    flush_ptrace_access was split into two parts. First that
    retrieves set of condition from vma and common that receives
    those conditions as flags.

    Note ARM cache flush function need kernel address
    through which instruction write happened, so instead
    of using uprobes copy_to_page function changed
    code to explicitly map page and do memcpy.

    Note arch_uprobe_copy_ixol function, in similar way as
    copy_to_user_page function, has preempt_disable/preempt_enable.

    Signed-off-by: Victor Kamensky
    Acked-by: Oleg Nesterov
    Reviewed-by: David A. Long
    Signed-off-by: Russell King

    Victor Kamensky
     

19 May, 2014

4 commits

  • ... in 3a497f48637 ("perf: Simplify perf_event_exit_task_context()")

    Signed-off-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1399720259-28275-1-git-send-email-bp@alien8.de
    Signed-off-by: Thomas Gleixner

    Borislav Petkov
     
  • Alexander noticed that we use RCU iteration on rb->event_list but do
    not use list_{add,del}_rcu() to add,remove entries to that list, nor
    do we observe proper grace periods when re-using the entries.

    Merge ring_buffer_detach() into ring_buffer_attach() such that
    attaching to the NULL buffer is detaching.

    Furthermore, ensure that between any 'detach' and 'attach' of the same
    event we observe the required grace period, but only when strictly
    required. In effect this means that only ioctl(.request =
    PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
    normal initial attach and final detach will not be delayed.

    This patch should, I think, do the right thing under all
    circumstances, the 'normal' cases all should never see the extra grace
    period, but the two cases:

    1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
    ring_buffer set, will now observe the required grace period between
    removing itself from the old and attaching itself to the new buffer.

    This case is 'simple' in that both buffers are present in
    perf_event_set_output() one could think an unconditional
    synchronize_rcu() would be sufficient; however...

    2) an event that has a buffer attached, the buffer is destroyed
    (munmap) and then the event is attached to a new/different buffer
    using PERF_EVENT_IOC_SET_OUTPUT.

    This case is more complex because the buffer destruction does:
    ring_buffer_attach(.rb = NULL)
    followed by the ioctl() doing:
    ring_buffer_attach(.rb = foo);

    and we still need to observe the grace period between these two
    calls due to us reusing the event->rb_entry list_head.

    In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
    call.

    Cc: Paul Mackerras
    Cc: Stephane Eranian
    Cc: Andi Kleen
    Cc: "Paul E. McKenney"
    Cc: Ingo Molnar
    Cc: Frederic Weisbecker
    Cc: Mike Galbraith
    Reported-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • The perf cpu offline callback takes down all cpu context
    events and releases swhash->swevent_hlist.

    This could race with task context software event being just
    scheduled on this cpu via perf_swevent_add while cpu hotplug
    code already cleaned up event's data.

    The race happens in the gap between the cpu notifier code
    and the cpu being actually taken down. Note that only cpu
    ctx events are terminated in the perf cpu hotplug code.

    It's easily reproduced with:
    $ perf record -e faults perf bench sched pipe

    while putting one of the cpus offline:
    # echo 0 > /sys/devices/system/cpu/cpu1/online

    Console emits following warning:
    WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
    Modules linked in:
    CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256
    Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
    0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
    0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
    ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
    Call Trace:
    [] dump_stack+0x4f/0x7c
    [] warn_slowpath_common+0x8c/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] perf_swevent_add+0x18d/0x1a0
    [] event_sched_in.isra.75+0x9e/0x1f0
    [] group_sched_in+0x6a/0x1f0
    [] ? sched_clock_local+0x25/0xa0
    [] ctx_sched_in+0x1f6/0x450
    [] perf_event_sched_in+0x6b/0xa0
    [] perf_event_context_sched_in+0x7b/0xc0
    [] __perf_event_task_sched_in+0x43e/0x460
    [] ? put_lock_stats.isra.18+0xe/0x30
    [] finish_task_switch+0xb8/0x100
    [] __schedule+0x30e/0xad0
    [] ? pipe_read+0x3e2/0x560
    [] ? preempt_schedule_irq+0x3e/0x70
    [] ? preempt_schedule_irq+0x3e/0x70
    [] preempt_schedule_irq+0x44/0x70
    [] retint_kernel+0x20/0x30
    [] ? lockdep_sys_exit+0x1a/0x90
    [] lockdep_sys_exit_thunk+0x35/0x67
    [] ? sysret_check+0x5/0x56

    Fixing this by tracking the cpu hotplug state and displaying
    the WARN only if current cpu is initialized properly.

    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: stable@vger.kernel.org
    Reported-by: Fengguang Wu
    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Thomas Gleixner

    Jiri Olsa
     
  • Vince reported that using a large sample_period (one with bit 63 set)
    results in wreckage since while the sample_period is fundamentally
    unsigned (negative periods don't make sense) the way we implement
    things very much rely on signed logic.

    So limit sample_period to 63 bits to avoid tripping over this.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

14 May, 2014

3 commits

  • If the probed insn triggers a trap, ->si_addr = regs->ip is technically
    correct, but this is not what the signal handler wants; we need to pass
    the address of the probed insn, not the address of xol slot.

    Add the new arch-agnostic helper, uprobe_get_trap_addr(), and change
    fill_trap_info() and math_error() to use it. !CONFIG_UPROBES case in
    uprobes.h uses a macro to avoid include hell and ensure that it can be
    compiled even if an architecture doesn't define instruction_pointer().

    Test-case:

    #include
    #include
    #include

    extern void probe_div(void);

    void sigh(int sig, siginfo_t *info, void *c)
    {
    int passed = (info->si_addr == probe_div);
    printf(passed ? "PASS\n" : "FAIL\n");
    _exit(!passed);
    }

    int main(void)
    {
    struct sigaction sa = {
    .sa_sigaction = sigh,
    .sa_flags = SA_SIGINFO,
    };

    sigaction(SIGFPE, &sa, NULL);

    asm (
    "xor %ecx,%ecx\n"
    ".globl probe_div; probe_div:\n"
    "idiv %ecx\n"
    );

    return 0;
    }

    it fails if probe_div() is probed.

    Note: show_unhandled_signals users should probably use this helper too,
    but we need to cleanup them first.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Masami Hiramatsu

    Oleg Nesterov
     
  • Hugh says:

    The one I noticed was that it forgets all about memcg (because
    it was copied from KSM, and there the replacement page has already
    been charged to a memcg). See how mm/memory.c do_anonymous_page()
    does a mem_cgroup_charge_anon().

    Hopefully not a big problem, uprobes is a system-wide thing and only
    root can insert the probes. But I agree, should be fixed anyway.

    Add mem_cgroup_{un,}charge_anon() into uprobe_write_opcode(). To simplify
    the error handling (and avoid the new "uncharge" label) the patch also
    moves anon_vma_prepare() up before we alloc/charge the new page.

    While at it fix the comment about ->mmap_sem, it is held for write.

    Suggested-by: Hugh Dickins
    Signed-off-by: Oleg Nesterov

    Oleg Nesterov
     
  • Unlike the more usual refcnting, what css_tryget() provides is the
    distinction between online and offline csses instead of protection
    against upping a refcnt which already reached zero. cgroup is
    planning to provide actual tryget which fails if the refcnt already
    reached zero. Let's rename the existing trygets so that they clearly
    indicate that they're onliness.

    I thought about keeping the existing names as-are and introducing new
    names for the planned actual tryget; however, given that each
    controller participates in the synchronization of the online state, it
    seems worthwhile to make it explicit that these functions are about
    on/offline state.

    Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
    to css_tryget_online_from_dir(). This is pure rename.

    v2: cgroup_freezer grew new usages of css_tryget(). Update
    accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

07 May, 2014

6 commits

  • Instead of jumping through hoops to make sure to find (and exit) each
    event, do it the simple straight fwd way.

    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-tij931199thfkys8vbnokdpf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Primarily make perf_event_release_kernel() into put_event(), this will
    allow kernel space to create per-task inherited events, and is safer
    in general.

    Also, document the free_event() assumptions.

    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-rk9pvr6e1d0559lxstltbztc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Document and validate the locking assumption of event_sched_in().

    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-sybq1publ9xt5no77cwvi0eo@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit 38b435b16c36 ("perf: Fix tear-down of inherited group events")
    states that we need to destroy groups for inherited events, but it
    doesn't make any sense to not also destroy groups for normal events.

    And while it usually makes no difference (the normal events won't
    leak, and its very likely all the group events will die in quick
    succession) it does make the code more consistent and closes a
    potential hole for trouble.

    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-426egt8zmsm12d2q8k2xz4tt@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make sure all events in a group have the same inherit state. It was
    possible for group leaders to have inherit set while sibling events
    would not have inherit set.

    In this case we'd still inherit the siblings, leading to some
    non-fatal weirdness.

    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-r32tt8yldvic3jlcghd3g35u@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar