19 Jul, 2013

1 commit

  • Pull driver core patches from Greg KH:
    "Here are some driver core patches for 3.11-rc2. They aren't really
    bugfixes, but a bunch of new helper macros for drivers to properly
    create attribute groups, which drivers and subsystems need to fix up a
    ton of race issues with incorrectly creating sysfs files (binary and
    normal) after userspace has been told that the device is present.

    Also here is the ability to create binary files as attribute groups,
    to solve that race condition, which was impossible to do before this,
    so that's my fault the drivers were broken.

    The majority of the .c changes is indenting and moving code around a
    bit. It affects no existing code, but allows the large backlog of 70+
    patches that I already have created to start flowing into the
    different subtrees, instead of having to live in my driver-core tree,
    causing merge nightmares in linux-next for the next few months.

    These were finalized too late for the -rc1 merge window, which is why
    they were didn't make that pull request, testing and review from
    others didn't happen until a few weeks ago, and then there's the whole
    distraction of the past few days, which prevented these from getting
    to you sooner, sorry about that.

    Oh, and there's a bugfix for the documentation build warning in here
    as well. All of these have been in linux-next this week, with no
    reported problems"

    * tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    driver-core: fix new kernel-doc warning in base/platform.c
    sysfs: use file mode defines from stat.h
    sysfs: add more helper macro's for (bin_)attribute(_groups)
    driver core: add default groups to struct class
    driver core: Introduce device_create_groups
    sysfs: prevent warning when only using binary attributes
    sysfs: add support for binary attributes in groups
    driver core: device.h: add RW and RO attribute macros
    sysfs.h: add BIN_ATTR macro
    sysfs.h: add ATTRIBUTE_GROUPS() macro
    sysfs.h: add __ATTR_RW() macro

    Linus Torvalds
     

17 Jul, 2013

1 commit


15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Jul, 2013

3 commits

  • Jiri managed to trigger this warning:

    [] ======================================================
    [] [ INFO: possible circular locking dependency detected ]
    [] 3.10.0+ #228 Tainted: G W
    [] -------------------------------------------------------
    [] p/6613 is trying to acquire lock:
    [] (rcu_node_0){..-...}, at: [] rcu_read_unlock_special+0xa7/0x250
    []
    [] but task is already holding lock:
    [] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0xd9/0x2c0
    []
    [] which lock already depends on the new lock.
    []
    [] the existing dependency chain (in reverse order) is:
    []
    [] -> #4 (&ctx->lock){-.-...}:
    [] -> #3 (&rq->lock){-.-.-.}:
    [] -> #2 (&p->pi_lock){-.-.-.}:
    [] -> #1 (&rnp->nocb_gp_wq[1]){......}:
    [] -> #0 (rcu_node_0){..-...}:

    Paul was quick to explain that due to preemptible RCU we cannot call
    rcu_read_unlock() while holding scheduler (or nested) locks when part
    of the read side critical section was preemptible.

    Therefore solve it by making the entire RCU read side non-preemptible.

    Also pull out the retry from under the non-preempt to play nice with RT.

    Reported-by: Jiri Olsa
    Helped-out-by: Paul E. McKenney
    Cc:
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The '!ctx->is_active' check has a valid scenario, so
    there's no need for the warning.

    The reason is that there's a time window between the
    'ctx->is_active' check in the perf_event_enable() function
    and the __perf_event_enable() function having:

    - IRQs on
    - ctx->lock unlocked

    where the task could be killed and 'ctx' deactivated by
    perf_event_exit_task(), ending up with the warning below.

    So remove the WARN_ON_ONCE() check and add comments to
    explain it all.

    This addresses the following warning reported by Vince Weaver:

    [ 324.983534] ------------[ cut here ]------------
    [ 324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190()
    [ 324.984420] Modules linked in:
    [ 324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246
    [ 324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
    [ 324.984420] 0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00
    [ 324.984420] ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860
    [ 324.984420] 0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a
    [ 324.984420] Call Trace:
    [ 324.984420] [] dump_stack+0x19/0x1b
    [ 324.984420] [] warn_slowpath_common+0x70/0xa0
    [ 324.984420] [] warn_slowpath_null+0x1a/0x20
    [ 324.984420] [] __perf_event_enable+0x187/0x190
    [ 324.984420] [] remote_function+0x40/0x50
    [ 324.984420] [] generic_smp_call_function_single_interrupt+0xbe/0x130
    [ 324.984420] [] smp_call_function_single_interrupt+0x27/0x40
    [ 324.984420] [] call_function_single_interrupt+0x6f/0x80
    [ 324.984420] [] ? _raw_spin_unlock_irqrestore+0x41/0x70
    [ 324.984420] [] perf_event_exit_task+0x14d/0x210
    [ 324.984420] [] ? switch_task_namespaces+0x24/0x60
    [ 324.984420] [] do_exit+0x2b6/0xa40
    [ 324.984420] [] ? _raw_spin_unlock_irq+0x2c/0x30
    [ 324.984420] [] do_group_exit+0x49/0xc0
    [ 324.984420] [] get_signal_to_deliver+0x254/0x620
    [ 324.984420] [] do_signal+0x57/0x5a0
    [ 324.984420] [] ? __do_page_fault+0x2a4/0x4e0
    [ 324.984420] [] ? retint_restore_args+0xe/0xe
    [ 324.984420] [] ? retint_signal+0x11/0x84
    [ 324.984420] [] do_notify_resume+0x65/0x80
    [ 324.984420] [] retint_signal+0x46/0x84
    [ 324.984420] ---[ end trace 442ec2f04db3771a ]---

    Reported-by: Vince Weaver
    Signed-off-by: Jiri Olsa
    Suggested-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Currently when the child context for inherited events is
    created, it's based on the pmu object of the first event
    of the parent context.

    This is wrong for the following scenario:

    - HW context having HW and SW event
    - HW event got removed (closed)
    - SW event stays in HW context as the only event
    and its pmu is used to clone the child context

    The issue starts when the cpu context object is touched
    based on the pmu context object (__get_cpu_context). In
    this case the HW context will work with SW cpu context
    ending up with following WARN below.

    Fixing this by using parent context pmu object to clone
    from child context.

    Addresses the following warning reported by Vince Weaver:

    [ 2716.472065] ------------[ cut here ]------------
    [ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
    [ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
    [ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
    [ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
    [ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
    [ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
    [ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
    [ 2716.476035] Call Trace:
    [ 2716.476035] [] ? warn_slowpath_common+0x5b/0x70
    [ 2716.476035] [] ? task_ctx_sched_out+0x3c/0x5f
    [ 2716.476035] [] ? perf_event_exit_task+0xbf/0x194
    [ 2716.476035] [] ? do_exit+0x3e7/0x90c
    [ 2716.476035] [] ? __do_fault+0x359/0x394
    [ 2716.476035] [] ? do_group_exit+0x66/0x98
    [ 2716.476035] [] ? get_signal_to_deliver+0x479/0x4ad
    [ 2716.476035] [] ? __perf_event_task_sched_out+0x230/0x2d1
    [ 2716.476035] [] ? do_signal+0x3c/0x432
    [ 2716.476035] [] ? ctx_sched_in+0x43/0x141
    [ 2716.476035] [] ? perf_event_context_sched_in+0x7a/0x90
    [ 2716.476035] [] ? __perf_event_task_sched_in+0x31/0x118
    [ 2716.476035] [] ? mmdrop+0xd/0x1c
    [ 2716.476035] [] ? finish_task_switch+0x7d/0xa6
    [ 2716.476035] [] ? do_notify_resume+0x20/0x5d
    [ 2716.476035] [] ? retint_signal+0x3d/0x78
    [ 2716.476035] ---[ end trace 827178d8a5966c3d ]---

    Reported-by: Vince Weaver
    Signed-off-by: Jiri Olsa
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

05 Jul, 2013

1 commit

  • This patch fixes a serious bug in:

    14c63f17b1fd perf: Drop sample rate when sampling is too slow

    There was an misunderstanding on the API of the do_div()
    macro. It returns the remainder of the division and this
    was not what the function expected leading to disabling the
    interrupt latency watchdog.

    This patch also remove a duplicate assignment in
    perf_sample_event_took().

    Signed-off-by: Stephane Eranian
    Cc: peterz@infradead.org
    Cc: dave.hansen@linux.intel.com
    Cc: ak@linux.intel.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/20130704223010.GA30625@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

23 Jun, 2013

1 commit

  • This patch keeps track of how long perf's NMI handler is taking,
    and also calculates how many samples perf can take a second. If
    the sample length times the expected max number of samples
    exceeds a configurable threshold, it drops the sample rate.

    This way, we don't have a runaway sampling process eating up the
    CPU.

    This patch can tend to drop the sample rate down to level where
    perf doesn't work very well. *BUT* the alternative is that my
    system hangs because it spends all of its time handling NMIs.

    I'll take a busted performance tool over an entire system that's
    busted and undebuggable any day.

    BTW, my suspicion is that there's still an underlying bug here.
    Using the HPET instead of the TSC is definitely a contributing
    factor, but I suspect there are some other things going on.
    But, I can't go dig down on a bug like that with my machine
    hanging all the time.

    Signed-off-by: Dave Hansen
    Acked-by: Peter Zijlstra
    Cc: paulus@samba.org
    Cc: acme@ghostprotocols.net
    Cc: Dave Hansen
    [ Prettified it a bit. ]
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

20 Jun, 2013

8 commits

  • This patch simply moves all per-cpu variables into the new
    single per-cpu "struct bp_cpuinfo".

    To me this looks more logical and clean, but this can also
    simplify the further potential changes. In particular, I do not
    think this memory should be per-cpu, it is never used "locally".
    After this change it is trivial to turn it into, say,
    bootmem[nr_cpu_ids].

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • 1. register_wide_hw_breakpoint() can use unregister_ if failure,
    no need to duplicate the code.

    2. "struct perf_event **pevent" adds the unnecesary lever of
    indirection and complication, use per_cpu(*cpu_events, cpu).

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add the trivial helper which simply returns cpumask_of() or
    cpu_possible_mask depending on bp->cpu.

    Change fetch_bp_busy_slots() and toggle_bp_slot() to always do
    for_each_cpu(cpumask_of_bp) to simplify the code and avoid the
    code duplication.

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Change toggle_bp_slot() to make "weight" negative if !enable.
    This way we can always use "+ weight" without additional "if
    (enable)" check and toggle_bp_task_slot() no longer needs this
    arg.

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155013.GA6337@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The enable/disable logic in toggle_bp_slot() is not symmetrical
    and imho very confusing. "old_count" in toggle_bp_task_slot() is
    actually new_count because this bp was already removed from the
    list.

    Change toggle_bp_slot() to always call list_add/list_del after
    toggle_bp_task_slot(). This way old_idx is task_bp_pinned() and
    this entry should be decremented, new_idx is +/-weight and we
    need to increment this element. The code/logic looks obvious.

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155011.GA6330@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Merge in two hw_breakpoint fixes, before applying another 5.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • fetch_bp_busy_slots() and toggle_bp_slot() use
    for_each_online_cpu(), this is obviously wrong wrt cpu_up() or
    cpu_down(), we can over/under account the per-cpu numbers.

    For example:

    # echo 0 >> /sys/devices/system/cpu/cpu1/online
    # perf record -e mem:0x10 -p 1 &
    # echo 1 >> /sys/devices/system/cpu/cpu1/online
    # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a &
    # taskset -p 0x2 1

    triggers the same WARN_ONCE("Can't find any breakpoint slot") in
    arch_install_hw_breakpoint().

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Cc:
    Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
    slot") in arch_install_hw_breakpoint() but the problem is not
    arch-specific.

    The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
    but this doesn't account the "all cpus" events with iter->cpu <
    0.

    This means that, say, register_user_hw_breakpoint(tsk) can
    happily create the arbitrary number > HBP_NUM of breakpoints
    which can not be activated. toggle_bp_task_slot() is equally
    wrong by the same reason and nr_task_bp_pinned[] can have
    negative entries.

    Simple test:

    # perl -e 'sleep 1 while 1' &
    # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`

    Before this patch this triggers the same problem/WARN_ON(),
    after the patch it correctly fails with -ENOSPC.

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Cc:
    Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

19 Jun, 2013

4 commits

  • This allows us to use pdev->name for registering a PMU device.
    IMO the name is not supposed to be changed anyway.

    Signed-off-by: Mischa Jonker
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1370339148-5566-1-git-send-email-mjonker@synopsys.com
    Signed-off-by: Ingo Molnar

    Mischa Jonker
     
  • Commit 2b923c8 perf/x86: Check branch sampling priv level in generic code
    was missing the check for the hypervisor (HV) priv level, so add it back.

    With this patch, we get the following correct behavior:

    # echo 2 >/proc/sys/kernel/perf_event_paranoid

    $ perf record -j any,k noploop 1
    Error:
    You may not have permission to collect stats.
    Consider tweaking /proc/sys/kernel/perf_event_paranoid:
    -1 - Not paranoid at all
    0 - Disallow raw tracepoint access for unpriv
    1 - Disallow cpu events for unpriv
    2 - Disallow kernel profiling for unpriv

    $ perf record -j any,hv noploop 1
    Error:
    You may not have permission to collect stats.
    Consider tweaking /proc/sys/kernel/perf_event_paranoid:
    -1 - Not paranoid at all
    0 - Disallow raw tracepoint access for unpriv
    1 - Disallow cpu events for unpriv
    2 - Disallow kernel profiling for unpriv

    Signed-off-by: Stephane Eranian
    Acked-by: Petr Matousek
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130606090204.GA3725@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • Merge in the latest fixes, to avoid conflicts with ongoing work.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Vince's fuzzer once again found holes. This time it spotted a leak in
    the locked page accounting.

    When an event had redirected output and its close() was the last
    reference to the buffer we didn't have a vm context to undo accounting.

    Change the code to destroy the buffer on the last munmap() and detach
    all redirected events at that time. This provides us the right context
    to undo the vm accounting.

    Reported-and-tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
    Cc:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 May, 2013

5 commits

  • Vince reported a problem found by his perf specific trinity
    fuzzer.

    Al noticed 2 problems with perf's mmap():

    - it has issues against fork() since we use vma->vm_mm for accounting.
    - it has an rb refcount leak on double mmap().

    We fix the issues against fork() by using VM_DONTCOPY; I don't
    think there's code out there that uses this; we didn't hear
    about weird accounting problems/crashes. If we do need this to
    work, the previously proposed VM_PINNED could make this work.

    Aside from the rb reference leak spotted by Al, Vince's example
    prog was indeed doing a double mmap() through the use of
    perf_event_set_output().

    This exposes another problem, since we now have 2 events with
    one buffer, the accounting gets screwy because we account per
    event. Fix this by making the buffer responsible for its own
    accounting.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: Al Viro
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/20130528085548.GA12193@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch moves commit 7cc23cd to the generic code:

    perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL

    The check is now implemented in generic code instead of x86 specific
    code. That way we do not have to repeat the test in each arch
    supporting branch sampling.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/20130521105337.GA2879@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • This patch adds /sys/device/xxx/perf_event_mux_interval_ms to ajust
    the multiplexing interval per PMU. The unit is milliseconds. Value has
    to be >= 1.

    In the 4th version, we renamed the sysfs file to be more consistent
    with the other /proc/sys/kernel entries for perf_events.

    In the 5th version, we handle the reprogramming of the hrtimer using
    hrtimer_forward_now(). That way, we sync up to new timer value quickly
    (suggested by Jiri Olsa).

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1364991694-5876-3-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • The current scheme of using the timer tick was fine for per-thread
    events. However, it was causing bias issues in system-wide mode
    (including for uncore PMUs). Event groups would not get their fair
    share of runtime on the PMU. With tickless kernels, if a core is idle
    there is no timer tick, and thus no event rotation (multiplexing).
    However, there are events (especially uncore events) which do count
    even though cores are asleep.

    This patch changes the timer source for multiplexing. It introduces a
    per-PMU per-cpu hrtimer. The advantage is that even when a core goes
    idle, it will come back to service the hrtimer, thus multiplexing on
    system-wide events works much better.

    The per-PMU implementation (suggested by PeterZ) enables adjusting the
    multiplexing interval per PMU. The preferred interval is stashed into
    the struct pmu. If not set, it will be forced to the default interval
    value.

    In order to minimize the impact of the hrtimer, it is turned on and
    off on demand. When the PMU on a CPU is overcommited, the hrtimer is
    activated. It is stopped when the PMU is not overcommitted.

    In order for this to work properly, we had to change the order of
    initialization in start_kernel() such that hrtimer_init() is run
    before perf_event_init().

    The default interval in milliseconds is set to a timer tick just like
    with the old code. We will provide a sysctl to tune this in another
    patch.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • The hw breakpoint pmu 'add' function is missing the
    period_left update needed for SW events.

    The perf HW breakpoint events use the SW events framework
    to process the overflow, so it needs to be properly initialized
    in the PMU 'add' method.

    Signed-off-by: Jiri Olsa
    Reviewed-by: Peter Zijlstra
    Cc: H. Peter Anvin
    Cc: Oleg Nesterov
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Vince Weaver
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1367421944-19082-5-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

07 May, 2013

2 commits

  • Add perf_event_aux() function to send out all types of
    auxiliary events - mmap, task, comm events. For each type
    there's match and output functions defined and used as
    callbacks during perf_event_aux processing.

    This way we can centralize the pmu/context iterating and
    event matching logic. Also since lot of the code was
    duplicated, this patch reduces the .text size about 2kB
    on my setup:

    snipped output from 'objdump -x kernel/events/core.o'

    before:
    Idx Name Size
    0 .text 0000d313

    after:
    Idx Name Size
    0 .text 0000cad3

    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Namhyung Kim
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/1367857638-27631-3-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • The perf_event_task_ctx() function needs to be called with
    preemption disabled, since it's checking for currently
    scheduled cpu against event cpu.

    We disable preemption for task related perf event context
    if there's one defined, leaving up to the chance which cpu
    it gets scheduled in.

    Signed-off-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Namhyung Kim
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Paul Mackerras
    Cc: Stephane Eranian
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/1367857638-27631-2-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

06 May, 2013

2 commits

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc fixes plus a small hw-enablement patch for Intel IB model 58
    uncore events"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
    perf/x86/intel/lbr: Fix LBR filter
    perf/x86: Blacklist all MEM_*_RETIRED events for Ivy Bridge
    perf: Fix vmalloc ring buffer pages handling
    perf/x86/intel: Fix unintended variable name reuse
    perf/x86/intel: Add support for IvyBridge model 58 Uncore
    perf/x86/intel: Fix typo in perf_event_intel_uncore.c
    x86: Eliminate irq_mis_count counted in arch_irq_stat

    Linus Torvalds
     

02 May, 2013

1 commit


01 May, 2013

1 commit

  • If we allocate perf ring buffer with the size of single (user)
    page, we will get memory corruption when releasing itin
    rb_free_work function (for CONFIG_PERF_USE_VMALLOC option).

    For single page sized ring buffer the page_order is -1 (because
    nr_pages is 0). This needs to be recognized in the rb_free_work
    function to release proper amount of pages.

    Adding data_page_nr function that returns number of allocated
    data pages. Customizing the rest of the code to use it.

    Reported-by: Jan Stancek
    Original-patch-by: Peter Zijlstra
    Acked-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/20130319143509.GA1128@krava.brq.redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

30 Apr, 2013

2 commits

  • Pull perf updates from Ingo Molnar:
    "Features:

    - Add "uretprobes" - an optimization to uprobes, like kretprobes are
    an optimization to kprobes. "perf probe -x file sym%return" now
    works like kretprobes. By Oleg Nesterov.

    - Introduce per core aggregation in 'perf stat', from Stephane
    Eranian.

    - Add memory profiling via PEBS, from Stephane Eranian.

    - Event group view for 'annotate' in --stdio, --tui and --gtk, from
    Namhyung Kim.

    - Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.

    - Add Ivy Bridge-EP uncore support, by Zheng Yan

    - IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.

    - Add perf test entries for checking breakpoint overflow signal
    handler issues, from Jiri Olsa.

    - Add perf test entry for for checking number of EXIT events, from
    Namhyung Kim.

    - Add perf test entries for checking --cpu in record and stat, from
    Jiri Olsa.

    - Introduce perf stat --repeat forever, from Frederik Deweerdt.

    - Add --no-demangle to report/top, from Namhyung Kim.

    - PowerPC fixes plus a couple of cleanups/optimizations in uprobes
    and trace_uprobes, by Oleg Nesterov.

    Various fixes and refactorings:

    - Fix dependency of the python binding wrt libtraceevent, from
    Naohiro Aota.

    - Simplify some perf_evlist methods and to allow 'stat' to share code
    with 'record' and 'trace', by Arnaldo Carvalho de Melo.

    - Remove dead code in related to libtraceevent integration, from
    Namhyung Kim.

    - Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
    sched lat' back working, by Arnaldo Carvalho de Melo

    - We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
    de Melo.

    - Kill a bunch of die() calls, from Namhyung Kim.

    - Fix build on non-glibc systems due to libio.h absence, from Cody P
    Schafer.

    - Remove some perf_session and tracing dead code, from David Ahern.

    - Honor parallel jobs, fix from Borislav Petkov

    - Introduce tools/lib/lk library, initially just removing duplication
    among tools/perf and tools/vm. from Borislav Petkov

    ... and many more I missed to list, see the shortlog and git log for
    more details."

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
    perf/x86/intel/P4: Robistify P4 PMU types
    perf/x86/amd: Fix AMD NB and L2I "uncore" support
    perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
    perf/x86: Check all MSRs before passing hw check
    perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
    perf/x86/intel: Add Ivy Bridge-EP uncore support
    perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
    perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
    uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
    uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
    uprobes/tracing: Change create_trace_uprobe() to support uretprobes
    uprobes/tracing: Make seq_printf() code uretprobe-friendly
    uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
    uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
    uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
    uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
    uprobes/tracing: Generalize struct uprobe_trace_entry_head
    uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
    uprobes/tracing: Kill the pointless seq_print_ip_sym() call
    uprobes/tracing: Kill the pointless task_pt_regs() calls
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     

23 Apr, 2013

2 commits

  • Provide a new helper that help full dynticks CPUs to prevent
    from stopping their tick in case there are events in the local
    rotation list.

    This way we make sure that perf_event_task_tick() is serviced
    on demand.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Stephane Eranian
    Cc: Jiri Olsa

    Frederic Weisbecker
     
  • Kick the current CPU's tick by sending it a self IPI when
    an event is queued on the rotation list and it is the first
    element inserted. This makes sure that perf_event_task_tick()
    works on full dynticks CPUs.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Stephane Eranian
    Cc: Jiri Olsa

    Frederic Weisbecker
     

21 Apr, 2013

2 commits

  • The following RCU splat indicates lack of RCU protection:

    [ 953.267649] ===============================
    [ 953.267652] [ INFO: suspicious RCU usage. ]
    [ 953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted
    [ 953.267661] -------------------------------
    [ 953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!
    [ 953.267669]
    [ 953.267669] other info that might help us debug this:
    [ 953.267669]
    [ 953.267675]
    [ 953.267675] rcu_scheduler_active = 1, debug_locks = 0
    [ 953.267680] 1 lock held by glxgears/1289:
    [ 953.267683] #0: (&sig->cred_guard_mutex){+.+.+.}, at: [] .prepare_bprm_creds+0x34/0xa0
    [ 953.267700]
    [ 953.267700] stack backtrace:
    [ 953.267704] Call Trace:
    [ 953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable)
    [ 953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180
    [ 953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690
    [ 953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0
    [ 953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220
    [ 953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0
    ...

    This commit therefore adds the required RCU read-side critical
    section to perf_event_comm().

    Reported-by: Adam Jackson
    Signed-off-by: Paul E. McKenney
    Cc: a.p.zijlstra@chello.nl
    Cc: paulus@samba.org
    Cc: acme@ghostprotocols.net
    Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar
    Tested-by: Gustavo Luiz Duarte

    Paul E. McKenney
     
  • Conflicts:
    arch/x86/kernel/cpu/perf_event_intel.c

    Merge in the latest fixes before applying new patches, resolve the conflict.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

15 Apr, 2013

1 commit

  • Trinity discovered that we fail to check all 64 bits of
    attr.config passed by user space, resulting to out-of-bounds
    access of the perf_swevent_enabled array in
    sw_perf_event_destroy().

    Introduced in commit b0a873ebb ("perf: Register PMU
    implementations").

    Signed-off-by: Tommi Rantala
    Cc: Peter Zijlstra
    Cc: davej@redhat.com
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com
    Signed-off-by: Ingo Molnar

    Tommi Rantala
     

13 Apr, 2013

2 commits

  • Enclose return probes implementation.

    Signed-off-by: Anton Arapov
    Acked-by: Srikar Dronamraju
    Signed-off-by: Oleg Nesterov

    Anton Arapov
     
  • Unlike the kretprobes we can't trust userspace, thus must have
    protection from user space attacks. User-space have "unlimited"
    stack, and this patch limits the return probes nestedness as a
    simple remedy for it.

    Note that this implementation leaks return_instance on siglongjmp
    until exit()/exec().

    The intention is to have KISS and bare minimum solution for the
    initial implementation in order to not complicate the uretprobes
    code.

    In the future we may come up with more sophisticated solution that
    remove this depth limitation. It is not easy task and lays beyond
    this patchset.

    Signed-off-by: Anton Arapov
    Acked-by: Srikar Dronamraju
    Signed-off-by: Oleg Nesterov

    Anton Arapov