07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds

03 Nov, 2011

1 commit

  • This reverts commit 144060fee07e9c22e179d00819c83c86fbcbf82c.

    It causes a resume regression for Andi on his Acer Aspire 1830T post
    3.1. The screen just stays black after wakeup.

    Also, it really looks like the wrong way to suspend and resume perf
    events: I think they should be done as part of the CPU suspend and
    resume, rather than as a notifier that does smp_call_function().

    Reported-by: Andi Kleen
    Acked-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Linus Torvalds

01 Nov, 2011

2 commits

  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
  • These files were getting via an implicit non-obvious
    path, but we want to crush those out of existence since they cost
    time during compiles of processing thousands of lines of headers
    for no reason. Give them the lightweight header that just contains
    the EXPORT_SYMBOL infrastructure.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker

26 Sep, 2011

1 commit

31 Aug, 2011

1 commit

  • We detected a serious issue with PERF_SAMPLE_READ and
    timing information when events were being multiplexing.

    Samples would have time_running > time_enabled. That
    was easy to reproduce with a libpfm4 example (ran 3
    times to cause multiplexing on Core 2):

    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
    syst_smpl: WARNING: time_running > time_enabled
    63277537998 uops_retired:freq=1 , scaled

    The bug was not present in kernel up to (and including) 3.0. It turns
    out the bug was introduced by the following commit:

    commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa

    events: Move lockless timer calculation into helper function

    The parameters of the function got reversed yet the call sites
    were not updated to reflect the change. That lead to time_running
    and time_enabled being swapped. That had no effect when there was
    no multiplexing because in that case time_running = time_enabled
    but it would show up in any other scenario.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad
    Signed-off-by: Ingo Molnar

    Eric B Munson

29 Aug, 2011

1 commit

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian

14 Aug, 2011

2 commits

  • Currently, an event's 'pmu' field is set after pmu::event_init() is
    called. This means that pmu::event_init() must figure out which struct
    pmu the event was initialised from. This makes it difficult to
    consolidate common event initialisation code for similar PMUs, and
    very difficult to implement drivers for PMUs which can have multiple
    instances (e.g. a USB controller PMU, a GPU PMU, etc).

    This patch sets the 'pmu' field before initialising the event, allowing
    event init code to identify the struct pmu instance easily. In the
    event of failure to initialise an event, the event is destroyed via
    kfree() without calling perf_event::destroy(), so this shouldn't
    result in bad behaviour even if the destroy field was set before
    failure to initialise was noted.

    Signed-off-by: Mark Rutland
    Reviewed-by: Will Deacon
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1313062280-19123-1-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
  • Francis reports that s2r gets him spurious NMIs, this is because the
    suspend code leaves the boot cpu up and running.

    Cure this by adding a suspend notifier. The problem is that hotplug
    and suspend are completely un-serialized and the PM notifiers run
    before the suspend cpu unplug of all but the boot cpu.

    This leaves a window where the user can initialize another hotplug
    operation (either remove or add a cpu) resulting in either one too
    many or one too few hotplug ops. Thus we cannot use the hotplug code
    for the suspend case.

    There's another reason to not use the hotplug code, which is that the
    hotplug code totally destroys the perf state, we can do better for
    suspend and simply remove all counters from the PMU so that we can
    re-instate them on resume.

    Reported-by: Francis Moreau
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-1cvevybkgmv4s6v5y37t4847@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra

22 Jul, 2011

1 commit

  • PMU type id can be allocated dynamically, so perf_event_attr::type check
    when copying attribute from userspace to kernel is not valid.

    Signed-off-by: Lin Ming
    Cc: Robert Richter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309421396-17438-4-git-send-email-ming.m.lin@intel.com
    Signed-off-by: Ingo Molnar

    Lin Ming

01 Jul, 2011

8 commits

  • KVM needs one-shot samples, since a PMC programmed to -X will fire after X
    events and then again after 2^40 events (i.e. variable period).

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-4-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
  • The perf_event overflow handler does not receive any caller-derived
    argument, so many callers need to resort to looking up the perf_event
    in their local data structure. This is ugly and doesn't scale if a
    single callback services many perf_events.

    Fix by adding a context parameter to perf_event_create_kernel_counter()
    (and derived hardware breakpoints APIs) and storing it in the perf_event.
    The field can be accessed from the callback as event->overflow_handler_context.
    All callers are updated.

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
  • Since only samples call perf_output_sample() its much saner (and more
    correct) to put the sample logic in there than in the
    perf_output_begin()/perf_output_end() pair.

    Saves a useless argument, reduces conditionals and shrinks
    struct perf_output_handle, win!

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • The nmi parameter indicated if we could do wakeups from the current
    context, if not, we would set some state and self-IPI and let the
    resulting interrupt do the wakeup.

    For the various event classes:

    - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
    the PMI-tail (ARM etc.)
    - tracepoint: nmi=0; since tracepoint could be from NMI context.
    - software: nmi=[0,1]; some, like the schedule thing cannot
    perform wakeups, and hence need 0.

    As one can see, there is very little nmi=1 usage, and the down-side of
    not using it is that on some platforms some software events can have a
    jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

    The up-side however is that we can remove the nmi parameter and save a
    bunch of conditionals in fast paths.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: Will Deacon
    Cc: Deng-Cheng Zhu
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Frederic Weisbecker
    Cc: Jason Wessel
    Cc: Don Zickus
    Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • The event tracing infrastructure exposes two timers which should be updated
    each time the value of the counter is updated. Currently, these counters are
    only updated when userspace calls read() on the fd associated with an event.
    This means that counters which are read via the mmap'd page exclusively never
    have their timers updated. This patch adds ensures that the timers are updated
    each time the values in the mmap'd page are updated.

    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308932786-5111-1-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
  • Take the timer calculation from perf_output_read and move it to a helper
    function for any place that needs timer values but cannot take the ctx->lock.

    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308861279-15216-2-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
  • Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308861279-15216-1-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
  • Since 2.6.36 (specifically commit d57e34fdd60b ("perf: Simplify the
    ring-buffer logic: make perf_buffer_alloc() do everything needed"),
    the perf_buffer_init_code() has been mis-setting the buffer watermark
    if perf_event_attr.wakeup_events has a non-zero value.

    This is because perf_event_attr.wakeup_events is a union with

    This commit re-enables the check for perf_event_attr.watermark being
    set before continuing with setting a non-default watermark.

    This bug is most noticable when you are trying to use PERF_IOC_REFRESH
    with a value larger than one and perf_event_attr.wakeup_events is set to
    one. In this case the buffer watermark will be set to 1 and you will
    get extraneous POLL_IN overflows rather than POLL_HUP as expected.

    [ avoid using attr.wakeup_events when attr.watermark is set ]

    Signed-off-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106011506390.5384@cl320.eecs.utk.edu
    Signed-off-by: Ingo Molnar

    Vince Weaver

09 Jun, 2011

1 commit

  • And create the internal perf events header.

    v2: Keep an internal inlined perf_output_copy()

    Signed-off-by: Frederic Weisbecker
    Acked-by: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Stephane Eranian
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com
    [ v3: use clearer 'ring_buffer' and 'rb' naming ]
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker

07 Jun, 2011

1 commit

  • A lost Quilt refresh of 2c29ef0fef8 (perf: Simplify and fix
    __perf_install_in_context()) is causing grief and lockups,
    reported by Jiri Olsa.

    When installing an event in a task context, there's a number of

    - there might not be an existing task context, in which case
    we should install the now current context;

    - there might already be a context, not the current one, in
    which case we should de-schedule the old and install the new;

    these cases were dealt with in the lost refresh, however there is one
    further case that was found in testing:

    - there might already be a context, the current one, in which
    case we should still de-schedule, and should take care
    to re-install it (note that task_ctx_sched_out() clears

    Reported-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1307399008.2497.971.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra

04 Jun, 2011

2 commits

31 May, 2011

1 commit

  • Ben changed the cgroup API in commit f780bdb7c1c (cgroups: add
    per-thread subsystem callbacks) in an incompatible way, but
    forgot to convert the perf cgroup bits.

    Avoid compile warnings and runtime splats and convert perf too ;-)

    Acked-by: Ben Blum
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1306767651.1200.2990.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra

29 May, 2011

9 commits

  • Since perf_install_in_context() will now install a context when we
    add the first event, we can de-schedule the context when the last
    event is removed.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192142.090431763@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • In order to always call list_del_event() on the correct cpu if the
    event is part of an active context and avoid having to do two IPIs,
    change the close() semantics slightly.

    The current perf_event_disable() call would disable a whole group if
    the event that's being closed is the group leader, whereas the new
    code keeps the group siblings enabled.

    People should not rely on this behaviour and I don't think they do,
    but in case we find they do, the fix is easy and we have to take the
    double IPI cost.

    Signed-off-by: Peter Zijlstra
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20110409192142.038377551@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • This was scattered out - refactor it into a single function.
    No change in functionality.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.979862055@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • Instead of tracking if a context is active or not, track which events
    of the context are active. By making it a bitmask of
    EVENT_PINNED|EVENT_FLEXIBLE we can simplify some of the scheduling
    routines since it can avoid adding events that are already active.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.930282378@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • Currently __perf_install_in_context() will try and schedule in the
    event irrespective of our event scheduling rules, that is, we try to
    schedule CPU-pinned, TASK-pinned, CPU-flexible, TASK-flexible, but
    when creating a new event we simply try and schedule it on top of
    whatever is already on the PMU, this can lead to errors for pinned

    Therefore, simplify things and simply schedule everything out, add the
    event to the corresponding context and schedule everything back in.

    This also nicely handles the case where with
    __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI can come right in the middle
    of schedule, before we managed to call perf_event_task_sched_in().

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.870894224@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • Make task_ctx_sched_*() imply EVENT_ALL, since anything less will not
    actually have scheduled the task in/out at all.

    Since there's no site that schedules all of a task in (due to the
    interleave with flexible cpuctx) we can remove this function.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.817893268@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • Currently we only hold one ctx->lock at a time, which results in us
    flipping back and forth between cpuctx->ctx.lock and task_ctx->lock.

    Avoid this and gain large atomic regions by holding both locks. We
    nest the task lock inside the cpu lock, since with task scheduling we
    might have to change task ctx while holding the cpu ctx lock.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.769881865@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • Small cleanup to how we refcount in find_get_context(), this also
    allows us to use put_ctx() to free things instead of using kfree().

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.719340481@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
  • Oleg noted that ctx_sched_out() disables the PMU even though it might
    not actually do something, avoid needless PMU-disabling.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.665385503@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra

28 May, 2011

1 commit

  • Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
    explicitly push the wakeup (including signals) when requested.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra

20 May, 2011

1 commit

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
    Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
    net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
    batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
    batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
    batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
    net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
    net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
    net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
    perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
    perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
    net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
    net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
    net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
    security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
    net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
    net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
    net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()

    Linus Torvalds

04 May, 2011

1 commit

03 May, 2011

2 commits

  • As part of the events sybsystem unification, relocate hw_breakpoint.c
    into its new destination.

    Cc: Frederic Weisbecker
    Signed-off-by: Borislav Petkov

    Borislav Petkov
  • mv kernel/perf_event.c -> kernel/events/core.c. From there, all further
    sensible splitting can happen. The idea is that due to perf_event.c
    becoming pretty sizable and with the advent of the marriage with ftrace,
    splitting functionality into its logical parts should help speeding up
    the unification and to manage the complexity of the subsystem.

    Signed-off-by: Borislav Petkov

    Borislav Petkov