10 Jan, 2012

1 commit

  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
    Kconfig: acpi: Fix typo in comment.
    misc latin1 to utf8 conversions
    devres: Fix a typo in devm_kfree comment
    btrfs: free-space-cache.c: remove extra semicolon.
    fat: Spelling s/obsolate/obsolete/g
    SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
    tools/power turbostat: update fields in manpage
    mac80211: drop spelling fix
    types.h: fix comment spelling for 'architectures'
    typo fixes: aera -> area, exntension -> extension
    devices.txt: Fix typo of 'VMware'.
    sis900: Fix enum typo 'sis900_rx_bufer_status'
    decompress_bunzip2: remove invalid vi modeline
    treewide: Fix comment and string typo 'bufer'
    hyper-v: Update MAINTAINERS
    treewide: Fix typos in various parts of the kernel, and fix some comments.
    clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
    gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
    leds: Kconfig: Fix typo 'D2NET_V2'
    sound: Kconfig: drop unknown symbol ARCH_CLPS7500
    ...

    Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
    kconfig additions, close to removed commented-out old ones)

    Linus Torvalds
     

07 Jan, 2012

2 commits

  • * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (106 commits)
    perf kvm: Fix copy & paste error in description
    perf script: Kill script_spec__delete
    perf top: Fix a memory leak
    perf stat: Introduce get_ratio_color() helper
    perf session: Remove impossible condition check
    perf tools: Fix feature-bits rework fallout, remove unused variable
    perf script: Add generic perl handler to process events
    perf tools: Use for_each_set_bit() to iterate over feature flags
    perf tools: Unify handling of features when writing feature section
    perf report: Accept fifos as input file
    perf tools: Moving code in some files
    perf tools: Fix out-of-bound access to struct perf_session
    perf tools: Continue processing header on unknown features
    perf tools: Improve macros for struct feature_ops
    perf: builtin-record: Document and check that mmap_pages must be a power of two.
    perf: builtin-record: Provide advice if mmap'ing fails with EPERM.
    perf tools: Fix truncated annotation
    perf script: look up thread using tid instead of pid
    perf tools: Look up thread names for system wide profiling
    perf tools: Fix comm for processes with named threads
    ...

    Linus Torvalds
     
  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    cpu: Export cpu_up()
    rcu: Apply ACCESS_ONCE() to rcu_boost() return value
    Revert "rcu: Permit rt_mutex_unlock() with irqs disabled"
    docs: Additional LWN links to RCU API
    rcu: Augment rcu_batch_end tracing for idle and callback state
    rcu: Add rcutorture tests for srcu_read_lock_raw()
    rcu: Make rcutorture test for hotpluggability before offlining CPUs
    driver-core/cpu: Expose hotpluggability to the rest of the kernel
    rcu: Remove redundant rcu_cpu_stall_suppress declaration
    rcu: Adaptive dyntick-idle preparation
    rcu: Keep invoking callbacks if CPU otherwise idle
    rcu: Irq nesting is always 0 on rcu_enter_idle_common
    rcu: Don't check irq nesting from rcu idle entry/exit
    rcu: Permit dyntick-idle with callbacks pending
    rcu: Document same-context read-side constraints
    rcu: Identify dyntick-idle CPUs on first force_quiescent_state() pass
    rcu: Remove dynticks false positives and RCU failures
    rcu: Reduce latency of rcu_prepare_for_idle()
    rcu: Eliminate RCU_FAST_NO_HZ grace-period hang
    rcu: Avoid needlessly IPIing CPUs at GP end
    ...

    Linus Torvalds
     

02 Jan, 2012

1 commit


21 Dec, 2011

1 commit


14 Dec, 2011

1 commit

  • Commit 10c6db11 ("perf: Fix loss of notification with multi-event")
    seems to unconditionally dereference event->rb in the wakeup handler,
    this is wrong, there might not be a buffer attached.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111213152651.GP20297@mudshark.cambridge.arm.com
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Will Deacon
     

13 Dec, 2011

1 commit

  • Now that subsys->can_attach() and attach() take @tset instead of
    @task, they can handle per-task operations. Convert
    ->can_attach_task() and ->attach_task() users to use ->can_attach()
    and attach() instead. Most converions are straight-forward.
    Noteworthy changes are,

    * In cgroup_freezer, remove unnecessary NULL assignments to unused
    methods. It's useless and very prone to get out of sync, which
    already happened.

    * In cpuset, PF_THREAD_BOUND test is checked for each task. This
    doesn't make any practical difference but is conceptually cleaner.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Frederic Weisbecker
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: James Morris
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     

12 Dec, 2011

1 commit


07 Dec, 2011

1 commit

  • perf_event_sched_in() shouldn't try to schedule task events if there
    are none otherwise task's ctx->is_active will be set and will not be
    cleared during sched_out. This will prevent newly added events from
    being scheduled into the task context.

    Fixes a boo-boo in commit 1d5f003f5a9 ("perf: Do not set task_ctx
    pointer in cpuctx if there are no events in the context").

    Signed-off-by: Gleb Natapov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111122140821.GF2557@redhat.com
    Signed-off-by: Ingo Molnar

    Gleb Natapov
     

06 Dec, 2011

5 commits

  • jump_lable patching is very expensive operation that involves pausing all
    cpus. The patching of perf_sched_events jump_label is easily controllable
    from userspace by unprivileged user.

    When te user runs a loop like this:

    "while true; do perf stat -e cycles true; done"

    ... the performance of my test application that just increments a counter
    for one second drops by 4%.

    This is on a 16 cpu box with my test application using only one of
    them. An impact on a real server doing real work will be worse.

    Performance of KVM PMU drops nearly 50% due to jump_lable for "perf
    record" since KVM PMU implementation creates and destroys perf event
    frequently.

    This patch introduces a way to rate limit jump_label patching and uses
    it to fix the above problem.

    I believe that as jump_label use will spread the problem will become more
    common and thus solving it in a generic code is appropriate. Also fixing
    it in the perf code would result in moving jump_label accounting logic to
    perf code with all the ifdefs in case of JUMP_LABEL=n kernel. With this
    patch all details are nicely hidden inside jump_label code.

    Signed-off-by: Gleb Natapov
    Acked-by: Jason Baron
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111127155909.GO2557@redhat.com
    Signed-off-by: Ingo Molnar

    Gleb Natapov
     
  • Deng-Cheng Zhu reported that sibling events that were created disabled
    with enable_on_exec would never get enabled. Iterate all events
    instead of the group lists.

    Reported-by: Deng-Cheng Zhu
    Tested-by: Deng-Cheng Zhu
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322048382.14799.41.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-yv4o74vh90suyghccgykbnry@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Gleb writes:

    > Currently pmu is disabled and re-enabled on each timer interrupt even
    > when no rotation or frequency adjustment is needed. On Intel CPU this
    > results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
    > it does not cause significant slowdown, but when running perf in a virtual
    > machine it leads to 20% slowdown on my machine.

    Cure this by keeping a perf_event_context::nr_freq counter that counts the
    number of active events that require frequency adjustments and use this in a
    similar fashion to the already existing nr_events != nr_active test in
    perf_rotate_context().

    By being able to exclude both rotation and frequency adjustments a-priory for
    the common case we can avoid the otherwise superfluous PMU disable.

    Suggested-by: Gleb Natapov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-515yhoatehd3gza7we9fapaa@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: Add these cherry-picked commits so that future changes
    on perf/core don't conflict.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Dec, 2011

1 commit

  • When you do:
    $ perf record -e cycles,cycles,cycles noploop 10

    You expect about 10,000 samples for each event, i.e., 10s at
    1000samples/sec. However, this is not what's happening. You
    get much fewer samples, maybe 3700 samples/event:

    $ perf report -D | tail -15
    Aggregated stats:
    TOTAL events: 10998
    MMAP events: 66
    COMM events: 2
    SAMPLE events: 10930
    cycles stats:
    TOTAL events: 3644
    SAMPLE events: 3644
    cycles stats:
    TOTAL events: 3642
    SAMPLE events: 3642
    cycles stats:
    TOTAL events: 3644
    SAMPLE events: 3644

    On a Intel Nehalem or even AMD64, there are 4 counters capable
    of measuring cycles, so there is plenty of space to measure those
    events without multiplexing (even with the NMI watchdog active).
    And even with multiplexing, we'd expect roughly the same number
    of samples per event.

    The root of the problem was that when the event that caused the buffer
    to become full was not the first event passed on the cmdline, the user
    notification would get lost. The notification was sent to the file
    descriptor of the overflowed event but the perf tool was not polling
    on it. The perf tool aggregates all samples into a single buffer,
    i.e., the buffer of the first event. Consequently, it assumes
    notifications for any event will come via that descriptor.

    The seemingly straight forward solution of moving the waitq into the
    ringbuffer object doesn't work because of life-time issues. One could
    perf_event_set_output() on a fd that you're also blocking on and cause
    the old rb object to be freed while its waitq would still be
    referenced by the blocked thread -> FAIL.

    Therefore link all events to the ringbuffer and broadcast the wakeup
    from the ringbuffer object to all possible events that could be waited
    upon. This is rather ugly, and we're open to better solutions but it
    works for now.

    Reported-by: Stephane Eranian
    Finished-by: Stephane Eranian
    Reviewed-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Nov, 2011

1 commit


14 Nov, 2011

3 commits

  • This patch solves the following problem:

    Now some samples may be lost due to throttling. The number of samples is
    restricted by sysctl_perf_event_sample_rate/HZ. A trace event is
    divided on some samples according to event's period. I don't sure, that
    we should generate more than one sample on each trace event. I think the
    better way to use SAMPLE_PERIOD.

    E.g.: I want to trace when a process sleeps. I created a process, which
    sleeps for 1ms and for 4ms. perf got 100 events in both cases.

    swapper 0 [000] 1141.371830: sched_stat_sleep: comm=foo pid=1801 delay=1386750 [ns]
    swapper 0 [000] 1141.369444: sched_stat_sleep: comm=foo pid=1801 delay=4499585 [ns]

    In the first case a kernel want to send 4499585 events and
    in the second case it wants to send 1386750 events.
    perf-reports shows that process sleeps in both places equal time. It's
    bug.

    With this patch kernel generates one event on each "sleep" and the time
    slice is saved in the field "period". Perf knows how handle it.

    Signed-off-by: Andrew Vagin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1320670457-2633428-3-git-send-email-avagin@openvz.org
    Signed-off-by: Ingo Molnar

    Andrew Vagin
     
  • Split the callchain code from the perf events core into
    a new kernel/events/callchain.c file.

    This simplifies a bit the big core.c

    Signed-off-by: Borislav Petkov
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    [keep ctx recursion handling inline and use internal headers]
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1318778104-17152-1-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • Do not set task_ctx pointer during sched_in if there are no
    events associated with the context. Otherwise if during task
    execution total number of events in the system will become zero
    perf_event_context_sched_out() will not be called and cpuctx->task_ctx
    will be left with a stale value.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111023171033.GI17571@redhat.com
    Signed-off-by: Ingo Molnar

    Gleb Natapov
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

04 Nov, 2011

1 commit

  • The legacy x86 nmi watchdog code was removed with the implementation
    of the perf based nmi watchdog. This broke Oprofile's nmi timer
    mode. To run nmi timer mode we relied on a continuous ticking nmi
    source which the nmi watchdog provided. The nmi tick was no longer
    available and current watchdog can not be used anymore since it runs
    with very long periods in the range of seconds. This patch
    reimplements the nmi timer mode using a perf counter nmi source.

    V2:
    * removing pr_info()
    * fix undefined reference to `__udivdi3' for 32 bit build
    * fix section mismatch of .cpuinit.data:nmi_timer_cpu_nb
    * removed nmi timer setup in arch/x86
    * implemented function stubs for op_nmi_init/exit()
    * made code more readable in oprofile_init()

    V3:
    * fix architectural initialization in oprofile_init()
    * fix CONFIG_OPROFILE_NMI_TIMER dependencies

    Acked-by: Peter Zijlstra
    Signed-off-by: Robert Richter

    Robert Richter
     

03 Nov, 2011

1 commit

  • This reverts commit 144060fee07e9c22e179d00819c83c86fbcbf82c.

    It causes a resume regression for Andi on his Acer Aspire 1830T post
    3.1. The screen just stays black after wakeup.

    Also, it really looks like the wrong way to suspend and resume perf
    events: I think they should be done as part of the CPU suspend and
    resume, rather than as a notifier that does smp_call_function().

    Reported-by: Andi Kleen
    Acked-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Nov, 2011

2 commits

  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • These files were getting via an implicit non-obvious
    path, but we want to crush those out of existence since they cost
    time during compiles of processing thousands of lines of headers
    for no reason. Give them the lightweight header that just contains
    the EXPORT_SYMBOL infrastructure.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

26 Sep, 2011

1 commit


31 Aug, 2011

1 commit

  • We detected a serious issue with PERF_SAMPLE_READ and
    timing information when events were being multiplexing.

    Samples would have time_running > time_enabled. That
    was easy to reproduce with a libpfm4 example (ran 3
    times to cause multiplexing on Core 2):

    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
    syst_smpl: WARNING: time_running > time_enabled
    63277537998 uops_retired:freq=1 , scaled

    The bug was not present in kernel up to (and including) 3.0. It turns
    out the bug was introduced by the following commit:

    commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa

    events: Move lockless timer calculation into helper function

    The parameters of the function got reversed yet the call sites
    were not updated to reflect the change. That lead to time_running
    and time_enabled being swapped. That had no effect when there was
    no multiplexing because in that case time_running = time_enabled
    but it would show up in any other scenario.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

29 Aug, 2011

1 commit

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

14 Aug, 2011

2 commits

  • Currently, an event's 'pmu' field is set after pmu::event_init() is
    called. This means that pmu::event_init() must figure out which struct
    pmu the event was initialised from. This makes it difficult to
    consolidate common event initialisation code for similar PMUs, and
    very difficult to implement drivers for PMUs which can have multiple
    instances (e.g. a USB controller PMU, a GPU PMU, etc).

    This patch sets the 'pmu' field before initialising the event, allowing
    event init code to identify the struct pmu instance easily. In the
    event of failure to initialise an event, the event is destroyed via
    kfree() without calling perf_event::destroy(), so this shouldn't
    result in bad behaviour even if the destroy field was set before
    failure to initialise was noted.

    Signed-off-by: Mark Rutland
    Reviewed-by: Will Deacon
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1313062280-19123-1-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     
  • Francis reports that s2r gets him spurious NMIs, this is because the
    suspend code leaves the boot cpu up and running.

    Cure this by adding a suspend notifier. The problem is that hotplug
    and suspend are completely un-serialized and the PM notifiers run
    before the suspend cpu unplug of all but the boot cpu.

    This leaves a window where the user can initialize another hotplug
    operation (either remove or add a cpu) resulting in either one too
    many or one too few hotplug ops. Thus we cannot use the hotplug code
    for the suspend case.

    There's another reason to not use the hotplug code, which is that the
    hotplug code totally destroys the perf state, we can do better for
    suspend and simply remove all counters from the PMU so that we can
    re-instate them on resume.

    Reported-by: Francis Moreau
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-1cvevybkgmv4s6v5y37t4847@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Jul, 2011

1 commit

  • PMU type id can be allocated dynamically, so perf_event_attr::type check
    when copying attribute from userspace to kernel is not valid.

    Signed-off-by: Lin Ming
    Cc: Robert Richter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309421396-17438-4-git-send-email-ming.m.lin@intel.com
    Signed-off-by: Ingo Molnar

    Lin Ming
     

01 Jul, 2011

8 commits

  • KVM needs one-shot samples, since a PMC programmed to -X will fire after X
    events and then again after 2^40 events (i.e. variable period).

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-4-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
     
  • The perf_event overflow handler does not receive any caller-derived
    argument, so many callers need to resort to looking up the perf_event
    in their local data structure. This is ugly and doesn't scale if a
    single callback services many perf_events.

    Fix by adding a context parameter to perf_event_create_kernel_counter()
    (and derived hardware breakpoints APIs) and storing it in the perf_event.
    The field can be accessed from the callback as event->overflow_handler_context.
    All callers are updated.

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
     
  • Since only samples call perf_output_sample() its much saner (and more
    correct) to put the sample logic in there than in the
    perf_output_begin()/perf_output_end() pair.

    Saves a useless argument, reduces conditionals and shrinks
    struct perf_output_handle, win!

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nmi parameter indicated if we could do wakeups from the current
    context, if not, we would set some state and self-IPI and let the
    resulting interrupt do the wakeup.

    For the various event classes:

    - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
    the PMI-tail (ARM etc.)
    - tracepoint: nmi=0; since tracepoint could be from NMI context.
    - software: nmi=[0,1]; some, like the schedule thing cannot
    perform wakeups, and hence need 0.

    As one can see, there is very little nmi=1 usage, and the down-side of
    not using it is that on some platforms some software events can have a
    jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

    The up-side however is that we can remove the nmi parameter and save a
    bunch of conditionals in fast paths.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: Will Deacon
    Cc: Deng-Cheng Zhu
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Frederic Weisbecker
    Cc: Jason Wessel
    Cc: Don Zickus
    Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The event tracing infrastructure exposes two timers which should be updated
    each time the value of the counter is updated. Currently, these counters are
    only updated when userspace calls read() on the fd associated with an event.
    This means that counters which are read via the mmap'd page exclusively never
    have their timers updated. This patch adds ensures that the timers are updated
    each time the values in the mmap'd page are updated.

    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308932786-5111-1-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
     
  • Take the timer calculation from perf_output_read and move it to a helper
    function for any place that needs timer values but cannot take the ctx->lock.

    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308861279-15216-2-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
     
  • Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308861279-15216-1-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
     
  • Since 2.6.36 (specifically commit d57e34fdd60b ("perf: Simplify the
    ring-buffer logic: make perf_buffer_alloc() do everything needed"),
    the perf_buffer_init_code() has been mis-setting the buffer watermark
    if perf_event_attr.wakeup_events has a non-zero value.

    This is because perf_event_attr.wakeup_events is a union with
    perf_event_attr.wakeup_watermark.

    This commit re-enables the check for perf_event_attr.watermark being
    set before continuing with setting a non-default watermark.

    This bug is most noticable when you are trying to use PERF_IOC_REFRESH
    with a value larger than one and perf_event_attr.wakeup_events is set to
    one. In this case the buffer watermark will be set to 1 and you will
    get extraneous POLL_IN overflows rather than POLL_HUP as expected.

    [ avoid using attr.wakeup_events when attr.watermark is set ]

    Signed-off-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106011506390.5384@cl320.eecs.utk.edu
    Signed-off-by: Ingo Molnar

    Vince Weaver
     

09 Jun, 2011

1 commit

  • And create the internal perf events header.

    v2: Keep an internal inlined perf_output_copy()

    Signed-off-by: Frederic Weisbecker
    Acked-by: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Stephane Eranian
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com
    [ v3: use clearer 'ring_buffer' and 'rb' naming ]
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker