04 Oct, 2013

1 commit

  • While auditing the list_entry usage due to a trinity bug I found that
    perf_pmu_migrate_context violates the rules for
    perf_event::event_entry.

    The problem is that perf_event::event_entry is a RCU list element, and
    hence we must wait for a full RCU grace period before re-using the
    element after deletion.

    Therefore the usage in perf_pmu_migrate_context() which re-uses the
    entry immediately is broken. For now introduce another list_head into
    perf_event for this specific usage.

    This doesn't actually fix the trinity report because that never goes
    through this code.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Sep, 2013

1 commit

  • Solve the problems around the broken definition of perf_event_mmap_page::
    cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
    fixed by:

    860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

    The problem with the fix (merged in v3.12-rc1 and not yet released
    officially), noticed by Vince Weaver is that the new behavior is
    not detectable by new user-space, and that due to the reuse of the
    field names it's easy to mis-compile a binary if old headers are used
    on a new kernel or new headers are used on an old kernel.

    To solve all that make this change explicit, detectable and self-contained,
    by iterating the ABI the following way:

    - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
    confuse old user-space binaries. RDPMC will be marked as unavailable
    to old binaries but that's within the ABI, this is a capability bit.

    - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
    libraries can reliably detect that bit 0 is deprecated and perma-zero
    without having to check the kernel version.

    - Use bits 2, 3, 4 for the newly defined, correct functionality:

    cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
    cap_user_time : 1, /* The time_* fields are used */
    cap_user_time_zero : 1, /* The time_zero field is used */

    - Rename all the bitfield names in perf_event.h to be different from the
    old names, to make sure it's not possible to mis-compile it
    accidentally with old assumptions.

    The 'size' field can then be used in the future to add new fields and it
    will act as a natural ABI version indicator as well.

    Also adjust tools/perf/ userspace for the new definitions, noticed by
    Adrian Hunter.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Also-Fixed-by: Adrian Hunter
    Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Sep, 2013

1 commit

  • Currently utask->depth is simply the number of allocated/pending
    return_instance's in uprobe_task->return_instances list.

    handle_trampoline() should decrement this counter every time we
    handle/free an instance, but due to typo it does this only if
    ->chained == T. This means that in the likely case this counter
    is never decremented and the probed task can't report more than
    MAX_URETPROBE_DEPTH events.

    Reported-by: Mikhail Kulemin
    Reported-by: Hemant Kumar Shaw
    Signed-off-by: Oleg Nesterov
    Acked-by: Anton Arapov
    Cc: masami.hiramatsu.pt@hitachi.com
    Cc: srikar@linux.vnet.ibm.com
    Cc: systemtap@sourceware.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

11 Sep, 2013

1 commit

  • The ino_generation field was added in the PERF_RECORD_MMAP2 record in
    the 13d7a24 cset but no space for it was allocated, corrupting the
    PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.

    Detected with one of the regression tests done by 'perf test':

    [root@sandy ~]# perf test -v 7
    7: Validate PERF_RECORD_* events & perf_sample fields :
    --- start ---
    61315294449606 0 PERF_RECORD_SAMPLE
    61315294453161 0 PERF_RECORD_SAMPLE
    61315294454441 0 PERF_RECORD_SAMPLE
    61315294455709 0 PERF_RECORD_SAMPLE
    61315295600899 0 PERF_RECORD_COMM: sleep:6500
    27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
    MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
    MMAP2 with unexpected cpu, expected 0, got 342521613
    MMAP2 with unexpected pid, expected 6500, got 1701606191
    MMAP2 with unexpected tid, expected 6500, got 28773
    27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
    MMAP2 with unexpected cpu, expected 0, got 342561333
    MMAP2 with unexpected pid, expected 6500, got 1932408369
    MMAP2 with unexpected tid, expected 6500, got 111
    27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
    MMAP2 with unexpected cpu, expected 0, got 342600095
    MMAP2 with unexpected pid, expected 6500, got 1935963739
    MMAP2 with unexpected tid, expected 6500, got 23919
    27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
    MMAP2 with unexpected cpu, expected 0, got 342882834
    MMAP2 with unexpected pid, expected 6500, got 909192754
    MMAP2 with unexpected tid, expected 6500, got 7303982
    61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
    ---- end ----
    Validate PERF_RECORD_* events & perf_sample fields: FAILED!
    [root@sandy ~]#

    After this patch:

    [root@sandy ~]# perf test 7
    7: Validate PERF_RECORD_* events & perf_sample fields : Ok
    [root@sandy ~]#

    Acked-by: Peter Zijlstra
    Acked-by: Stephane Eranian
    Cc: Adrian Hunter
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

04 Sep, 2013

2 commits

  • …rnel.org/pub/scm/linux/kernel/git/tip/tip

    Pull perf changes from Ingo Molnar:
    "As a first remark I'd like to point out that the obsolete '-f'
    (--force) option, which has not done anything for several releases,
    has been removed from 'perf record' and related utilities. Everyone
    please update muscle memory accordingly! :-)

    Main changes on the perf kernel side:

    - Performance optimizations:
    . for trace events, by Steve Rostedt.
    . for time values, by Peter Zijlstra

    - New hardware support:
    . for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
    . for Intel SNB-EP uncore PMUs, by Zheng Yan

    - Enhanced hardware support:
    . for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan

    - Core perf events code enhancements and fixes:
    . for full-nohz feature handling, by Frederic Weisbecker
    . for group events, by Jiri Olsa
    . for call chains, by Frederic Weisbecker
    . for event stream parsing, by Adrian Hunter

    - New ABI details:
    . Add attr->mmap2 attribute, by Stephane Eranian
    . Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
    . Export u64 time_zero on the mmap header page to allow TSC
    calculation, by Adrian Hunter
    . Add dummy software event, by Adrian Hunter.
    . Add a new PERF_SAMPLE_IDENTIFIER to make samples always
    parseable, by Adrian Hunter.
    . Make Power7 events available via sysfs, by Runzhen Wang.

    - Code cleanups and refactorings:
    . for nohz-full, by Frederic Weisbecker
    . for group events, by Jiri Olsa

    - Documentation updates:
    . for perf_event_type, by Peter Zijlstra

    Main changes on the perf tooling side (some of these tooling changes
    utilize the above kernel side changes):

    - Lots of 'perf trace' enhancements:

    . Make 'perf trace' command line arguments consistent with
    'perf record', by David Ahern.

    . Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.

    . Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.

    . Support ! in -e expressions, to filter a list of syscalls,
    by Arnaldo Carvalho de Melo.

    . Arg formatting improvements to allow masking arguments in
    syscalls such as futex and open, where the some arguments are
    ignored and thus should not be printed depending on other args,
    by Arnaldo Carvalho de Melo.

    . Beautify futex open, openat, open_by_handle_at, lseek and futex
    syscalls, by Arnaldo Carvalho de Melo.

    . Add option to analyze events in a file versus live, so that
    one can do:

    [root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
    [ perf record: Woken up 0 times to write data ]
    [ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
    [root@zoo ~]# perf trace -i perf.data -e futex --duration 1
    17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
    113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
    133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
    [root@zoo ~]#

    By David Ahern.

    . Honor target pid / tid options when analyzing a file, by David Ahern.

    . Introduce better formatting of syscall arguments, including so
    far beautifiers for mmap, madvise, syscall return values,
    by Arnaldo Carvalho de Melo.

    . Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.

    - 'perf report/top' enhancements:

    . Do annotation using /proc/kcore and /proc/kallsyms when
    available, removing the forced need for a vmlinux file kernel
    assembly annotation. This also improves this use case because
    vmlinux has just the initial kernel image, not what is actually
    in use after various code patchings by things like alternatives.
    By Adrian Hunter.

    . Add --ignore-callees=<regex> option to collapse undesired parts
    of call graphs, by Greg Price.

    . Simplify symbol filtering by doing it at machine class level,
    by Adrian Hunter.

    . Add support for callchains in the gtk UI, by Namhyung Kim.

    . Add --objdump option to 'perf top', by Sukadev Bhattiprolu.

    - 'perf kvm' enhancements:

    . Add option to print only events that exceed a specified time
    duration, by David Ahern.

    . Improve stack trace printing, by David Ahern.

    . Update documentation of the live command, by David Ahern

    . Add perf kvm stat live mode that combines aspects of 'perf kvm
    stat' record and report, by David Ahern.

    . Add option to analyze specific VM in perf kvm stat report, by
    David Ahern.

    . Do not require /lib/modules/* on a guest, by Jason Wessel.

    - 'perf script' enhancements:

    . Fix symbol offset computation for some dsos, by David Ahern.

    . Fix named threads support, by David Ahern.

    . Don't install scripting files files when perl/python support
    is disabled, by Arnaldo Carvalho de Melo.

    - 'perf test' enhancements:

    . Add various improvements and fixes to the "vmlinux matches
    kallsyms" 'perf test' entry, related to the /proc/kcore
    annotation feature. By Adrian Hunter.

    . Add sample parsing test, by Adrian Hunter.

    . Add test for reading object code, by Adrian Hunter.

    . Add attr record group sampling test, by Jiri Olsa.

    . Misc testing infrastructure improvements and other details,
    by Jiri Olsa.

    - 'perf list' enhancements:

    . Skip unsupported hardware events, by Namhyung Kim.

    . List pmu events, by Andi Kleen.

    - 'perf diff' enhancements:

    . Add support for more than two files comparison, by Jiri Olsa.

    - 'perf sched' enhancements:

    . Various improvements, including removing reliance on some
    scheduler tracepoints that provide the same information as the
    PERF_RECORD_{FORK,EXIT} events. By David Ahern.

    . Remove odd build stall by moving a large struct initialization
    from a local variable to a global one, by Namhyung Kim.

    - 'perf stat' enhancements:

    . Add --initial-delay option to skip measuring for a defined
    startup phase, by Andi Kleen.

    - Generic perf tooling infrastructure/plumbing changes:

    . Tidy up sample parsing validation, by Adrian Hunter.

    . Fix up jobserver setup in libtraceevent Makefile.
    by Arnaldo Carvalho de Melo.

    . Debug improvements, by Adrian Hunter.

    . Fix correlation of samples coming after PERF_RECORD_EXIT event,
    by David Ahern.

    . Improve robustness of the topology parsing code,
    by Stephane Eranian.

    . Add group leader sampling, that allows just one event in a group
    to sample while the other events have just its values read,
    by Jiri Olsa.

    . Add support for a new modifier "D", which requests that the
    event, or group of events, be pinned to the PMU.
    By Michael Ellerman.

    . Support callchain sorting based on addresses, by Andi Kleen

    . Prep work for multi perf data file storage, by Jiri Olsa.

    . libtraceevent cleanups, by Namhyung Kim.

    And lots and lots of other fixes and code reorganizations that did not
    make it into the list, see the shortlog, diffstat and the Git log for
    details!"

    [ Also merge a leftover from the 3.11 cycle ]

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Prevent race in unthrottling code

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
    perf trace: Tell arg formatters the arg index
    perf trace: Add beautifier for open's flags arg
    perf trace: Add beautifier for lseek's whence arg
    perf tools: Fix symbol offset computation for some dsos
    perf list: Skip unsupported events
    perf tests: Add 'keep tracking' test
    perf tools: Add support for PERF_COUNT_SW_DUMMY
    perf: Add a dummy software event to keep tracking
    perf trace: Add beautifier for futex 'operation' parm
    perf trace: Allow syscall arg formatters to mask args
    perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
    perf: Export struct perf_branch_entry to userspace
    perf: Add attr->mmap2 attribute to an event
    perf/x86: Add Silvermont (22nm Atom) support
    perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
    perf trace: Handle missing HUGEPAGE defines
    perf trace: Honor target pid / tid options when analyzing a file
    perf trace: Add option to analyze events in a file versus live
    perf evlist: Add tracepoint lookup by name
    perf tests: Add a sample parsing test
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

02 Sep, 2013

2 commits

  • Adds a new PERF_RECORD_MMAP2 record type which is essence
    an expanded version of PERF_RECORD_MMAP.

    Used to request mmap records with more information about
    the mapping, including device major, minor and the inode
    number and generation for mappings associated with files
    or shared memory segments. Works for code and data
    (with attr->mmap_data set).

    Existing PERF_RECORD_MMAP record is unmodified by this patch.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Al Viro
    Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com
    [ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ]
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • The current throttling code triggers WARN below via following
    workload (only hit on AMD machine with 48 CPUs):

    # while [ 1 ]; do perf record perf bench sched messaging; done

    WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100()
    SNIP
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x61/0x80
    [] warn_slowpath_null+0x1a/0x20
    [] x86_pmu_start+0xc6/0x100
    [] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0
    [] perf_event_task_tick+0xc8/0xf0
    [] scheduler_tick+0xd1/0x140
    [] update_process_times+0x66/0x80
    [] tick_sched_handle.isra.15+0x25/0x60
    [] tick_sched_timer+0x41/0x60
    [] __run_hrtimer+0x74/0x1d0
    [] ? tick_sched_handle.isra.15+0x60/0x60
    [] hrtimer_interrupt+0xf7/0x240
    [] smp_apic_timer_interrupt+0x69/0x9c
    [] apic_timer_interrupt+0x6d/0x80
    [] ? __perf_event_task_sched_in+0x184/0x1a0
    [] ? kfree_skbmem+0x37/0x90
    [] ? __slab_free+0x1ac/0x30f
    [] ? kfree+0xfd/0x130
    [] kmem_cache_free+0x1b2/0x1d0
    [] kfree_skbmem+0x37/0x90
    [] consume_skb+0x34/0x80
    [] unix_stream_recvmsg+0x4e7/0x820
    [] sock_aio_read.part.7+0x116/0x130
    [] ? __perf_sw_event+0x19c/0x1e0
    [] sock_aio_read+0x21/0x30
    [] do_sync_read+0x80/0xb0
    [] vfs_read+0x145/0x170
    [] SyS_read+0x49/0xa0
    [] ? __audit_syscall_exit+0x1f6/0x2a0
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 622b7e226c4a766a ]---

    The reason is a race in perf_event_task_tick() throttling code.
    The race flow (simplified code):

    - perf_throttled_count is per cpu variable and is
    CPU throttling flag, here starting with 0

    - perf_throttled_seq is sequence/domain for allowed
    count of interrupts within the tick, gets increased
    each tick

    on single CPU (CPU bounded event):

    ... workload

    perf_event_task_tick:
    |
    | T0 inc(perf_throttled_seq)
    | T1 needs_unthr = xchg(perf_throttled_count, 0) == 0
    tick gets interrupted:

    ... event gets throttled under new seq ...

    T2 last NMI comes, event is throttled - inc(perf_throttled_count)

    back to tick:
    | perf_adjust_freq_unthr_context:
    |
    | T3 unthrottling is skiped for event (needs_unthr == 0)
    | T4 event is stop and started via freq adjustment
    |
    tick ends

    ... workload
    ... no sample is hit for event ...

    perf_event_task_tick:
    |
    | T5 needs_unthr = xchg(perf_throttled_count, 0) != 0 (from T2)
    | T6 unthrottling is done on event (interrupts == MAX_INTERRUPTS)
    | event is already started (from T4) -> WARN

    Fixing this by not checking needs_unthr again and thus
    check all events for unthrottling.

    Signed-off-by: Jiri Olsa
    Reported-by: Jan Stancek
    Suggested-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: Andi Kleen
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1377355554-8934-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

30 Aug, 2013

1 commit

  • The event stream is not always parsable because the format of a sample
    is dependent on the sample_type of the selected event. When there is
    more than one selected event and the sample_types are not the same then
    parsing becomes problematic. A sample can be matched to its selected
    event using the ID that is allocated when the event is opened.
    Unfortunately, to get the ID from the sample means first parsing it.

    This patch adds a new sample format bit PERF_SAMPLE_IDENTIFER that puts
    the ID at a fixed position so that the ID can be retrieved without
    parsing the sample. For sample events, that is the first position
    immediately after the header. For non-sample events, that is the last
    position.

    In this respect parsing samples requires that the sample_type and ID
    values are recorded. For example, perf tools records struct
    perf_event_attr and the IDs within the perf.data file. Those must be
    read first before it is possible to parse samples found later in the
    perf.data file.

    Signed-off-by: Adrian Hunter
    Tested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Mike Galbraith
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/r/1377591794-30553-6-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Adrian Hunter
     

27 Aug, 2013

1 commit

  • cgroup_css_from_dir() will grow another user. In preparation, make
    the following changes.

    * All css functions are prefixed with just "css_", rename it to
    css_from_dir().

    * Take dentry * instead of file * as dentry is what ultimately
    identifies a cgroup and file may not always be available. Note that
    the function now checkes whether @dentry->d_inode is NULL as the
    caller now may specify a negative dentry.

    * Make it take cgroup_subsys * instead of integer subsys_id. This
    simplifies the function and allows specifying no subsystem for
    cgroup->dummy_css.

    * Make return section a bit less verbose.

    This patch doesn't introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar

    Tejun Heo
     

16 Aug, 2013

3 commits

  • We should not be calling calc_timer_values() for events that do not actually
    have an mmap()'ed userpage.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130802191630.GT27162@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Freq events may not always be affine to a particular CPU. As such,
    account_event_cpu() may crash if we account per cpu a freq event
    that has event->cpu == -1.

    To solve this, lets account freq events globally. In practice
    this doesn't change much the picture because perf tools create
    per-task perf events with one event per CPU by default. Profiling a
    single CPU is usually a corner case so there is no much point in
    optimizing things that way.

    Reported-by: Jiri Olsa
    Suggested-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Tested-by: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1375460996-16329-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • When we fail to allocate the callchain buffers, we roll back the refcount
    we did and return from get_callchain_buffers().

    However we take the refcount and allocate under the callchain lock
    but the rollback is done outside the lock.

    As a result, while we roll back, some concurrent callchain user may
    call get_callchain_buffers(), see the non-zero refcount and give up
    because the buffers are NULL without itself retrying the allocation.

    The consequences aren't that bad but that behaviour looks weird enough and
    it's better to give their chances to the following callchain users where
    we failed.

    Reported-by: Jiri Olsa
    Signed-off-by: Frederic Weisbecker
    Acked-by: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1375460996-16329-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

13 Aug, 2013

1 commit

  • cgroup->subsys[] will become RCU protected and thus all cgroup_css()
    usages should either be under RCU read lock or cgroup_mutex. This
    patch updates cgroup_css_from_dir() which returns the matching
    cgroup_subsys_state given a directory file and subsys_id so that it
    requires RCU read lock and updates its sole user
    perf_cgroup_connect().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar

    Tejun Heo
     

09 Aug, 2013

3 commits

  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

08 Aug, 2013

2 commits

  • It's possible some of the counters in the group could be
    disabled when sampling member of the event group is reading
    the rest via PERF_SAMPLE_READ sample type processing. Disabled
    counters could then produce wrong numbers.

    Fixing that by reading only enabled counters for PERF_SAMPLE_READ
    sample type processing.

    Signed-off-by: Jiri Olsa
    Acked-by: Namhyung Kim
    Acked-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-wwkjb0bbcuslnz0klrmqi26r@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • The only way to get the event ID is by reading the event fd,
    followed by parsing the ID value out of the returned data.

    While this is ok for current read format used by perf tool,
    it is not ok when we use PERF_FORMAT_GROUP format.

    With this format the data are returned for the whole group
    and there's no way to find out what ID belongs to our fd
    (if we are not group leader event).

    Adding a simple ioctl that returns event primary ID for given fd.

    Signed-off-by: Jiri Olsa
    Acked-by: Namhyung Kim
    Acked-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-v1bn5cto707jn0bon34afqr1@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     

31 Jul, 2013

7 commits

  • Currently the full dynticks subsystem keep the
    tick alive as long as there are perf events running.

    This prevents the tick from being stopped as long as features
    such that the lockup detectors are running. As a temporary fix,
    the lockup detector is disabled by default when full dynticks
    is built but this is not a long term viable solution.

    To fix this, only keep the tick alive when an event configured
    with a frequency rather than a period is running on the CPU,
    or when an event throttles on the CPU.

    These are the only purposes of the perf tick, especially now that
    the rotation of flexible events is handled from a seperate hrtimer.
    The tick can be shutdown the rest of the time.

    Original-patch-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-8-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • This is going to be used by the full dynticks subsystem
    as a finer-grained information to know when to keep and
    when to stop the tick.

    Original-patch-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-7-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • When an event is migrated, move the event per-cpu
    accounting accordingly so that branch stack and cgroup
    events work correctly on the new CPU.

    Original-patch-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-6-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • This way we can use the per-cpu handling seperately.
    This is going to be used by to fix the event migration
    code accounting.

    Original-patch-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Gather all the event accounting code to a single place,
    once all the prerequisites are completed. This simplifies
    the refcounting.

    Original-patch-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • In case of allocation failure, get_callchain_buffer() keeps the
    refcount incremented for the current event.

    As a result, when get_callchain_buffers() returns an error,
    we must cleanup what it did by cancelling its last refcount
    with a call to put_callchain_buffers().

    This is a hack in order to be able to call free_event()
    after that failure.

    The original purpose of that was to simplify the failure
    path. But this error handling is actually counter intuitive,
    ugly and not very easy to follow because one expect to
    see the resources used to perform a service to be cleaned
    by the callee if case of failure, not by the caller.

    So lets clean this up by cancelling the refcount from
    get_callchain_buffer() in case of failure. And correctly free
    the event accordingly in perf_event_alloc().

    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • On callchain buffers allocation failure, free_event() is
    called and all the accounting performed in perf_event_alloc()
    for that event is cancelled.

    But if the event has branch stack sampling, it is unaccounted
    as well from the branch stack sampling events refcounts.

    This is a bug because this accounting is performed after the
    callchain buffer allocation. As a result, the branch stack sampling
    events refcount can become negative.

    To fix this, move the branch stack event accounting before the
    callchain buffer allocation.

    Reported-by: Peter Zijlstra
    Signed-off-by: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1374539466-4799-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

23 Jul, 2013

1 commit

  • Due to a discussion with Adrian I had a good look at the perf_event_type record
    layout and found the documentation to be somewhat unclear.

    Cc: Adrian Hunter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130716150907.GL23818@dyad.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Jul, 2013

2 commits

  • Merge in a v3.11-rc1-ish branch to go from v3.10 based development
    to a v3.11 based one.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Pull driver core patches from Greg KH:
    "Here are some driver core patches for 3.11-rc2. They aren't really
    bugfixes, but a bunch of new helper macros for drivers to properly
    create attribute groups, which drivers and subsystems need to fix up a
    ton of race issues with incorrectly creating sysfs files (binary and
    normal) after userspace has been told that the device is present.

    Also here is the ability to create binary files as attribute groups,
    to solve that race condition, which was impossible to do before this,
    so that's my fault the drivers were broken.

    The majority of the .c changes is indenting and moving code around a
    bit. It affects no existing code, but allows the large backlog of 70+
    patches that I already have created to start flowing into the
    different subtrees, instead of having to live in my driver-core tree,
    causing merge nightmares in linux-next for the next few months.

    These were finalized too late for the -rc1 merge window, which is why
    they were didn't make that pull request, testing and review from
    others didn't happen until a few weeks ago, and then there's the whole
    distraction of the past few days, which prevented these from getting
    to you sooner, sorry about that.

    Oh, and there's a bugfix for the documentation build warning in here
    as well. All of these have been in linux-next this week, with no
    reported problems"

    * tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    driver-core: fix new kernel-doc warning in base/platform.c
    sysfs: use file mode defines from stat.h
    sysfs: add more helper macro's for (bin_)attribute(_groups)
    driver core: add default groups to struct class
    driver core: Introduce device_create_groups
    sysfs: prevent warning when only using binary attributes
    sysfs: add support for binary attributes in groups
    driver core: device.h: add RW and RO attribute macros
    sysfs.h: add BIN_ATTR macro
    sysfs.h: add ATTRIBUTE_GROUPS() macro
    sysfs.h: add __ATTR_RW() macro

    Linus Torvalds
     

17 Jul, 2013

1 commit


15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Jul, 2013

4 commits

  • It gives the following benefits:

    - only one function pointer is passed along the way

    - the 'match' function is called within output function
    and could be inlined by the compiler

    Suggested-by: Peter Zijlstra
    Signed-off-by: Jiri Olsa
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1373388991-9711-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Jiri managed to trigger this warning:

    [] ======================================================
    [] [ INFO: possible circular locking dependency detected ]
    [] 3.10.0+ #228 Tainted: G W
    [] -------------------------------------------------------
    [] p/6613 is trying to acquire lock:
    [] (rcu_node_0){..-...}, at: [] rcu_read_unlock_special+0xa7/0x250
    []
    [] but task is already holding lock:
    [] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0xd9/0x2c0
    []
    [] which lock already depends on the new lock.
    []
    [] the existing dependency chain (in reverse order) is:
    []
    [] -> #4 (&ctx->lock){-.-...}:
    [] -> #3 (&rq->lock){-.-.-.}:
    [] -> #2 (&p->pi_lock){-.-.-.}:
    [] -> #1 (&rnp->nocb_gp_wq[1]){......}:
    [] -> #0 (rcu_node_0){..-...}:

    Paul was quick to explain that due to preemptible RCU we cannot call
    rcu_read_unlock() while holding scheduler (or nested) locks when part
    of the read side critical section was preemptible.

    Therefore solve it by making the entire RCU read side non-preemptible.

    Also pull out the retry from under the non-preempt to play nice with RT.

    Reported-by: Jiri Olsa
    Helped-out-by: Paul E. McKenney
    Cc:
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The '!ctx->is_active' check has a valid scenario, so
    there's no need for the warning.

    The reason is that there's a time window between the
    'ctx->is_active' check in the perf_event_enable() function
    and the __perf_event_enable() function having:

    - IRQs on
    - ctx->lock unlocked

    where the task could be killed and 'ctx' deactivated by
    perf_event_exit_task(), ending up with the warning below.

    So remove the WARN_ON_ONCE() check and add comments to
    explain it all.

    This addresses the following warning reported by Vince Weaver:

    [ 324.983534] ------------[ cut here ]------------
    [ 324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190()
    [ 324.984420] Modules linked in:
    [ 324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246
    [ 324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
    [ 324.984420] 0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00
    [ 324.984420] ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860
    [ 324.984420] 0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a
    [ 324.984420] Call Trace:
    [ 324.984420] [] dump_stack+0x19/0x1b
    [ 324.984420] [] warn_slowpath_common+0x70/0xa0
    [ 324.984420] [] warn_slowpath_null+0x1a/0x20
    [ 324.984420] [] __perf_event_enable+0x187/0x190
    [ 324.984420] [] remote_function+0x40/0x50
    [ 324.984420] [] generic_smp_call_function_single_interrupt+0xbe/0x130
    [ 324.984420] [] smp_call_function_single_interrupt+0x27/0x40
    [ 324.984420] [] call_function_single_interrupt+0x6f/0x80
    [ 324.984420] [] ? _raw_spin_unlock_irqrestore+0x41/0x70
    [ 324.984420] [] perf_event_exit_task+0x14d/0x210
    [ 324.984420] [] ? switch_task_namespaces+0x24/0x60
    [ 324.984420] [] do_exit+0x2b6/0xa40
    [ 324.984420] [] ? _raw_spin_unlock_irq+0x2c/0x30
    [ 324.984420] [] do_group_exit+0x49/0xc0
    [ 324.984420] [] get_signal_to_deliver+0x254/0x620
    [ 324.984420] [] do_signal+0x57/0x5a0
    [ 324.984420] [] ? __do_page_fault+0x2a4/0x4e0
    [ 324.984420] [] ? retint_restore_args+0xe/0xe
    [ 324.984420] [] ? retint_signal+0x11/0x84
    [ 324.984420] [] do_notify_resume+0x65/0x80
    [ 324.984420] [] retint_signal+0x46/0x84
    [ 324.984420] ---[ end trace 442ec2f04db3771a ]---

    Reported-by: Vince Weaver
    Signed-off-by: Jiri Olsa
    Suggested-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Currently when the child context for inherited events is
    created, it's based on the pmu object of the first event
    of the parent context.

    This is wrong for the following scenario:

    - HW context having HW and SW event
    - HW event got removed (closed)
    - SW event stays in HW context as the only event
    and its pmu is used to clone the child context

    The issue starts when the cpu context object is touched
    based on the pmu context object (__get_cpu_context). In
    this case the HW context will work with SW cpu context
    ending up with following WARN below.

    Fixing this by using parent context pmu object to clone
    from child context.

    Addresses the following warning reported by Vince Weaver:

    [ 2716.472065] ------------[ cut here ]------------
    [ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
    [ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
    [ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
    [ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
    [ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
    [ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
    [ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
    [ 2716.476035] Call Trace:
    [ 2716.476035] [] ? warn_slowpath_common+0x5b/0x70
    [ 2716.476035] [] ? task_ctx_sched_out+0x3c/0x5f
    [ 2716.476035] [] ? perf_event_exit_task+0xbf/0x194
    [ 2716.476035] [] ? do_exit+0x3e7/0x90c
    [ 2716.476035] [] ? __do_fault+0x359/0x394
    [ 2716.476035] [] ? do_group_exit+0x66/0x98
    [ 2716.476035] [] ? get_signal_to_deliver+0x479/0x4ad
    [ 2716.476035] [] ? __perf_event_task_sched_out+0x230/0x2d1
    [ 2716.476035] [] ? do_signal+0x3c/0x432
    [ 2716.476035] [] ? ctx_sched_in+0x43/0x141
    [ 2716.476035] [] ? perf_event_context_sched_in+0x7a/0x90
    [ 2716.476035] [] ? __perf_event_task_sched_in+0x31/0x118
    [ 2716.476035] [] ? mmdrop+0xd/0x1c
    [ 2716.476035] [] ? finish_task_switch+0x7d/0xa6
    [ 2716.476035] [] ? do_notify_resume+0x20/0x5d
    [ 2716.476035] [] ? retint_signal+0x3d/0x78
    [ 2716.476035] ---[ end trace 827178d8a5966c3d ]---

    Reported-by: Vince Weaver
    Signed-off-by: Jiri Olsa
    Cc: Corey Ashford
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

05 Jul, 2013

1 commit

  • This patch fixes a serious bug in:

    14c63f17b1fd perf: Drop sample rate when sampling is too slow

    There was an misunderstanding on the API of the do_div()
    macro. It returns the remainder of the division and this
    was not what the function expected leading to disabling the
    interrupt latency watchdog.

    This patch also remove a duplicate assignment in
    perf_sample_event_took().

    Signed-off-by: Stephane Eranian
    Cc: peterz@infradead.org
    Cc: dave.hansen@linux.intel.com
    Cc: ak@linux.intel.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/20130704223010.GA30625@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

23 Jun, 2013

1 commit

  • This patch keeps track of how long perf's NMI handler is taking,
    and also calculates how many samples perf can take a second. If
    the sample length times the expected max number of samples
    exceeds a configurable threshold, it drops the sample rate.

    This way, we don't have a runaway sampling process eating up the
    CPU.

    This patch can tend to drop the sample rate down to level where
    perf doesn't work very well. *BUT* the alternative is that my
    system hangs because it spends all of its time handling NMIs.

    I'll take a busted performance tool over an entire system that's
    busted and undebuggable any day.

    BTW, my suspicion is that there's still an underlying bug here.
    Using the HPET instead of the TSC is definitely a contributing
    factor, but I suspect there are some other things going on.
    But, I can't go dig down on a bug like that with my machine
    hanging all the time.

    Signed-off-by: Dave Hansen
    Acked-by: Peter Zijlstra
    Cc: paulus@samba.org
    Cc: acme@ghostprotocols.net
    Cc: Dave Hansen
    [ Prettified it a bit. ]
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

20 Jun, 2013

3 commits

  • This patch simply moves all per-cpu variables into the new
    single per-cpu "struct bp_cpuinfo".

    To me this looks more logical and clean, but this can also
    simplify the further potential changes. In particular, I do not
    think this memory should be per-cpu, it is never used "locally".
    After this change it is trivial to turn it into, say,
    bootmem[nr_cpu_ids].

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • 1. register_wide_hw_breakpoint() can use unregister_ if failure,
    no need to duplicate the code.

    2. "struct perf_event **pevent" adds the unnecesary lever of
    indirection and complication, use per_cpu(*cpu_events, cpu).

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add the trivial helper which simply returns cpumask_of() or
    cpu_possible_mask depending on bp->cpu.

    Change fetch_bp_busy_slots() and toggle_bp_slot() to always do
    for_each_cpu(cpumask_of_bp) to simplify the code and avoid the
    code duplication.

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov