Eric Lee / smarc-fsl-linux-kernel

04 Oct, 2013

1 commit

9886167d2 perf: Fix perf_pmu_migrate_context ... Browse Code »

While auditing the list_entry usage due to a trinity bug I found that
perf_pmu_migrate_context violates the rules for
perf_event::event_entry.

The problem is that perf_event::event_entry is a RCU list element, and
hence we must wait for a full RCU grace period before re-using the
element after deletion.

Therefore the usage in perf_pmu_migrate_context() which re-uses the
entry immediately is broken. For now introduce another list_head into
perf_event for this specific usage.

This doesn't actually fix the trinity report because that never goes
through this code.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-10-04 15:58:53 +0800

20 Sep, 2013

1 commit

fa7315871 perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' ... Browse Code »

Solve the problems around the broken definition of perf_event_mmap_page::
cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
fixed by:

860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

The problem with the fix (merged in v3.12-rc1 and not yet released
officially), noticed by Vince Weaver is that the new behavior is
not detectable by new user-space, and that due to the reuse of the
field names it's easy to mis-compile a binary if old headers are used
on a new kernel or new headers are used on an old kernel.

To solve all that make this change explicit, detectable and self-contained,
by iterating the ABI the following way:

- Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
confuse old user-space binaries. RDPMC will be marked as unavailable
to old binaries but that's within the ABI, this is a capability bit.

- Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
libraries can reliably detect that bit 0 is deprecated and perma-zero
without having to check the kernel version.

- Use bits 2, 3, 4 for the newly defined, correct functionality:

cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
cap_user_time : 1, /* The time_* fields are used */
cap_user_time_zero : 1, /* The time_zero field is used */

- Rename all the bitfield names in perf_event.h to be different from the
old names, to make sure it's not possible to mis-compile it
accidentally with old assumptions.

The 'size' field can then be used in the future to add new fields and it
will act as a natural ABI version indicator as well.

Also adjust tools/perf/ userspace for the new definitions, noticed by
Adrian Hunter.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Also-Fixed-by: Adrian Hunter
Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-09-20 15:45:11 +0800

12 Sep, 2013

1 commit

878b5a6ef uprobes: Fix utask->depth accounting in handle_trampoline() ... Browse Code »

Currently utask->depth is simply the number of allocated/pending
return_instance's in uprobe_task->return_instances list.

handle_trampoline() should decrement this counter every time we
handle/free an instance, but due to typo it does this only if
->chained == T. This means that in the likely case this counter
is never decremented and the probed task can't report more than
MAX_URETPROBE_DEPTH events.

Reported-by: Mikhail Kulemin
Reported-by: Hemant Kumar Shaw
Signed-off-by: Oleg Nesterov
Acked-by: Anton Arapov
Cc: masami.hiramatsu.pt@hitachi.com
Cc: srikar@linux.vnet.ibm.com
Cc: systemtap@sourceware.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2013-09-12 14:00:55 +0800

11 Sep, 2013

1 commit

d008d5258 perf: Fix up MMAP2 buffer space reservation ... Browse Code »

The ino_generation field was added in the PERF_RECORD_MMAP2 record in
the 13d7a24 cset but no space for it was allocated, corrupting the
PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.

Detected with one of the regression tests done by 'perf test':

[root@sandy ~]# perf test -v 7
7: Validate PERF_RECORD_* events & perf_sample fields :
--- start ---
61315294449606 0 PERF_RECORD_SAMPLE
61315294453161 0 PERF_RECORD_SAMPLE
61315294454441 0 PERF_RECORD_SAMPLE
61315294455709 0 PERF_RECORD_SAMPLE
61315295600899 0 PERF_RECORD_COMM: sleep:6500
27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
MMAP2 with unexpected cpu, expected 0, got 342521613
MMAP2 with unexpected pid, expected 6500, got 1701606191
MMAP2 with unexpected tid, expected 6500, got 28773
27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
MMAP2 with unexpected cpu, expected 0, got 342561333
MMAP2 with unexpected pid, expected 6500, got 1932408369
MMAP2 with unexpected tid, expected 6500, got 111
27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
MMAP2 with unexpected cpu, expected 0, got 342600095
MMAP2 with unexpected pid, expected 6500, got 1935963739
MMAP2 with unexpected tid, expected 6500, got 23919
27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
MMAP2 with unexpected cpu, expected 0, got 342882834
MMAP2 with unexpected pid, expected 6500, got 909192754
MMAP2 with unexpected tid, expected 6500, got 7303982
61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
---- end ----
Validate PERF_RECORD_* events & perf_sample fields: FAILED!
[root@sandy ~]#

After this patch:

[root@sandy ~]# perf test 7
7: Validate PERF_RECORD_* events & perf_sample fields : Ok
[root@sandy ~]#

Acked-by: Peter Zijlstra
Acked-by: Stephane Eranian
Cc: Adrian Hunter
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Stephane Eranian
Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2013-09-11 21:11:46 +0800

04 Sep, 2013

2 commits

0d99b7087 Merge branches 'perf-urgent-for-linus' and 'perf-core-for-linus' of git://git.ke… ... Browse Code »

…rnel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf changes from Ingo Molnar:
"As a first remark I'd like to point out that the obsolete '-f'
(--force) option, which has not done anything for several releases,
has been removed from 'perf record' and related utilities. Everyone
please update muscle memory accordingly! :-)

Main changes on the perf kernel side:

- Performance optimizations:
. for trace events, by Steve Rostedt.
. for time values, by Peter Zijlstra

- New hardware support:
. for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
. for Intel SNB-EP uncore PMUs, by Zheng Yan

- Enhanced hardware support:
. for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan

- Core perf events code enhancements and fixes:
. for full-nohz feature handling, by Frederic Weisbecker
. for group events, by Jiri Olsa
. for call chains, by Frederic Weisbecker
. for event stream parsing, by Adrian Hunter

- New ABI details:
. Add attr->mmap2 attribute, by Stephane Eranian
. Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
. Export u64 time_zero on the mmap header page to allow TSC
calculation, by Adrian Hunter
. Add dummy software event, by Adrian Hunter.
. Add a new PERF_SAMPLE_IDENTIFIER to make samples always
parseable, by Adrian Hunter.
. Make Power7 events available via sysfs, by Runzhen Wang.

- Code cleanups and refactorings:
. for nohz-full, by Frederic Weisbecker
. for group events, by Jiri Olsa

- Documentation updates:
. for perf_event_type, by Peter Zijlstra

Main changes on the perf tooling side (some of these tooling changes
utilize the above kernel side changes):

- Lots of 'perf trace' enhancements:

. Make 'perf trace' command line arguments consistent with
'perf record', by David Ahern.

. Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.

. Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.

. Support ! in -e expressions, to filter a list of syscalls,
by Arnaldo Carvalho de Melo.

. Arg formatting improvements to allow masking arguments in
syscalls such as futex and open, where the some arguments are
ignored and thus should not be printed depending on other args,
by Arnaldo Carvalho de Melo.

. Beautify futex open, openat, open_by_handle_at, lseek and futex
syscalls, by Arnaldo Carvalho de Melo.

. Add option to analyze events in a file versus live, so that
one can do:

[root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
[root@zoo ~]# perf trace -i perf.data -e futex --duration 1
17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
[root@zoo ~]#

By David Ahern.

. Honor target pid / tid options when analyzing a file, by David Ahern.

. Introduce better formatting of syscall arguments, including so
far beautifiers for mmap, madvise, syscall return values,
by Arnaldo Carvalho de Melo.

. Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.

- 'perf report/top' enhancements:

. Do annotation using /proc/kcore and /proc/kallsyms when
available, removing the forced need for a vmlinux file kernel
assembly annotation. This also improves this use case because
vmlinux has just the initial kernel image, not what is actually
in use after various code patchings by things like alternatives.
By Adrian Hunter.

. Add --ignore-callees=<regex> option to collapse undesired parts
of call graphs, by Greg Price.

. Simplify symbol filtering by doing it at machine class level,
by Adrian Hunter.

. Add support for callchains in the gtk UI, by Namhyung Kim.

. Add --objdump option to 'perf top', by Sukadev Bhattiprolu.

- 'perf kvm' enhancements:

. Add option to print only events that exceed a specified time
duration, by David Ahern.

. Improve stack trace printing, by David Ahern.

. Update documentation of the live command, by David Ahern

. Add perf kvm stat live mode that combines aspects of 'perf kvm
stat' record and report, by David Ahern.

. Add option to analyze specific VM in perf kvm stat report, by
David Ahern.

. Do not require /lib/modules/* on a guest, by Jason Wessel.

- 'perf script' enhancements:

. Fix symbol offset computation for some dsos, by David Ahern.

. Fix named threads support, by David Ahern.

. Don't install scripting files files when perl/python support
is disabled, by Arnaldo Carvalho de Melo.

- 'perf test' enhancements:

. Add various improvements and fixes to the "vmlinux matches
kallsyms" 'perf test' entry, related to the /proc/kcore
annotation feature. By Adrian Hunter.

. Add sample parsing test, by Adrian Hunter.

. Add test for reading object code, by Adrian Hunter.

. Add attr record group sampling test, by Jiri Olsa.

. Misc testing infrastructure improvements and other details,
by Jiri Olsa.

- 'perf list' enhancements:

. Skip unsupported hardware events, by Namhyung Kim.

. List pmu events, by Andi Kleen.

- 'perf diff' enhancements:

. Add support for more than two files comparison, by Jiri Olsa.

- 'perf sched' enhancements:

. Various improvements, including removing reliance on some
scheduler tracepoints that provide the same information as the
PERF_RECORD_{FORK,EXIT} events. By David Ahern.

. Remove odd build stall by moving a large struct initialization
from a local variable to a global one, by Namhyung Kim.

- 'perf stat' enhancements:

. Add --initial-delay option to skip measuring for a defined
startup phase, by Andi Kleen.

- Generic perf tooling infrastructure/plumbing changes:

. Tidy up sample parsing validation, by Adrian Hunter.

. Fix up jobserver setup in libtraceevent Makefile.
by Arnaldo Carvalho de Melo.

. Debug improvements, by Adrian Hunter.

. Fix correlation of samples coming after PERF_RECORD_EXIT event,
by David Ahern.

. Improve robustness of the topology parsing code,
by Stephane Eranian.

. Add group leader sampling, that allows just one event in a group
to sample while the other events have just its values read,
by Jiri Olsa.

. Add support for a new modifier "D", which requests that the
event, or group of events, be pinned to the PMU.
By Michael Ellerman.

. Support callchain sorting based on addresses, by Andi Kleen

. Prep work for multi perf data file storage, by Jiri Olsa.

. libtraceevent cleanups, by Namhyung Kim.

And lots and lots of other fixes and code reorganizations that did not
make it into the list, see the shortlog, diffstat and the Git log for
details!"

[ Also merge a leftover from the 3.11 cycle ]

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Prevent race in unthrottling code

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
perf trace: Tell arg formatters the arg index
perf trace: Add beautifier for open's flags arg
perf trace: Add beautifier for lseek's whence arg
perf tools: Fix symbol offset computation for some dsos
perf list: Skip unsupported events
perf tests: Add 'keep tracking' test
perf tools: Add support for PERF_COUNT_SW_DUMMY
perf: Add a dummy software event to keep tracking
perf trace: Add beautifier for futex 'operation' parm
perf trace: Allow syscall arg formatters to mask args
perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
perf: Export struct perf_branch_entry to userspace
perf: Add attr->mmap2 attribute to an event
perf/x86: Add Silvermont (22nm Atom) support
perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
perf trace: Handle missing HUGEPAGE defines
perf trace: Honor target pid / tid options when analyzing a file
perf trace: Add option to analyze events in a file versus live
perf evlist: Add tracepoint lookup by name
perf tests: Add a sample parsing test
...

Linus Torvalds
2013-09-04 23:25:35 +0800
32dad03d1 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.

- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.

Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.

- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.

- Various fixes and cleanups"

Fixed up conflict in kernel/cgroup.c as per Tejun.

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...

Linus Torvalds
2013-09-04 09:25:03 +0800

02 Sep, 2013

2 commits

13d7a2410 perf: Add attr->mmap2 attribute to an event ... Browse Code »

Adds a new PERF_RECORD_MMAP2 record type which is essence
an expanded version of PERF_RECORD_MMAP.

Used to request mmap records with more information about
the mapping, including device major, minor and the inode
number and generation for mappings associated with files
or shared memory segments. Works for code and data
(with attr->mmap_data set).

Existing PERF_RECORD_MMAP record is unmodified by this patch.

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Cc: Al Viro
Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com
[ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ]
Signed-off-by: Ingo Molnar

Stephane Eranian
2013-09-02 14:42:48 +0800
ae23bff1d perf: Prevent race in unthrottling code ... Browse Code »

The current throttling code triggers WARN below via following
workload (only hit on AMD machine with 48 CPUs):

# while [ 1 ]; do perf record perf bench sched messaging; done

WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100()
SNIP
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x61/0x80
[] warn_slowpath_null+0x1a/0x20
[] x86_pmu_start+0xc6/0x100
[] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0
[] perf_event_task_tick+0xc8/0xf0
[] scheduler_tick+0xd1/0x140
[] update_process_times+0x66/0x80
[] tick_sched_handle.isra.15+0x25/0x60
[] tick_sched_timer+0x41/0x60
[] __run_hrtimer+0x74/0x1d0
[] ? tick_sched_handle.isra.15+0x60/0x60
[] hrtimer_interrupt+0xf7/0x240
[] smp_apic_timer_interrupt+0x69/0x9c
[] apic_timer_interrupt+0x6d/0x80
[] ? __perf_event_task_sched_in+0x184/0x1a0
[] ? kfree_skbmem+0x37/0x90
[] ? __slab_free+0x1ac/0x30f
[] ? kfree+0xfd/0x130
[] kmem_cache_free+0x1b2/0x1d0
[] kfree_skbmem+0x37/0x90
[] consume_skb+0x34/0x80
[] unix_stream_recvmsg+0x4e7/0x820
[] sock_aio_read.part.7+0x116/0x130
[] ? __perf_sw_event+0x19c/0x1e0
[] sock_aio_read+0x21/0x30
[] do_sync_read+0x80/0xb0
[] vfs_read+0x145/0x170
[] SyS_read+0x49/0xa0
[] ? __audit_syscall_exit+0x1f6/0x2a0
[] system_call_fastpath+0x16/0x1b
---[ end trace 622b7e226c4a766a ]---

The reason is a race in perf_event_task_tick() throttling code.
The race flow (simplified code):

- perf_throttled_count is per cpu variable and is
CPU throttling flag, here starting with 0

- perf_throttled_seq is sequence/domain for allowed
count of interrupts within the tick, gets increased
each tick

on single CPU (CPU bounded event):

... workload

perf_event_task_tick:
|
| T0 inc(perf_throttled_seq)
| T1 needs_unthr = xchg(perf_throttled_count, 0) == 0
tick gets interrupted:

... event gets throttled under new seq ...

T2 last NMI comes, event is throttled - inc(perf_throttled_count)

back to tick:
| perf_adjust_freq_unthr_context:
|
| T3 unthrottling is skiped for event (needs_unthr == 0)
| T4 event is stop and started via freq adjustment
|
tick ends

... workload
... no sample is hit for event ...

perf_event_task_tick:
|
| T5 needs_unthr = xchg(perf_throttled_count, 0) != 0 (from T2)
| T6 unthrottling is done on event (interrupts == MAX_INTERRUPTS)
| event is already started (from T4) -> WARN

Fixing this by not checking needs_unthr again and thus
check all events for unthrottling.

Signed-off-by: Jiri Olsa
Reported-by: Jan Stancek
Suggested-by: Peter Zijlstra
Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: Andi Kleen
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1377355554-8934-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar

Jiri Olsa
2013-09-02 14:13:24 +0800

30 Aug, 2013

1 commit

ff3d527ce perf: make events stream always parsable ... Browse Code »

The event stream is not always parsable because the format of a sample
is dependent on the sample_type of the selected event. When there is
more than one selected event and the sample_types are not the same then
parsing becomes problematic. A sample can be matched to its selected
event using the ID that is allocated when the event is opened.
Unfortunately, to get the ID from the sample means first parsing it.

This patch adds a new sample format bit PERF_SAMPLE_IDENTIFER that puts
the ID at a fixed position so that the ID can be retrieved without
parsing the sample. For sample events, that is the first position
immediately after the header. For non-sample events, that is the last
position.

In this respect parsing samples requires that the sample_type and ID
values are recorded. For example, perf tools records struct
perf_event_attr and the IDs within the perf.data file. Those must be
read first before it is possible to parse samples found later in the
perf.data file.

Signed-off-by: Adrian Hunter
Tested-by: Stephane Eranian
Acked-by: Peter Zijlstra
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Jiri Olsa
Cc: Mike Galbraith
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Stephane Eranian
Link: http://lkml.kernel.org/r/1377591794-30553-6-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo

Adrian Hunter
2013-08-30 02:40:03 +0800

27 Aug, 2013

1 commit

35cf08361 cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax ... Browse Code »

cgroup_css_from_dir() will grow another user. In preparation, make
the following changes.

* All css functions are prefixed with just "css_", rename it to
css_from_dir().

* Take dentry * instead of file * as dentry is what ultimately
identifies a cgroup and file may not always be available. Note that
the function now checkes whether @dentry->d_inode is NULL as the
caller now may specify a negative dentry.

* Make it take cgroup_subsys * instead of integer subsys_id. This
simplifies the function and allows specifying no subsystem for
cgroup->dummy_css.

* Make return section a bit less verbose.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Kirill A. Shutemov
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar

Tejun Heo
2013-08-27 06:40:56 +0800

16 Aug, 2013

3 commits

5ec4c599a perf: Do not compute time values unnecessarily ... Browse Code »

We should not be calling calc_timer_values() for events that do not actually
have an mmap()'ed userpage.

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20130802191630.GT27162@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-08-16 23:55:52 +0800
948b26b6d perf: Account freq events globally ... Browse Code »

Freq events may not always be affine to a particular CPU. As such,
account_event_cpu() may crash if we account per cpu a freq event
that has event->cpu == -1.

To solve this, lets account freq events globally. In practice
this doesn't change much the picture because perf tools create
per-task perf events with one event per CPU by default. Profiling a
single CPU is usually a corner case so there is no much point in
optimizing things that way.

Reported-by: Jiri Olsa
Suggested-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Tested-by: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1375460996-16329-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-08-16 23:55:51 +0800
fc3b86d67 perf: Roll back callchain buffer refcount under the callchain mutex ... Browse Code »

When we fail to allocate the callchain buffers, we roll back the refcount
we did and return from get_callchain_buffers().

However we take the refcount and allocate under the callchain lock
but the rollback is done outside the lock.

As a result, while we roll back, some concurrent callchain user may
call get_callchain_buffers(), see the non-zero refcount and give up
because the buffers are NULL without itself retrying the allocation.

The consequences aren't that bad but that behaviour looks weird enough and
it's better to give their chances to the following callchain users where
we failed.

Reported-by: Jiri Olsa
Signed-off-by: Frederic Weisbecker
Acked-by: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1375460996-16329-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-08-16 23:55:50 +0800

13 Aug, 2013

1 commit

b77d7b608 cgroup: cgroup_css_from_dir() now should be called with RCU read locked ... Browse Code »

cgroup->subsys[] will become RCU protected and thus all cgroup_css()
usages should either be under RCU read lock or cgroup_mutex. This
patch updates cgroup_css_from_dir() which returns the matching
cgroup_subsys_state given a directory file and subsys_id so that it
requires RCU read lock and updates its sole user
perf_cgroup_connect().

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar

Tejun Heo
2013-08-13 23:01:54 +0800

09 Aug, 2013

3 commits

d99c8727e cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

cgroup_taskset which is used by the subsystem attach methods is the
last cgroup subsystem API which isn't using css as the handle. Update
cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

The conversions are pretty mechanical. One exception is
cpuset::cgroup_cs(), which lost its last user and got removed.

This patch shouldn't introduce any functional changes.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Daniel Wagner
Cc: Ingo Molnar
Cc: Matt Helsley
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:27 +0800
eb95419b0 cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:23 +0800
8af01f56a cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ ... Browse Code »

The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:22 +0800

08 Aug, 2013

2 commits

6f5ab0019 perf: Do not get values from disabled counters in group format read ... Browse Code »

It's possible some of the counters in the group could be
disabled when sampling member of the event group is reading
the rest via PERF_SAMPLE_READ sample type processing. Disabled
counters could then produce wrong numbers.

Fixing that by reading only enabled counters for PERF_SAMPLE_READ
sample type processing.

Signed-off-by: Jiri Olsa
Acked-by: Namhyung Kim
Acked-by: Peter Zijlstra
Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-wwkjb0bbcuslnz0klrmqi26r@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2013-08-08 04:35:19 +0800
cf4957f17 perf: Add PERF_EVENT_IOC_ID ioctl to return event ID ... Browse Code »

The only way to get the event ID is by reading the event fd,
followed by parsing the ID value out of the returned data.

While this is ok for current read format used by perf tool,
it is not ok when we use PERF_FORMAT_GROUP format.

With this format the data are returned for the whole group
and there's no way to find out what ID belongs to our fd
(if we are not group leader event).

Adding a simple ioctl that returns event primary ID for given fd.

Signed-off-by: Jiri Olsa
Acked-by: Namhyung Kim
Acked-by: Peter Zijlstra
Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-v1bn5cto707jn0bon34afqr1@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2013-08-08 04:35:19 +0800

31 Jul, 2013

7 commits

d84153d6c perf: Implement finer grained full dynticks kick ... Browse Code »

Currently the full dynticks subsystem keep the
tick alive as long as there are perf events running.

This prevents the tick from being stopped as long as features
such that the lockup detectors are running. As a temporary fix,
the lockup detector is disabled by default when full dynticks
is built but this is not a long term viable solution.

To fix this, only keep the tick alive when an event configured
with a frequency rather than a period is running on the CPU,
or when an event throttles on the CPU.

These are the only purposes of the perf tick, especially now that
the rotation of flexible events is handled from a seperate hrtimer.
The tick can be shutdown the rest of the time.

Original-patch-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:29:15 +0800
ba8a75c16 perf: Account freq events per cpu ... Browse Code »

This is going to be used by the full dynticks subsystem
as a finer-grained information to know when to keep and
when to stop the tick.

Original-patch-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:29:14 +0800
9a545de01 perf: Migrate per cpu event accounting ... Browse Code »

When an event is migrated, move the event per-cpu
accounting accordingly so that branch stack and cgroup
events work correctly on the new CPU.

Original-patch-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:29:14 +0800
4beb31f36 perf: Split the per-cpu accounting part of the event accounting code ... Browse Code »

This way we can use the per-cpu handling seperately.
This is going to be used by to fix the event migration
code accounting.

Original-patch-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:29:13 +0800
766d6c076 perf: Factor out event accounting code to account_event()/__free_event() ... Browse Code »

Gather all the event accounting code to a single place,
once all the prerequisites are completed. This simplifies
the refcounting.

Original-patch-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:29:12 +0800
90983b160 perf: Sanitize get_callchain_buffer() ... Browse Code »

In case of allocation failure, get_callchain_buffer() keeps the
refcount incremented for the current event.

As a result, when get_callchain_buffers() returns an error,
we must cleanup what it did by cancelling its last refcount
with a call to put_callchain_buffers().

This is a hack in order to be able to call free_event()
after that failure.

The original purpose of that was to simplify the failure
path. But this error handling is actually counter intuitive,
ugly and not very easy to follow because one expect to
see the resources used to perform a service to be cleaned
by the callee if case of failure, not by the caller.

So lets clean this up by cancelling the refcount from
get_callchain_buffer() in case of failure. And correctly free
the event accordingly in perf_event_alloc().

Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:29:12 +0800
6050cb0b0 perf: Fix branch stack refcount leak on callchain init failure ... Browse Code »

On callchain buffers allocation failure, free_event() is
called and all the accounting performed in perf_event_alloc()
for that event is cancelled.

But if the event has branch stack sampling, it is unaccounted
as well from the branch stack sampling events refcounts.

This is a bug because this accounting is performed after the
callchain buffer allocation. As a result, the branch stack sampling
events refcount can become negative.

To fix this, move the branch stack event accounting before the
callchain buffer allocation.

Reported-by: Peter Zijlstra
Signed-off-by: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1374539466-4799-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-07-31 04:22:58 +0800

23 Jul, 2013

1 commit

a5cdd40c9 perf: Update perf_event_type documentation ... Browse Code »

Due to a discussion with Adrian I had a good look at the perf_event_type record
layout and found the documentation to be somewhat unclear.

Cc: Adrian Hunter
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20130716150907.GL23818@dyad.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-07-23 18:17:08 +0800

19 Jul, 2013

2 commits

e43fff2b9 Merge branch 'linus' into perf/core ... Browse Code »

Merge in a v3.11-rc1-ish branch to go from v3.10 based development
to a v3.11 based one.

Signed-off-by: Ingo Molnar

Ingo Molnar
2013-07-19 15:34:42 +0800
7a62711aa Merge tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core patches from Greg KH:
"Here are some driver core patches for 3.11-rc2. They aren't really
bugfixes, but a bunch of new helper macros for drivers to properly
create attribute groups, which drivers and subsystems need to fix up a
ton of race issues with incorrectly creating sysfs files (binary and
normal) after userspace has been told that the device is present.

Also here is the ability to create binary files as attribute groups,
to solve that race condition, which was impossible to do before this,
so that's my fault the drivers were broken.

The majority of the .c changes is indenting and moving code around a
bit. It affects no existing code, but allows the large backlog of 70+
patches that I already have created to start flowing into the
different subtrees, instead of having to live in my driver-core tree,
causing merge nightmares in linux-next for the next few months.

These were finalized too late for the -rc1 merge window, which is why
they were didn't make that pull request, testing and review from
others didn't happen until a few weeks ago, and then there's the whole
distraction of the past few days, which prevented these from getting
to you sooner, sorry about that.

Oh, and there's a bugfix for the documentation build warning in here
as well. All of these have been in linux-next this week, with no
reported problems"

* tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver-core: fix new kernel-doc warning in base/platform.c
sysfs: use file mode defines from stat.h
sysfs: add more helper macro's for (bin_)attribute(_groups)
driver core: add default groups to struct class
driver core: Introduce device_create_groups
sysfs: prevent warning when only using binary attributes
sysfs: add support for binary attributes in groups
driver core: device.h: add RW and RO attribute macros
sysfs.h: add BIN_ATTR macro
sysfs.h: add ATTRIBUTE_GROUPS() macro
sysfs.h: add __ATTR_RW() macro

Linus Torvalds
2013-07-19 03:48:40 +0800

17 Jul, 2013

1 commit

b9b325974 sysfs.h: add __ATTR_RW() macro ... Browse Code »

A number of parts of the kernel created their own version of this, might
as well have the sysfs core provide it instead.

Reviewed-by: Guenter Roeck
Tested-by: Guenter Roeck
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-07-17 01:57:36 +0800

15 Jul, 2013

1 commit

0db0628d9 kernel: delete __cpuinit usage from all core kernel files ... Browse Code »

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.

[1] https://lkml.org/lkml/2013/5/20/589

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2013-07-15 07:36:59 +0800

12 Jul, 2013

4 commits

675168446 perf: Remove the 'match' callback for auxiliary events processing ... Browse Code »

It gives the following benefits:

- only one function pointer is passed along the way

- the 'match' function is called within output function
and could be inlined by the compiler

Suggested-by: Peter Zijlstra
Signed-off-by: Jiri Olsa
Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1373388991-9711-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar

Jiri Olsa
2013-07-12 19:50:36 +0800
058ebd0eb perf: Fix perf_lock_task_context() vs RCU ... Browse Code »

Jiri managed to trigger this warning:

[] ======================================================
[] [ INFO: possible circular locking dependency detected ]
[] 3.10.0+ #228 Tainted: G W
[] -------------------------------------------------------
[] p/6613 is trying to acquire lock:
[] (rcu_node_0){..-...}, at: [] rcu_read_unlock_special+0xa7/0x250
[]
[] but task is already holding lock:
[] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0xd9/0x2c0
[]
[] which lock already depends on the new lock.
[]
[] the existing dependency chain (in reverse order) is:
[]
[] -> #4 (&ctx->lock){-.-...}:
[] -> #3 (&rq->lock){-.-.-.}:
[] -> #2 (&p->pi_lock){-.-.-.}:
[] -> #1 (&rnp->nocb_gp_wq[1]){......}:
[] -> #0 (rcu_node_0){..-...}:

Paul was quick to explain that due to preemptible RCU we cannot call
rcu_read_unlock() while holding scheduler (or nested) locks when part
of the read side critical section was preemptible.

Therefore solve it by making the entire RCU read side non-preemptible.

Also pull out the retry from under the non-preempt to play nice with RT.

Reported-by: Jiri Olsa
Helped-out-by: Paul E. McKenney
Cc:
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-07-12 17:11:09 +0800
06f417968 perf: Remove WARN_ON_ONCE() check in __perf_event_enable() for valid scenario ... Browse Code »

The '!ctx->is_active' check has a valid scenario, so
there's no need for the warning.

The reason is that there's a time window between the
'ctx->is_active' check in the perf_event_enable() function
and the __perf_event_enable() function having:

- IRQs on
- ctx->lock unlocked

where the task could be killed and 'ctx' deactivated by
perf_event_exit_task(), ending up with the warning below.

So remove the WARN_ON_ONCE() check and add comments to
explain it all.

This addresses the following warning reported by Vince Weaver:

[ 324.983534] ------------[ cut here ]------------
[ 324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190()
[ 324.984420] Modules linked in:
[ 324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246
[ 324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ 324.984420] 0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00
[ 324.984420] ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860
[ 324.984420] 0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a
[ 324.984420] Call Trace:
[ 324.984420] [] dump_stack+0x19/0x1b
[ 324.984420] [] warn_slowpath_common+0x70/0xa0
[ 324.984420] [] warn_slowpath_null+0x1a/0x20
[ 324.984420] [] __perf_event_enable+0x187/0x190
[ 324.984420] [] remote_function+0x40/0x50
[ 324.984420] [] generic_smp_call_function_single_interrupt+0xbe/0x130
[ 324.984420] [] smp_call_function_single_interrupt+0x27/0x40
[ 324.984420] [] call_function_single_interrupt+0x6f/0x80
[ 324.984420] [] ? _raw_spin_unlock_irqrestore+0x41/0x70
[ 324.984420] [] perf_event_exit_task+0x14d/0x210
[ 324.984420] [] ? switch_task_namespaces+0x24/0x60
[ 324.984420] [] do_exit+0x2b6/0xa40
[ 324.984420] [] ? _raw_spin_unlock_irq+0x2c/0x30
[ 324.984420] [] do_group_exit+0x49/0xc0
[ 324.984420] [] get_signal_to_deliver+0x254/0x620
[ 324.984420] [] do_signal+0x57/0x5a0
[ 324.984420] [] ? __do_page_fault+0x2a4/0x4e0
[ 324.984420] [] ? retint_restore_args+0xe/0xe
[ 324.984420] [] ? retint_signal+0x11/0x84
[ 324.984420] [] do_notify_resume+0x65/0x80
[ 324.984420] [] retint_signal+0x46/0x84
[ 324.984420] ---[ end trace 442ec2f04db3771a ]---

Reported-by: Vince Weaver
Signed-off-by: Jiri Olsa
Suggested-by: Peter Zijlstra
Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc:
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar

Jiri Olsa
2013-07-12 17:11:01 +0800
734df5ab5 perf: Clone child context from parent context pmu ... Browse Code »

Currently when the child context for inherited events is
created, it's based on the pmu object of the first event
of the parent context.

This is wrong for the following scenario:

- HW context having HW and SW event
- HW event got removed (closed)
- SW event stays in HW context as the only event
and its pmu is used to clone the child context

The issue starts when the cpu context object is touched
based on the pmu context object (__get_cpu_context). In
this case the HW context will work with SW cpu context
ending up with following WARN below.

Fixing this by using parent context pmu object to clone
from child context.

Addresses the following warning reported by Vince Weaver:

[ 2716.472065] ------------[ cut here ]------------
[ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
[ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
[ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
[ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
[ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
[ 2716.476035] Call Trace:
[ 2716.476035] [] ? warn_slowpath_common+0x5b/0x70
[ 2716.476035] [] ? task_ctx_sched_out+0x3c/0x5f
[ 2716.476035] [] ? perf_event_exit_task+0xbf/0x194
[ 2716.476035] [] ? do_exit+0x3e7/0x90c
[ 2716.476035] [] ? __do_fault+0x359/0x394
[ 2716.476035] [] ? do_group_exit+0x66/0x98
[ 2716.476035] [] ? get_signal_to_deliver+0x479/0x4ad
[ 2716.476035] [] ? __perf_event_task_sched_out+0x230/0x2d1
[ 2716.476035] [] ? do_signal+0x3c/0x432
[ 2716.476035] [] ? ctx_sched_in+0x43/0x141
[ 2716.476035] [] ? perf_event_context_sched_in+0x7a/0x90
[ 2716.476035] [] ? __perf_event_task_sched_in+0x31/0x118
[ 2716.476035] [] ? mmdrop+0xd/0x1c
[ 2716.476035] [] ? finish_task_switch+0x7d/0xa6
[ 2716.476035] [] ? do_notify_resume+0x20/0x5d
[ 2716.476035] [] ? retint_signal+0x3d/0x78
[ 2716.476035] ---[ end trace 827178d8a5966c3d ]---

Reported-by: Vince Weaver
Signed-off-by: Jiri Olsa
Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc:
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar

Jiri Olsa
2013-07-12 17:10:47 +0800

05 Jul, 2013

1 commit

e5302920d perf: Fix interrupt handler timing harness ... Browse Code »

This patch fixes a serious bug in:

14c63f17b1fd perf: Drop sample rate when sampling is too slow

There was an misunderstanding on the API of the do_div()
macro. It returns the remainder of the division and this
was not what the function expected leading to disabling the
interrupt latency watchdog.

This patch also remove a duplicate assignment in
perf_sample_event_took().

Signed-off-by: Stephane Eranian
Cc: peterz@infradead.org
Cc: dave.hansen@linux.intel.com
Cc: ak@linux.intel.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/20130704223010.GA30625@quad
Signed-off-by: Ingo Molnar

Stephane Eranian
2013-07-05 14:54:43 +0800

23 Jun, 2013

1 commit

14c63f17b perf: Drop sample rate when sampling is too slow ... Browse Code »

This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second. If
the sample length times the expected max number of samples
exceeds a configurable threshold, it drops the sample rate.

This way, we don't have a runaway sampling process eating up the
CPU.

This patch can tend to drop the sample rate down to level where
perf doesn't work very well. *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.

I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.

BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.

Signed-off-by: Dave Hansen
Acked-by: Peter Zijlstra
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen
[ Prettified it a bit. ]
Signed-off-by: Ingo Molnar

Dave Hansen
2013-06-23 17:52:57 +0800

20 Jun, 2013

3 commits

bde96030f hw_breakpoint: Introduce "struct bp_cpuinfo" ... Browse Code »

This patch simply moves all per-cpu variables into the new
single per-cpu "struct bp_cpuinfo".

To me this looks more logical and clean, but this can also
simplify the further potential changes. In particular, I do not
think this memory should be per-cpu, it is never used "locally".
After this change it is trivial to turn it into, say,
bootmem[nr_cpu_ids].

Reported-by: Vince Weaver
Signed-off-by: Oleg Nesterov
Acked-by: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2013-06-20 23:58:57 +0800
e12cbc10c hw_breakpoint: Simplify *register_wide_hw_breakpoint() ... Browse Code »

1. register_wide_hw_breakpoint() can use unregister_ if failure,
no need to duplicate the code.

2. "struct perf_event **pevent" adds the unnecesary lever of
indirection and complication, use per_cpu(*cpu_events, cpu).

Reported-by: Vince Weaver
Signed-off-by: Oleg Nesterov
Acked-by: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2013-06-20 23:58:57 +0800
1c10adbb9 hw_breakpoint: Introduce cpumask_of_bp() ... Browse Code »

Add the trivial helper which simply returns cpumask_of() or
cpu_possible_mask depending on bp->cpu.

Change fetch_bp_busy_slots() and toggle_bp_slot() to always do
for_each_cpu(cpumask_of_bp) to simplify the code and avoid the
code duplication.

Reported-by: Vince Weaver
Signed-off-by: Oleg Nesterov
Acked-by: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2013-06-20 23:58:56 +0800