09 Aug, 2014
1 commit
-
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fnAfter:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfreeAs you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.oThis patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Vladimir Davydov
Signed-off-by: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jul, 2014
2 commits
-
Signed-off-by: Ingo Molnar
-
Pull perf fixes from Thomas Gleixner:
"A bunch of fixes for perf and kprobes:
- revert a commit that caused a perf group regression
- silence dmesg spam
- fix kprobe probing errors on ia64 and ppc64
- filter kprobe faults from userspace
- lockdep fix for perf exit path
- prevent perf #GP in KVM guest
- correct perf event and filters"* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
kprobes/x86: Don't try to resolve kprobe faults from userspace
perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
perf/x86/intel: Protect LBR and extra_regs against KVM lying
perf: Fix lockdep warning on process exit
perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
perf: Revert ("perf: Always destroy groups on exit")
17 Jul, 2014
1 commit
-
Pull perf fixes from Ingo Molnar:
"Tooling fixes and an Intel PMU driver fixlet"* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not allow optimized switch for non-cloned events
perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
perf symbols: Get kernel start address by symbol name
perf tools: Fix segfault in cumulative.callchain report
16 Jul, 2014
3 commits
-
The following patch added another way to get mmap name: 78d683e838a6
("mm, fs: Add vm_ops->name as an alternative to arch_vma_name")The vdso vma mapping already switch to this and we no longer get vdso
name via arch_vma_name function. Adding this way to the perf mmap
event name retrieval code.Caught this via perf test:
$ sudo ./perf test -v 7
7: Validate PERF_RECORD_* events & perf_sample fields :
--- start ---SNIP
PERF_RECORD_MMAP for [vdso] missing!
test child finished with 255
---- end ----
Validate PERF_RECORD_* events & perf_sample fields: FAILED!Signed-off-by: Jiri Olsa
Acked-by: Andy Lutomirski
Signed-off-by: Peter Zijlstra
Cc: Namhyung Kim
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras
Cc: Corey Ashford
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1405353439-14211-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar -
Sasha Levin reported:
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel I've stumbled on the following spew:
>
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.15.0-next-20140613-sasha-00026-g6dd125d-dirty #654 Not tainted
> -------------------------------------------------------
> trinity-c578/9725 is trying to acquire lock:
> (&(&pool->lock)->rlock){-.-...}, at: __queue_work (kernel/workqueue.c:1346)
>
> but task is already holding lock:
> (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533)
>
> which lock already depends on the new lock.> 1 lock held by trinity-c578/9725:
> #0: (&ctx->lock){-.....}, at: perf_event_exit_task (kernel/events/core.c:7471 kernel/events/core.c:7533)
>
> Call Trace:
> dump_stack (lib/dump_stack.c:52)
> print_circular_bug (kernel/locking/lockdep.c:1216)
> __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
> lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
> _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
> __queue_work (kernel/workqueue.c:1346)
> queue_work_on (kernel/workqueue.c:1424)
> free_object (lib/debugobjects.c:209)
> __debug_check_no_obj_freed (lib/debugobjects.c:715)
> debug_check_no_obj_freed (lib/debugobjects.c:727)
> kmem_cache_free (mm/slub.c:2683 mm/slub.c:2711)
> free_task (kernel/fork.c:221)
> __put_task_struct (kernel/fork.c:250)
> put_ctx (include/linux/sched.h:1855 kernel/events/core.c:898)
> perf_event_exit_task (kernel/events/core.c:907 kernel/events/core.c:7478 kernel/events/core.c:7533)
> do_exit (kernel/exit.c:766)
> do_group_exit (kernel/exit.c:884)
> get_signal_to_deliver (kernel/signal.c:2347)
> do_signal (arch/x86/kernel/signal.c:698)
> do_notify_resume (arch/x86/kernel/signal.c:751)
> int_signal (arch/x86/kernel/entry_64.S:600)Urgh.. so the only way I can make that happen is through:
perf_event_exit_task_context()
raw_spin_lock(&child_ctx->lock);
unclone_ctx(child_ctx)
put_ctx(ctx->parent_ctx);
raw_spin_unlock_irqrestore(&child_ctx->lock);And we can avoid this by doing the change below.
I can't immediately see how this changed recently, but given that you
say it's easy to reproduce, lets fix this.Reported-by: Sasha Levin
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Dave Jones
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140623141242.GB19860@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
Vince reported that commit 15a2d4de0eab5 ("perf: Always destroy groups
on exit") causes a regression with grouped events. In particular his
read_group_attached.c test fails.https://github.com/deater/perf_event_tests/blob/master/tests/bugs/read_group_attached.c
Because of the context switch optimization in
perf_event_context_sched_out() the 'original' event may end up in the
child process and when that exits the change in the patch in question
destroys the actual grouping.Therefore revert that change and only destroy inherited groups.
Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-zedy3uktcp753q8fw8dagx7a@git.kernel.org
Signed-off-by: Ingo Molnar
05 Jul, 2014
1 commit
-
Leftover from '8dc85d5 perf: Multiple task contexts'.
Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Namhyung Kim
Link: http://lkml.kernel.org/r/1403598026-2310-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar
04 Jul, 2014
1 commit
-
…it/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
where running:# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whateverkills the uprobe. Along the way he found some other minor bugs and
clean ups that he fixed up making it a total of 4 patches.Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when
ftrace was the only user, but now that it's used by perf and kprobes,
this is a bug where reading trace can disable kprobes and perf. A
very unexpected side effect and should be fixed"* tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove ftrace_stop/start() from reading the trace file
tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
tracing/uprobes: Revert "Support mix of ftrace and perf"
02 Jul, 2014
1 commit
-
The context check in perf_event_context_sched_out allows
non-cloned context to be part of the optimized schedule
out switch.This could move non-cloned context into another workload
child. Once this child exits, the context is closed and
leaves all original (parent) events in closed state.Any other new cloned event will have closed state and not
measure anything. And probably causing other odd bugs.Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra
Cc:
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras
Cc: Frederic Weisbecker
Cc: Namhyung Kim
Cc: Paul Mackerras
Cc: Corey Ashford
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1403598026-2310-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar
01 Jul, 2014
1 commit
-
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure
that nobody tries to play with the dead uprobe/consumer. This helps
to catch the bugs like the one fixed by the previous patch.In the longer term we should fix this poorly designed interface.
uprobe_register() should return "struct uprobe *" which should be
passed to apply/unregister. Plus other semantic changes, see the
changelog in commit 41ccba029e94.Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com
Acked-by: Namhyung Kim
Acked-by: Srikar Dronamraju
Signed-off-by: Oleg Nesterov
Signed-off-by: Steven Rostedt
14 Jun, 2014
1 commit
-
Signed-off-by: Ingo Molnar
13 Jun, 2014
1 commit
-
Pull more perf updates from Ingo Molnar:
"A second round of perf updates:- wide reaching kprobes sanitization and robustization, with the hope
of fixing all 'probe this function crashes the kernel' bugs, by
Masami Hiramatsu.- uprobes updates from Oleg Nesterov: tmpfs support, corner case
fixes and robustization work.- perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
et al:
* Add support to accumulate hist periods (Namhyung Kim)
* various fixes, refactorings and enhancements"* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
perf: Differentiate exec() and non-exec() comm events
perf: Fix perf_event_comm() vs. exec() assumption
uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
perf/documentation: Add description for conditional branch filter
perf/x86: Add conditional branch filtering support
perf/tool: Add conditional branch filter 'cond' to perf record
perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
uprobes: Teach copy_insn() to support tmpfs
uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
perf/x86: Use common PMU interrupt disabled code
perf/ARM: Use common PMU interrupt disabled code
perf: Disable sampled events if no PMU interrupt
perf: Fix use after free in perf_remove_from_context()
perf tools: Fix 'make help' message error
perf record: Fix poll return value propagation
perf tools: Move elide bool into perf_hpp_fmt struct
perf tools: Remove elide setup for SORT_MODE__MEMORY mode
perf tools: Fix "==" into "=" in ui_browser__warning assignment
perf tools: Allow overriding sysfs and proc finding with env var
perf tools: Consider header files outside perf directory in tags target
...
10 Jun, 2014
1 commit
-
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchyDocumentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
09 Jun, 2014
2 commits
-
This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488.
Re-enable the mmap2 interface as we will have a user soon.
Since things have changed since perf disabled mmap2, small tweaks
to the revert had to be done:o commit 9d4ecc88 forced (n!=8) to become (n
Link: http://lkml.kernel.org/r/1401461382-209586-1-git-send-email-dzickus@redhat.com
Signed-off-by: Jiri Olsa -
The mmap2 interface was missing the protection and flags bits needed to
accurately determine if a mmap memory area was shared or private and
if it was readable or not.Signed-off-by: Peter Zijlstra
[tweaked patch to compile and wrote changelog]
Signed-off-by: Don Zickus
Link: http://lkml.kernel.org/r/1400526833-141779-2-git-send-email-dzickus@redhat.com
Signed-off-by: Jiri Olsa
06 Jun, 2014
4 commits
-
perf tools like 'perf report' can aggregate samples by comm strings,
which generally works. However, there are other potential use-cases.
For example, to pair up 'calls' with 'returns' accurately (from branch
events like Intel BTS) it is necessary to identify whether the process
has exec'd. Although a comm event is generated when an 'exec' happens
it is also generated whenever the comm string is changed on a whim
(e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event
to differentiate one case from the other.In order to determine whether the kernel supports the new flag, a
selection bit named 'exec' is added to struct perf_event_attr. The
bit does nothing but will cause perf_event_open() to fail if the bit
is set on kernels that do not have it defined.Signed-off-by: Adrian Hunter
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com
Cc: Paul Mackerras
Cc: Dave Jones
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Alexander Viro
Cc: Linus Torvalds
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
Conflicts:
arch/x86/kernel/traps.cSigned-off-by: Ingo Molnar
-
perf_event_comm() assumes that set_task_comm() is only called on
exec(), and in particular that its only called on current.Neither are true, as Dave reported a WARN triggered by set_task_comm()
being called on !current.Separate the exec() hook from the comm hook.
Reported-by: Dave Jones
Signed-off-by: Peter Zijlstra
Cc: Alexander Viro
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net
[ Build fix. ]
Signed-off-by: Ingo Molnar -
Pull ARM updates from Russell King:
- Major clean-up of the L2 cache support code. The existing mess was
becoming rather unmaintainable through all the additions that others
have done over time. This turns it into a much nicer structure, and
implements a few performance improvements as well.- Clean up some of the CP15 control register tweaks for alignment
support, moving some code and data into alignment.c- DMA properties for ARM, from Santosh and reviewed by DT people. This
adds DT properties to specify bus translations we can't discover
automatically, and to indicate whether devices are coherent.- Hibernation support for ARM
- Make ftrace work with read-only text in modules
- add suspend support for PJ4B CPUs
- rework interrupt masking for undefined instruction handling, which
allows us to enable interrupts earlier in the handling of these
exceptions.- support for big endian page tables
- fix stacktrace support to exclude stacktrace functions from the
trace, and add save_stack_trace_regs() implementation so that kprobes
can record stack traces.- Add support for the Cortex-A17 CPU.
- Remove last vestiges of ARM710 support.
- Removal of ARM "meminfo" structure, finally converting us solely to
memblock to handle the early memory initialisation.* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (142 commits)
ARM: ensure C page table setup code follows assembly code (part II)
ARM: ensure C page table setup code follows assembly code
ARM: consolidate last remaining open-coded alignment trap enable
ARM: remove global cr_no_alignment
ARM: remove CPU_CP15 conditional from alignment.c
ARM: remove unused adjust_cr() function
ARM: move "noalign" command line option to alignment.c
ARM: provide common method to clear bits in CPU control register
ARM: 8025/1: Get rid of meminfo
ARM: 8060/1: mm: allow sub-architectures to override PCI I/O memory type
ARM: 8066/1: correction for ARM patch 8031/2
ARM: 8049/1: ftrace/add save_stack_trace_regs() implementation
ARM: 8065/1: remove last use of CONFIG_CPU_ARM710
ARM: 8062/1: Modify ldrt fixup handler to re-execute the userspace instruction
ARM: 8047/1: rwsem: use asm-generic rwsem implementation
ARM: l2c: trial at enabling some Cortex-A9 optimisations
ARM: l2c: add warnings for stuff modifying aux_ctrl register values
ARM: l2c: print a warning with L2C-310 caches if the cache size is modified
ARM: l2c: remove old .set_debug method
ARM: l2c: kill L2X0_AUX_CTRL_MASK before anyone else makes use of this
...
05 Jun, 2014
5 commits
-
tmpfs is widely used but as Denys reports shmem_aops doesn't have
->readpage() and thus you can't probe a binary on this filesystem.As Hugh suggested we can use shmem_read_mapping_page() in this case,
just we need to check shmem_mapping() if ->readpage == NULL.Reported-by: Denys Vlasenko
Suggested-by: Hugh Dickins
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140519184136.GB6750@redhat.com
Signed-off-by: Ingo Molnar -
copy_insn() fails with -EIO if ->readpage == NULL, but this error
is not propagated unless uprobe_register() path finds ->mm which
already mmaps this file. In this case (say) "perf record" does not
actually install the probe, but the user can't know about this.Move this check into uprobe_register() so that this problem can be
detected earlier and reported to user.Note: this is still not perfect,
- copy_insn() and arch_uprobe_analyze_insn() should be called
by uprobe_register() but this is not simple, we need vm_file
for read_mapping_page() (although perhaps we can pass NULL),
and we need ->mm for is_64bit_mm() (although this logic is
broken anyway).- uprobe_register() should be called by create_trace_uprobe(),
not by probe_event_enable(), so that an error can be detected
at "perf probe -x" time. This also needs more changes in the
core uprobe code, uprobe register/unregister interface was
poorly designed from the very beginning.Reported-by: Denys Vlasenko
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140519184054.GA6750@redhat.com
Signed-off-by: Ingo Molnar -
Add common code to generate -ENOTSUPP at event creation time if an
architecture attempts to create a sampled event and
PERF_PMU_NO_INTERRUPT is set.This adds a new pmu->capabilities flag. Initially we only support
PERF_PMU_NO_INTERRUPT (to indicate a PMU has no support for generating
hardware interrupts) but there are other capabilities that can be
added later.Signed-off-by: Vince Weaver
Acked-by: Will Deacon
[peterz: rename to PERF_PMU_CAP_* and moved the pmu::capabilities word into a hole]
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu
Signed-off-by: Ingo MolnarSigned-off-by: Ingo Molnar
-
While that mutex should guard the elements, it doesn't guard against the
use-after-free that's from list_for_each_entry_rcu().
__perf_event_exit_task() can actually free the event.And because list addition/deletion is guarded by both ctx->mutex and
ctx->lock, holding ctx->mutex is sufficient for reading the list, so we
don't actually need the rcu list iteration.Fixes: 3a497f48637e ("perf: Simplify perf_event_exit_task_context()")
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Signed-off-by: Peter Zijlstra
Cc: Dave Jones
Cc: acme@ghostprotocols.net
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140529170024.GA2315@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
These bits from Oleg are fully cooked, ship them to Linus.
Signed-off-by: Ingo Molnar
04 Jun, 2014
1 commit
-
…git/tip/tip into next
Pull perf updates from Ingo Molnar:
"The tooling changes maintained by Jiri Olsa until Arnaldo is on
vacation:User visible changes:
- Add -F option for specifying output fields (Namhyung Kim)
- Propagate exit status of a command line workload for record command
(Namhyung Kim)
- Use tid for finding thread (Namhyung Kim)
- Clarify the output of perf sched map plus small sched command
fixes (Dongsheng Yang)
- Wire up perf_regs and unwind support for ARM64 (Jean Pihet)
- Factor hists statistics counts processing which in turn also fixes
several bugs in TUI report command (Namhyung Kim)
- Add --percentage option to control absolute/relative percentage
output (Namhyung Kim)
- Add --list-cmds to 'kmem', 'mem', 'lock' and 'sched', for use by
completion scripts (Ramkumar Ramachandra)Development/infrastructure changes and fixes:
- Android related fixes for pager and map dso resolving (Michael
Lentine)
- Add libdw DWARF post unwind support for ARM (Jean Pihet)
- Consolidate types.h for ARM and ARM64 (Jean Pihet)
- Fix possible null pointer dereference in session.c (Masanari Iida)
- Cleanup, remove unused variables in map_switch_event() (Dongsheng
Yang)
- Remove nr_state_machine_bugs in perf latency (Dongsheng Yang)
- Remove usage of trace_sched_wakeup(.success) (Peter Zijlstra)
- Cleanups for perf.h header (Jiri Olsa)
- Consolidate types.h and export.h within tools (Borislav Petkov)
- Move u64_swap union to its single user's header, evsel.h (Borislav
Petkov)
- Fix for s390 to properly parse tracepoints plus test code
(Alexander Yarygin)
- Handle EINTR error for readn/writen (Namhyung Kim)
- Add a test case for hists filtering (Namhyung Kim)
- Share map_groups among threads of the same group (Arnaldo Carvalho
de Melo, Jiri Olsa)
- Making some code (cpu node map and report parse callchain callback)
global to be usable by upcomming changes (Don Zickus)
- Fix pmu object compilation error (Jiri Olsa)Kernel side changes:
- intrusive uprobes fixes from Oleg Nesterov. Since the interface is
admin-only, and the bug only affects user-space ("any probed
jmp/call can kill the application"), we queued these fixes via the
development tree, as a special exception.
- more fuzzer motivated race fixes and related refactoring and
robustization.
- allow PMU drivers to be built as modules. (No actual module yet,
because the x86 Intel uncore module wasn't ready in time for this)"* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
perf tools: Add automatic remapping of Android libraries
perf tools: Add cat as fallback pager
perf tests: Add a testcase for histogram output sorting
perf tests: Factor out print_hists_*()
perf tools: Introduce reset_output_field()
perf tools: Get rid of obsolete hist_entry__sort_list
perf hists: Reset width of output fields with header length
perf tools: Skip elided sort entries
perf top: Add --fields option to specify output fields
perf report/tui: Fix a bug when --fields/sort is given
perf tools: Add ->sort() member to struct sort_entry
perf report: Add -F option to specify output fields
perf tools: Call perf_hpp__init() before setting up GUI browsers
perf tools: Consolidate management of default sort orders
perf tools: Allow hpp fields to be sort keys
perf ui: Get rid of callback from __hpp__fmt()
perf tools: Consolidate output field handling to hpp format routines
perf tools: Use hpp formats to sort final output
perf tools: Support event grouping in hpp ->sort()
perf tools: Use hpp formats to sort hist entries
...
26 May, 2014
1 commit
-
After instruction write into xol area, on ARM V7
architecture code need to flush dcache and icache to sync
them up for given set of addresses. Having just
'flush_dcache_page(page)' call is not enough - it is
possible to have stale instruction sitting in icache
for given xol area slot address.Introduce arch_uprobe_ixol_copy weak function
that by default calls uprobes copy_to_page function and
than flush_dcache_page function and on ARM define new one
that handles xol slot copy in ARM specific wayflush_uprobe_xol_access function shares/reuses implementation
with/of flush_ptrace_access function and takes care of writing
instruction to user land address space on given variety of
different cache types on ARM CPUs. Because
flush_uprobe_xol_access does not have vma around
flush_ptrace_access was split into two parts. First that
retrieves set of condition from vma and common that receives
those conditions as flags.Note ARM cache flush function need kernel address
through which instruction write happened, so instead
of using uprobes copy_to_page function changed
code to explicitly map page and do memcpy.Note arch_uprobe_copy_ixol function, in similar way as
copy_to_user_page function, has preempt_disable/preempt_enable.Signed-off-by: Victor Kamensky
Acked-by: Oleg Nesterov
Reviewed-by: David A. Long
Signed-off-by: Russell King
19 May, 2014
4 commits
-
... in 3a497f48637 ("perf: Simplify perf_event_exit_task_context()")
Signed-off-by: Borislav Petkov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1399720259-28275-1-git-send-email-bp@alien8.de
Signed-off-by: Thomas Gleixner -
Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
ring_buffer set, will now observe the required grace period between
removing itself from the old and attaching itself to the new buffer.This case is 'simple' in that both buffers are present in
perf_event_set_output() one could think an unconditional
synchronize_rcu() would be sufficient; however...2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.This case is more complex because the buffer destruction does:
ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
ring_buffer_attach(.rb = foo);and we still need to observe the grace period between these two
calls due to us reusing the event->rb_entry list_head.In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.Cc: Paul Mackerras
Cc: Stephane Eranian
Cc: Andi Kleen
Cc: "Paul E. McKenney"
Cc: Ingo Molnar
Cc: Frederic Weisbecker
Cc: Mike Galbraith
Reported-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner -
The perf cpu offline callback takes down all cpu context
events and releases swhash->swevent_hlist.This could race with task context software event being just
scheduled on this cpu via perf_swevent_add while cpu hotplug
code already cleaned up event's data.The race happens in the gap between the cpu notifier code
and the cpu being actually taken down. Note that only cpu
ctx events are terminated in the perf cpu hotplug code.It's easily reproduced with:
$ perf record -e faults perf bench sched pipewhile putting one of the cpus offline:
# echo 0 > /sys/devices/system/cpu/cpu1/onlineConsole emits following warning:
WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
Modules linked in:
CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256
Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
Call Trace:
[] dump_stack+0x4f/0x7c
[] warn_slowpath_common+0x8c/0xc0
[] warn_slowpath_null+0x1a/0x20
[] perf_swevent_add+0x18d/0x1a0
[] event_sched_in.isra.75+0x9e/0x1f0
[] group_sched_in+0x6a/0x1f0
[] ? sched_clock_local+0x25/0xa0
[] ctx_sched_in+0x1f6/0x450
[] perf_event_sched_in+0x6b/0xa0
[] perf_event_context_sched_in+0x7b/0xc0
[] __perf_event_task_sched_in+0x43e/0x460
[] ? put_lock_stats.isra.18+0xe/0x30
[] finish_task_switch+0xb8/0x100
[] __schedule+0x30e/0xad0
[] ? pipe_read+0x3e2/0x560
[] ? preempt_schedule_irq+0x3e/0x70
[] ? preempt_schedule_irq+0x3e/0x70
[] preempt_schedule_irq+0x44/0x70
[] retint_kernel+0x20/0x30
[] ? lockdep_sys_exit+0x1a/0x90
[] lockdep_sys_exit_thunk+0x35/0x67
[] ? sysret_check+0x5/0x56Fixing this by tracking the cpu hotplug state and displaying
the WARN only if current cpu is initialized properly.Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: stable@vger.kernel.org
Reported-by: Fengguang Wu
Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com
Signed-off-by: Thomas Gleixner -
Vince reported that using a large sample_period (one with bit 63 set)
results in wreckage since while the sample_period is fundamentally
unsigned (negative periods don't make sense) the way we implement
things very much rely on signed logic.So limit sample_period to 63 bits to avoid tripping over this.
Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org
Signed-off-by: Thomas Gleixner
14 May, 2014
3 commits
-
If the probed insn triggers a trap, ->si_addr = regs->ip is technically
correct, but this is not what the signal handler wants; we need to pass
the address of the probed insn, not the address of xol slot.Add the new arch-agnostic helper, uprobe_get_trap_addr(), and change
fill_trap_info() and math_error() to use it. !CONFIG_UPROBES case in
uprobes.h uses a macro to avoid include hell and ensure that it can be
compiled even if an architecture doesn't define instruction_pointer().Test-case:
#include
#include
#includeextern void probe_div(void);
void sigh(int sig, siginfo_t *info, void *c)
{
int passed = (info->si_addr == probe_div);
printf(passed ? "PASS\n" : "FAIL\n");
_exit(!passed);
}int main(void)
{
struct sigaction sa = {
.sa_sigaction = sigh,
.sa_flags = SA_SIGINFO,
};sigaction(SIGFPE, &sa, NULL);
asm (
"xor %ecx,%ecx\n"
".globl probe_div; probe_div:\n"
"idiv %ecx\n"
);return 0;
}it fails if probe_div() is probed.
Note: show_unhandled_signals users should probably use this helper too,
but we need to cleanup them first.Signed-off-by: Oleg Nesterov
Reviewed-by: Masami Hiramatsu -
Hugh says:
The one I noticed was that it forgets all about memcg (because
it was copied from KSM, and there the replacement page has already
been charged to a memcg). See how mm/memory.c do_anonymous_page()
does a mem_cgroup_charge_anon().Hopefully not a big problem, uprobes is a system-wide thing and only
root can insert the probes. But I agree, should be fixed anyway.Add mem_cgroup_{un,}charge_anon() into uprobe_write_opcode(). To simplify
the error handling (and avoid the new "uncharge" label) the patch also
moves anon_vma_prepare() up before we alloc/charge the new page.While at it fix the comment about ->mmap_sem, it is held for write.
Suggested-by: Hugh Dickins
Signed-off-by: Oleg Nesterov -
Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero. cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero. Let's rename the existing trygets so that they clearly
indicate that they're onliness.I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir(). This is pure rename.v2: cgroup_freezer grew new usages of css_tryget(). Update
accordingly.Signed-off-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
07 May, 2014
6 commits
-
Instead of jumping through hoops to make sure to find (and exit) each
event, do it the simple straight fwd way.Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: Stephane Eranian
Link: http://lkml.kernel.org/n/tip-tij931199thfkys8vbnokdpf@git.kernel.org
Signed-off-by: Ingo Molnar -
Primarily make perf_event_release_kernel() into put_event(), this will
allow kernel space to create per-task inherited events, and is safer
in general.Also, document the free_event() assumptions.
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: Stephane Eranian
Link: http://lkml.kernel.org/n/tip-rk9pvr6e1d0559lxstltbztc@git.kernel.org
Signed-off-by: Ingo Molnar -
Document and validate the locking assumption of event_sched_in().
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: Stephane Eranian
Link: http://lkml.kernel.org/n/tip-sybq1publ9xt5no77cwvi0eo@git.kernel.org
Signed-off-by: Ingo Molnar -
Commit 38b435b16c36 ("perf: Fix tear-down of inherited group events")
states that we need to destroy groups for inherited events, but it
doesn't make any sense to not also destroy groups for normal events.And while it usually makes no difference (the normal events won't
leak, and its very likely all the group events will die in quick
succession) it does make the code more consistent and closes a
potential hole for trouble.Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: Stephane Eranian
Link: http://lkml.kernel.org/n/tip-426egt8zmsm12d2q8k2xz4tt@git.kernel.org
Signed-off-by: Ingo Molnar -
Make sure all events in a group have the same inherit state. It was
possible for group leaders to have inherit set while sibling events
would not have inherit set.In this case we'd still inherit the siblings, leading to some
non-fatal weirdness.Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: Stephane Eranian
Link: http://lkml.kernel.org/n/tip-r32tt8yldvic3jlcghd3g35u@git.kernel.org
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar