22 Oct, 2012

1 commit


13 Oct, 2012

1 commit

  • Pull perf updates from Ingo Molnar:
    "This tree includes some late late perf items that missed the first
    round:

    tools:

    - Bash auto completion improvements, now we can auto complete the
    tools long options, tracepoint event names, etc, from Namhyung Kim.

    - Look up thread using tid instead of pid in 'perf sched'.

    - Move global variables into a perf_kvm struct, from David Ahern.

    - Hists refactorings, preparatory for improved 'diff' command, from
    Jiri Olsa.

    - Hists refactorings, preparatory for event group viewieng work, from
    Namhyung Kim.

    - Remove double negation on optional feature macro definitions, from
    Namhyung Kim.

    - Remove several cases of needless global variables, on most
    builtins.

    - misc fixes

    kernel:

    - sysfs support for IBS on AMD CPUs, from Robert Richter.

    - Support for an upcoming Intel CPU, the Xeon-Phi / Knights Corner
    HPC blade PMU, from Vince Weaver.

    - misc fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    perf: Fix perf_cgroup_switch for sw-events
    perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu
    perf/AMD/IBS: Add sysfs support
    perf hists: Add more helpers for hist entry stat
    perf hists: Move he->stat.nr_events initialization to a template
    perf hists: Introduce struct he_stat
    perf diff: Removing the total_period argument from output code
    perf tool: Add hpp interface to enable/disable hpp column
    perf tools: Removing hists pair argument from output path
    perf hists: Separate overhead and baseline columns
    perf diff: Refactor diff displacement possition info
    perf hists: Add struct hists pointer to struct hist_entry
    perf tools: Complete tracepoint event names
    perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU
    perf evlist: Remove some unused methods
    perf evlist: Introduce add_newtp method
    perf kvm: Move global variables into a perf_kvm struct
    perf tools: Convert to BACKTRACE_SUPPORT
    perf tools: Long option completion support for each subcommands
    perf tools: Complete long option names of perf command
    ...

    Linus Torvalds
     

09 Oct, 2012

3 commits

  • In order to allow sleeping during invalidate_page mmu notifier calls, we
    need to avoid calling when holding the PT lock. In addition to its direct
    calls, invalidate_page can also be called as a substitute for a change_pte
    call, in case the notifier client hasn't implemented change_pte.

    This patch drops the invalidate_page call from change_pte, and instead
    wraps all calls to change_pte with invalidate_range_start and
    invalidate_range_end calls.

    Note that change_pte still cannot sleep after this patch, and that clients
    implementing change_pte should not take action on it in case the number of
    outstanding invalidate_range_start calls is larger than one, otherwise
    they might miss a later invalidation.

    Signed-off-by: Haggai Eran
    Cc: Andrea Arcangeli
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haggai Eran
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Oct, 2012

6 commits

  • Multiple threads can manipulate uprobe->flags, this is obviously
    unsafe. For example mmap can set UPROBE_COPY_INSN while register
    tries to set UPROBE_RUN_HANDLER, the latter can also race with
    can_skip_sstep() which clears UPROBE_SKIP_SSTEP.

    Change this code to use bitops.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • install_breakpoint() is called under mm->mmap_sem, this protects
    set_swbp() but not prepare_uprobe(). Two or more different tasks
    can call install_breakpoint()->prepare_uprobe() at the same time,
    this leads to numerous problems if UPROBE_COPY_INSN is not set.

    Just for example, the second copy_insn() can corrupt the already
    analyzed/fixuped uprobe->arch.insn and race with handle_swbp().

    This patch simply adds uprobe->copy_mutex to serialize this code.
    We could probably reuse ->consumer_rwsem, but this would mean that
    consumer->handler() can not use mm->mmap_sem, not good.

    Note: this is another temporary ugly hack until we move this logic
    into uprobe_register().

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Preparation. Extract the copy_insn/arch_uprobe_analyze_insn code
    from install_breakpoint() into the new helper, prepare_uprobe().

    And move uprobe->flags defines from uprobes.h to uprobes.c, nobody
    else can use them anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Strictly speaking this race was added by me in 56bb4cf6. However
    I think that this bug is just another indication that we should
    move copy_insn/uprobe_analyze_insn code from install_breakpoint()
    to uprobe_register(), there are a lot of other reasons for that.
    Until then, add a hack to close the race.

    A task can hit uprobe U1, but before it calls find_uprobe() this
    uprobe can be unregistered *AND* another uprobe U2 can be added to
    uprobes_tree at the same inode/offset. In this case handle_swbp()
    will use the not-fully-initialized U2, in particular its arch.insn
    for xol.

    Add the additional !UPROBE_COPY_INSN check into handle_swbp(),
    if this flag is not set we simply restart as if the new uprobe was
    not inserted yet. This is not very nice, we need barriers, but we
    will remove this hack when we change uprobe_register().

    Note: with or without this patch install_breakpoint() can race with
    itself, yet another reson to kill UPROBE_COPY_INSN altogether. And
    even the usage of uprobe->flags is not safe. See the next patches.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • delete_uprobe() must not be called if register_for_each_vma(false)
    fails to remove all breakpoints, __uprobe_unregister() is correct.
    The problem is that register_for_each_vma(false) always returns 0
    and thus this logic does not work.

    1. Change verify_opcode() to return 0 rather than -EINVAL when
    unregister detects the !is_swbp insn, we can treat this case
    as success and currently unregister paths ignore the error
    code anyway.

    2. Change remove_breakpoint() to propagate the error code from
    write_opcode().

    3. Change register_for_each_vma(is_register => false) to remove
    as much breakpoints as possible but return non-zero if
    remove_breakpoint() fails at least once.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • If alloc_uprobe() fails uprobe_register() should return ENOMEM, not 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     

05 Oct, 2012

2 commits

  • Jiri reported that he could trigger the WARN_ON_ONCE() in
    perf_cgroup_switch() using sw-events. This is because sw-events share
    a cpuctx with multiple PMUs.

    Use the ->unique_pmu pointer to limit the pmu iteration to unique
    cpuctx instances.

    Reported-and-Tested-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-so7wi2zf3jjzrwcutm2mkz0j@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Stephane thought the perf_cpu_context::active_pmu name confusing and
    suggested using 'unique_pmu' instead.

    This pointer is a pointer to a 'random' pmu sharing the cpuctx
    instance, therefore limiting a for_each_pmu loop to those where
    cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx
    instances.

    Suggested-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-kxyjqpfj2fn9gt7kwu5ag9ks@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Oct, 2012

2 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     

02 Oct, 2012

1 commit

  • Pull perf update from Ingo Molnar:
    "Lots of changes in this cycle as well, with hundreds of commits from
    over 30 contributors. Most of the activity was on the tooling side.

    Higher level changes:

    - New 'perf kvm' analysis tool, from Xiao Guangrong.

    - New 'perf trace' system-wide tracing tool

    - uprobes fixes + cleanups from Oleg Nesterov.

    - Lots of patches to make perf build on Android out of box, from
    Irina Tirdea

    - Extend ftrace function tracing utility to be more dynamic for its
    users. It allows for data passing to the callback functions, as
    well as reading regs as if a breakpoint were to trigger at function
    entry.

    The main goal of this patch series was to allow kprobes to use
    ftrace as an optimized probe point when a probe is placed on an
    ftrace nop. With lots of help from Masami Hiramatsu, and going
    through lots of iterations, we finally came up with a good
    solution.

    - Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.

    - Various tracing updates from Steve Rostedt

    - Clean up and improve 'perf sched' performance by elliminating lots
    of needless calls to libtraceevent.

    - Event group parsing support, from Jiri Olsa

    - UI/gtk refactorings and improvements from Namhyung Kim

    - Add support for non-tracepoint events in perf script python, from
    Feng Tang

    - Add --symbols to 'script', similar to the one in 'report', from
    Feng Tang.

    Infrastructure enhancements and fixes:

    - Convert the trace builtins to use the growing evsel/evlist
    tracepoint infrastructure, removing several open coded constructs
    like switch like series of strcmp to dispatch events, etc.
    Basically what had already been showcased in 'perf sched'.

    - Add evsel constructor for tracepoints, that uses libtraceevent just
    to parse the /format events file, use it in a new 'perf test' to
    make sure the libtraceevent format parsing regressions can be more
    readily caught.

    - Some strange errors were happening in some builds, but not on the
    next, reported by several people, problem was some parser related
    files, generated during the build, didn't had proper make deps, fix
    from Eric Sandeen.

    - Introduce struct and cache information about the environment where
    a perf.data file was captured, from Namhyung Kim.

    - Fix handling of unresolved samples when --symbols is used in
    'report', from Feng Tang.

    - Add union member access support to 'probe', from Hyeoncheol Lee.

    - Fixups to die() removal, from Namhyung Kim.

    - Render fixes for the TUI, from Namhyung Kim.

    - Don't enable annotation in non symbolic view, from Namhyung Kim.

    - Fix pipe mode in 'report', from Namhyung Kim.

    - Move related stats code from stat to util/, will be used by the
    'stat' kvm tool, from Xiao Guangrong.

    - Remove die()/exit() calls from several tools.

    - Resolve vdso callchains, from Jiri Olsa

    - Don't pass const char pointers to basename, so that we can
    unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
    from David Ahern

    - Refactor hist formatting so that it can be reused with the GTK
    browser, From Namhyung Kim

    - Fix build for another rbtree.c change, from Adrian Hunter.

    - Make 'perf diff' command work with evsel hists, from Jiri Olsa.

    - Use the only field_sep var that is set up: symbol_conf.field_sep,
    fix from Jiri Olsa.

    - .gitignore compiled python binaries, from Namhyung Kim.

    - Get rid of die() in more libtraceevent places, from Namhyung Kim.

    - Rename libtraceevent 'private' struct member to 'priv' so that it
    works in C++, from Steven Rostedt

    - Remove lots of exit()/die() calls from tools so that the main perf
    exit routine can take place, from David Ahern

    - Fix x86 build on x86-64, from David Ahern.

    - {int,str,rb}list fixes from Suzuki K Poulose

    - perf.data header fixes from Namhyung Kim

    - Allow user to indicate objdump path, needed in cross environments,
    from Maciek Borzecki

    - Fix hardware cache event name generation, fix from Jiri Olsa

    - Add round trip test for sw, hw and cache event names, catching the
    problem Jiri fixed, after Jiri's patch, the test passes
    successfully.

    - Clean target should do clean for lib/traceevent too, fix from David
    Ahern

    - Check the right variable for allocation failure, fix from Namhyung
    Kim

    - Set up evsel->tp_format regardless of evsel->name being set
    already, fix from Namhyung Kim

    - Oprofile fixes from Robert Richter.

    - Remove perf_event_attr needless version inflation, from Jiri Olsa

    - Introduce libtraceevent strerror like error reporting facility,
    from Namhyung Kim

    - Add pmu mappings to perf.data header and use event names from cmd
    line, from Robert Richter

    - Fix include order for bison/flex-generated C files, from Ben
    Hutchings

    - Build fixes and documentation corrections from David Ahern

    - Assorted cleanups from Robert Richter

    - Let O= makes handle relative paths, from Steven Rostedt

    - perf script python fixes, from Feng Tang.

    - Initial bash completion support, from Frederic Weisbecker

    - Allow building without libelf, from Namhyung Kim.

    - Support DWARF CFI based unwind to have callchains when %bp based
    unwinding is not possible, from Jiri Olsa.

    - Symbol resolution fixes, while fixing support PPC64 files with an
    .opt ELF section was the end goal, several fixes for code that
    handles all architectures and cleanups are included, from Cody
    Schafer.

    - Assorted fixes for Documentation and build in 32 bit, from Robert
    Richter

    - Cache the libtraceevent event_format associated to each evsel
    early, so that we avoid relookups, i.e. calling pevent_find_event
    repeatedly when processing tracepoint events.

    [ This is to reduce the surface contact with libtraceevents and
    make clear what is that the perf tools needs from that lib: so
    far parsing the common and per event fields. ]

    - Don't stop the build if the audit libraries are not installed, fix
    from Namhyung Kim.

    - Fix bfd.h/libbfd detection with recent binutils, from Markus
    Trippelsdorf.

    - Improve warning message when libunwind devel packages not present,
    from Jiri Olsa"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
    perf trace: Add aliases for some syscalls
    perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
    perf tools: Check libaudit availability for perf-trace builtin
    perf hists: Add missing period_* fields when collapsing a hist entry
    perf trace: New tool
    perf evsel: Export the event_format constructor
    perf evsel: Introduce rawptr() method
    perf tools: Use perf_evsel__newtp in the event parser
    perf evsel: The tracepoint constructor should store sys:name
    perf evlist: Introduce set_filter() method
    perf evlist: Renane set_filters method to apply_filters
    perf test: Add test to check we correctly parse and match syscall open parms
    perf evsel: Handle endianity in intval method
    perf evsel: Know if byte swap is needed
    perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
    perf evsel: Improve tracepoint constructor setup
    tools lib traceevent: Fix error path on pevent_parse_event
    perf test: Fix build failure
    trace: Move trace event enable from fs_initcall to core_initcall
    tracing: Add an option for disabling markers
    ...

    Linus Torvalds
     

30 Sep, 2012

12 commits

  • After the previous change is_swbp_at_addr() is always called with
    current->mm. Remove this check and move it close to its single caller.

    Also, remove the obsolete comment about is_swbp_at_addr() and
    uprobe_state.count.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Unlike set_swbp(), set_orig_insn()->is_swbp_at_addr() makes sense,
    although it can't prevent all confusions.

    But the usage of is_swbp_at_addr() is equally confusing, and it adds
    the extra get_user_pages() we can avoid.

    This patch removes set_orig_insn()->is_swbp_at_addr() but changes
    write_opcode() to do the necessary checks before replace_page().

    Perhaps it also makes sense to ensure PAGE_MAPPING_ANON in unregister
    case.

    find_active_uprobe() becomes the only user of is_swbp_at_addr(),
    we can change its semantics.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • No functional changes, preparations.

    1. Extract the kmap-and-memcpy code from read_opcode() into the
    new trivial helper, copy_opcode(). The next patch will add
    another user.

    2. read_opcode() becomes really trivial, fold it into its single
    caller, is_swbp_at_addr().

    3. Remove "auprobe" argument from write_opcode(), it is not used
    since f403072c6.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • A separate patch for better documentation.

    set_swbp()->is_swbp_at_addr() is not needed for correctness, it is
    harmless to do the unnecessary __replace_page(old_page, new_page)
    when these 2 pages are identical.

    And it can not be counted as optimization. mmap/register races are
    very unlikely, while in the likely case is_swbp_at_addr() adds the
    extra get_user_pages() even if the caller is uprobe_mmap(current->mm)
    and returns false.

    Note also that the semantics/usage of is_swbp_at_addr() in uprobe.c
    is confusing. set_swbp() uses it to detect the case when this insn
    was already modified by uprobes, that is why it should always compare
    the opcode with UPROBE_SWBP_INSN even if the hardware (like powerpc)
    has other trap insns. It doesn't matter if this breakpoint was in fact
    installed by gdb or application itself, we are going to "steal" this
    breakpoint anyway and execute the original insn from vm_file even if
    it no longer matches the memory.

    OTOH, handle_swbp()->find_active_uprobe() uses is_swbp_at_addr() to
    figure out whether we need to send SIGTRAP or not if we can not find
    uprobe, so in this case it should return true for all trap variants,
    not only for UPROBE_SWBP_INSN.

    This patch removes set_swbp()->is_swbp_at_addr(), the next patches
    will remove it from set_orig_insn() which is similar to set_swbp()
    in this respect. So the only caller will be handle_swbp() and we
    can make its semantics clear.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • valid_vma(false) ignores ->vm_flags, this is not actually right.
    We should never try to write into MAP_SHARED mapping, this can
    confuse an apllication which actually writes to ->vm_file.

    With this patch valid_vma(false) ignores VM_WRITE only but checks
    other (immutable) bits checked by valid_vma(true). This can also
    speedup uprobe_munmap() and uprobe_unregister().

    Note: even after this patch _unregister can confuse the probed
    application if it does mprotect(PROT_WRITE) after _register and
    installs "int3", but this is hardly possible to avoid and this
    doesn't differ from gdb case.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • uprobe_register() or uprobe_mmap() requires VM_READ | VM_EXEC, this
    is not right. An apllication can do mprotect(PROT_EXEC) later and
    execute this code.

    Change valid_vma(is_register => true) to check VM_MAYEXEC instead.
    No need to check VM_MAYREAD, it is always set.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • write_opcode()->get_user_pages() needs FOLL_FORCE to ensure we can
    read the page even if the probed task did mprotect(PROT_NONE) after
    uprobe_register(). Without FOLL_WRITE, FOLL_FORCE doesn't have any
    side effect but allows to read the !VM_READ memory.

    Otherwiese the subsequent uprobe_unregister()->set_orig_insn() fails
    and we leak "int3". If that task does mprotect(PROT_READ | EXEC) and
    execute the probed insn later it will be killed.

    Note: in fact this is also needed for _register, see the next patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Move clear_thread_flag(TIF_UPROBE) from do_notify_resume() to
    uprobe_notify_resume() for !CONFIG_UPROBES case.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Kill UTASK_BP_HIT state, it buys nothing but complicates the code.
    It is only used in uprobe_notify_resume() to decide who should be
    called, we can check utask->active_uprobe != NULL instead. And this
    allows us to simplify handle_swbp(), no need to clear utask->state.

    Likewise we could kill UTASK_SSTEP, but UTASK_BP_HIT is worse and
    imho should die. The problem is, it creates the special case when
    task->utask is NULL, we can't distinguish RUNNING and BP_HIT. With
    this patch utask == NULL always means RUNNING.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • If handle_swbp()->add_utask() fails but UPROBE_SKIP_SSTEP is set,
    cleanup_ret: path do not restart the insn, this is wrong. Remove
    this check and add the additional label for can_skip_sstep() = T
    case.

    Note also that UPROBE_SKIP_SSTEP can be false positive, we simply
    can not trust it unless arch_uprobe_skip_sstep() was already called.

    Also, move another UPROBE_SKIP_SSTEP check before can_skip_sstep()
    into this helper, this looks more clean and understandable.

    Note: probably we should rename "skip" to "emulate" and I think
    that "clear UPROBE_SKIP_SSTEP" should be moved to arch_can_skip.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • handle_swbp() sets utask->active_uprobe before handler_chain(),
    and UTASK_SSTEP before pre_ssout(). This complicates the code
    for no reason, arch_ hooks or consumer->handler() should not
    (and can't) use this info.

    Change handle_swbp() to initialize them after pre_ssout(), and
    remove the no longer needed cleanup-utask code.

    Signed-off-by: Oleg Nesterov
    cked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • If handle_swbp()->find_active_uprobe() fails we return with
    utask->state = UTASK_BP_HIT.

    Change handle_swbp() to reset utask->state at the start. Note
    that we do this unconditionally, see the next patch(es).

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     

27 Sep, 2012

2 commits


15 Sep, 2012

6 commits

  • As Oleg pointed out in [0] uprobe should not use the ptrace interface
    for enabling/disabling single stepping.

    [0] http://lkml.kernel.org/r/20120730141638.GA5306@redhat.com

    Add the new "__weak arch" helpers which simply call user_*_single_step()
    as a preparation. This is only needed to not break the powerpc port, we
    will fold this logic into arch_uprobe_pre/post_xol() hooks later.

    We should also change handle_singlestep(), _disable_step(&uprobe->arch)
    should be called before put_uprobe().

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Sebastian Andrzej Siewior
     
  • The wrong MMF_HAS_UPROBES doesn't really hurt, just it triggers
    the "slow" and unnecessary handle_swbp() path if the task hits
    the non-uprobe breakpoint.

    So this patch changes find_active_uprobe() to check every valid
    vma and clear MMF_HAS_UPROBES if no uprobes were found. This is
    adds the slow O(n) path, but it is only called in unlikely case
    when the task hits the normal breakpoint first time after
    uprobe_unregister().

    Note the "not strictly accurate" comment in mmf_recalc_uprobes().
    We can fix this, we only need to teach vma_has_uprobes() to return
    a bit more more info, but I am not sure this worth the trouble.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Add the new MMF_RECALC_UPROBES flag, it means that MMF_HAS_UPROBES
    can be false positive after remove_breakpoint() or uprobe_munmap().
    It is also set by uprobe_dup_mmap(), this is not optimal but simple.
    We could add the new hook, uprobe_dup_vma(), to set MMF_HAS_UPROBES
    only if the new mm actually has uprobes, but I don't think this
    makes sense.

    The next patch will use this flag to clear MMF_HAS_UPROBES.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • Nobody plays with uprobes_tree/uprobes_treelock in interrupt context,
    no need to disable irqs.

    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Oleg Nesterov
     
  • alloc_uprobe() might return a NULL pointer, put_uprobe() can't deal with
    this.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju

    Sebastian Andrzej Siewior
     
  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

04 Sep, 2012

2 commits

  • While debugging a warning message on PowerPC while using hardware
    breakpoints, it was discovered that when perf_event_disable is invoked
    through hw_breakpoint_handler function with interrupts disabled, a
    subsequent IPI in the code path would trigger a WARN_ON_ONCE message in
    smp_call_function_single function.

    This patch calls __perf_event_disable() when interrupts are already
    disabled, instead of perf_event_disable().

    Reported-by: Edjunior Barbosa Machado
    Signed-off-by: K.Prasad
    [naveen.n.rao@linux.vnet.ibm.com: v3: Check to make sure we target current task]
    Signed-off-by: Naveen N. Rao
    Acked-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120802081635.5811.17737.stgit@localhost.localdomain
    [ Fixed build error on MIPS. ]
    Signed-off-by: Ingo Molnar

    K.Prasad
     
  • Don't mess with file refcounts (or keep a reference to file, for
    that matter) in perf_event. Use explicit refcount of its own
    instead. Deal with the race between the final reference to event
    going away and new children getting created for it by use of
    atomic_long_inc_not_zero() in inherit_event(); just have the
    latter free what it had allocated and return NULL, that works
    out just fine (children of siblings of something doomed are
    created as singletons, same as if the child of leader had been
    created and immediately killed).

    Signed-off-by: Al Viro
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.uk
    Signed-off-by: Ingo Molnar

    Al Viro
     

29 Aug, 2012

2 commits