02 Jun, 2012

1 commit

  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     

01 Jun, 2012

7 commits


31 May, 2012

1 commit


30 May, 2012

1 commit

  • The "if (mm)" check is not required in find_vma, as the kernel code
    calls find_vma only when it is absolutely sure that the mm_struct arg to
    it is non-NULL.

    Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now. This
    will serve the purpose of mandating that the execution
    context(user-mode/kernel-mode) be known before find_vma is called. Also
    fixed 2 checkpatch.pl errors in the declaration of the rb_node and
    vma_tmp local variables.

    I was browsing through the internet and read a discussion at
    https://lkml.org/lkml/2012/3/27/342 which discusses removal of the
    validation check within find_vma. Since no-one responded, I decided to
    send this patch with Andrew's suggestions.

    [akpm@linux-foundation.org: add remove-me comment]
    Signed-off-by: Rajman Mekaco
    Cc: Kautuk Consul
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rajman Mekaco
     

25 May, 2012

1 commit

  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

14 May, 2012

1 commit


07 May, 2012

2 commits

  • The VM accounting makes no sense at this level, and half of the callers
    didn't ever actually use the end result. The only time we want to
    unaccount the memory is when we actually remove the vma, so do the
    accounting at that point instead.

    This simplifies the interfaces (no need to pass down that silly page
    counter to functions that really don't care), and also makes it much
    more obvious what is actually going on: we do vm_[un]acct_memory() when
    adding or removing the vma, not on random page walking.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • None of the callers want to pass in 'zap_details', and it doesn't even
    make sense for the case of actually unmapping vma's. So remove the
    argument, and clean up the interface.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Apr, 2012

4 commits

  • it's always current->mm

    Signed-off-by: Al Viro

    Al Viro
     
  • This continues the theme started with vm_brk() and vm_munmap():
    vm_mmap() does the same thing as do_mmap(), but additionally does the
    required VM locking.

    This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
    duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
    to export our internal do_mmap_pgoff() function.

    Some day we hopefully don't have to export do_mmap() either, if all
    modular users can become the simpler vm_mmap() instead. We're actually
    very close to that already, with the notable exception of the (broken)
    use in i810, and a couple of stragglers in binfmt_elf.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Like the vm_brk() function, this is the same as "do_munmap()", except it
    does the VM locking for the caller.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It does the same thing as "do_brk()", except it handles the VM locking
    too.

    It turns out that all external callers want that anyway, so we can make
    do_brk() static to just mm/mmap.c while at it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Apr, 2012

2 commits

  • Uprobes has a callback (uprobe_munmap()) in the unmap path to
    maintain the uprobes count.

    In the exit path this callback gets called in unlink_file_vma().
    However by the time unlink_file_vma() is called, the pages would
    have been unmapped (in unmap_vmas()) and the task->rss_stat counts
    accounted (in zap_pte_range()).

    If the exiting process has probepoints, uprobe_munmap() checks if
    the breakpoint instruction was around before decrementing the probe
    count.

    This results in a file backed page being reread by uprobe_munmap()
    and hence it does not find the breakpoint.

    This patch fixes this problem by moving the callback to
    unmap_single_vma(). Since unmap_single_vma() may not unmap the
    complete vma, add start and end parameters to uprobe_munmap().

    This bug became apparent courtesy of commit c3f0327f8e9d
    ("mm: add rss counters consistency check").

    Signed-off-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Jim Keniston
    Cc: Linux-mm
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Cc: Masami Hiramatsu
    Cc: Anton Arapov
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • Merge in latest upstream (and the latest perf development tree),
    to prepare for tooling changes, and also to pick up v3.4 MM
    changes that the uprobes code needs to take care of.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

31 Mar, 2012

1 commit

  • Maintain a per-mm counter: number of uprobes that are inserted
    on this process address space.

    This counter can be used at probe hit time to determine if we
    need a lookup in the uprobes rbtree. Everytime a probe gets
    inserted successfully, the probe count is incremented and
    everytime a probe gets removed, the probe count is decremented.

    The new uprobe_munmap hook ensures the count is correct on a
    unmap or remap of a region. We expect that once a
    uprobe_munmap() is called, the vma goes away. So
    uprobe_unregister() finding a probe to unregister would either
    mean unmap event hasnt occurred yet or a mmap event on the same
    executable file occured after a unmap event.

    Additionally, uprobe_mmap hook now also gets called:

    a. on every executable vma that is COWed at fork.
    b. a vma of interest is newly mapped; breakpoint insertion also
    happens at the required address.

    On process creation, make sure the probes count in the child is
    set correctly.

    Special cases that are taken care include:

    a. mremap
    b. VM_DONTCOPY vmas on fork()
    c. insertion/removal races in the parent during fork().

    Signed-off-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Jim Keniston
    Cc: Linux-mm
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Cc: Masami Hiramatsu
    Cc: Anton Arapov
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120330182646.10018.85805.sendpatchset@srdronam.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

6 commits

  • The comment above __insert_vm_struct seems to suggest that this function
    is also going to link the VMA with the anon_vma, but this is not true.
    This function only links the VMA to the mm->mm_rb tree and the mm->mmap
    linked list.

    [akpm@linux-foundation.org: improve comment layout and text]
    Signed-off-by: Kautuk Consul
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
    PAGE_SIZE, but this is not sufficient.

    Modify hugetlb_file_setup() to align requests to the huge page size, and
    to accept an address argument so that all alignment checks can be
    performed in hugetlb_file_setup(), rather than in its callers. Change
    newseg() and mmap_pgoff() to match the new prototype and eliminate a now
    redundant alignment check.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Steven Truelove
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Truelove
     
  • If the required size is bigger than cached_hole_size it is better to
    search from free_area_cache - it is easier to get a free region,
    specifically for the 64 bit process whose address space is large enough

    Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c

    Signed-off-by: Xiao Guangrong
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Michal Hocko
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • In the current code, cached_hole_size is set to the maximum value if the
    unmapped vma is less that free_area_cache so the next search will search
    from the base address.

    Actually, we can keep cached_hole_size so that if the next required size
    is more than cached_hole_size, it can search from free_area_cache.

    Signed-off-by: Xiao Guangrong
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Michal Hocko
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Pull munmap/truncate race fixes from Al Viro:
    "Fixes for racy use of unmap_vmas() on truncate-related codepaths"

    * 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VM: make zap_page_range() callers that act on a single VMA use separate helper
    VM: make unmap_vmas() return void
    VM: don't bother with feeding upper limit to tlb_finish_mmu() in exit_mmap()
    VM: make zap_page_range() return void
    VM: can't go through the inner loop in unmap_vmas() more than once...
    VM: unmap_page_range() can return void

    Linus Torvalds
     
  • Pull security subsystem updates for 3.4 from James Morris:
    "The main addition here is the new Yama security module from Kees Cook,
    which was discussed at the Linux Security Summit last year. Its
    purpose is to collect miscellaneous DAC security enhancements in one
    place. This also marks a departure in policy for LSM modules, which
    were previously limited to being standalone access control systems.
    Chromium OS is using Yama, and I believe there are plans for Ubuntu,
    at least.

    This patchset also includes maintenance updates for AppArmor, TOMOYO
    and others."

    Fix trivial conflict in due to the jumo_label->static_key
    rename.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
    AppArmor: Fix location of const qualifier on generated string tables
    TOMOYO: Return error if fails to delete a domain
    AppArmor: add const qualifiers to string arrays
    AppArmor: Add ability to load extended policy
    TOMOYO: Return appropriate value to poll().
    AppArmor: Move path failure information into aa_get_name and rename
    AppArmor: Update dfa matching routines.
    AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
    AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
    AppArmor: Add const qualifiers to generated string tables
    AppArmor: Fix oops in policy unpack auditing
    AppArmor: Fix error returned when a path lookup is disconnected
    KEYS: testing wrong bit for KEY_FLAG_REVOKED
    TOMOYO: Fix mount flags checking order.
    security: fix ima kconfig warning
    AppArmor: Fix the error case for chroot relative path name lookup
    AppArmor: fix mapping of META_READ to audit and quiet flags
    AppArmor: Fix underflow in xindex calculation
    AppArmor: Fix dropping of allowed operations that are force audited
    AppArmor: Add mising end of structure test to caps unpacking
    ...

    Linus Torvalds
     

21 Mar, 2012

2 commits


07 Mar, 2012

2 commits

  • Commit 6bd4837de96e ("mm: simplify find_vma_prev()") broke memory
    management on PA-RISC.

    After application of the patch, programs that allocate big arrays on the
    stack crash with segfault, for example, this will crash if compiled
    without optimization:

    int main()
    {
    char array[200000];
    array[199999] = 0;
    return 0;
    }

    The reason is that PA-RISC has up-growing stack and the stack is usually
    the last memory area. In the above example, a page fault happens above
    the stack.

    Previously, if we passed too high address to find_vma_prev, it returned
    NULL and stored the last VMA in *pprev. After "simplify find_vma_prev"
    change, it stores NULL in *pprev. Consequently, the stack area is not
    found and it is not expanded, as it used to be before the change.

    This patch restores the old behavior and makes it return the last VMA in
    *pprev if the requested address is higher than address of any other VMA.

    Signed-off-by: Mikulas Patocka
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Currently error is -ENOMEM when rejecting VM_GROWSDOWN|VM_GROWSUP
    from shared anonymous: hoist the file case's -EINVAL up for both.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Mar, 2012

1 commit


17 Feb, 2012

2 commits

  • Make the uprobes code readable to me:

    - improve the Kconfig text so that a mere mortal gets some idea
    what CONFIG_UPROBES=y is really about

    - do trivial renames to standardize around the uprobes_*() namespace

    - clean up and simplify various code flow details

    - separate basic blocks of functionality

    - line break artifact and white space related removal

    - use standard local varible definition blocks

    - use vertical spacing to make things more readable

    - remove unnecessary volatile

    - restructure comment blocks to make them more uniform and
    more readable in general

    Cc: Srikar Dronamraju
    Cc: Jim Keniston
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Masami Hiramatsu
    Cc: Arnaldo Carvalho de Melo
    Cc: Anton Arapov
    Cc: Ananth N Mavinakayanahalli
    Link: http://lkml.kernel.org/n/tip-ewbwhb8o6navvllsauu7k07p@git.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Add uprobes support to the core kernel, with x86 support.

    This commit adds the kernel facilities, the actual uprobes
    user-space ABI and perf probe support comes in later commits.

    General design:

    Uprobes are maintained in an rb-tree indexed by inode and offset
    (the offset here is from the start of the mapping). For a unique
    (inode, offset) tuple, there can be at most one uprobe in the
    rb-tree.

    Since the (inode, offset) tuple identifies a unique uprobe, more
    than one user may be interested in the same uprobe. This provides
    the ability to connect multiple 'consumers' to the same uprobe.

    Each consumer defines a handler and a filter (optional). The
    'handler' is run every time the uprobe is hit, if it matches the
    'filter' criteria.

    The first consumer of a uprobe causes the breakpoint to be
    inserted at the specified address and subsequent consumers are
    appended to this list. On subsequent probes, the consumer gets
    appended to the existing list of consumers. The breakpoint is
    removed when the last consumer unregisters. For all other
    unregisterations, the consumer is removed from the list of
    consumers.

    Given a inode, we get a list of the mms that have mapped the
    inode. Do the actual registration if mm maps the page where a
    probe needs to be inserted/removed.

    We use a temporary list to walk through the vmas that map the
    inode.

    - The number of maps that map the inode, is not known before we
    walk the rmap and keeps changing.
    - extending vm_area_struct wasn't recommended, it's a
    size-critical data structure.
    - There can be more than one maps of the inode in the same mm.

    We add callbacks to the mmap methods to keep an eye on text vmas
    that are of interest to uprobes. When a vma of interest is mapped,
    we insert the breakpoint at the right address.

    Uprobe works by replacing the instruction at the address defined
    by (inode, offset) with the arch specific breakpoint
    instruction. We save a copy of the original instruction at the
    uprobed address.

    This is needed for:

    a. executing the instruction out-of-line (xol).
    b. instruction analysis for any subsequent fixups.
    c. restoring the instruction back when the uprobe is unregistered.

    We insert or delete a breakpoint instruction, and this
    breakpoint instruction is assumed to be the smallest instruction
    available on the platform. For fixed size instruction platforms
    this is trivially true, for variable size instruction platforms
    the breakpoint instruction is typically the smallest (often a
    single byte).

    Writing the instruction is done by COWing the page and changing
    the instruction during the copy, this even though most platforms
    allow atomic writes of the breakpoint instruction. This also
    mirrors the behaviour of a ptrace() memory write to a PRIVATE
    file map.

    The core worker is derived from KSM's replace_page() logic.

    In essence, similar to KSM:

    a. allocate a new page and copy over contents of the page that
    has the uprobed vaddr
    b. modify the copy and insert the breakpoint at the required
    address
    c. switch the original page with the copy containing the
    breakpoint
    d. flush page tables.

    replace_page() is being replicated here because of some minor
    changes in the type of pages and also because Hugh Dickins had
    plans to improve replace_page() for KSM specific work.

    Instruction analysis on x86 is based on instruction decoder and
    determines if an instruction can be probed and determines the
    necessary fixups after singlestep. Instruction analysis is done
    at probe insertion time so that we avoid having to repeat the
    same analysis every time a probe is hit.

    A lot of code here is due to the improvement/suggestions/inputs
    from Peter Zijlstra.

    Changelog:

    (v10):
    - Add code to clear REX.B prefix as suggested by Denys Vlasenko
    and Masami Hiramatsu.

    (v9):
    - Use insn_offset_modrm as suggested by Masami Hiramatsu.

    (v7):

    Handle comments from Peter Zijlstra:

    - Dont take reference to inode. (expect inode to uprobe_register to be sane).
    - Use PTR_ERR to set the return value.
    - No need to take reference to inode.
    - use PTR_ERR to return error value.
    - register and uprobe_unregister share code.

    (v5):

    - Modified del_consumer as per comments from Peter.
    - Drop reference to inode before dropping reference to uprobe.
    - Use i_size_read(inode) instead of inode->i_size.
    - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
    - Includes errno.h as recommended by Stephen Rothwell to fix a build issue
    on sparc defconfig
    - Remove restrictions while unregistering.
    - Earlier code leaked inode references under some conditions while
    registering/unregistering.
    - Continue the vma-rmap walk even if the intermediate vma doesnt
    meet the requirements.
    - Validate the vma found by find_vma before inserting/removing the
    breakpoint
    - Call del_consumer under mutex_lock.
    - Use hash locks.
    - Handle mremap.
    - Introduce find_least_offset_node() instead of close match logic in
    find_uprobe
    - Uprobes no more depends on MM_OWNER; No reference to task_structs
    while inserting/removing a probe.
    - Uses read_mapping_page instead of grab_cache_page so that the pages
    have valid content.
    - pass NULL to get_user_pages for the task parameter.
    - call SetPageUptodate on the new page allocated in write_opcode.
    - fix leaking a reference to the new page under certain conditions.
    - Include Instruction Decoder if Uprobes gets defined.
    - Remove const attributes for instruction prefix arrays.
    - Uses mm_context to know if the application is 32 bit.

    Signed-off-by: Srikar Dronamraju
    Also-written-by: Jim Keniston
    Reviewed-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Roland McGrath
    Cc: Masami Hiramatsu
    Cc: Arnaldo Carvalho de Melo
    Cc: Anton Arapov
    Cc: Ananth N Mavinakayanahalli
    Cc: Stephen Rothwell
    Cc: Denys Vlasenko
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Linux-mm
    Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
    [ Made various small edits to the commit log ]
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

14 Feb, 2012

2 commits


11 Jan, 2012

2 commits

  • commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
    vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
    using it. Also, this change helps to improve page fault performance
    because it has stronger locality of reference.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Shaohua Li
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds