26 Oct, 2012

1 commit

  • On s390 any write to a page (even from kernel itself) sets architecture
    specific page dirty bit. Thus when a page is written to via buffered
    write, HW dirty bit gets set and when we later map and unmap the page,
    page_remove_rmap() finds the dirty bit and calls set_page_dirty().

    Dirtying of a page which shouldn't be dirty can cause all sorts of
    problems to filesystems. The bug we observed in practice is that
    buffers from the page get freed, so when the page gets later marked as
    dirty and writeback writes it, XFS crashes due to an assertion
    BUG_ON(!PagePrivate(page)) in page_buffers() called from
    xfs_count_page_state().

    Similar problem can also happen when zero_user_segment() call from
    xfs_vm_writepage() (or block_write_full_page() for that matter) set the
    hardware dirty bit during writeback, later buffers get freed, and then
    page unmapped.

    Fix the issue by ignoring s390 HW dirty bit for page cache pages of
    mappings with mapping_cap_account_dirty(). This is safe because for
    such mappings when a page gets marked as writeable in PTE it is also
    marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is
    cleared by clear_page_dirty_for_io(), the page gets writeprotected in
    page_mkclean(). So pagecache page is writeable if and only if it is
    dirty.

    Thanks to Hugh Dickins for pointing out mapping has to have
    mapping_cap_account_dirty() for things to work and proposing a cleaned
    up variant of the patch.

    The patch has survived about two hours of running fsx-linux on tmpfs
    while heavily swapping and several days of running on out build machines
    where the original problem was triggered.

    Signed-off-by: Jan Kara
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Heiko Carstens
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Oct, 2012

7 commits

  • In order to allow sleeping during mmu notifier calls, we need to avoid
    invoking them under the page table spinlock. This patch solves the
    problem by calling invalidate_page notification after releasing the lock
    (but before freeing the page itself), or by wrapping the page invalidation
    with calls to invalidate_range_begin and invalidate_range_end.

    To prevent accidental changes to the invalidate_range_end arguments after
    the call to invalidate_range_begin, the patch introduces a convention of
    saving the arguments in consistently named locals:

    unsigned long mmun_start; /* For mmu_notifiers */
    unsigned long mmun_end; /* For mmu_notifiers */

    ...

    mmun_start = ...
    mmun_end = ...
    mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

    ...

    mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

    The patch changes code to use this convention for all calls to
    mmu_notifier_invalidate_range_start/end, except those where the calls are
    close enough so that anyone who glances at the code can see the values
    aren't changing.

    This patchset is a preliminary step towards on-demand paging design to be
    added to the RDMA stack.

    Why do we want on-demand paging for Infiniband?

    Applications register memory with an RDMA adapter using system calls,
    and subsequently post IO operations that refer to the corresponding
    virtual addresses directly to HW. Until now, this was achieved by
    pinning the memory during the registration calls. The goal of on demand
    paging is to avoid pinning the pages of registered memory regions (MRs).
    This will allow users the same flexibility they get when swapping any
    other part of their processes address spaces. Instead of requiring the
    entire MR to fit in physical memory, we can allow the MR to be larger,
    and only fit the current working set in physical memory.

    Why should anyone care? What problems are users currently experiencing?

    This can make programming with RDMA much simpler. Today, developers
    that are working with more data than their RAM can hold need either to
    deregister and reregister memory regions throughout their process's
    life, or keep a single memory region and copy the data to it. On demand
    paging will allow these developers to register a single MR at the
    beginning of their process's life, and let the operating system manage
    which pages needs to be fetched at a given time. In the future, we
    might be able to provide a single memory access key for each process
    that would provide the entire process's address as one large memory
    region, and the developers wouldn't need to register memory regions at
    all.

    Is there any prospect that any other subsystems will utilise these
    infrastructural changes? If so, which and how, etc?

    As for other subsystems, I understand that XPMEM wanted to sleep in
    MMU notifiers, as Christoph Lameter wrote at
    http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
    perhaps Andrea knows about other use cases.

    Scheduling in mmu notifications is required since we need to sync the
    hardware with the secondary page tables change. A TLB flush of an IO
    device is inherently slower than a CPU TLB flush, so our design works by
    sending the invalidation request to the device, and waiting for an
    interrupt before exiting the mmu notifier handler.

    Avi said:

    kvm may be a buyer. kvm::mmu_lock, which serializes guest page
    faults, also protects long operations such as destroying large ranges.
    It would be good to convert it into a spinlock, but as it is used inside
    mmu notifiers, this cannot be done.

    (there are alternatives, such as keeping the spinlock and using a
    generation counter to do the teardown in O(1), which is what the "may"
    is doing up there).

    [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Haggai Eran
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In file and anon rmap, we use interval trees to find potentially relevant
    vmas and then call vma_address() to find the virtual address the given
    page might be found at in these vmas. vma_address() used to include a
    check that the returned address falls within the limits of the vma, but
    this check isn't necessary now that we always use interval trees in rmap:
    the interval tree just doesn't return any vmas which this check would find
    to be irrelevant. As a result, we can replace the use of -EFAULT error
    code (which then needed to be checked in every call site) with a
    VM_BUG_ON().

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When a large VMA (anon or private file mapping) is first touched, which
    will populate its anon_vma field, and then split into many regions through
    the use of mprotect(), the original anon_vma ends up linking all of the
    vmas on a linked list. This can cause rmap to become inefficient, as we
    have to walk potentially thousands of irrelevent vmas before finding the
    one a given anon page might fall into.

    By replacing the same_anon_vma linked list with an interval tree (where
    each avc's interval is determined by its vma's start and last pgoffs), we
    can make rmap efficient for this use case again.

    While the change is large, all of its pieces are fairly simple.

    Most places that were walking the same_anon_vma list were looking for a
    known pgoff, so they can just use the anon_vma_interval_tree_foreach()
    interval tree iterator instead. The exception here is ksm, where the
    page's index is not known. It would probably be possible to rework ksm so
    that the index would be known, but for now I have decided to keep things
    simple and just walk the entirety of the interval tree there.

    When updating vma's that already have an anon_vma assigned, we must take
    care to re-index the corresponding avc's on their interval tree. This is
    done through the use of anon_vma_interval_tree_pre_update_vma() and
    anon_vma_interval_tree_post_update_vma(), which remove the avc's from
    their interval tree before the update and re-insert them after the update.
    The anon_vma stays locked during the update, so there is no chance that
    rmap would miss the vmas that are being updated.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • mremap() had a clever optimization where move_ptes() did not take the
    anon_vma lock to avoid a race with anon rmap users such as page migration.
    Instead, the avc's were ordered in such a way that the origin vma was
    always visited by rmap before the destination. This ordering and the use
    of page table locks rmap usage safe. However, we want to replace the use
    of linked lists in anon rmap with an interval tree, and this will make it
    harder to impose such ordering as the interval tree will always be sorted
    by the avc->vma->vm_pgoff value. For now, let's replace the
    anon_vma_moveto_tail() ordering function with proper anon_vma locking in
    move_ptes(). Once we have the anon interval tree in place, we will
    re-introduce an optimization to avoid taking these locks in the most
    common cases.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

30 May, 2012

1 commit

  • The swap token code no longer fits in with the current VM model. It
    does not play well with cgroups or the better NUMA placement code in
    development, since we have only one swap token globally.

    It also has the potential to mess with scalability of the system, by
    increasing the number of non-reclaimable pages on the active and
    inactive anon LRU lists.

    Last but not least, the swap token code has been broken for a year
    without complaints, as reported by Konstantin Khlebnikov. This suggests
    we no longer have much use for it.

    The days of sub-1G memory systems with heavy use of swap are over. If
    we ever need thrashing reducing code in the future, we will have to
    implement something that does scale.

    Signed-off-by: Rik van Riel
    Cc: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Bob Picco
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

22 Mar, 2012

3 commits

  • Now, page-stat-per-memcg is recorded into per page_cgroup flag by
    duplicating page's status into the flag. The reason is that memcg has a
    feature to move a page from a group to another group and we have race
    between "move" and "page stat accounting",

    Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
    does "page stat accounting".

    When CPU-A goes 1st,

    CPU-A CPU-B
    update "struct page" info.
    move_lock_mem_cgroup(memcg)
    see pc->flags
    copy page stat to new group
    overwrite pc->mem_cgroup.
    move_unlock_mem_cgroup(memcg)
    move_lock_mem_cgroup(mem)
    set pc->flags
    update page stat accounting
    move_unlock_mem_cgroup(mem)

    stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
    (CPU-A) doesn't see changes in "struct page" information.

    But it's costly to have the same information both in 'struct page' and
    'struct page_cgroup'. And, there is a potential problem.

    For example, assume we have PG_dirty accounting in memcg.
    PG_..is a flag for struct page.
    PCG_ is a flag for struct page_cgroup.
    (This is just an example. The same problem can be found in any
    kind of page stat accounting.)

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) TestClear PG_dirty
    if (TestClear(PCG_dirty))
    memcg->nr_dirty--
    if (TestSet(PCG_dirty))
    memcg->nr_dirty++

    Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
    Thelen . Now, only FILE_MAPPED is supported but
    fortunately, it's serialized by page table lock and this is not real bug,
    _now_,

    If this potential problem is caused by having duplicated information in
    struct page and struct page_cgroup, we may be able to fix this by using
    original 'struct page' information. But we'll have a problem in "move
    account"

    Assume we use only PG_dirty.

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) move_lock_mem_cgroup()
    if (PageDirty(page))
    new_memcg->nr_dirty++
    pc->mem_cgroup = new_memcg;
    move_unlock_mem_cgroup()
    move_lock_mem_cgroup()
    memcg = pc->mem_cgroup
    new_memcg->nr_dirty++

    accounting information may be double-counted. This was original reason to
    have PCG_xxx flags but it seems PCG_xxx has another problem.

    I think we need a bigger lock as

    move_lock_mem_cgroup(page)
    TestSetPageDirty(page)
    update page stats (without any checks)
    move_unlock_mem_cgroup(page)

    This fixes both of problems and we don't have to duplicate page flag into
    page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
    are possibility of "account move" under the system. So, in most path,
    status update will go without atomic locks.

    This patch introduces mem_cgroup_begin_update_page_stat() and
    mem_cgroup_end_update_page_stat() both should be called at modifying
    'struct page' information if memcg takes care of it. as

    mem_cgroup_begin_update_page_stat()
    modify page information
    mem_cgroup_update_page_stat()
    => never check any 'struct page' info, just update counters.
    mem_cgroup_end_update_page_stat().

    This patch is slow because we need to call begin_update_page_stat()/
    end_update_page_stat() regardless of accounted will be changed or not. A
    following patch adds an easy optimization and reduces the cost.

    [akpm@linux-foundation.org: s/lock/locked/]
    [hughd@google.com: fix deadlock by avoiding stat lock when anon]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Reduce code duplication by calling anon_vma_chain_link() from
    anon_vma_prepare().

    Also move anon_vmal_chain_link() to a more suitable location in the file.

    Signed-off-by: Kautuk Consul
    Acked-by: Peter Zijlstra
    Cc: Hugh Dickins
    Reviewed-by: KAMEZWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(),
    IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
    for checking config options in C expressions.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

1 commit


11 Jan, 2012

1 commit

  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit


31 Oct, 2011

1 commit


27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

24 Jul, 2011

1 commit

  • On x86 a page without a mapper is by definition not referenced / old.
    The s390 architecture keeps the reference bit in the storage key and
    the current code will check the storage key for page without a mapper.
    This leads to an interesting effect: the first time an s390 system
    needs to write pages to swap it only finds referenced pages. This
    causes a lot of pages to get added and written to the swap device.
    To avoid this behaviour change page_referenced to query the storage
    key only if there is a mapper of the page.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

21 Jul, 2011

1 commit

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 Jun, 2011

1 commit


18 Jun, 2011

3 commits

  • Hugh Dickins points out that lockdep (correctly) spots a potential
    deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
    of anon_vma_chain while doing anon_vma_clone(). The problem is that
    page reclaim will want to take the anon_vma lock of any anonymous pages
    that it will try to reclaim.

    So re-organize the code in anon_vma_clone() slightly: first do just a
    GFP_NOWAIT allocation, which will usually work fine. But if that fails,
    let's just drop the lock and re-do the allocation, now with GFP_KERNEL.

    End result: not only do we avoid the locking problem, this also ends up
    getting better concurrency in case the allocation does need to block.
    Tim Chen reports that with all these anon_vma locking tweaks, we're now
    almost back up to the spinlock performance.

    Reported-and-tested-by: Hugh Dickins
    Tested-by: Tim Chen
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This matches the anon_vma_clone() case, and uses the same lock helper
    functions. Because of the need to potentially release the anon_vma's,
    it's a bit more complex, though.

    We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
    the anon_vma lock (with the helper function that only takes the lock
    once for the whole loop), and removes any entries that don't need any
    more processing.

    The second phase just traverses the remaining list entries (without
    holding the anon_vma lock), and does any actual freeing of the
    anon_vma's that is required.

    Signed-off-by: Peter Zijlstra
    Tested-by: Hugh Dickins
    Tested-by: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
    vma, locking the anon_vma for each entry.

    But they are all going to have the same root entry, which means that
    we're locking and unlocking the same lock over and over again. Which is
    expensive in locked operations, but can get _really_ expensive when that
    root entry sees any kind of lock contention.

    In fact, Tim Chen reports a big performance regression due to this: when
    we switched to use a mutex instead of a spinlock, the contention case
    gets much worse.

    So to alleviate this all, this commit creates a small helper function
    (lock_anon_vma_root()) that can be used to take the lock just once
    rather than taking and releasing it over and over again.

    We still have the same "take the lock and release" it behavior in the
    exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
    since we're actually freeing the anon_vma entries as we go, and that
    will touch the lock too.

    Reported-and-tested-by: Tim Chen
    Tested-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

30 May, 2011

1 commit


29 May, 2011

2 commits

  • On one machine I've been getting hangs, a page fault's anon_vma_prepare()
    waiting in anon_vma_lock(), other processes waiting for that page's lock.

    This is a replay of last year's f18194275c39 "mm: fix hang on
    anon_vma->root->lock".

    The new page_lock_anon_vma() places too much faith in its refcount: when
    it has acquired the mutex_trylock(), it's possible that a racing task in
    anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
    to 1, and is about to reset its anon_vma->root.

    Fix this by saving anon_vma->root, and relying on the usual page_mapped()
    check instead of a refcount check: if page is still mapped, the anon_vma
    is still ours; if page is not still mapped, we're no longer interested.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I've hit the "address >= vma->vm_end" check in do_page_add_anon_rmap()
    just once. The stack showed khugepaged allocation trying to compact
    pages: the call to page_add_anon_rmap() coming from remove_migration_pte().

    That path holds anon_vma lock, but does not hold mmap_sem: it can
    therefore race with a split_vma(), and in commit 5f70b962ccc2 "mmap:
    avoid unnecessary anon_vma lock" we just took away the anon_vma lock
    protection when adjusting vma->vm_end.

    I don't think that particular BUG_ON ever caught anything interesting,
    so better replace it by a comment, than reinstate the anon_vma locking.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2011

6 commits

  • Optimize the page_lock_anon_vma() fast path to be one atomic op, instead
    of two.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Straightforward conversion of anon_vma->lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Convert page_lock_anon_vma() over to use refcounts. This is done to
    prepare for the conversion of anon_vma from spinlock to mutex.

    Sadly this inceases the cost of page_lock_anon_vma() from one to two
    atomics, a follow up patch addresses this, lets keep that simple for now.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • A slightly more verbose comment to go along with the trickery in
    page_lock_anon_vma().

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Its beyond ugly and gets in the way.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Namhyung Kim
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

23 May, 2011

1 commit

  • The page_clear_dirty primitive always sets the default storage key
    which resets the access control bits and the fetch protection bit.
    That will surprise a KVM guest that sets non-zero access control
    bits or the fetch protection bit. Merge page_test_dirty and
    page_clear_dirty back to a single function and only clear the
    dirty bit from the storage key.

    In addition move the function page_test_and_clear_dirty and
    page_test_and_clear_young to page.h where they belong. This
    requires to change the parameter from a struct page * to a page
    frame number.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

25 Mar, 2011

2 commits

  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

23 Mar, 2011

3 commits

  • This patch changes the anon_vma refcount to be 0 when the object is free.
    It does this by adding 1 ref to being in use in the anon_vma structure
    (iow. the anon_vma->head list is not empty).

    This allows a simpler release scheme without having to check both the
    refcount and the list as well as avoids taking a ref for each entry on the
    list.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • We need the anon_vma refcount unconditionally to simplify the anon_vma
    lifetime rules.

    Signed-off-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • The normal code pattern used in the kernel is: get/put.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra