27 May, 2011

3 commits

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • This fourth patch of eight in this cleancache series provides the
    core hooks in VFS for: initializing cleancache per filesystem;
    capturing clean pages reclaimed by page cache; attempting to get
    pages from cleancache before filesystem read; and ensuring coherency
    between pagecache, disk, and cleancache. Note that the placement
    of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
    change was required due to a patchset in 2.6.39.

    All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
    a check of a boolean global if CONFIG_CLEANCACHE is set but no
    cleancache "backend" has claimed cleancache_ops.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
    Signed-off-by: Chris Mason
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

25 May, 2011

6 commits

  • Previously the mmap sequential readahead is triggered by updating
    ra->prev_pos on each page fault and compare it with current page offset.

    It costs dirtying the cache line on each _minor_ page fault. So remove
    the ra->prev_pos recording, and instead tag PG_readahead to trigger the
    possible sequential readahead. It's not only more simple, but also will
    work more reliably and reduce cache line bouncing on concurrent page
    faults on shared struct file.

    In the mosbench exim benchmark which does multi-threaded page faults on
    shared struct file, the ra->mmap_miss and ra->prev_pos updates are found
    to cause excessive cache line bouncing on tmpfs, which actually disabled
    readahead totally (shmem_backing_dev_info.ra_pages == 0).

    So remove the ra->prev_pos recording, and instead tag PG_readahead to
    trigger the possible sequential readahead. It's not only more simple, but
    also will work more reliably on concurrent reads on shared struct file.

    Signed-off-by: Wu Fengguang
    Tested-by: Tim Chen
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The original INT_MAX is too large, reduce it to

    - avoid unnecessarily dirtying/bouncing the cache line

    - restore mmap read-around faster on changed access pattern

    Background: in the mosbench exim benchmark which does multi-threaded page
    faults on shared struct file, the ra->mmap_miss updates are found to cause
    excessive cache line bouncing on tmpfs. The ra state updates are needless
    for tmpfs because it actually disabled readahead totally
    (shmem_backing_dev_info.ra_pages == 0).

    Tested-by: Tim Chen
    Signed-off-by: Andi Kleen
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Reduce readahead overheads by returning early in do_sync_mmap_readahead().

    tmpfs has ra_pages=0 and it can page fault really fast (not constraint by
    IO if not swapping).

    Signed-off-by: Wu Fengguang
    Tested-by: Tim Chen
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • When an oom killing occurs, almost all processes are getting stuck at the
    following two points.

    1) __alloc_pages_nodemask
    2) __lock_page_or_retry

    1) is not very problematic because TIF_MEMDIE leads to an allocation
    failure and getting out from page allocator.

    2) is more problematic. In an OOM situation, zones typically don't have
    page cache at all and memory starvation might lead to greatly reduced IO
    performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly,
    meaning that a fork bomb may create new process quickly rather than the
    oom-killer killing it. Then, the system may become livelocked.

    This patch makes the pagefault interruptible by SIGKILL.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2687a356 ("Add lock_page_killable") introduced killable
    lock_page(). Similarly this patch introdues killable
    wait_on_page_locked().

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 Mar, 2011

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

7 commits

  • Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
    truncate_inode_pages_range() - stop looking further when it returns 0.

    But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
    especially if we have preemptible RCU enabled, isn't it conceivable that
    all 14 pages returned could be removed from the page cache by
    shrink_page_list(), before find_get_pages() gets to process them? So
    causing it to return 0 although there may be plenty more pages beyond.

    Make find_get_pages() and find_get_pages_tag() check for this unlikely
    case, and restart should it occur; but callers of find_get_pages_contig()
    have no such expectation, it's okay for that to return 0 early.

    I have not seen this in practice, just worried by the possibility.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The radix_tree_deref_retry() case in find_get_pages() has a strange little
    excrescence, not seen in the other gang lookups: it looks like the start
    of an abandoned attempt to guarantee forward progress in a case that
    cannot arise.

    ret should always be 0 here: if it isn't, then going back to restart will
    leak references to pages already gotten. There used to be a comment
    saying nr_found is necessarily 1 here: that's not quite true, but the
    radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
    race with it being moved out of the radix_tree root or back.

    Remove the worrisome two lines, add a brief comment here and in
    find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
    find_get_pages() should it ever be seen elsewhere than at 0.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now delete_from_page_cache() replaces remove_from_page_cache(). So we
    remove remove_from_page_cache so fs or something out of mainline will
    notice it when compile time and can fix it.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently we increase the page refcount in add_to_page_cache() but don't
    decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
    requiring that callers notice it and a comment explaining why they release
    a page reference. It's not a good API.

    A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
    gave up because reiser4's drop_page() had to unlock the page between
    removing it from page cache and doing the page_cache_release(). But now
    the situation is changed. I think at least things in current mainline
    don't have any obstacles. The problem is for out-of-mainline filesystems
    - if they have done such things as reiser4, this patch could be a problem
    but they will discover this at compile time since we remove
    remove_from_page_cache().

    This patch:

    This function works as just wrapper remove_from_page_cache(). The
    difference is that it decreases page references in itself. So caller have
    to make sure it has a page reference before calling.

    This patch is ready for removing remove_from_page_cache().

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • GUP user may want to try to acquire a reference to a page if it is already
    in memory, but not if IO, to bring it in, is needed. For example KVM may
    tell vcpu to schedule another guest process if current one is trying to
    access swapped out page. Meanwhile, the page will be swapped in and the
    guest process, that depends on it, will be able to run again.

    This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
    FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
    conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
    it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
    instead.

    [akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
    Signed-off-by: Gleb Natapov
    Cc: Linus Torvalds
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     

10 Mar, 2011

2 commits


14 Jan, 2011

3 commits

  • Running the annotated branch profiler on a box doing average work
    (firefox, evolution, xchat, distcc farm), the likely() used in
    grab_cache_page_write_begin() was incorrect most of the time:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

    Adding a trace_printk() and running the function tracer limited to
    just this function I can see:

    gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
    gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Temporary IO failures, eg. due to loss of both multipath paths, can
    permanently leave the PageError bit set on a page, resulting in msync or
    fsync returning -EIO over and over again, even if IO is now getting to the
    disk correctly.

    We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
    filemap_fdatawait_range function. Also clearing the PageError bit on the
    page allows subsequent msync or fsync calls on this file to return without
    an error, if the subsequent IO succeeds.

    Unfortunately data written out in the msync or fsync call that returned
    -EIO can still get lost, because the page dirty bit appears to not get
    restored on IO error. However, the alternative could be potentially all
    of memory filling up with uncleanable dirty pages, hanging the system, so
    there is no nice choice here...

    Signed-off-by: Rik van Riel
    Acked-by: Valerie Aurora
    Acked-by: Jeff Layton
    Cc: Theodore Ts'o
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Testing ->mapping and ->index without a ref is not stable as the page
    may have been reused at this point.

    Signed-off-by: Nick Piggin
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Jan, 2011

1 commit


02 Dec, 2010

1 commit

  • NFS needs to be able to release objects that are stored in the page
    cache once the page itself is no longer visible from the page cache.

    This patch adds a callback to the address space operations that allows
    filesystems to perform page cleanups once the page has been removed
    from the page cache.

    Original patch by: Linus Torvalds
    [trondmy: cover the cases of invalidate_inode_pages2() and
    truncate_inode_pages()]
    Signed-off-by: Trond Myklebust

    Linus Torvalds
     

12 Nov, 2010

2 commits

  • Salman Qazi describes the following radix-tree bug:

    In the following case, we get can get a deadlock:

    0. The radix tree contains two items, one has the index 0.
    1. The reader (in this case find_get_pages) takes the rcu_read_lock.
    2. The reader acquires slot(s) for item(s) including the index 0 item.
    3. The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
    3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
    4. The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
    5. The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

    The fix is to re-use the same "indirect" pointer case that requires a slot
    lookup retry into a general "retry the lookup" bit.

    Signed-off-by: Nick Piggin
    Reported-by: Salman Qazi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • 70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
    ran into a NULL dereference in here:

    int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    unsigned long from)
    {
    ----> struct inode *inode = page->mapping->host;

    It looks like page->mapping was the culprit. (xmon trace is below).
    After closer examination, I realized that do_generic_file_read() does a
    find_get_page(), and eventually locks the page before calling
    block_is_partially_uptodate(). However, it doesn't revalidate the
    page->mapping after the page is locked. So, there's a small window
    between the find_get_page() and ->is_partially_uptodate() where the page
    could get truncated and page->mapping cleared.

    We _have_ a reference, so it can't get reclaimed, but it certainly
    can be truncated.

    I think the correct thing is to check page->mapping after the
    trylock_page(), and jump out if it got truncated. This patch has been
    running in the test environment for a month or so now, and we have not
    seen this bug pop up again.

    xmon info:

    1f:mon> e
    cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
    pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
    lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
    sp: c0000002ae36f9f0
    msr: 8000000000009032
    dar: 0
    dsisr: 40000000
    current = 0xc000000378f99e30
    paca = 0xc000000000f66300
    pid = 21946, comm = bash
    1f:mon> r
    R00 = 0025c0500000006d R16 = 0000000000000000
    R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
    R02 = c000000000e8cd80 R18 = ffffffffffffffff
    R03 = c0000000031d0f88 R19 = 0000000000000001
    R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
    R05 = 0000000000000000 R21 = c0000002ae36fa68
    R06 = 0000000000000000 R22 = 0000000000000000
    R07 = 0000000000000001 R23 = c0000002ae36fbb0
    R08 = 0000000000000002 R24 = 0000000000000000
    R09 = 0000000000000000 R25 = c000000362cd3a80
    R10 = 0000000000000000 R26 = 0000000000000002
    R11 = c0000000001e7b60 R27 = 0000000000000000
    R12 = 0000000042000484 R28 = 0000000000000001
    R13 = c000000000f66300 R29 = c0000003bb97b9b8
    R14 = 0000000000000001 R30 = c000000000e28a08
    R15 = 000000000000ffff R31 = c0000000031d0f88
    pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
    lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
    msr = 8000000000009032 cr = 22000488
    ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
    dar = 0000000000000000 dsisr = 40000000
    1f:mon> t
    [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
    [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
    [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
    [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
    [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
    [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
    --- Exception: c00 (System Call) at 00000080a840bc54
    SP (fffca15df30) is in userspace
    1f:mon> di c0000000001e7a6c
    c0000000001e7a6c e9290000 ld r9,0(r9)
    c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
    c0000000001e7a74 e9440008 ld r10,8(r4)
    c0000000001e7a78 78a80020 clrldi r8,r5,32
    c0000000001e7a7c 3c000001 lis r0,1
    c0000000001e7a80 812900a8 lwz r9,168(r9)
    c0000000001e7a84 39600001 li r11,1
    c0000000001e7a88 7c080050 subf r0,r8,r0
    c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
    c0000000001e7a90 7d6b4830 slw r11,r11,r9
    c0000000001e7a94 796b0020 clrldi r11,r11,32
    c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
    c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
    c0000000001e7aa0 7d004214 add r8,r0,r8
    c0000000001e7aa4 79080020 clrldi r8,r8,32
    c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100

    Signed-off-by: Dave Hansen
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

03 Nov, 2010

1 commit


27 Oct, 2010

3 commits

  • 'end' shadows earlier one and is not necessary at all. Remove it and use
    'pos' instead. This removes following sparse warnings:

    mm/filemap.c:2180:24: warning: symbol 'end' shadows an earlier one
    mm/filemap.c:2132:25: originally declared here

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • This change reduces mmap_sem hold times that are caused by waiting for
    disk transfers when accessing file mapped VMAs.

    It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
    site wants mmap_sem to be released if blocking on a pending disk transfer.
    In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
    do_page_fault() will then re-acquire mmap_sem and retry the page fault.

    It is expected that the retry will hit the same page which will now be
    cached, and thus it will complete with a low mmap_sem hold time.

    Tests:

    - microbenchmark: thread A mmaps a large file and does random read accesses
    to the mmaped area - achieves about 55 iterations/s. Thread B does
    mmap/munmap in a loop at a separate location - achieves 55 iterations/s
    before, 15000 iterations/s after.

    - We are seeing related effects in some applications in house, which show
    significant performance regressions when running without this change.

    [akpm@linux-foundation.org: fix warning & crash]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce a single location where filemap_fault() locks the desired page.
    There used to be two such places, depending if the initial find_get_page()
    was successful or not.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

10 Aug, 2010

1 commit


31 May, 2010

1 commit


28 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     

27 May, 2010

1 commit

  • I/O errors can happen due to temporary failures, like multipath
    errors or losing network contact with the iSCSI server. Because
    of that, the VM will retry readpage on the page.

    However, do_generic_file_read does not clear PG_error. This
    causes the system to be unable to actually use the data in the
    page cache page, even if the subsequent readpage completes
    successfully!

    The function filemap_fault has had a ClearPageError before
    readpage forever. This patch simply adds the same to
    do_generic_file_read.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

25 May, 2010

3 commits

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
    This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
    streaming IO first"). Wow, It is 2 years old patch!

    Currently, tmpfs file cache is inserted active list at first. This means
    that the insertion doesn't only increase numbers of pages in anon LRU, but
    it also reduces anon scanning ratio. Therefore, vmscan will get totally
    confused. It scans almost only file LRU even though the system has plenty
    unused tmpfs pages.

    Historically, lru_cache_add_active_anon() was used for two reasons.
    1) Intend to priotize shmem page rather than regular file cache.
    2) Intend to avoid reclaim priority inversion of used once pages.

    But we've lost both motivation because (1) Now we have separate anon and
    file LRU list. then, to insert active list doesn't help such priotize.
    (2) In past, one pte access bit will cause page activation. then to
    insert inactive list with pte access bit mean higher priority than to
    insert active list. Its priority inversion may lead to uninteded lru
    chun. but it was already solved by commit 645747462 (vmscan: detect
    mapped file pages used only once). (Thanks Hannes, you are great!)

    Thus, now we can use lru_cache_add_anon() instead.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Henrique de Moraes Holschuh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This is similar to what already happens in the write case. If we have a short
    read while doing O_DIRECT, instead of just returning, fallthrough and try to
    read the rest via buffered IO. BTRFS needs this because if we encounter a
    compressed or inline extent during DIO, we need to fallback on buffered. If the
    extent is compressed we need to read the entire thing into memory and
    de-compress it into the users pages. I have tested this with fsx and everything
    works great. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik