25 Mar, 2011

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

7 commits

  • Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
    truncate_inode_pages_range() - stop looking further when it returns 0.

    But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
    especially if we have preemptible RCU enabled, isn't it conceivable that
    all 14 pages returned could be removed from the page cache by
    shrink_page_list(), before find_get_pages() gets to process them? So
    causing it to return 0 although there may be plenty more pages beyond.

    Make find_get_pages() and find_get_pages_tag() check for this unlikely
    case, and restart should it occur; but callers of find_get_pages_contig()
    have no such expectation, it's okay for that to return 0 early.

    I have not seen this in practice, just worried by the possibility.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The radix_tree_deref_retry() case in find_get_pages() has a strange little
    excrescence, not seen in the other gang lookups: it looks like the start
    of an abandoned attempt to guarantee forward progress in a case that
    cannot arise.

    ret should always be 0 here: if it isn't, then going back to restart will
    leak references to pages already gotten. There used to be a comment
    saying nr_found is necessarily 1 here: that's not quite true, but the
    radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
    race with it being moved out of the radix_tree root or back.

    Remove the worrisome two lines, add a brief comment here and in
    find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
    find_get_pages() should it ever be seen elsewhere than at 0.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now delete_from_page_cache() replaces remove_from_page_cache(). So we
    remove remove_from_page_cache so fs or something out of mainline will
    notice it when compile time and can fix it.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently we increase the page refcount in add_to_page_cache() but don't
    decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
    requiring that callers notice it and a comment explaining why they release
    a page reference. It's not a good API.

    A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
    gave up because reiser4's drop_page() had to unlock the page between
    removing it from page cache and doing the page_cache_release(). But now
    the situation is changed. I think at least things in current mainline
    don't have any obstacles. The problem is for out-of-mainline filesystems
    - if they have done such things as reiser4, this patch could be a problem
    but they will discover this at compile time since we remove
    remove_from_page_cache().

    This patch:

    This function works as just wrapper remove_from_page_cache(). The
    difference is that it decreases page references in itself. So caller have
    to make sure it has a page reference before calling.

    This patch is ready for removing remove_from_page_cache().

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • GUP user may want to try to acquire a reference to a page if it is already
    in memory, but not if IO, to bring it in, is needed. For example KVM may
    tell vcpu to schedule another guest process if current one is trying to
    access swapped out page. Meanwhile, the page will be swapped in and the
    guest process, that depends on it, will be able to run again.

    This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
    FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
    conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
    it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
    instead.

    [akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
    Signed-off-by: Gleb Natapov
    Cc: Linus Torvalds
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     

10 Mar, 2011

2 commits


14 Jan, 2011

3 commits

  • Running the annotated branch profiler on a box doing average work
    (firefox, evolution, xchat, distcc farm), the likely() used in
    grab_cache_page_write_begin() was incorrect most of the time:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

    Adding a trace_printk() and running the function tracer limited to
    just this function I can see:

    gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
    gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Temporary IO failures, eg. due to loss of both multipath paths, can
    permanently leave the PageError bit set on a page, resulting in msync or
    fsync returning -EIO over and over again, even if IO is now getting to the
    disk correctly.

    We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
    filemap_fdatawait_range function. Also clearing the PageError bit on the
    page allows subsequent msync or fsync calls on this file to return without
    an error, if the subsequent IO succeeds.

    Unfortunately data written out in the msync or fsync call that returned
    -EIO can still get lost, because the page dirty bit appears to not get
    restored on IO error. However, the alternative could be potentially all
    of memory filling up with uncleanable dirty pages, hanging the system, so
    there is no nice choice here...

    Signed-off-by: Rik van Riel
    Acked-by: Valerie Aurora
    Acked-by: Jeff Layton
    Cc: Theodore Ts'o
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Testing ->mapping and ->index without a ref is not stable as the page
    may have been reused at this point.

    Signed-off-by: Nick Piggin
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Jan, 2011

1 commit


02 Dec, 2010

1 commit

  • NFS needs to be able to release objects that are stored in the page
    cache once the page itself is no longer visible from the page cache.

    This patch adds a callback to the address space operations that allows
    filesystems to perform page cleanups once the page has been removed
    from the page cache.

    Original patch by: Linus Torvalds
    [trondmy: cover the cases of invalidate_inode_pages2() and
    truncate_inode_pages()]
    Signed-off-by: Trond Myklebust

    Linus Torvalds
     

12 Nov, 2010

2 commits

  • Salman Qazi describes the following radix-tree bug:

    In the following case, we get can get a deadlock:

    0. The radix tree contains two items, one has the index 0.
    1. The reader (in this case find_get_pages) takes the rcu_read_lock.
    2. The reader acquires slot(s) for item(s) including the index 0 item.
    3. The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
    3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
    4. The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
    5. The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

    The fix is to re-use the same "indirect" pointer case that requires a slot
    lookup retry into a general "retry the lookup" bit.

    Signed-off-by: Nick Piggin
    Reported-by: Salman Qazi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • 70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
    ran into a NULL dereference in here:

    int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    unsigned long from)
    {
    ----> struct inode *inode = page->mapping->host;

    It looks like page->mapping was the culprit. (xmon trace is below).
    After closer examination, I realized that do_generic_file_read() does a
    find_get_page(), and eventually locks the page before calling
    block_is_partially_uptodate(). However, it doesn't revalidate the
    page->mapping after the page is locked. So, there's a small window
    between the find_get_page() and ->is_partially_uptodate() where the page
    could get truncated and page->mapping cleared.

    We _have_ a reference, so it can't get reclaimed, but it certainly
    can be truncated.

    I think the correct thing is to check page->mapping after the
    trylock_page(), and jump out if it got truncated. This patch has been
    running in the test environment for a month or so now, and we have not
    seen this bug pop up again.

    xmon info:

    1f:mon> e
    cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
    pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
    lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
    sp: c0000002ae36f9f0
    msr: 8000000000009032
    dar: 0
    dsisr: 40000000
    current = 0xc000000378f99e30
    paca = 0xc000000000f66300
    pid = 21946, comm = bash
    1f:mon> r
    R00 = 0025c0500000006d R16 = 0000000000000000
    R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
    R02 = c000000000e8cd80 R18 = ffffffffffffffff
    R03 = c0000000031d0f88 R19 = 0000000000000001
    R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
    R05 = 0000000000000000 R21 = c0000002ae36fa68
    R06 = 0000000000000000 R22 = 0000000000000000
    R07 = 0000000000000001 R23 = c0000002ae36fbb0
    R08 = 0000000000000002 R24 = 0000000000000000
    R09 = 0000000000000000 R25 = c000000362cd3a80
    R10 = 0000000000000000 R26 = 0000000000000002
    R11 = c0000000001e7b60 R27 = 0000000000000000
    R12 = 0000000042000484 R28 = 0000000000000001
    R13 = c000000000f66300 R29 = c0000003bb97b9b8
    R14 = 0000000000000001 R30 = c000000000e28a08
    R15 = 000000000000ffff R31 = c0000000031d0f88
    pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
    lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
    msr = 8000000000009032 cr = 22000488
    ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
    dar = 0000000000000000 dsisr = 40000000
    1f:mon> t
    [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
    [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
    [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
    [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
    [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
    [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
    --- Exception: c00 (System Call) at 00000080a840bc54
    SP (fffca15df30) is in userspace
    1f:mon> di c0000000001e7a6c
    c0000000001e7a6c e9290000 ld r9,0(r9)
    c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
    c0000000001e7a74 e9440008 ld r10,8(r4)
    c0000000001e7a78 78a80020 clrldi r8,r5,32
    c0000000001e7a7c 3c000001 lis r0,1
    c0000000001e7a80 812900a8 lwz r9,168(r9)
    c0000000001e7a84 39600001 li r11,1
    c0000000001e7a88 7c080050 subf r0,r8,r0
    c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
    c0000000001e7a90 7d6b4830 slw r11,r11,r9
    c0000000001e7a94 796b0020 clrldi r11,r11,32
    c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
    c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
    c0000000001e7aa0 7d004214 add r8,r0,r8
    c0000000001e7aa4 79080020 clrldi r8,r8,32
    c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100

    Signed-off-by: Dave Hansen
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

03 Nov, 2010

1 commit


27 Oct, 2010

3 commits

  • 'end' shadows earlier one and is not necessary at all. Remove it and use
    'pos' instead. This removes following sparse warnings:

    mm/filemap.c:2180:24: warning: symbol 'end' shadows an earlier one
    mm/filemap.c:2132:25: originally declared here

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • This change reduces mmap_sem hold times that are caused by waiting for
    disk transfers when accessing file mapped VMAs.

    It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
    site wants mmap_sem to be released if blocking on a pending disk transfer.
    In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
    do_page_fault() will then re-acquire mmap_sem and retry the page fault.

    It is expected that the retry will hit the same page which will now be
    cached, and thus it will complete with a low mmap_sem hold time.

    Tests:

    - microbenchmark: thread A mmaps a large file and does random read accesses
    to the mmaped area - achieves about 55 iterations/s. Thread B does
    mmap/munmap in a loop at a separate location - achieves 55 iterations/s
    before, 15000 iterations/s after.

    - We are seeing related effects in some applications in house, which show
    significant performance regressions when running without this change.

    [akpm@linux-foundation.org: fix warning & crash]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce a single location where filemap_fault() locks the desired page.
    There used to be two such places, depending if the initial find_get_page()
    was successful or not.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

10 Aug, 2010

1 commit


31 May, 2010

1 commit


28 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     

27 May, 2010

1 commit

  • I/O errors can happen due to temporary failures, like multipath
    errors or losing network contact with the iSCSI server. Because
    of that, the VM will retry readpage on the page.

    However, do_generic_file_read does not clear PG_error. This
    causes the system to be unable to actually use the data in the
    page cache page, even if the subsequent readpage completes
    successfully!

    The function filemap_fault has had a ClearPageError before
    readpage forever. This patch simply adds the same to
    do_generic_file_read.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

25 May, 2010

4 commits

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
    This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
    streaming IO first"). Wow, It is 2 years old patch!

    Currently, tmpfs file cache is inserted active list at first. This means
    that the insertion doesn't only increase numbers of pages in anon LRU, but
    it also reduces anon scanning ratio. Therefore, vmscan will get totally
    confused. It scans almost only file LRU even though the system has plenty
    unused tmpfs pages.

    Historically, lru_cache_add_active_anon() was used for two reasons.
    1) Intend to priotize shmem page rather than regular file cache.
    2) Intend to avoid reclaim priority inversion of used once pages.

    But we've lost both motivation because (1) Now we have separate anon and
    file LRU list. then, to insert active list doesn't help such priotize.
    (2) In past, one pte access bit will cause page activation. then to
    insert inactive list with pte access bit mean higher priority than to
    insert active list. Its priority inversion may lead to uninteded lru
    chun. but it was already solved by commit 645747462 (vmscan: detect
    mapped file pages used only once). (Thanks Hannes, you are great!)

    Thus, now we can use lru_cache_add_anon() instead.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Henrique de Moraes Holschuh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This is similar to what already happens in the write case. If we have a short
    read while doing O_DIRECT, instead of just returning, fallthrough and try to
    read the rest via buffered IO. BTRFS needs this because if we encounter a
    compressed or inline extent during DIO, we need to fallback on buffered. If the
    extent is compressed we need to read the entire thing into memory and
    de-compress it into the users pages. I have tested this with fsx and everything
    works great. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This is needed to enable moving pages into the page cache in fuse with
    splice(..., SPLICE_F_MOVE).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

04 Mar, 2010

1 commit

  • No one is calling this anymore as everyone has switched to
    invalidate_mapping_pages long time ago. Also update a few
    references to it in comments. nfs has two more, but I can't
    easily figure what they are actually referring to, so I left
    them as-is.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

03 Feb, 2010

1 commit

  • The cache alias problem will happen if the changes of user shared mapping
    is not flushed before copying, then user and kernel mapping may be mapped
    into two different cache line, it is impossible to guarantee the coherence
    after iov_iter_copy_from_user_atomic. So the right steps should be:

    flush_dcache_page(page);
    kmap_atomic(page);
    write to page;
    kunmap_atomic(page);
    flush_dcache_page(page);

    More precisely, we might create two new APIs flush_dcache_user_page and
    flush_dcache_kern_page to replace the two flush_dcache_page accordingly.

    Here is a snippet tested on omap2430 with VIPT cache, and I think it is
    not ARM-specific:

    int val = 0x11111111;
    fd = open("abc", O_RDWR);
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    *(addr+0) = 0x44444444;
    tmp = *(addr+0);
    *(addr+1) = 0x77777777;
    write(fd, &val, sizeof(int));
    close(fd);

    The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.

    Signed-off-by: Anfei
    Cc: Russell King
    Cc: Miklos Szeredi
    Cc: Nick Piggin
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    anfei zhou
     

28 Jan, 2010

1 commit

  • It's a simplified 'read_cache_page()' which takes a page allocation
    flag, so that different paths can control how aggressive the memory
    allocations are that populate a address space.

    In particular, the intel GPU object mapping code wants to be able to do
    a certain amount of own internal memory management by automatically
    shrinking the address space when memory starts getting tight. This
    allows it to dynamically use different memory allocation policies on a
    per-allocation basis, rather than depend on the (static) address space
    gfp policy.

    The actual new function is a one-liner, but re-organizing the helper
    functions to the point where you can do this with a single line of code
    is what most of the patch is all about.

    Tested-by: Chris Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Dec, 2009

1 commit

  • In the case of direct I/O falling back to buffered I/O we sync data
    twice currently: once at the end of generic_file_buffered_write using
    filemap_write_and_wait_range and once a little later in
    __generic_file_aio_write using do_sync_mapping_range with all flags set.

    The wait before write of the do_sync_mapping_range call does not make
    any sense, so just keep the filemap_write_and_wait_range call and move
    it to the right spot.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Dec, 2009

1 commit


04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa