09 Oct, 2012

2 commits

  • .fault now can retry. The retry can break state machine of .fault. In
    filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
    try, since the page is in page cache now, ra->mmap_miss is decreased. And
    these are done in one fault, so we can't detect random mmap file access.

    Add a new flag to indicate .fault is tried once. In the second try, skip
    ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.

    I only tested x86, didn't test other archs, but looks the change for other
    archs is obvious, but who knows :)

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

09 Aug, 2012

2 commits

  • Move unplugging for direct I/O from around ->direct_IO() down to
    do_blockdev_direct_IO(). This implicitly adds plugging for direct
    writes.

    CC: Li Shaohua
    Acked-by: Jeff Moyer
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Fengguang Wu
     
  • Buffered write(2) is not directly tied to IO, so it's not suitable to
    handle plug in generic_file_aio_write().

    Note that plugging for O_SYNC writes is also removed. The user may pass
    arbitrary @size arguments, which may be much larger than the preferable
    I/O size, or may cross extent/device boundaries. Let the lower layers
    handle the plugging. The plugging code here actually turns them into
    no-ops.

    CC: Li Shaohua
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Fengguang Wu
     

31 Jul, 2012

2 commits

  • There are several entry points which dirty pages in a filesystem. mmap
    (handled by block_page_mkwrite()), buffered write (handled by
    __generic_file_aio_write()), splice write (generic_file_splice_write),
    truncate, and fallocate (these can dirty last partial page - handled inside
    each filesystem separately). Protect these places with sb_start_write() and
    sb_end_write().

    ->page_mkwrite() calls are particularly complex since they are called with
    mmap_sem held and thus we cannot use standard sb_start_write() due to lock
    ordering constraints. We solve the problem by using a special freeze protection
    sb_start_pagefault() which ranks below mmap_sem.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Make default vm_ops provide ->page_mkwrite handler. Currently it only updates
    file's modification times and gets locked page but later it will also handle
    filesystem freezing.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

02 Jun, 2012

2 commits

  • Pull vfs changes from Al Viro.
    "A lot of misc stuff. The obvious groups:
    * Miklos' atomic_open series; kills the damn abuse of
    ->d_revalidate() by NFS, which was the major stumbling block for
    all work in that area.
    * ripping security_file_mmap() and dealing with deadlocks in the
    area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
    general.
    * ->encode_fh() switched to saner API; insane fake dentry in
    mm/cleancache.c gone.
    * assorted annotations in fs (endianness, __user)
    * parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
    * ->update_time() work from Josef.
    * other bits and pieces all over the place.

    Normally it would've been in two or three pull requests, but
    signal.git stuff had eaten a lot of time during this cycle ;-/"

    Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
    'truncate_range' inode method was removed by the VM changes, the VFS
    update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
    to sparse fix added twice, with other changes nearby).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
    nfs: don't open in ->d_revalidate
    vfs: retry last component if opening stale dentry
    vfs: nameidata_to_filp(): don't throw away file on error
    vfs: nameidata_to_filp(): inline __dentry_open()
    vfs: do_dentry_open(): don't put filp
    vfs: split __dentry_open()
    vfs: do_last() common post lookup
    vfs: do_last(): add audit_inode before open
    vfs: do_last(): only return EISDIR for O_CREAT
    vfs: do_last(): check LOOKUP_DIRECTORY
    vfs: do_last(): make ENOENT exit RCU safe
    vfs: make follow_link check RCU safe
    vfs: do_last(): use inode variable
    vfs: do_last(): inline walk_component()
    vfs: do_last(): make exit RCU safe
    vfs: split do_lookup()
    Btrfs: move over to use ->update_time
    fs: introduce inode operation ->update_time
    reiserfs: get rid of resierfs_sync_super
    reiserfs: mark the superblock as dirty a bit later
    ...

    Linus Torvalds
     
  • Btrfs has to make sure we have space to allocate new blocks in order to modify
    the inode, so updating time can fail. We've gotten around this by having our
    own file_update_time but this is kind of a pain, and Christoph has indicated he
    would like to make xfs do something different with atime updates. So introduce
    ->update_time, where we will deal with i_version an a/m/c time updates and
    indicate which changes need to be made. The normal version just does what it
    has always done, updates the time and marks the inode dirty, and then
    filesystems can choose to do something different.

    I've gone through all of the users of file_update_time and made them check for
    errors with the exception of the fault code since it's complicated and I wasn't
    quite sure what to do there, also Jan is going to be pushing the file time
    updates into page_mkwrite for those who have it so that should satisfy btrfs and
    make it not a big deal to check the file_update_time() return code in the
    generic fault path. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

31 May, 2012

1 commit


30 May, 2012

1 commit


29 Mar, 2012

1 commit

  • Replace radix_tree_gang_lookup_slot() and
    radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
    brand-new radix-tree direct iterating. This avoids the double-scanning
    and pointer copying.

    Iterator don't stop after nr_pages page-get fails in a row, it continue
    lookup till the radix-tree end. Thus we can safely remove these restart
    conditions.

    Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
    case does not fit into new code, so the patch adds an extra check at the
    beginning.

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

23 Mar, 2012

1 commit

  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     

22 Mar, 2012

3 commits

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When i_mmap_lock changed to a mutex the locking order in memory failure
    was changed to take the sleeping lock first. But the big fat mm lock
    ordering comment (BFMLO) wasn't updated. Do this here.

    Pointed out by Andrew.

    Signed-off-by: Andi Kleen
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • There is not much point in skipping zones during allocation based on the
    dirty usage which they'll never contribute to. And we'd like to avoid
    page reclaim waits when writing to ramfs/sysfs etc.

    Signed-off-by: Fengguang Wu
    Acked-by: Johannes Weiner
    Cc: Jan Kara
    Cc: Greg Thelen
    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

20 Mar, 2012

2 commits


04 Feb, 2012

1 commit

  • Herbert Poetzl reported a performance regression since 2.6.39. The test
    is a simple dd read, but with big block size. The reason is:

    T1: ra (A, A+128k), (A+128k, A+256k)
    T2: lock_page for page A, submit the 256k
    T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
    because of plug and there isn't any lock_page till we hit page A+256k
    because all pages from A to A+256k is in memory
    T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
    submitted again.
    T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
    waitting for (A+256k, A+512k) finish.

    There is no request to disk in T3 and T4, so readahead pipeline breaks.

    We really don't need block plug for generic_file_aio_read() for buffered
    I/O. The readahead already has plug and has fine grained control when I/O
    should be submitted. Deleting plug for buffered I/O fixes the regression.

    One side effect is plug makes the request size 256k, the size is 128k
    without it. This is because default ra size is 128k and not a reason we
    need plug here.

    Vivek said:

    : We submit some readahead IO to device request queue but because of nested
    : plug, queue never gets unplugged. When read logic reaches a page which is
    : not in page cache, it waits for page to be read from the disk
    : (lock_page_killable()) and that time we flush the plug list.
    :
    : So effectively read ahead logic is kind of broken in parts because of
    : nested plugging. Removing top level plug (generic_file_aio_read()) for
    : buffered reads, will allow unplugging queue earlier for readahead.

    Signed-off-by: Shaohua Li
    Signed-off-by: Wu Fengguang
    Reported-by: Herbert Poetzl
    Tested-by: Eric Dumazet
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

24 Jan, 2012

1 commit

  • Per akpm suggestions alter the use of the term flush to be
    invalidate. The next patch will do this across all MM.

    This change is completely cosmetic.

    [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

    Signed-off-by: Dan Magenheimer
    Cc: Kamezawa Hiroyuki
    Cc: Jan Beulich
    Reviewed-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    [v10: Fixed fs: move code out of buffer.c conflict change]
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

13 Jan, 2012

1 commit

  • Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

11 Jan, 2012

1 commit

  • Tell the page allocator that pages allocated through
    grab_cache_page_write_begin() are expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jan, 2012

1 commit


22 Dec, 2011

1 commit


02 Dec, 2011

1 commit

  • Currently write(2) to a file is not interruptible by any signal.
    Sometimes this is desirable, e.g. when you want to quickly kill a
    process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
    task in balance_dirty_pages() killable"), it's necessary to abort the
    current write accordingly to avoid it quickly dirtying lots more pages
    at unthrottled rate.

    This patch makes write interruptible by SIGKILL. We do not allow write
    to be interruptible by any other signal because that has larger
    potential of screwing some badly written applications.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Acked-by: Matthew Wilcox
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

31 Oct, 2011

1 commit


28 Oct, 2011

1 commit

  • Currently, when you call iov_iter_advance, then the pointer to the iovec
    array can be incremented, but it does not decrement the nr_segs value in
    the iov_iter struct. The result is a iov_iter struct with a nr_segs
    value that goes beyond the end of the array.

    While I'm not aware of anything that's specifically broken by this, it
    seems odd and a bit dangerous not to decrement that value. If someone
    were to trust the nr_segs value to be correct, then they could end up
    walking off the end of the array.

    Changing this might also provide some micro-optimization when dealing
    with the last iovec in an array. Many of the other routines that deal
    with iov_iter have optimized codepaths when nr_segs == 1.

    Cc: Nick Piggin
    Signed-off-by: Jeff Layton
    Signed-off-by: Christoph Hellwig

    Jeff Layton
     

15 Sep, 2011

1 commit

  • The found entries by find_get_pages() could be all swap entries. In
    this case we skip the entries, but make sure the skipped entries are
    accounted, so we don't keep looping.

    Using nr_found > nr_skip to simplify code as suggested by Eric.

    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Shaohua Li
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 Aug, 2011

4 commits

  • Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

    It's hard to devise a suitable snappy name that illuminates the use by
    shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
    generality. And akpm points out that /* radix_tree_deref_retry(page) */
    comments look like calls that have been commented out for unknown
    reason.

    Skirt the naming difficulty by rearranging these blocks to handle the
    transient radix_tree_deref_retry(page) case first; then just explain the
    remaining shmem/tmpfs swap case in a comment.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageSwapBacked (!page_is_file_cache) cases from
    add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
    go through shmem_add_to_page_cache().

    Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
    and add a comment on swap entries to invalidate_mapping_pages().

    And mincore_page() uses find_get_page() on what might be shmem or a
    tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
    find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If swap entries are to be stored along with struct page pointers in a
    radix tree, they need to be distinguished as exceptional entries.

    Most of the handling of swap entries in radix tree will be contained in
    shmem.c, but a few functions in filemap.c's common code need to check
    for their appearance: find_get_page(), find_lock_page(),
    find_get_pages() and find_get_pages_contig().

    So as not to slow their fast paths, tuck those checks inside the
    existing checks for unlikely radix_tree_deref_slot(); except for
    find_lock_page(), where it is an added test. And make it a BUG in
    find_get_pages_tag(), which is not applied to tmpfs files.

    A part of the reason for eliminating shmem_readpage() earlier, was to
    minimize the places where common code would need to allow for swap
    entries.

    The swp_entry_t known to swapfile.c must be massaged into a slightly
    different form when stored in the radix tree, just as it gets massaged
    into a pte_t when stored in page tables.

    In an i386 kernel this limits its information (type and page offset) to
    30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
    swapfile size of 128GB. Which is less than the 512GB we previously
    allowed with X86_PAE (where the swap entry can occupy the entire upper
    32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
    there's not a new limitation on 64-bit (where swap filesize is already
    limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
    probably still enough swap for a 64GB 32-bit machine.

    Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
    enforce filesize limit in read_swap_header(), just as for ptes.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
    peculiar swap vector, instead keeping a file's swap entries in the same
    radix tree as its struct page pointers: thus saving memory, and
    simplifying its code and locking.

    This patch:

    The radix_tree is used by several subsystems for different purposes. A
    major use is to store the struct page pointers of a file's pagecache for
    memory management. But what if mm wanted to store something other than
    page pointers there too?

    The low bit of a radix_tree entry is already used to denote an indirect
    pointer, for internal use, and the unlikely radix_tree_deref_retry()
    case.

    Define the next bit as denoting an exceptional entry, and supply inline
    functions radix_tree_exception() to return non-0 in either unlikely
    case, and radix_tree_exceptional_entry() to return non-0 in the second
    case.

    If a subsystem already uses radix_tree with that bit set, no problem: it
    does not affect internal workings at all, but is defined for the
    convenience of those storing well-aligned pointers in the radix_tree.

    The radix_tree_gang_lookups have an implicit assumption that the caller
    can deduce the offset of each entry returned e.g. by the page->index of
    a struct page. But that may not be feasible for some kinds of item to
    be stored there.

    radix_tree_gang_lookup_slot() allow for an optional indices argument,
    output array in which to return those offsets. The same could be added
    to other radix_tree_gang_lookups, but for now keep it to the only one
    for which we need it.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

2 commits

  • Make the pagevec_lookup loops in truncate_inode_pages_range(),
    invalidate_mapping_pages() and invalidate_inode_pages2_range() more
    consistent with each other.

    They were relying upon page->index of an unlocked page, but apologizing
    for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
    index handling.

    invalidate_inode_pages2_range() had special handling for a wrapped
    page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
    near there, and a corrupt page->index in the radix_tree could cause more
    trouble than that would catch. Remove that wrapped handling.

    invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
    when near the end of the range: copy that into the other two, although
    it's less useful than you might think (it limits the use of the buffer,
    rather than the indices looked up).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The often-NULL data arg to read_cache_page() and read_mapping_page()
    functions is misdescribed as "destination for read data": no, it's the
    first arg to the filler function, often struct file * to ->readpage().

    Satisfy checkpatch.pl on those filler prototypes, and tidy up the
    declarations in linux/pagemap.h.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jul, 2011

1 commit

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

04 Jun, 2011

1 commit

  • Caching "we have already removed suid/caps" was overenthusiastic as merged.
    On network filesystems we might have had suid/caps set on another client,
    silently picked by this client on revalidate, all of that *without* clearing
    the S_NOSEC flag.

    AFAICS, the only reasonably sane way to deal with that is
    * new superblock flag; unless set, S_NOSEC is not going to be set.
    * local block filesystems set it in their ->mount() (more accurately,
    mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
    local block ones clear it)
    * if any network filesystem (or a cluster one) wants to use S_NOSEC,
    it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
    inode attribute changes are picked from other clients.

    It's not an earth-shattering hole (anybody that can set suid on another client
    will almost certainly be able to write to the file before doing that anyway),
    but it's a bug that needs fixing.

    Signed-off-by: Al Viro

    Al Viro
     

29 May, 2011

1 commit

  • Some recent benchmarking on btrfs showed that a major scaling bottleneck
    on large systems on btrfs is currently the xattr lookup on every write.

    Why xattr lookup on every write I hear you ask?

    write wants to drop suid and security related xattrs that could set o
    capabilities for executables. To do that it currently looks up
    security.capability on EVERY write (even for non executables) to decide
    whether to drop it or not.

    In btrfs this causes an additional tree walk, hitting some per file system
    locks and quite bad scalability. In a simple read workload on a 8S
    system I saw over 90% CPU time in spinlocks related to that.

    Chris Mason tells me this is also a problem in ext4, where it hits
    the global mbcache lock.

    This patch adds a simple per inode to avoid this problem. We only
    do the lookup once per file and then if there is no xattr cache
    the decision. All xattr changes clear the flag.

    I also used the same flag to avoid the suid check, although
    that one is pretty cheap.

    A file system can also set this flag when it creates the inode,
    if it has a cheap way to do so. This is done for some common file systems
    in followon patches.

    With this patch a major part of the lock contention disappears
    for btrfs. Some testing on smaller systems didn't show significant
    performance changes, but at least it helps the larger systems
    and is generally more efficient.

    v2: Rename is_sgid. add file system helper.
    Cc: chris.mason@oracle.com
    Cc: josef@redhat.com
    Cc: viro@zeniv.linux.org.uk
    Cc: agruen@linbit.com
    Cc: Serge E. Hallyn
    Signed-off-by: Andi Kleen
    Signed-off-by: Al Viro

    Andi Kleen
     

28 May, 2011

1 commit


27 May, 2011

1 commit

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han