13 Jan, 2012

1 commit

  • Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

11 Jan, 2012

1 commit

  • Tell the page allocator that pages allocated through
    grab_cache_page_write_begin() are expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jan, 2012

1 commit


22 Dec, 2011

1 commit


02 Dec, 2011

1 commit

  • Currently write(2) to a file is not interruptible by any signal.
    Sometimes this is desirable, e.g. when you want to quickly kill a
    process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
    task in balance_dirty_pages() killable"), it's necessary to abort the
    current write accordingly to avoid it quickly dirtying lots more pages
    at unthrottled rate.

    This patch makes write interruptible by SIGKILL. We do not allow write
    to be interruptible by any other signal because that has larger
    potential of screwing some badly written applications.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Acked-by: Matthew Wilcox
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

31 Oct, 2011

1 commit


28 Oct, 2011

1 commit

  • Currently, when you call iov_iter_advance, then the pointer to the iovec
    array can be incremented, but it does not decrement the nr_segs value in
    the iov_iter struct. The result is a iov_iter struct with a nr_segs
    value that goes beyond the end of the array.

    While I'm not aware of anything that's specifically broken by this, it
    seems odd and a bit dangerous not to decrement that value. If someone
    were to trust the nr_segs value to be correct, then they could end up
    walking off the end of the array.

    Changing this might also provide some micro-optimization when dealing
    with the last iovec in an array. Many of the other routines that deal
    with iov_iter have optimized codepaths when nr_segs == 1.

    Cc: Nick Piggin
    Signed-off-by: Jeff Layton
    Signed-off-by: Christoph Hellwig

    Jeff Layton
     

15 Sep, 2011

1 commit

  • The found entries by find_get_pages() could be all swap entries. In
    this case we skip the entries, but make sure the skipped entries are
    accounted, so we don't keep looping.

    Using nr_found > nr_skip to simplify code as suggested by Eric.

    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Shaohua Li
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 Aug, 2011

4 commits

  • Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

    It's hard to devise a suitable snappy name that illuminates the use by
    shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
    generality. And akpm points out that /* radix_tree_deref_retry(page) */
    comments look like calls that have been commented out for unknown
    reason.

    Skirt the naming difficulty by rearranging these blocks to handle the
    transient radix_tree_deref_retry(page) case first; then just explain the
    remaining shmem/tmpfs swap case in a comment.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageSwapBacked (!page_is_file_cache) cases from
    add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
    go through shmem_add_to_page_cache().

    Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
    and add a comment on swap entries to invalidate_mapping_pages().

    And mincore_page() uses find_get_page() on what might be shmem or a
    tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
    find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If swap entries are to be stored along with struct page pointers in a
    radix tree, they need to be distinguished as exceptional entries.

    Most of the handling of swap entries in radix tree will be contained in
    shmem.c, but a few functions in filemap.c's common code need to check
    for their appearance: find_get_page(), find_lock_page(),
    find_get_pages() and find_get_pages_contig().

    So as not to slow their fast paths, tuck those checks inside the
    existing checks for unlikely radix_tree_deref_slot(); except for
    find_lock_page(), where it is an added test. And make it a BUG in
    find_get_pages_tag(), which is not applied to tmpfs files.

    A part of the reason for eliminating shmem_readpage() earlier, was to
    minimize the places where common code would need to allow for swap
    entries.

    The swp_entry_t known to swapfile.c must be massaged into a slightly
    different form when stored in the radix tree, just as it gets massaged
    into a pte_t when stored in page tables.

    In an i386 kernel this limits its information (type and page offset) to
    30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
    swapfile size of 128GB. Which is less than the 512GB we previously
    allowed with X86_PAE (where the swap entry can occupy the entire upper
    32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
    there's not a new limitation on 64-bit (where swap filesize is already
    limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
    probably still enough swap for a 64GB 32-bit machine.

    Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
    enforce filesize limit in read_swap_header(), just as for ptes.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
    peculiar swap vector, instead keeping a file's swap entries in the same
    radix tree as its struct page pointers: thus saving memory, and
    simplifying its code and locking.

    This patch:

    The radix_tree is used by several subsystems for different purposes. A
    major use is to store the struct page pointers of a file's pagecache for
    memory management. But what if mm wanted to store something other than
    page pointers there too?

    The low bit of a radix_tree entry is already used to denote an indirect
    pointer, for internal use, and the unlikely radix_tree_deref_retry()
    case.

    Define the next bit as denoting an exceptional entry, and supply inline
    functions radix_tree_exception() to return non-0 in either unlikely
    case, and radix_tree_exceptional_entry() to return non-0 in the second
    case.

    If a subsystem already uses radix_tree with that bit set, no problem: it
    does not affect internal workings at all, but is defined for the
    convenience of those storing well-aligned pointers in the radix_tree.

    The radix_tree_gang_lookups have an implicit assumption that the caller
    can deduce the offset of each entry returned e.g. by the page->index of
    a struct page. But that may not be feasible for some kinds of item to
    be stored there.

    radix_tree_gang_lookup_slot() allow for an optional indices argument,
    output array in which to return those offsets. The same could be added
    to other radix_tree_gang_lookups, but for now keep it to the only one
    for which we need it.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

2 commits

  • Make the pagevec_lookup loops in truncate_inode_pages_range(),
    invalidate_mapping_pages() and invalidate_inode_pages2_range() more
    consistent with each other.

    They were relying upon page->index of an unlocked page, but apologizing
    for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
    index handling.

    invalidate_inode_pages2_range() had special handling for a wrapped
    page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
    near there, and a corrupt page->index in the radix_tree could cause more
    trouble than that would catch. Remove that wrapped handling.

    invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
    when near the end of the range: copy that into the other two, although
    it's less useful than you might think (it limits the use of the buffer,
    rather than the indices looked up).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The often-NULL data arg to read_cache_page() and read_mapping_page()
    functions is misdescribed as "destination for read data": no, it's the
    first arg to the filler function, often struct file * to ->readpage().

    Satisfy checkpatch.pl on those filler prototypes, and tidy up the
    declarations in linux/pagemap.h.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jul, 2011

1 commit

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

04 Jun, 2011

1 commit

  • Caching "we have already removed suid/caps" was overenthusiastic as merged.
    On network filesystems we might have had suid/caps set on another client,
    silently picked by this client on revalidate, all of that *without* clearing
    the S_NOSEC flag.

    AFAICS, the only reasonably sane way to deal with that is
    * new superblock flag; unless set, S_NOSEC is not going to be set.
    * local block filesystems set it in their ->mount() (more accurately,
    mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
    local block ones clear it)
    * if any network filesystem (or a cluster one) wants to use S_NOSEC,
    it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
    inode attribute changes are picked from other clients.

    It's not an earth-shattering hole (anybody that can set suid on another client
    will almost certainly be able to write to the file before doing that anyway),
    but it's a bug that needs fixing.

    Signed-off-by: Al Viro

    Al Viro
     

29 May, 2011

1 commit

  • Some recent benchmarking on btrfs showed that a major scaling bottleneck
    on large systems on btrfs is currently the xattr lookup on every write.

    Why xattr lookup on every write I hear you ask?

    write wants to drop suid and security related xattrs that could set o
    capabilities for executables. To do that it currently looks up
    security.capability on EVERY write (even for non executables) to decide
    whether to drop it or not.

    In btrfs this causes an additional tree walk, hitting some per file system
    locks and quite bad scalability. In a simple read workload on a 8S
    system I saw over 90% CPU time in spinlocks related to that.

    Chris Mason tells me this is also a problem in ext4, where it hits
    the global mbcache lock.

    This patch adds a simple per inode to avoid this problem. We only
    do the lookup once per file and then if there is no xattr cache
    the decision. All xattr changes clear the flag.

    I also used the same flag to avoid the suid check, although
    that one is pretty cheap.

    A file system can also set this flag when it creates the inode,
    if it has a cheap way to do so. This is done for some common file systems
    in followon patches.

    With this patch a major part of the lock contention disappears
    for btrfs. Some testing on smaller systems didn't show significant
    performance changes, but at least it helps the larger systems
    and is generally more efficient.

    v2: Rename is_sgid. add file system helper.
    Cc: chris.mason@oracle.com
    Cc: josef@redhat.com
    Cc: viro@zeniv.linux.org.uk
    Cc: agruen@linbit.com
    Cc: Serge E. Hallyn
    Signed-off-by: Andi Kleen
    Signed-off-by: Al Viro

    Andi Kleen
     

28 May, 2011

1 commit


27 May, 2011

3 commits

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • This fourth patch of eight in this cleancache series provides the
    core hooks in VFS for: initializing cleancache per filesystem;
    capturing clean pages reclaimed by page cache; attempting to get
    pages from cleancache before filesystem read; and ensuring coherency
    between pagecache, disk, and cleancache. Note that the placement
    of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
    change was required due to a patchset in 2.6.39.

    All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
    a check of a boolean global if CONFIG_CLEANCACHE is set but no
    cleancache "backend" has claimed cleancache_ops.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
    Signed-off-by: Chris Mason
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

25 May, 2011

6 commits

  • Previously the mmap sequential readahead is triggered by updating
    ra->prev_pos on each page fault and compare it with current page offset.

    It costs dirtying the cache line on each _minor_ page fault. So remove
    the ra->prev_pos recording, and instead tag PG_readahead to trigger the
    possible sequential readahead. It's not only more simple, but also will
    work more reliably and reduce cache line bouncing on concurrent page
    faults on shared struct file.

    In the mosbench exim benchmark which does multi-threaded page faults on
    shared struct file, the ra->mmap_miss and ra->prev_pos updates are found
    to cause excessive cache line bouncing on tmpfs, which actually disabled
    readahead totally (shmem_backing_dev_info.ra_pages == 0).

    So remove the ra->prev_pos recording, and instead tag PG_readahead to
    trigger the possible sequential readahead. It's not only more simple, but
    also will work more reliably on concurrent reads on shared struct file.

    Signed-off-by: Wu Fengguang
    Tested-by: Tim Chen
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The original INT_MAX is too large, reduce it to

    - avoid unnecessarily dirtying/bouncing the cache line

    - restore mmap read-around faster on changed access pattern

    Background: in the mosbench exim benchmark which does multi-threaded page
    faults on shared struct file, the ra->mmap_miss updates are found to cause
    excessive cache line bouncing on tmpfs. The ra state updates are needless
    for tmpfs because it actually disabled readahead totally
    (shmem_backing_dev_info.ra_pages == 0).

    Tested-by: Tim Chen
    Signed-off-by: Andi Kleen
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Reduce readahead overheads by returning early in do_sync_mmap_readahead().

    tmpfs has ra_pages=0 and it can page fault really fast (not constraint by
    IO if not swapping).

    Signed-off-by: Wu Fengguang
    Tested-by: Tim Chen
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • When an oom killing occurs, almost all processes are getting stuck at the
    following two points.

    1) __alloc_pages_nodemask
    2) __lock_page_or_retry

    1) is not very problematic because TIF_MEMDIE leads to an allocation
    failure and getting out from page allocator.

    2) is more problematic. In an OOM situation, zones typically don't have
    page cache at all and memory starvation might lead to greatly reduced IO
    performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly,
    meaning that a fork bomb may create new process quickly rather than the
    oom-killer killing it. Then, the system may become livelocked.

    This patch makes the pagefault interruptible by SIGKILL.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2687a356 ("Add lock_page_killable") introduced killable
    lock_page(). Similarly this patch introdues killable
    wait_on_page_locked().

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 Mar, 2011

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

7 commits

  • Callers of find_get_pages(), or its wrapper pagevec_lookup() - notably
    truncate_inode_pages_range() - stop looking further when it returns 0.

    But if an interrupt comes just after its radix_tree_gang_lookup_slot(),
    especially if we have preemptible RCU enabled, isn't it conceivable that
    all 14 pages returned could be removed from the page cache by
    shrink_page_list(), before find_get_pages() gets to process them? So
    causing it to return 0 although there may be plenty more pages beyond.

    Make find_get_pages() and find_get_pages_tag() check for this unlikely
    case, and restart should it occur; but callers of find_get_pages_contig()
    have no such expectation, it's okay for that to return 0 early.

    I have not seen this in practice, just worried by the possibility.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The radix_tree_deref_retry() case in find_get_pages() has a strange little
    excrescence, not seen in the other gang lookups: it looks like the start
    of an abandoned attempt to guarantee forward progress in a case that
    cannot arise.

    ret should always be 0 here: if it isn't, then going back to restart will
    leak references to pages already gotten. There used to be a comment
    saying nr_found is necessarily 1 here: that's not quite true, but the
    radix_tree_deref_retry() case is peculiar to the entry at index 0, when we
    race with it being moved out of the radix_tree root or back.

    Remove the worrisome two lines, add a brief comment here and in
    find_get_pages_contig() and find_get_pages_tag(), and a WARN_ON in
    find_get_pages() should it ever be seen elsewhere than at 0.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now delete_from_page_cache() replaces remove_from_page_cache(). So we
    remove remove_from_page_cache so fs or something out of mainline will
    notice it when compile time and can fix it.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently we increase the page refcount in add_to_page_cache() but don't
    decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
    requiring that callers notice it and a comment explaining why they release
    a page reference. It's not a good API.

    A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
    gave up because reiser4's drop_page() had to unlock the page between
    removing it from page cache and doing the page_cache_release(). But now
    the situation is changed. I think at least things in current mainline
    don't have any obstacles. The problem is for out-of-mainline filesystems
    - if they have done such things as reiser4, this patch could be a problem
    but they will discover this at compile time since we remove
    remove_from_page_cache().

    This patch:

    This function works as just wrapper remove_from_page_cache(). The
    difference is that it decreases page references in itself. So caller have
    to make sure it has a page reference before calling.

    This patch is ready for removing remove_from_page_cache().

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • GUP user may want to try to acquire a reference to a page if it is already
    in memory, but not if IO, to bring it in, is needed. For example KVM may
    tell vcpu to schedule another guest process if current one is trying to
    access swapped out page. Meanwhile, the page will be swapped in and the
    guest process, that depends on it, will be able to run again.

    This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
    FOLL_NOWAIT follow_page flags. FAULT_FLAG_RETRY_NOWAIT, when used in
    conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
    it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
    instead.

    [akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
    Signed-off-by: Gleb Natapov
    Cc: Linus Torvalds
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gleb Natapov