27 May, 2011

1 commit

  • This fourth patch of eight in this cleancache series provides the
    core hooks in VFS for: initializing cleancache per filesystem;
    capturing clean pages reclaimed by page cache; attempting to get
    pages from cleancache before filesystem read; and ensuring coherency
    between pagecache, disk, and cleancache. Note that the placement
    of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
    change was required due to a patchset in 2.6.39.

    All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
    a check of a boolean global if CONFIG_CLEANCACHE is set but no
    cleancache "backend" has claimed cleancache_ops.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
    Signed-off-by: Chris Mason
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

23 Mar, 2011

3 commits

  • Recently, there are reported problem about thrashing.
    (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
    workloads(ex, nightly rsync). That's because the workload makes just
    use-once pages and touches pages twice. It promotes the page into active
    list so that it results in working set page eviction.

    Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
    don't support it, either.
    (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

    By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
    problem. If kernel meets page is writing during invalidate_mapping_pages,
    it can't work. It makes for application programmer to use it since they
    always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
    make sure the pages could be discardable. At last, they can't use
    deferred write of kernel so that they could see performance loss.
    (http://insights.oetiker.ch/linux/fadvise.html)

    In fact, invalidation is very big hint to reclaimer. It means we don't
    use the page any more. So let's move the writing page into inactive
    list's head if we can't truncate it right now.

    Why I move page to head of lru on this patch, Dirty/Writeback page would
    be flushed sooner or later. It can prevent writeout of pageout which is
    less effective than flusher's writeout.

    Originally, I reused lru_demote of Peter with some change so added his
    Signed-off-by.

    Signed-off-by: Minchan Kim
    Reported-by: Ben Gamari
    Signed-off-by: Peter Zijlstra
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Acked-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch series changes remove_from_page_cache()'s page ref counting
    rule. Page cache ref count is decreased in delete_from_page_cache(). So
    we don't need to decrease the page reference in callers.

    Signed-off-by: Minchan Kim
    Cc: Dan Magenheimer
    Cc: Andi Kleen
    Cc: Nick Piggin
    Cc: Al Viro
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

26 Feb, 2011

1 commit

  • It seems odd that truncate_inode_pages_range(), called not only when
    truncating but also when evicting inodes, has mem_cgroup_uncharge_start
    and _end() batching in its second loop to clear up a few leftovers, but
    not in its first loop that does almost all the work: add them there too.

    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jan, 2011

1 commit

  • Contrary to what the comment says, truncate_setsize() should be called
    *before* filesystem truncated blocks.

    Signed-off-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

02 Dec, 2010

1 commit

  • NFS needs to be able to release objects that are stored in the page
    cache once the page itself is no longer visible from the page cache.

    This patch adds a callback to the address space operations that allows
    filesystems to perform page cleanups once the page has been removed
    from the page cache.

    Original patch by: Linus Torvalds
    [trondmy: cover the cases of invalidate_inode_pages2() and
    truncate_inode_pages()]
    Signed-off-by: Trond Myklebust

    Linus Torvalds
     

10 Aug, 2010

1 commit

  • Make sure we check the truncate constraints early on in ->setattr by adding
    those checks to inode_change_ok. Also clean up and document inode_change_ok
    to make this obvious.

    As a fallout we don't have to call inode_newsize_ok from simple_setsize and
    simplify it down to a truncate_setsize which doesn't return an error. This
    simplifies a lot of setattr implementations and means we use truncate_setsize
    almost everywhere. Get rid of fat_setsize now that it's trivial and mark
    ext2_setsize static to make the calling convention obvious.

    Keep the inode_newsize_ok in vmtruncate for now as all callers need an
    audit for its removal anyway.

    Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
    needs a deeper audit, but that is left for later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 May, 2010

1 commit

  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

14 Jan, 2010

1 commit

  • If __block_prepare_write() was failed in block_write_begin(), the
    allocated blocks can be outside of ->i_size.

    But new truncate_pagecache() in vmtuncate() does nothing if new < old.
    It means the above usage is not working anymore.

    So, this patch fixes it by removing "new < old" check. It would need
    more cleanup/change. But, now -rc and truncate working is in progress,
    so, this tried to fix it minimum change.

    Acked-by: Nick Piggin
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     

16 Dec, 2009

1 commit

  • In massive parallel enviroment, res_counter can be a performance
    bottleneck. One strong techinque to reduce lock contention is reducing
    calls by coalescing some amount of calls into one.

    Considering charge/uncharge chatacteristic,
    - charge is done one by one via demand-paging.
    - uncharge is done by
    - in chunk at munmap, truncate, exit, execve...
    - one by one via vmscan/paging.

    It seems we have a chance to coalesce uncharges for improving scalability
    at unmap/truncation.

    This patch is a for coalescing uncharge. For avoiding scattering memcg's
    structure to functions under /mm, this patch adds memcg batch uncharge
    information to the task. A reason for per-task batching is for making use
    of caller's context information. We do batched uncharge (deleyed
    uncharge) when truncation/unmap occurs but do direct uncharge when
    uncharge is called by memory reclaim (vmscan.c).

    The degree of coalescing depends on callers
    - at invalidate/trucate... pagevec size
    - at unmap ....ZAP_BLOCK_SIZE
    (memory itself will be freed in this degree.)
    Then, we'll not coalescing too much.

    On x86-64 8cpu server, I tested overheads of memcg at page fault by
    running a program which does map/fault/unmap in a loop. Running
    a task per a cpu by taskset and see sum of the number of page faults
    in 60secs.

    [without memcg config]
    40156968 page-faults # 0.085 M/sec ( +- 0.046% )
    27.67 cache-miss/faults
    [root cgroup]
    36659599 page-faults # 0.077 M/sec ( +- 0.247% )
    31.58 miss/faults
    [in a child cgroup]
    18444157 page-faults # 0.039 M/sec ( +- 0.133% )
    69.96 miss/faults
    [child with this patch]
    27133719 page-faults # 0.057 M/sec ( +- 0.155% )
    47.16 miss/faults

    We can see some amounts of improvement.
    (root cgroup doesn't affected by this patch)
    Another patch for "charge" will follow this and above will be improved more.

    Changelog(since 2009/10/02):
    - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
    - some clean up and commentary/description updates.
    - added initialize code to copy_process(). (possible bug fix)

    Changelog(old):
    - fixed !CONFIG_MEM_CGROUP case.
    - rebased onto the latest mmotm + softlimit fix patches.
    - unified patch for callers
    - added commetns.
    - make ->do_batch as bool.
    - removed css_get() at el. We don't need it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Dec, 2009

1 commit


24 Sep, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

16 Sep, 2009

3 commits


17 Jun, 2009

1 commit

  • Remove __invalidate_mapping_pages atomic variant now that its sole caller
    can sleep (fixed in eccb95cee4f0d56faa46ef22fb94dd4a3578d3eb ("vfs: fix
    lock inversion in drop_pagecache_sb()")).

    This fixes softlockups that can occur while in the drop_caches path.

    Signed-off-by: Mike Waychison
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Nick Piggin
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Waychison
     

29 May, 2009

1 commit

  • mapping->tree_lock can be acquired from interrupt context. Then,
    following dead lock can occur.

    Assume "A" as a page.

    CPU0:
    lock_page_cgroup(A)
    interrupted
    -> take mapping->tree_lock.
    CPU1:
    take mapping->tree_lock
    -> lock_page_cgroup(A)

    This patch tries to fix above deadlock by moving memcg's hook to out of
    mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected
    by page lock, not tree_lock.

    After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

03 Apr, 2009

1 commit

  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

20 Oct, 2008

1 commit

  • Originally by Nick Piggin

    Remove mlocked pages from the LRU using "unevictable infrastructure"
    during mmap(), munmap(), mremap() and truncate(). Try to move back to
    normal LRU lists on munmap() when last mlocked mapping removed. Remove
    PageMlocked() status when page truncated from file.

    [akpm@linux-foundation.org: cleanup]
    [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
    [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
    [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
    [akpm@linux-foundation.org: remove bogus kerneldoc token]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Oct, 2008

1 commit


03 Sep, 2008

1 commit

  • Dio write returns EIO when try_to_release_page fails because bh is
    still referenced.

    The patch

    commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
    Author: Mingming Cao
    Date: Fri Jul 25 01:46:22 2008 -0700

    jbd: fix race between free buffer and commit transaction

    was merged into 2.6.27-rc1, but I noticed that this patch is not enough
    to fix the race.

    I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
    sometimes got EIO through this test.

    The patch above fixed race between freeing buffer(dio) and committing
    transaction(jbd) but I discovered that there is another race, freeing
    buffer(dio) and ext3/4_ordered_writepage.

    : background_writeout()
    ->write_cache_pages()
    ->ext3_ordered_writepage()
    walk_page_buffers() -> take a bh ref
    block_write_full_page() -> unlock_page
    : try_to_release_page fails)
    walk_page_buffers() ->release a bh ref

    ext3_ordered_writepage holds bh ref and does unlock_page remaining
    taking a bh ref, so this causes the race and failure of
    try_to_release_page.

    To fix this race, I used the approach of falling back to buffered
    writes if try_to_release_page() fails on a page.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Hisashi Hifumi
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: Mingming Cao
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

03 Aug, 2008

1 commit

  • Brian Wang reported that a FUSE filesystem exported through NFS could
    return I/O errors on read. This was traced to splice_direct_to_actor()
    returning a short or zero count when racing with page invalidation.

    However this is not FUSE or NFSD specific, other filesystems (notably
    NFS) also call invalidate_inode_pages2() to purge stale data from the
    cache.

    If this happens while such pages are sitting in a pipe buffer, then
    splice(2) from the pipe can return zero, and read(2) from the pipe can
    return ENODATA.

    The zero return is especially bad, since it implies end-of-file or
    disconnected pipe/socket, and is documented as such for splice. But
    returning an error for read() is also nasty, when in fact there was no
    error (data becoming stale is not an error).

    The same problems can be triggered by "hole punching" with
    madvise(MADV_REMOVE).

    Fix this by not clearing the PG_uptodate flag on truncation and
    invalidation.

    Signed-off-by: Miklos Szeredi
    Acked-by: Nick Piggin
    Cc: Andrew Morton
    Cc: Jens Axboe
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

27 Jul, 2008

1 commit

  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

28 Apr, 2008

1 commit

  • DIO invalidates page cache through invalidate_inode_pages2_range().
    invalidate_inode_pages2_range() sets ret=-EIO when
    invalidate_complete_page2() fails, but this ret is cleared if
    do_launder_page() succeed on a page of next index.

    In this case, dio is carried out even if invalidate_complete_page2() fails
    on some pages.

    This can cause inconsistency between memory and blocks on HDD because the
    page cache still exists.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hisashi Hifumi
    Cc: Badari Pulavarty
    Cc: Ken Chen
    Cc: Zach Brown
    Cc: Nick Piggin
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Chuck Lever
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

04 Mar, 2008

1 commit


06 Feb, 2008

3 commits

  • Orphaned page might have fs-private metadata, the page is truncated. As
    the page hasn't mapping, page migration refuse to migrate the page. It
    appears the page is only freed in page reclaim and if zone watermark is
    low, the page is never freed, as a result migration always fail. I thought
    we could free the metadata so such page can be freed in migration and make
    migration more reliable.

    [akpm@linux-foundation.org: go direct to try_to_free_buffers()]
    Signed-off-by: Shaohua Li
    Acked-by: Nick Piggin
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make
    try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
    was changed to bail out if the page was dirty.

    That in turn caused truncate_complete_page to leak massive amounts of
    memory, because the dirty bit was only cleared after the call to
    try_to_free_buffers.

    So the call to cancel_dirty_page was moved up to have the dirty bit
    cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate:
    clear page dirtiness before running try_to_free_buffers()").

    The problem with that fix is, that the page can be redirtied after
    cancel_dirty_page was called, eg. like this:

    truncate_complete_page()
    cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
    do_invalidatepage()
    ext3_invalidatepage()
    journal_invalidatepage()
    journal_unmap_buffer()
    __dispose_buffer()
    __journal_unfile_buffer()
    __journal_temp_unlink_buffer()
    mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

    And then we end up with dirty pages being wrongly accounted.

    As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect
    'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
    were reverted, so the original reason for the massive memory leak is
    gone, and we can also revert the move of the call to cancel_dirty_page
    from truncate_complete_page and get the accounting right again.

    I'm not sure if it matters, but opposed to the final check in
    __remove_from_page_cache, this one also cares about the task io
    accounting, so maybe we want to use this instead, although it's not
    quite the clean fix either.

    Signed-off-by: Björn Steinbrink
    Tested-by: Krzysztof Piotr Oledzki
    Cc: Jan Kara
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Thomas Osterried
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bjorn Steinbrink
     
  • Simplify page cache zeroing of segments of pages through 3 functions

    zero_user_segments(page, start1, end1, start2, end2)

    Zeros two segments of the page. It takes the position where to
    start and end the zeroing which avoids length calculations and
    makes code clearer.

    zero_user_segment(page, start, end)

    Same for a single segment.

    zero_user(page, start, length)

    Length variant for the case where we know the length.

    We remove the zero_user_page macro. Issues:

    1. Its a macro. Inline functions are preferable.

    2. The KM_USER0 macro is only defined for HIGHMEM.

    Having to treat this special case everywhere makes the
    code needlessly complex. The parameter for zeroing is always
    KM_USER0 except in one single case that we open code.

    Avoiding KM_USER0 makes a lot of code not having to be dealing
    with the special casing for HIGHMEM anymore. Dealing with
    kmap is only necessary for HIGHMEM configurations. In those
    configurations we use KM_USER0 like we do for a series of other
    functions defined in highmem.h.

    Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
    function could not be a macro. zero_user_* functions introduced
    here can be be inline because that constant is not used when these
    functions are called.

    Also extract the flushing of the caches to be outside of the kmap.

    [akpm@linux-foundation.org: fix nfs and ntfs build]
    [akpm@linux-foundation.org: fix ntfs build some more]
    Signed-off-by: Christoph Lameter
    Cc: Steven French
    Cc: Michael Halcrow
    Cc:
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: Michael Halcrow
    Cc: Steven French
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Feb, 2008

1 commit


17 Oct, 2007

2 commits


20 Jul, 2007

2 commits

  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix the race between invalidate_inode_pages and do_no_page.

    Andrea Arcangeli identified a subtle race between invalidation of pages from
    pagecache with userspace mappings, and do_no_page.

    The issue is that invalidation has to shoot down all mappings to the page,
    before it can be discarded from the pagecache. Between shooting down ptes to
    a particular page, and actually dropping the struct page from the pagecache,
    do_no_page from any process might fault on that page and establish a new
    mapping to the page just before it gets discarded from the pagecache.

    The most common case where such invalidation is used is in file truncation.
    This case was catered for by doing a sort of open-coded seqlock between the
    file's i_size, and its truncate_count.

    Truncation will decrease i_size, then increment truncate_count before
    unmapping userspace pages; do_no_page will read truncate_count, then find the
    page if it is within i_size, and then check truncate_count under the page
    table lock and back out and retry if it had subsequently been changed (ptl
    will serialise against unmapping, and ensure a potentially updated
    truncate_count is actually visible).

    Complexity and documentation issues aside, the locking protocol fails in the
    case where we would like to invalidate pagecache inside i_size. do_no_page
    can come in anytime and filemap_nopage is not aware of the invalidation in
    progress (as it is when it is outside i_size). The end result is that
    dangling (->mapping == NULL) pages that appear to be from a particular file
    may be mapped into userspace with nonsense data. Valid mappings to the same
    place will see a different page.

    Andrea implemented two working fixes, one using a real seqlock, another using
    a page->flags bit. He also proposed using the page lock in do_no_page, but
    that was initially considered too heavyweight. However, it is not a global or
    per-file lock, and the page cacheline is modified in do_no_page to increment
    _count and _mapcount anyway, so a further modification should not be a large
    performance hit. Scalability is not an issue.

    This patch implements this latter approach. ->nopage implementations return
    with the page locked if it is possible for their underlying file to be
    invalidated (in that case, they must set a special vm_flags bit to indicate
    so). do_no_page only unlocks the page after setting up the mapping
    completely. invalidation is excluded because it holds the page lock during
    invalidation of each page (and ensures that the page is not mapped while
    holding the lock).

    This also allows significant simplifications in do_no_page, because we have
    the page locked in the right place in the pagecache from the start.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

1 commit

  • It is a bug to set a page dirty if it is not uptodate unless it has
    buffers. If the page has buffers, then the page may be dirty (some buffers
    dirty) but not uptodate (some buffers not uptodate). The exception to this
    rule is if the set_page_dirty caller is racing with truncate or invalidate.

    A buffer can not be set dirty if it is not uptodate.

    If either of these situations occurs, it indicates there could be some data
    loss problem. Some of these warnings could be a harmless one where the
    page or buffer is set uptodate immediately after it is dirtied, however we
    should fix those up, and enforce this ordering.

    Bring the order of operations for truncate into line with those of
    invalidate. This will prevent a page from being able to go !uptodate while
    we're holding the tree_lock, which is probably a good thing anyway.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

17 Jul, 2007

2 commits

  • invalidate_mapping_pages() can sometimes take a long time (millions of pages
    to free). Long enough for the softlockup detector to trigger.

    We used to have a cond_resched() in there but I took it out because the
    drop_caches code calls invalidate_mapping_pages() under inode_lock.

    The patch adds a nasty flag and puts the cond_resched() back.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Fix the shrink_list name on some files under mm/ directory.

    Signed-off-by: Anderson Briglia
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anderson Briglia