30 Apr, 2013

1 commit

  • There is an ifdef in page_cache_get_speculative() that checks for !SMP
    and TREE_RCU, which has been an impossible combination since the advent
    of TINY_RCU. The ifdef enables a fastpath that is valid when preemption
    is disabled by rcu_read_lock() in UP systems, which is the case when
    TINY_RCU is enabled. This commit therefore adjusts the ifdef to
    generate the fastpath when TINY_RCU is enabled.

    Signed-off-by: Paul E. McKenney
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     

22 Feb, 2013

1 commit

  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

12 Dec, 2012

1 commit

  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a guest,
    thus imposing performance penalties associated with the reduced number of
    transparent huge pages that could be used by the guest workload.

    This patch introduces a common interface to help a balloon driver on
    making its page set movable to compaction, and thus allowing the system
    to better leverage the compation efforts on memory defragmentation.

    [akpm@linux-foundation.org: use PAGE_FLAGS_CHECK_AT_PREP, s/__balloon_page_flags/page_flags_cleared/, small cleanups]
    [rientjes@google.com: allow balloon compaction for any system with memory compaction enabled, which is the defconfig]
    Signed-off-by: Rafael Aquini
    Acked-by: Mel Gorman
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

01 Aug, 2012

1 commit

  • In order to teach filesystems to handle swap cache pages, three new page
    functions are introduced:

    pgoff_t page_file_index(struct page *);
    loff_t page_file_offset(struct page *);
    struct address_space *page_file_mapping(struct page *);

    page_file_index() - gives the offset of this page in the file in
    PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
    function also gives the correct index for PG_swapcache pages.

    page_file_offset() - uses page_file_index(), so that it will give the
    expected result, even for PG_swapcache pages.

    page_file_mapping() - gives the mapping backing the actual page; that is
    for swap cache pages it will give swap_file->f_mapping.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 May, 2012

1 commit

  • Commit f56f821feb7b ("mm: extend prefault helpers to fault in more than
    PAGE_SIZE") added in the new functions: fault_in_multipages_writeable()
    and fault_in_multipages_readable().

    However, we currently see:

    include/linux/pagemap.h:492: warning: 'ret' may be used uninitialized in this function
    include/linux/pagemap.h:492: note: 'ret' was declared here

    Unlike a lot of gcc nags, this one appears somewhat legit. i.e. passing
    in an invalid negative value of "size" does make it look like all the
    conditionals in there would be bypassed and the uninitialized value would
    be returned.

    Signed-off-by: Paul Gortmaker
    Cc: Daniel Vetter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

24 Apr, 2012

1 commit


27 Mar, 2012

1 commit

  • drm/i915 wants to read/write more than one page in its fastpath
    and hence needs to prefault more than PAGE_SIZE bytes.

    Add new functions in filemap.h to make that possible.

    Also kill a copy&pasted spurious space in both functions while at it.

    v2: As suggested by Andrew Morton, add a multipage parameter to both
    functions to avoid the additional branch for the pagemap.c hotpath.
    My gcc 4.6 here seems to dtrt and indeed reap these branches where not
    needed.

    v3: Becaus I couldn't find a way around adding a uaddr += PAGE_SIZE to
    the filemap.c hotpaths (that the compiler couldn't remove again),
    let's go with separate new functions for the multipage use-case.

    v4: Adjust comment to CodingStlye and fix spelling.

    Acked-by: Andrew Morton
    Signed-off-by: Daniel Vetter

    Daniel Vetter
     

26 Jul, 2011

1 commit

  • The often-NULL data arg to read_cache_page() and read_mapping_page()
    functions is misdescribed as "destination for read data": no, it's the
    first arg to the filler function, often struct file * to ->readpage().

    Satisfy checkpatch.pl on those filler prototypes, and tidy up the
    declarations in linux/pagemap.h.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Jun, 2011

1 commit

  • Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
    of preempt count offset independently. So that the offset
    can be updated by preempt_disable() and preempt_enable()
    even without the need for CONFIG_PREEMPT beeing set.

    This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
    with !CONFIG_PREEMPT where it currently doesn't detect
    code that sleeps inside explicit preemption disabled
    sections.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

25 May, 2011

2 commits

  • Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.

    readahead page allocations are completely optional. They are OK to fail
    and in particular shall not trigger OOM on themselves.

    Reported-by: Dave Young
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • commit 2687a356 ("Add lock_page_killable") introduced killable
    lock_page(). Similarly this patch introdues killable
    wait_on_page_locked().

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

4 commits

  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now delete_from_page_cache() replaces remove_from_page_cache(). So we
    remove remove_from_page_cache so fs or something out of mainline will
    notice it when compile time and can fix it.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently we increase the page refcount in add_to_page_cache() but don't
    decrease it in remove_from_page_cache(). Such asymmetry adds confusion,
    requiring that callers notice it and a comment explaining why they release
    a page reference. It's not a good API.

    A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
    gave up because reiser4's drop_page() had to unlock the page between
    removing it from page cache and doing the page_cache_release(). But now
    the situation is changed. I think at least things in current mainline
    don't have any obstacles. The problem is for out-of-mainline filesystems
    - if they have done such things as reiser4, this patch could be a problem
    but they will discover this at compile time since we remove
    remove_from_page_cache().

    This patch:

    This function works as just wrapper remove_from_page_cache(). The
    difference is that it decreases page references in itself. So caller have
    to make sure it has a page reference before calling.

    This patch is ready for removing remove_from_page_cache().

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

1 commit

  • The mapping_unevictable() has a likely() around the mapping parameter.
    This mapping parameter comes from page_mapping() which has an unlikely()
    that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
    NULL. One would think that this unlikely() means that the mapping
    returned by page_mapping() would not be NULL, but where page_mapping() is
    used just above mapping_unevictable(), that unlikely() is incorrect most
    of the time. This means that the "likely(mapping)" in
    mapping_unevictable() is incorrect most of the time.

    Running the annotated branch profiler on my main box which runs firefox,
    evolution, xchat and is part of my distcc farm, I had this:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    12872836 1269443893 98 mapping_unevictable pagemap.h 51
    35935762 1270265395 97 page_mapping mm.h 659
    1306198001 143659 0 page_mapping mm.h 657
    203131478 121586 0 page_mapping mm.h 657
    5415491 1116 0 page_mapping mm.h 657
    74899487 1116 0 page_mapping mm.h 657
    203132845 224 0 page_mapping mm.h 659
    5415464 27 0 page_mapping mm.h 659
    13552 0 0 page_mapping mm.h 657
    13552 0 0 page_mapping mm.h 659
    242630 0 0 page_mapping mm.h 657
    242630 0 0 page_mapping mm.h 659
    74899487 0 0 page_mapping mm.h 659

    The page_mapping() is a static inline, which is why it shows up multiple
    times. The mapping_unevictable() is also a static inline but seems to be
    used only once in my setup.

    The unlikely in page_mapping() was correct a total of 1909540379 times and
    incorrect 1270533123 times, with a 39% being incorrect. Perhaps this is
    enough to remove the unlikely from page_mapping() as well.

    Signed-off-by: Steven Rostedt
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

27 Oct, 2010

1 commit

  • This change reduces mmap_sem hold times that are caused by waiting for
    disk transfers when accessing file mapped VMAs.

    It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
    site wants mmap_sem to be released if blocking on a pending disk transfer.
    In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
    do_page_fault() will then re-acquire mmap_sem and retry the page fault.

    It is expected that the retry will hit the same page which will now be
    cached, and thus it will complete with a low mmap_sem hold time.

    Tests:

    - microbenchmark: thread A mmaps a large file and does random read accesses
    to the mmaped area - achieves about 55 iterations/s. Thread B does
    mmap/munmap in a loop at a separate location - achieves 55 iterations/s
    before, 15000 iterations/s after.

    - We are seeing related effects in some applications in house, which show
    significant performance regressions when running without this change.

    [akpm@linux-foundation.org: fix warning & crash]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

13 Aug, 2010

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
    hwpoison: rename CONFIG
    HWPOISON, hugetlb: support hwpoison injection for hugepage
    HWPOISON, hugetlb: detect hwpoison in hugetlb code
    HWPOISON, hugetlb: isolate corrupted hugepage
    HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
    HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
    HWPOISON, hugetlb: enable error handling path for hugepage
    hugetlb, rmap: add reverse mapping for hugepage
    hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

    Fix up trivial conflicts in mm/memory-failure.c

    Linus Torvalds
     

11 Aug, 2010

2 commits

  • This patch adds reverse mapping feature for hugepage by introducing
    mapcount for shared/private-mapped hugepage and anon_vma for
    private-mapped hugepage.

    While hugepage is not currently swappable, reverse mapping can be useful
    for memory error handler.

    Without this patch, memory error handler cannot identify processes
    using the bad hugepage nor unmap it from them. That is:
    - for shared hugepage:
    we can collect processes using a hugepage through pagecache,
    but can not unmap the hugepage because of the lack of mapcount.
    - for privately mapped hugepage:
    we can neither collect processes nor unmap the hugepage.
    This patch solves these problems.

    This patch include the bug fix given by commit 23be7468e8, so reverts it.

    Dependency:
    "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

    ChangeLog since May 24.
    - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
    - move functions setting up anon_vma for hugepage into mm/rmap.c.

    ChangeLog since May 13.
    - rebased to 2.6.34
    - fix logic error (in case that private mapping and shared mapping coexist)
    - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
    from linear_page_index()
    - define and use linear_hugepage_index() instead of compound_order()
    - use page_move_anon_rmap() in hugetlb_cow()
    - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
    - revert commit 24be7468 completely

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • is_vm_hugetlb_page() is a widely used inline function to insert hooks
    into hugetlb code.
    But we can't use it in pagemap.h because of circular dependency of
    the header files. This patch removes this limitation.

    Acked-by: Mel Gorman
    Acked-by: Fengguang Wu
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

10 Aug, 2010

1 commit


28 Jan, 2010

1 commit

  • It's a simplified 'read_cache_page()' which takes a page allocation
    flag, so that different paths can control how aggressive the memory
    allocations are that populate a address space.

    In particular, the intel GPU object mapping code wants to be able to do
    a certain amount of own internal memory management by automatically
    shrinking the address space when memory starts getting tight. This
    allows it to dynamically use different memory allocation policies on a
    per-allocation basis, rather than depend on the (static) address space
    gfp policy.

    The actual new function is a one-liner, but re-organizing the helper
    functions to the point where you can do this with a single line of code
    is what most of the patch is all about.

    Tested-by: Chris Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Aug, 2009

1 commit

  • A couple of references to CONFIG_CLASSIC_RCU have survived.
    Although these are harmless, it is past time for them to go.
    The one in hardirq.h is strictly a readability problem.

    The two in pagemap.h appear to disable a !SMP performance
    optimization (which this patch re-enables).

    This does raise the issue as to whether pagemap.h should really
    be referring to the CPU implementation. Long term, I intend to
    make the RCU implementation driven by CONFIG_PREEMPT, at which
    point these should change from defined(CONFIG_TREE_RCU) to
    !defined(CONFIG_PREEMPT). In the meantime, is there something
    else that could be done in pagemap.h?

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: akpm@linux-foundation.org
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josht@linux.vnet.ibm.com
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

17 Jun, 2009

1 commit

  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

03 Apr, 2009

2 commits

  • Add a function to install a monitor on the page lock waitqueue for a particular
    page, thus allowing the page being unlocked to be detected.

    This is used by CacheFiles to detect read completion on a page in the backing
    filesystem so that it can then copy the data to the waiting netfs page.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • A new "address_space flag"--AS_MM_ALL_LOCKS--was defined to use the next
    available AS flag while the Unevictable LRU was under development. The
    Unevictable LRU was using the same flag and "no one" noticed. Current
    mainline, since 2.6.28, has same value for two symbolic flag names.

    So, define a unique flag value for AS_UNEVICTABLE--up close to the other
    flags, [at the cost of an additional #ifdef] so we'll notice next time.
    Note that #ifdef is not actually required, if we don't mind having the
    unused flag value defined.

    Replace #defines with an enum.

    Signed-off-by: Lee Schermerhorn
    Cc: [2.6.28.x, 2.6.29.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

05 Jan, 2009

1 commit

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Oct, 2008

4 commits

  • trylock_page, unlock_page open and close a critical section. Hence,
    we can use the lock bitops to get the desired memory ordering.

    Also, mark trylock as likely to succeed (and remove the annotation from
    callers).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Setting and clearing the page locked when inserting it into swapcache /
    pagecache when it has no other references can use non-atomic page flags
    operations because no other CPU may be operating on it at this time.

    This saves one atomic operation when inserting a page into pagecache.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
    kept on the normal LRU, since scanning them is a waste of time and might
    throw off kswapd's balancing algorithms. Place them on the unevictable
    LRU list instead.

    Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
    memory regions as unevictable. Then these pages will be culled off the
    normal LRU lists during vmscan.

    Add new wrapper function to clear the mapping's unevictable state when/if
    shared memory segment is munlocked.

    Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
    the shmem segment's mapping [struct address_space] for evictability now
    that they're no longer locked. If so, move them to the appropriate zone
    lru list.

    Changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [kosaki.motohiro@jp.fujitsu.com: revert shm change]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Kosaki Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Christoph Lameter pointed out that ram disk pages also clutter the LRU
    lists. When vmscan finds them dirty and tries to clean them, the ram disk
    writeback function just redirties the page so that it goes back onto the
    active list. Round and round she goes...

    With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
    longer the case, as ram disk pages are no longer maintained on the lru.
    [This makes them unmigratable for defrag or memory hot remove, but that
    can be addressed by a separate patch series.] However, the ramfs pages
    behave like ram disk pages used to, so:

    Define new address_space flag [shares address_space flags member with
    mapping's gfp mask] to indicate that the address space contains all
    unevictable pages. This will provide for efficient testing of ramfs pages
    in page_evictable().

    Also provide wrapper functions to set/test the unevictable state to
    minimize #ifdefs in ramfs driver and any other users of this facility.

    Set the unevictable state on address_space structures for new ramfs
    inodes. Test the unevictable state in page_evictable() to cull
    unevictable pages.

    These changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [riel@redhat.com: undo the brd.c part]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Debugged-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Jul, 2008

1 commit

  • Implement lockless get_user_pages_fast for 64-bit powerpc.

    Page table existence is guaranteed with RCU, and speculative page references
    are used to take a reference to the pages without having a prior existence
    guarantee on them.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Nick Piggin
     

29 Jul, 2008

1 commit

  • mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
    mmu notifiers to register into the mm at any time with the guarantee that
    no mmu operation is in progress on the mm.

    This operation locks against the VM for all pte/vma/mm related operations
    that could ever happen on a certain mm. This includes vmtruncate,
    try_to_unmap, and all page faults.

    The caller must take the mmap_sem in write mode before calling
    mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
    until mm_drop_all_locks() returns.

    mmap_sem in write mode is required in order to block all operations that
    could modify pagetables and free pages without need of altering the vma
    layout (for example populate_range() with nonlinear vmas). It's also
    needed in write mode to avoid new anon_vmas to be associated with existing
    vmas.

    A single task can't take more than one mm_take_all_locks() in a row or it
    would deadlock.

    mm_take_all_locks() and mm_drop_all_locks are expensive operations that
    may have to take thousand of locks.

    mm_take_all_locks() can fail if it's interrupted by signals.

    When mmu_notifier_register returns, we must be sure that the driver is
    notified if some task is in the middle of a vmtruncate for the 'mm' where
    the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
    is run around the vmtruncation but mmu_notifier_register can run after
    mmu_notifier_invalidate_range_start and before
    mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
    we've to remove page pinning to avoid replicating the tlb_gather logic
    inside KVM (and GRU doesn't work well with page pinning regardless of
    needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
    the page, kvm would have no way to notice that it mapped into sptes a page
    that is going into the freelist without a chance of any further
    mmu_notifier notification.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Linus Torvalds
    Cc: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

1 commit

  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

25 Jul, 2008

1 commit


14 Feb, 2008

1 commit


07 Dec, 2007

1 commit