26 Jul, 2011

40 commits

  • The address limit is already set in flush_old_exec() so those calls to
    set_fs(USER_DS) are redundant.

    Signed-off-by: Mathias Krause
    Cc: Jesper Nilsson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • Fix some harmless warnings such as

    arch/cris/arch-v32/mach-a3/pinmux.c:273: warning: ISO C90 forbids mixed declarations and code:

    Signed-off-by: WANG Cong
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • The address limit is already set in flush_old_exec() so those calls to
    set_fs(USER_DS) are redundant.

    Signed-off-by: Mathias Krause
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • The address limit is already set in flush_old_exec() so this
    set_fs(USER_DS) is redundant.

    Signed-off-by: Mathias Krause
    Cc: Hirokazu Takata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • The address limit is already set in flush_old_exec() so this
    set_fs(USER_DS) is redundant.

    Signed-off-by: Mathias Krause
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • The address limit is already set in flush_old_exec() so those calls to
    set_fs(USER_DS) are redundant.

    Signed-off-by: Mathias Krause
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • NR_WRITTEN is now accounted at block IO enqueue time, which is not very
    accurate as to common understanding. This moves NR_WRITTEN accounting to
    the IO completion time and makes it more consistent with BDI_WRITTEN,
    which is used for bandwidth estimation.

    Signed-off-by: Wu Fengguang
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • shmem_unuse_inode() and shmem_writepage() contain a little code to cope
    with pages inserted independently into the filecache, probably by a
    filesystem stacked on top of tmpfs, then fed to its ->readpage() or
    ->writepage().

    Unionfs was indeed experimenting with working in that way three years ago,
    but I find no current examples: nowadays the stacking filesystems use vfs
    interfaces to the lower filesystem.

    It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.

    Signed-off-by: Hugh Dickins
    Cc: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
    filepage passed in via shmem_readpage(), then swappage found, which must
    then be copied over to it.

    Although at first it's tempting to replace the **pagep arg by returning
    struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
    callers, so leave as is.

    Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
    complication came from uninitialized pages inserted into filecache prior
    to readpage; but now we're in control, and only release pagelock on
    filecache once it's uptodate (if an error occurs in reading back from
    swap, the page remains in swapcache, never moved to filecache).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
    complicated: first simplify that before going on to filepage/swappage.

    That's right, don't report ENOMEM when the preallocation fails: we may or
    may not need the page. But simply report ENOMEM once we find we do need
    it, instead of dropping lock, repeating allocation, unwinding on failure
    etc. And leave the out label on the fast path, don't goto.

    Fix something that looks like a bug but turns out not to be: set
    PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
    the removed case was doing. That's important before adding to LRU
    (determines which LRU the page goes on), and does affect which path it
    takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
    is handled no differently from CACHE.

    Signed-off-by: Hugh Dickins
    Acked-by: Shaohua Li
    Cc: "Zhang, Yanmin"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove that pernicious shmem_readpage() at last: the things we needed it
    for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
    shmem_file_splice_read() and shmem_read_mapping_page_gfp().

    This removal clears the way for a simpler shmem_getpage_gfp(), since page
    is never passed in; but leave most of that cleanup until after.

    sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
    instead of unexpectedly trying to read ahead on tmpfs: if that proves to
    be an issue for someone, then we can either arrange for them to return
    success instead, or try to implement async readahead on tmpfs.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
    shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().

    Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
    CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().

    Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
    though we might later want to let that case route to read_cache_page_gfp().

    It annoys me to have these two almost-redundant args, gfp and fault_type:
    I can't find a better way; but initialize fault_type only in shmem_fault().

    Note that before, read_cache_page_gfp() was allocating i915_gem's pages
    with __GFP_NORETRY as intended; but the corresponding swap vector pages
    got allocated without it, leaving a small possibility of OOM.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tidy up shmem_file_splice_read():

    Remove readahead: okay, we could implement shmem readahead on swap,
    but have never done so before, swap being the slow exceptional path.

    Use shmem_getpage() instead of find_or_create_page() plus ->readpage().

    Remove several comments: sorry, I found them more distracting than
    helpful, and this will not be the reference version of splice_read().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Copy __generic_file_splice_read() and generic_file_splice_read() from
    fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
    page_cache_pipe_buf_ops and spd_release_page() accessible to it.

    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • radix_tree_tagged() is lockless - it reads from a member of the raid-tree
    root node. It does not require any protection.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • With zone_reclaim_mode enabled, it's possible for zones to be considered
    full in the zonelist_cache so they are skipped in the future. If the
    process enters direct reclaim, the ZLC may still consider zones to be full
    even after reclaiming pages. Reconsider all zones for allocation if
    direct reclaim returns successfully.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla. In these cases, zone_reclaim was enabled by
    default due to large NUMA distances. In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).

    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free. After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs. In bad cases, each page allocated results in a full scan of the
    preferred zone.

    Patch 1 checks the preferred zone for recent allocation failure
    which is particularly important if zone_reclaim has failed
    recently. This avoids rescanning the zone in the near future and
    instead falling back to another node. This may hurt node locality
    in some cases but a failure to zone_reclaim is more expensive than
    a remote access.

    Patch 2 clears the zlc information after direct reclaim.
    Otherwise, zone_reclaim can mark zones full, direct reclaim can
    reclaim enough pages but the zone is still not considered for
    allocation.

    This was tested on a 24-thread 2-node x86_64 machine. The tests were
    focused on large amounts of IO. All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes. The kernels tested are

    3.0-rc6-vanilla Vanilla 3.0-rc6
    zlcfirst Patch 1 applied
    zlcreconsider Patches 1+2 applied

    FS-Mark
    ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
    fsmark-3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirs zlcreconsider
    Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
    Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
    Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
    Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
    Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
    Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
    Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
    Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 501.49 493.91 499.93
    Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

    MMTests Statistics: vmstat
    Page Ins 46268 63840 66008
    Page Outs 90821596 90671128 88043732
    Swap Ins 0 0 0
    Swap Outs 0 0 0
    Direct pages scanned 13091697 8966863 8971790
    Kswapd pages scanned 0 1830011 1831116
    Kswapd pages reclaimed 0 1829068 1829930
    Direct pages reclaimed 13037777 8956828 8648314
    Kswapd efficiency 100% 99% 99%
    Kswapd velocity 0.000 810.643 826.346
    Direct efficiency 99% 99% 96%
    Direct velocity 5340.128 3972.068 4048.788
    Percentage direct scans 100% 83% 83%
    Page writes by reclaim 0 3 0
    Slabs scanned 796672 720640 720256
    Direct inode steals 7422667 7160012 7088638
    Kswapd inode steals 0 1736840 2021238

    Test completes far faster with a large increase in the number of files
    created per second. Standard deviation is high as a small number of
    iterations were much higher than the mean. The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.

    LARGE DD
    3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirst zlcreconsider
    download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
    dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
    delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 125.03 118.98 122.01
    Total Elapsed Time (seconds) 624.56 375.02 398.06

    MMTests Statistics: vmstat
    Page Ins 3594216 439368 407032
    Page Outs 23380832 23380488 23377444
    Swap Ins 0 0 0
    Swap Outs 0 436 287
    Direct pages scanned 17482342 69315973 82864918
    Kswapd pages scanned 0 519123 575425
    Kswapd pages reclaimed 0 466501 522487
    Direct pages reclaimed 5858054 2732949 2712547
    Kswapd efficiency 100% 89% 90%
    Kswapd velocity 0.000 1384.254 1445.574
    Direct efficiency 33% 3% 3%
    Direct velocity 27991.453 184832.737 208171.929
    Percentage direct scans 100% 99% 99%
    Page writes by reclaim 0 5082 13917
    Slabs scanned 17280 29952 35328
    Direct inode steals 115257 1431122 332201
    Kswapd inode steals 0 0 979532

    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with. Time to
    completion is reduced. The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered. The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.

    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 124.47 111.67 112.64
    Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

    MMTests Statistics: vmstat
    Page Ins 90760 89124 89516
    Page Outs 121028340 120199524 120736696
    Swap Ins 0 86 55
    Swap Outs 0 0 0
    Direct pages scanned 114989363 96461439 96330619
    Kswapd pages scanned 56430948 56965763 57075875
    Kswapd pages reclaimed 27743219 27752044 27766606
    Direct pages reclaimed 49777 46884 36655
    Kswapd efficiency 49% 48% 48%
    Kswapd velocity 26392.541 31363.631 30561.736
    Direct efficiency 0% 0% 0%
    Direct velocity 53780.091 53108.759 51581.004
    Percentage direct scans 67% 62% 62%
    Page writes by reclaim 385 122 1513
    Slabs scanned 43008 39040 42112
    Direct inode steals 0 10 8
    Kswapd inode steals 733 534 477

    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.

    The gains are mostly down to two things. In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently. Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.

    This patch: initialise ZLC for first zone eligible for zone_reclaim.

    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently. The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.

    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called. The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive. If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.

    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim(). Once initialised, it is checked whether the zone failed
    zone_reclaim recently. If it has, the zone is skipped. As the first zone
    is now being checked, additional care has to be taken about zones marked
    full. A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently we are keeping faulted page locked throughout whole __do_fault
    call (except for page_mkwrite code path) after calling file system's fault
    code. If we do early COW, we allocate a new page which has to be charged
    for a memcg (mem_cgroup_newpage_charge).

    This function, however, might block for unbounded amount of time if memcg
    oom killer is disabled or fork-bomb is running because the only way out of
    the OOM situation is either an external event or OOM-situation fix.

    In the end we are keeping the faulted page locked and blocking other
    processes from faulting it in which is not good at all because we are
    basically punishing potentially an unrelated process for OOM condition in
    a different group (I have seen stuck system because of ld-2.11.1.so being
    locked).

    We can do test easily.

    % cgcreate -g memory:A
    % cgset -r memory.limit_in_bytes=64M A
    % cgset -r memory.memsw.limit_in_bytes=64M A
    % cd kernel_dir; cgexec -g memory:A make -j

    Then, the whole system will live-locked until you kill 'make -j'
    by hands (or push reboot...) This is because some important page in a
    a shared library are locked.

    Considering again, the new page is not necessary to be allocated
    with lock_page() held. And usual page allocation may dive into
    long memory reclaim loop with holding lock_page() and can cause
    very long latency.

    There are 3 ways.
    1. do allocation/charge before lock_page()
    Pros. - simple and can handle page allocation in the same manner.
    This will reduce holding time of lock_page() in general.
    Cons. - we do page allocation even if ->fault() returns error.

    2. do charge after unlock_page(). Even if charge fails, it's just OOM.
    Pros. - no impact to non-memcg path.
    Cons. - implemenation requires special cares of LRU and we need to modify
    page_add_new_anon_rmap()...

    3. do unlock->charge->lock again method.
    Pros. - no impact to non-memcg path.
    Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
    lock and get it again...

    This patch moves "charge" and memory allocation for COW page
    before lock_page(). Then, we can avoid scanning LRU with holding
    a lock on a page and latency under lock_page() will be reduced.

    Then, above livelock disappears.

    [akpm@linux-foundation.org: fix code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Lutz Vieweg
    Original-idea-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • 2.6.36's 7e496299d4d2 ("tmpfs: make tmpfs scalable with percpu_counter for
    used blocks") to make tmpfs scalable with percpu_counter used
    inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
    that was adverse to scalability, and unnecessary, since info->lock is
    already held there in the fast paths.

    Remove those uses of i_lock, and add info->lock in the three error paths
    where it's then needed across shmem_free_blocks(). It's not actually
    needed across shmem_unacct_blocks(), but they're so often paired that it
    looks wrong to split them apart.

    Signed-off-by: Hugh Dickins
    Acked-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • truncate_inode_pages_range()'s final loop has a nice pincer property,
    bringing start and end together, squeezing out the last pages. But the
    range handling missed out on that, just sliding up the range, perhaps
    letting pages come in behind it. Add one more test to give it the same
    pincer effect.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make the pagevec_lookup loops in truncate_inode_pages_range(),
    invalidate_mapping_pages() and invalidate_inode_pages2_range() more
    consistent with each other.

    They were relying upon page->index of an unlocked page, but apologizing
    for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
    index handling.

    invalidate_inode_pages2_range() had special handling for a wrapped
    page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
    near there, and a corrupt page->index in the radix_tree could cause more
    trouble than that would catch. Remove that wrapped handling.

    invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
    when near the end of the range: copy that into the other two, although
    it's less useful than you might think (it limits the use of the buffer,
    rather than the indices looked up).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use consistent variable names in truncate_pagecache(), truncate_setsize(),
    vmtruncate() and vmtruncate_range().

    unmap_mapping_range() and vmtruncate_range() have mismatched interfaces:
    don't change either, but make the vmtruncates more precise about what they
    expect unmap_mapping_range() to do.

    vmtruncate_range() is currently called only with page-aligned start and
    end+1: can handle unaligned start, but unaligned end+1 would hit BUG_ON in
    truncate_inode_pages_range() (lacks partial clearing of the end page).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Correct comment on truncate_inode_pages*() in linux/mm.h; and remove
    declaration of page_unuse(), it didn't exist even in 2.2.26 or 2.4.0!

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The often-NULL data arg to read_cache_page() and read_mapping_page()
    functions is misdescribed as "destination for read data": no, it's the
    first arg to the filler function, often struct file * to ->readpage().

    Satisfy checkpatch.pl on those filler prototypes, and tidy up the
    declarations in linux/pagemap.h.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Signed-off-by: David S. Miller
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David S. Miller
     
  • Luckily there are still a few software PTE bits remaining and they even
    match up in both the sun4u and sun4v pte layouts.

    Signed-off-by: David S. Miller
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David S. Miller
     
  • Make use of the generic RCU page table freeing on Sparc64, doing so allows
    for race-free software page-table walkers like gup_fast().

    Signed-off-by: David S. Miller
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David S. Miller
     
  • With the recent mmu_gather changes that included generic RCU freeing of
    page-tables, it is now quite straightforward to implement gup_fast() on
    sparc64.

    This patch:

    Remove the page table quicklists. They are pointless and make it harder
    to use RCU page table freeing and share code with other architectures.

    BTW, this is the second time this has happened, see commit 3c936465249f
    ("[SPARC64]: Kill pgtable quicklists and use SLAB.")

    Signed-off-by: David S. Miller
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David S. Miller
     
  • - shmem pages are not immediately available, but they are not
    potentially available either, even if we swap them out, they will just
    relocate from memory into swap, total amount of immediate and
    potentially available memory is not going to be affected, so we
    shouldn't count them as potentially free in the first place.

    - nr_free_pages() is not an expensive operation anymore, there is no
    need to split the decision making in two halves and repeat code.

    Signed-off-by: Dmitry Fink
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Fink
     
  • RED_INACTIVE is a slab thing, and reusing it for memblock was
    inappropriate, because memblock is dealing with phys_addr_t's which have a
    Kconfigurable sizeof().

    Create a new poison type for this application. Fixes the sparse warning

    warning: cast truncates bits from constant value (9f911029d74e35b becomes 9d74e35b)

    Reported-by: H Hartley Sweeten
    Tested-by: H Hartley Sweeten
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
    according to Documentation/feature-removal-schedule.txt.

    This patch makes the warning more verbose by making it appear as a more
    serious problem (the presence of a stack trace and being multiline should
    attract more attention) so that applications still using the old interface
    can get fixed.

    Very popular users of the old interface have been converted since the oom
    killer rewrite has been introduced. udevd switched to the
    /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
    opensshd switched in 5.7p1.

    At the start of 2012, this should be changed into a WARN() to emit all
    such incidents and then finally remove the tunable in August 2012 as
    scheduled.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The badness() function in the oom killer was renamed to oom_badness() in
    a63d83f427fb ("oom: badness heuristic rewrite") since it is a globally
    exported function for clarity.

    The prototype for the old function still existed in linux/oom.h, so remove
    it. There are no existing users.

    Also fixes documentation and comment references to badness() and adjusts
    them accordingly.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
    Remove i_mmap_lock lockbreak"). So zap it.

    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Fix coding style issues flagged by checkpatch.pl

    Signed-off-by: Chris Forbes
    Acked-by: Eric B Munson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Forbes
     
  • The lock is released first thing in all three branches. Simplify this by
    unconditionally releasing lock and remove else clause which was only there
    to be sure lock was released.

    Signed-off-by: Chris Wright
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • Commit a539f3533b78e3 ("mm: add SECTION_ALIGN_UP() and
    SECTION_ALIGN_DOWN() macro") introduced the SECTION_ALIGN_UP() and
    SECTION_ALIGN_DOWN() macros. Use those macros to increase code
    readability.

    Signed-off-by: Daniel Kiper
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • In commit a2c8990aed5ab ("memsw: remove noswapaccount kernel parameter"),
    Michal forgot to remove some left pieces of noswapaccount in the tree,
    this patch removes them all.

    Signed-off-by: WANG Cong
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • Commit bae9c19bf1 ("thp: split_huge_page_mm/vma") changed locking behavior
    of walk_page_range(). Thus this patch changes the comment too.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Originally, walk_hugetlb_range() didn't require a caller take any lock.
    But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range") changed its rule. Because it added find_vma() call in
    walk_hugetlb_range().

    Any locking-rule change commit should write a doc too.

    [akpm@linux-foundation.org: clarify comment]
    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro