26 Jul, 2011

40 commits

  • * Merge akpm patch series: (122 commits)
    drivers/connector/cn_proc.c: remove unused local
    Documentation/SubmitChecklist: add RCU debug config options
    reiserfs: use hweight_long()
    reiserfs: use proper little-endian bitops
    pnpacpi: register disabled resources
    drivers/rtc/rtc-tegra.c: properly initialize spinlock
    drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
    drivers/rtc: add support for Qualcomm PMIC8xxx RTC
    drivers/rtc/rtc-s3c.c: support clock gating
    drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
    init: skip calibration delay if previously done
    misc/eeprom: add eeprom access driver for digsy_mtc board
    misc/eeprom: add driver for microwire 93xx46 EEPROMs
    checkpatch.pl: update $logFunctions
    checkpatch: make utf-8 test --strict
    checkpatch.pl: add ability to ignore various messages
    checkpatch: add a "prefer __aligned" check
    checkpatch: validate signature styles and To: and Cc: lines
    checkpatch: add __rcu as a sparse modifier
    checkpatch: suggest using min_t or max_t
    ...

    Did this as a merge because of (trivial) conflicts in
    - Documentation/feature-removal-schedule.txt
    - arch/xtensa/include/asm/uaccess.h
    that were just easier to fix up in the merge than in the patch series.

    Linus Torvalds
     
  • devres uses the pointer value as key after it's freed, which is safe but
    triggers spurious use-after-free warnings on some static analysis tools.
    Rearrange code to avoid such warnings.

    Signed-off-by: Maxin B. John
    Reviewed-by: Rolf Eike Beer
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxin B John
     
  • NR_WRITTEN is now accounted at block IO enqueue time, which is not very
    accurate as to common understanding. This moves NR_WRITTEN accounting to
    the IO completion time and makes it more consistent with BDI_WRITTEN,
    which is used for bandwidth estimation.

    Signed-off-by: Wu Fengguang
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • shmem_unuse_inode() and shmem_writepage() contain a little code to cope
    with pages inserted independently into the filecache, probably by a
    filesystem stacked on top of tmpfs, then fed to its ->readpage() or
    ->writepage().

    Unionfs was indeed experimenting with working in that way three years ago,
    but I find no current examples: nowadays the stacking filesystems use vfs
    interfaces to the lower filesystem.

    It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.

    Signed-off-by: Hugh Dickins
    Cc: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
    filepage passed in via shmem_readpage(), then swappage found, which must
    then be copied over to it.

    Although at first it's tempting to replace the **pagep arg by returning
    struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
    callers, so leave as is.

    Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
    complication came from uninitialized pages inserted into filecache prior
    to readpage; but now we're in control, and only release pagelock on
    filecache once it's uptodate (if an error occurs in reading back from
    swap, the page remains in swapcache, never moved to filecache).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
    complicated: first simplify that before going on to filepage/swappage.

    That's right, don't report ENOMEM when the preallocation fails: we may or
    may not need the page. But simply report ENOMEM once we find we do need
    it, instead of dropping lock, repeating allocation, unwinding on failure
    etc. And leave the out label on the fast path, don't goto.

    Fix something that looks like a bug but turns out not to be: set
    PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
    the removed case was doing. That's important before adding to LRU
    (determines which LRU the page goes on), and does affect which path it
    takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
    is handled no differently from CACHE.

    Signed-off-by: Hugh Dickins
    Acked-by: Shaohua Li
    Cc: "Zhang, Yanmin"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove that pernicious shmem_readpage() at last: the things we needed it
    for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
    shmem_file_splice_read() and shmem_read_mapping_page_gfp().

    This removal clears the way for a simpler shmem_getpage_gfp(), since page
    is never passed in; but leave most of that cleanup until after.

    sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
    instead of unexpectedly trying to read ahead on tmpfs: if that proves to
    be an issue for someone, then we can either arrange for them to return
    success instead, or try to implement async readahead on tmpfs.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
    shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().

    Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
    CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().

    Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
    though we might later want to let that case route to read_cache_page_gfp().

    It annoys me to have these two almost-redundant args, gfp and fault_type:
    I can't find a better way; but initialize fault_type only in shmem_fault().

    Note that before, read_cache_page_gfp() was allocating i915_gem's pages
    with __GFP_NORETRY as intended; but the corresponding swap vector pages
    got allocated without it, leaving a small possibility of OOM.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tidy up shmem_file_splice_read():

    Remove readahead: okay, we could implement shmem readahead on swap,
    but have never done so before, swap being the slow exceptional path.

    Use shmem_getpage() instead of find_or_create_page() plus ->readpage().

    Remove several comments: sorry, I found them more distracting than
    helpful, and this will not be the reference version of splice_read().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Copy __generic_file_splice_read() and generic_file_splice_read() from
    fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
    page_cache_pipe_buf_ops and spd_release_page() accessible to it.

    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • radix_tree_tagged() is lockless - it reads from a member of the raid-tree
    root node. It does not require any protection.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • With zone_reclaim_mode enabled, it's possible for zones to be considered
    full in the zonelist_cache so they are skipped in the future. If the
    process enters direct reclaim, the ZLC may still consider zones to be full
    even after reclaiming pages. Reconsider all zones for allocation if
    direct reclaim returns successfully.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been a small number of complaints about significant stalls
    while copying large amounts of data on NUMA machines reported on a
    distribution bugzilla. In these cases, zone_reclaim was enabled by
    default due to large NUMA distances. In general, the complaints have not
    been about the workload itself unless it was a file server (in which case
    the recommendation was disable zone_reclaim).

    The stalls are mostly due to significant amounts of time spent scanning
    the preferred zone for pages to free. After a failure, it might fallback
    to another node (as zonelists are often node-ordered rather than
    zone-ordered) but stall quickly again when the next allocation attempt
    occurs. In bad cases, each page allocated results in a full scan of the
    preferred zone.

    Patch 1 checks the preferred zone for recent allocation failure
    which is particularly important if zone_reclaim has failed
    recently. This avoids rescanning the zone in the near future and
    instead falling back to another node. This may hurt node locality
    in some cases but a failure to zone_reclaim is more expensive than
    a remote access.

    Patch 2 clears the zlc information after direct reclaim.
    Otherwise, zone_reclaim can mark zones full, direct reclaim can
    reclaim enough pages but the zone is still not considered for
    allocation.

    This was tested on a 24-thread 2-node x86_64 machine. The tests were
    focused on large amounts of IO. All tests were bound to the CPUs on
    node-0 to avoid disturbances due to processes being scheduled on different
    nodes. The kernels tested are

    3.0-rc6-vanilla Vanilla 3.0-rc6
    zlcfirst Patch 1 applied
    zlcreconsider Patches 1+2 applied

    FS-Mark
    ./fs_mark -d /tmp/fsmark-10813 -D 100 -N 5000 -n 208 -L 35 -t 24 -S0 -s 524288
    fsmark-3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirs zlcreconsider
    Files/s min 54.90 ( 0.00%) 49.80 (-10.24%) 49.10 (-11.81%)
    Files/s mean 100.11 ( 0.00%) 135.17 (25.94%) 146.93 (31.87%)
    Files/s stddev 57.51 ( 0.00%) 138.97 (58.62%) 158.69 (63.76%)
    Files/s max 361.10 ( 0.00%) 834.40 (56.72%) 802.40 (55.00%)
    Overhead min 76704.00 ( 0.00%) 76501.00 ( 0.27%) 77784.00 (-1.39%)
    Overhead mean 1485356.51 ( 0.00%) 1035797.83 (43.40%) 1594680.26 (-6.86%)
    Overhead stddev 1848122.53 ( 0.00%) 881489.88 (109.66%) 1772354.90 ( 4.27%)
    Overhead max 7989060.00 ( 0.00%) 3369118.00 (137.13%) 10135324.00 (-21.18%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 501.49 493.91 499.93
    Total Elapsed Time (seconds) 2451.57 2257.48 2215.92

    MMTests Statistics: vmstat
    Page Ins 46268 63840 66008
    Page Outs 90821596 90671128 88043732
    Swap Ins 0 0 0
    Swap Outs 0 0 0
    Direct pages scanned 13091697 8966863 8971790
    Kswapd pages scanned 0 1830011 1831116
    Kswapd pages reclaimed 0 1829068 1829930
    Direct pages reclaimed 13037777 8956828 8648314
    Kswapd efficiency 100% 99% 99%
    Kswapd velocity 0.000 810.643 826.346
    Direct efficiency 99% 99% 96%
    Direct velocity 5340.128 3972.068 4048.788
    Percentage direct scans 100% 83% 83%
    Page writes by reclaim 0 3 0
    Slabs scanned 796672 720640 720256
    Direct inode steals 7422667 7160012 7088638
    Kswapd inode steals 0 1736840 2021238

    Test completes far faster with a large increase in the number of files
    created per second. Standard deviation is high as a small number of
    iterations were much higher than the mean. The number of pages scanned by
    zone_reclaim is reduced and kswapd is used for more work.

    LARGE DD
    3.0-rc6 3.0-rc6 3.0-rc6
    vanilla zlcfirst zlcreconsider
    download tar 59 ( 0.00%) 59 ( 0.00%) 55 ( 7.27%)
    dd source files 527 ( 0.00%) 296 (78.04%) 320 (64.69%)
    delete source 36 ( 0.00%) 19 (89.47%) 20 (80.00%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 125.03 118.98 122.01
    Total Elapsed Time (seconds) 624.56 375.02 398.06

    MMTests Statistics: vmstat
    Page Ins 3594216 439368 407032
    Page Outs 23380832 23380488 23377444
    Swap Ins 0 0 0
    Swap Outs 0 436 287
    Direct pages scanned 17482342 69315973 82864918
    Kswapd pages scanned 0 519123 575425
    Kswapd pages reclaimed 0 466501 522487
    Direct pages reclaimed 5858054 2732949 2712547
    Kswapd efficiency 100% 89% 90%
    Kswapd velocity 0.000 1384.254 1445.574
    Direct efficiency 33% 3% 3%
    Direct velocity 27991.453 184832.737 208171.929
    Percentage direct scans 100% 99% 99%
    Page writes by reclaim 0 5082 13917
    Slabs scanned 17280 29952 35328
    Direct inode steals 115257 1431122 332201
    Kswapd inode steals 0 0 979532

    This test downloads a large tarfile and copies it with dd a number of
    times - similar to the most recent bug report I've dealt with. Time to
    completion is reduced. The number of pages scanned directly is still
    disturbingly high with a low efficiency but this is likely due to the
    number of dirty pages encountered. The figures could probably be improved
    with more work around how kswapd is used and how dirty pages are handled
    but that is separate work and this result is significant on its own.

    Streaming Mapped Writer
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 124.47 111.67 112.64
    Total Elapsed Time (seconds) 2138.14 1816.30 1867.56

    MMTests Statistics: vmstat
    Page Ins 90760 89124 89516
    Page Outs 121028340 120199524 120736696
    Swap Ins 0 86 55
    Swap Outs 0 0 0
    Direct pages scanned 114989363 96461439 96330619
    Kswapd pages scanned 56430948 56965763 57075875
    Kswapd pages reclaimed 27743219 27752044 27766606
    Direct pages reclaimed 49777 46884 36655
    Kswapd efficiency 49% 48% 48%
    Kswapd velocity 26392.541 31363.631 30561.736
    Direct efficiency 0% 0% 0%
    Direct velocity 53780.091 53108.759 51581.004
    Percentage direct scans 67% 62% 62%
    Page writes by reclaim 385 122 1513
    Slabs scanned 43008 39040 42112
    Direct inode steals 0 10 8
    Kswapd inode steals 733 534 477

    This test just creates a large file mapping and writes to it linearly.
    Time to completion is again reduced.

    The gains are mostly down to two things. In many cases, there is less
    scanning as zone_reclaim simply gives up faster due to recent failures.
    The second reason is that memory is used more efficiently. Instead of
    scanning the preferred zone every time, the allocator falls back to
    another zone and uses it instead improving overall memory utilisation.

    This patch: initialise ZLC for first zone eligible for zone_reclaim.

    The zonelist cache (ZLC) is used among other things to record if
    zone_reclaim() failed for a particular zone recently. The intention is to
    avoid a high cost scanning extremely long zonelists or scanning within the
    zone uselessly.

    Currently the zonelist cache is setup only after the first zone has been
    considered and zone_reclaim() has been called. The objective was to avoid
    a costly setup but zone_reclaim is itself quite expensive. If it is
    failing regularly such as the first eligible zone having mostly mapped
    pages, the cost in scanning and allocation stalls is far higher than the
    ZLC initialisation step.

    This patch initialises ZLC before the first eligible zone calls
    zone_reclaim(). Once initialised, it is checked whether the zone failed
    zone_reclaim recently. If it has, the zone is skipped. As the first zone
    is now being checked, additional care has to be taken about zones marked
    full. A zone can be marked "full" because it should not have enough
    unmapped pages for zone_reclaim but this is excessive as direct reclaim or
    kswapd may succeed where zone_reclaim fails. Only mark zones "full" after
    zone_reclaim fails if it failed to reclaim enough pages after scanning.

    Signed-off-by: Mel Gorman
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently we are keeping faulted page locked throughout whole __do_fault
    call (except for page_mkwrite code path) after calling file system's fault
    code. If we do early COW, we allocate a new page which has to be charged
    for a memcg (mem_cgroup_newpage_charge).

    This function, however, might block for unbounded amount of time if memcg
    oom killer is disabled or fork-bomb is running because the only way out of
    the OOM situation is either an external event or OOM-situation fix.

    In the end we are keeping the faulted page locked and blocking other
    processes from faulting it in which is not good at all because we are
    basically punishing potentially an unrelated process for OOM condition in
    a different group (I have seen stuck system because of ld-2.11.1.so being
    locked).

    We can do test easily.

    % cgcreate -g memory:A
    % cgset -r memory.limit_in_bytes=64M A
    % cgset -r memory.memsw.limit_in_bytes=64M A
    % cd kernel_dir; cgexec -g memory:A make -j

    Then, the whole system will live-locked until you kill 'make -j'
    by hands (or push reboot...) This is because some important page in a
    a shared library are locked.

    Considering again, the new page is not necessary to be allocated
    with lock_page() held. And usual page allocation may dive into
    long memory reclaim loop with holding lock_page() and can cause
    very long latency.

    There are 3 ways.
    1. do allocation/charge before lock_page()
    Pros. - simple and can handle page allocation in the same manner.
    This will reduce holding time of lock_page() in general.
    Cons. - we do page allocation even if ->fault() returns error.

    2. do charge after unlock_page(). Even if charge fails, it's just OOM.
    Pros. - no impact to non-memcg path.
    Cons. - implemenation requires special cares of LRU and we need to modify
    page_add_new_anon_rmap()...

    3. do unlock->charge->lock again method.
    Pros. - no impact to non-memcg path.
    Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
    lock and get it again...

    This patch moves "charge" and memory allocation for COW page
    before lock_page(). Then, we can avoid scanning LRU with holding
    a lock on a page and latency under lock_page() will be reduced.

    Then, above livelock disappears.

    [akpm@linux-foundation.org: fix code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Lutz Vieweg
    Original-idea-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • 2.6.36's 7e496299d4d2 ("tmpfs: make tmpfs scalable with percpu_counter for
    used blocks") to make tmpfs scalable with percpu_counter used
    inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
    that was adverse to scalability, and unnecessary, since info->lock is
    already held there in the fast paths.

    Remove those uses of i_lock, and add info->lock in the three error paths
    where it's then needed across shmem_free_blocks(). It's not actually
    needed across shmem_unacct_blocks(), but they're so often paired that it
    looks wrong to split them apart.

    Signed-off-by: Hugh Dickins
    Acked-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • truncate_inode_pages_range()'s final loop has a nice pincer property,
    bringing start and end together, squeezing out the last pages. But the
    range handling missed out on that, just sliding up the range, perhaps
    letting pages come in behind it. Add one more test to give it the same
    pincer effect.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make the pagevec_lookup loops in truncate_inode_pages_range(),
    invalidate_mapping_pages() and invalidate_inode_pages2_range() more
    consistent with each other.

    They were relying upon page->index of an unlocked page, but apologizing
    for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
    index handling.

    invalidate_inode_pages2_range() had special handling for a wrapped
    page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
    near there, and a corrupt page->index in the radix_tree could cause more
    trouble than that would catch. Remove that wrapped handling.

    invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
    when near the end of the range: copy that into the other two, although
    it's less useful than you might think (it limits the use of the buffer,
    rather than the indices looked up).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use consistent variable names in truncate_pagecache(), truncate_setsize(),
    vmtruncate() and vmtruncate_range().

    unmap_mapping_range() and vmtruncate_range() have mismatched interfaces:
    don't change either, but make the vmtruncates more precise about what they
    expect unmap_mapping_range() to do.

    vmtruncate_range() is currently called only with page-aligned start and
    end+1: can handle unaligned start, but unaligned end+1 would hit BUG_ON in
    truncate_inode_pages_range() (lacks partial clearing of the end page).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The often-NULL data arg to read_cache_page() and read_mapping_page()
    functions is misdescribed as "destination for read data": no, it's the
    first arg to the filler function, often struct file * to ->readpage().

    Satisfy checkpatch.pl on those filler prototypes, and tidy up the
    declarations in linux/pagemap.h.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • - shmem pages are not immediately available, but they are not
    potentially available either, even if we swap them out, they will just
    relocate from memory into swap, total amount of immediate and
    potentially available memory is not going to be affected, so we
    shouldn't count them as potentially free in the first place.

    - nr_free_pages() is not an expensive operation anymore, there is no
    need to split the decision making in two halves and repeat code.

    Signed-off-by: Dmitry Fink
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Fink
     
  • RED_INACTIVE is a slab thing, and reusing it for memblock was
    inappropriate, because memblock is dealing with phys_addr_t's which have a
    Kconfigurable sizeof().

    Create a new poison type for this application. Fixes the sparse warning

    warning: cast truncates bits from constant value (9f911029d74e35b becomes 9d74e35b)

    Reported-by: H Hartley Sweeten
    Tested-by: H Hartley Sweeten
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The badness() function in the oom killer was renamed to oom_badness() in
    a63d83f427fb ("oom: badness heuristic rewrite") since it is a globally
    exported function for clarity.

    The prototype for the old function still existed in linux/oom.h, so remove
    it. There are no existing users.

    Also fixes documentation and comment references to badness() and adjusts
    them accordingly.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
    Remove i_mmap_lock lockbreak"). So zap it.

    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Fix coding style issues flagged by checkpatch.pl

    Signed-off-by: Chris Forbes
    Acked-by: Eric B Munson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Forbes
     
  • The lock is released first thing in all three branches. Simplify this by
    unconditionally releasing lock and remove else clause which was only there
    to be sure lock was released.

    Signed-off-by: Chris Wright
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • Commit a539f3533b78e3 ("mm: add SECTION_ALIGN_UP() and
    SECTION_ALIGN_DOWN() macro") introduced the SECTION_ALIGN_UP() and
    SECTION_ALIGN_DOWN() macros. Use those macros to increase code
    readability.

    Signed-off-by: Daniel Kiper
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • In commit a2c8990aed5ab ("memsw: remove noswapaccount kernel parameter"),
    Michal forgot to remove some left pieces of noswapaccount in the tree,
    this patch removes them all.

    Signed-off-by: WANG Cong
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Cong
     
  • Commit bae9c19bf1 ("thp: split_huge_page_mm/vma") changed locking behavior
    of walk_page_range(). Thus this patch changes the comment too.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Originally, walk_hugetlb_range() didn't require a caller take any lock.
    But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range") changed its rule. Because it added find_vma() call in
    walk_hugetlb_range().

    Any locking-rule change commit should write a doc too.

    [akpm@linux-foundation.org: clarify comment]
    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, walk_page_range() calls find_vma() every page table for walk
    iteration. but it's completely unnecessary if walk->hugetlb_entry is
    unused. And we don't have to assume find_vma() is a lightweight
    operation. So this patch checks the walk->hugetlb_entry and avoids the
    find_vma() call if possible.

    This patch also makes some cleanups. 1) remove ugly uninitialized_var()
    and 2) #ifdef in function body.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The doc of find_vma() says,

    /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
    struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
    {
    (snip)

    Thus, caller should confirm whether the returned vma matches a desired one.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Document some swap token aging design decisions.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • global_faults and last_aging are only used in grab_swap_token(). Move
    them into grab_swap_token().

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • http://www.cs.wm.edu/~sjiang/token.pdf is now dead. Replace it with an
    alive alternative.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This patch contains online_page_callback and apropriate functions for
    registering/unregistering online page callbacks. It allows to do some
    machine specific tasks during online page stage which is required to
    implement memory hotplug in virtual machines. Currently this patch is
    required by latest memory hotplug support for Xen balloon driver patch
    which will be posted soon.

    Additionally, originial online_page() function was splited into
    following functions doing "atomic" operations:

    - __online_page_set_limits() - set new limits for memory management code,
    - __online_page_increment_counters() - increment totalram_pages and totalhigh_pages,
    - __online_page_free() - free page to allocator.

    It was done to:
    - not duplicate existing code,
    - ease hotplug code devolpment by usage of well defined interface,
    - avoid stupid bugs which are unavoidable when the same code
    (by design) is developed in many places.

    [akpm@linux-foundation.org: use explicit indirect-call syntax]
    Signed-off-by: Daniel Kiper
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Ian Campbell
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • Vito said:

    : The system has many usb disks coming and going day to day, with their
    : respective bdi's having min_ratio set to 1 when inserted. It works for
    : some time until eventually min_ratio can no longer be set, even when the
    : active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
    : anywhere near 100.
    :
    : This then leads to an unrelated starvation problem caused by write-heavy
    : fuse mounts being used atop the usb disks, a problem the min_ratio setting
    : at the underlying devices bdi effectively prevents.

    Fix this leakage by resetting the bdi min_ratio when unregistering the
    BDI.

    Signed-off-by: Peter Zijlstra
    Reported-by: Vito Caputo
    Cc: Wu Fengguang
    Cc: Miklos Szeredi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • These uses are read-only and in a subsequent patch I have a const struct
    page in my hand...

    [akpm@linux-foundation.org: fix warnings in lowmem_page_address()]
    Signed-off-by: Ian Campbell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Campbell
     
  • This is needed on HIGHMEM systems - we don't always have a virtual
    address so store the physical address and map it in as needed.

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Becky Bruce
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Becky Bruce
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds