07 Jan, 2009

40 commits

  • At this point we already know that 'addr' is not NULL so get rid of
    redundant 'if'. Probably gcc eliminate it by optimization pass.

    [akpm@linux-foundation.org: use __weak, too]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Wassim Dagash reported following kswapd infinite loop problem.

    kswapd runs in some infinite loop trying to swap until order 10 of zone
    highmem is OK.... kswapd will continue to try to balance order 10 of zone
    highmem forever (or until someone release a very large chunk of highmem).

    For non order-0 allocations, the system may never be balanced due to
    fragmentation but kswapd should not infinitely loop as a result.

    Instead, recheck all watermarks at order-0 as they are the most important.
    If watermarks are ok, kswapd will go back to sleep.

    [akpm@linux-foundation.org: fix comment]
    Reported-by: wassim dagash
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Nick Piggin
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Moving the request details print-out before the sanity checks that
    might panic() enables us to analyse invalid requests without having
    access to the line information of the stack dump.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When dup_mmap() ooms we can end up with mm->mmap == NULL. The error
    path does mmput() and unmap_vmas() gets a NULL vma which it
    dereferences.

    In exit_mmap() there is nothing to do at all for this case, we can
    cancel the callpath right there.

    [akpm@linux-foundation.org: add sorely-needed comment]
    Signed-off-by: Johannes Weiner
    Reported-by: Akinobu Mita
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_queue_congested() was introduced in 2002, but it was never used

    Signed-off-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed.

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
    mm->hiwater_xxx directly, this leads to 2 problems:

    - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
    moment before the task exits, so we should check the current values of
    rss/vm anyway.

    - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can
    be preempted right before mm->hiwater_xxx = new_val, and another thread
    can use A_LOT of memory and exit in between. When the first thread
    resumes it can be the last thread in the thread group, in that case we
    report the wrong hiwater_xxx values which do not take A_LOT into
    account.

    Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
    xacct_add_tsk() to use them. The first helper will also be used by
    rusage->ru_maxrss accounting.

    Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to
    decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
    can look at this mm_struct when exit_mmap() actually unmaps the memory.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Frustratingly, gfp_t is really divided into two classes of flags. One are
    the context dependent ones (can we sleep? can we enter filesystem? block
    subsystem? should we use some extra reserves, etc.). The other ones are
    the type of memory required and depend on how the algorithm is implemented
    rather than the point at which the memory is allocated (highmem? dma
    memory? etc).

    Some of the functions which allocate a page and add it to page cache take
    a gfp_t, but sometimes those functions or their callers aren't really
    doing the right thing: when allocating pagecache page, the memory type
    should be mapping_gfp_mask(mapping). When allocating radix tree nodes,
    the memory type should be kernel mapped (not highmem) memory. The gfp_t
    argument should only really be needed for context dependent options.

    This patch doesn't really solve that tangle in a nice way, but it does
    attempt to fix a couple of bugs.

    - find_or_create_page changes its radix-tree allocation to only include
    the main context dependent flags in order so the pagecache page may be
    allocated from arbitrary types of memory without affecting the
    radix-tree. In practice, slab allocations don't come from highmem
    anyway, and radix-tree only uses slab allocations. So there isn't a
    practical change (unless some fs uses GFP_DMA for pages).

    - grab_cache_page_nowait() is changed to allocate radix-tree nodes with
    GFP_NOFS, because it is not supposed to reenter the filesystem. This
    bug could cause lock recursion if a filesystem is not expecting the
    function to reenter the fs (as-per documentation).

    Filesystems should be careful about exactly what semantics they want and
    what they get when fiddling with gfp_t masks to allocate pagecache. One
    should be as liberal as possible with the type of memory that can be used,
    and same for the the context specific flags.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Direct IO can invalidate and sync a lot of pagecache pages in the mapping.
    A 4K direct IO will actually try to sync and/or invalidate the pagecache
    of the entire file, for example (which might be many GB or TB large).

    Improve this by doing range syncs. Also, memory no longer has to be
    unmapped to catch the dirty bits for syncing, as dirty bits would remain
    coherent due to dirty mmap accounting.

    This fixes the immediate DM deadlocks when doing direct IO reads to block
    device with a mounted filesystem, if only by papering over the problem
    somewhat rather than addressing the fsync starvation cases.

    Signed-off-by: Nick Piggin
    Reviewed-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix a little of the coding style in mm/mmap.c

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: ZhenwenXu
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    ZhenwenXu
     
  • tiny-shmem shares most of its 130 lines of code with shmem and tends to
    break when particular bits of shmem get modified. Unifying saves code and
    makes keeping these two in sync much easier.

    before:
    14367 392 24 14783 39bf mm/shmem.o
    396 72 8 476 1dc mm/tiny-shmem.o

    after:
    14367 392 24 14783 39bf mm/shmem.o
    412 72 8 492 1ec mm/shmem.o tiny

    Signed-off-by: Matt Mackall
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • The initial implementation of checking TIF_MEMDIE covers the cases of OOM
    killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
    return immediately. This patch includes:

    1. add the case that the SIGKILL is sent by user processes. The
    process can try to get_user_pages() unlimited memory even if a user
    process has sent a SIGKILL to it(maybe a monitor find the process
    exceed its memory limit and try to kill it). In the old
    implementation, the SIGKILL won't be handled until the get_user_pages()
    returns.

    2. change the return value to be ERESTARTSYS. It makes no sense to
    return ENOMEM if the get_user_pages returned by getting a SIGKILL
    signal. Considering the general convention for a system call
    interrupted by a signal is ERESTARTNOSYS, so the current return value
    is consistant to that.

    Lee:

    An unfortunate side effect of "make-get_user_pages-interruptible" is that
    it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
    resulting in freeing of mlocked pages. Freeing of mlocked pages, in
    itself, is not so bad. We just count them now--altho' I had hoped to
    remove this stat and add PG_MLOCKED to the free pages flags check.

    However, consider pages in shared libraries mapped by more than one task
    that a task mlocked--e.g., via mlockall(). If the task that mlocked the
    pages exits via SIGKILL, these pages would be left mlocked and
    unevictable.

    Proposed fix:

    Add another GUP flag to ignore sigkill when calling get_user_pages from
    munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
    the same purpose. We are not actually allocating memory in this case,
    which "make-get_user_pages-interruptible" intends to avoid. We're just
    munlocking pages that are already resident and mapped, and we're reusing
    get_user_pages() to access those pages.

    ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
    into a single flag: GUP_FLAGS_MUNLOCK ???

    [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
    Signed-off-by: Paul Menage
    Signed-off-by: Ying Han
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Lee Schermerhorn
    Cc: Rohit Seth
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • These three statements manipulate local variables and do not need the lock
    coverage.

    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
    which I've followed in print_bad_pte(). These are serious system errors,
    on a par with BUGs, but they're not quite emergencies, and we do our best
    to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

    And remove the "Trying to fix it up, but a reboot is needed" line: it's
    not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • print_bad_pte() and bad_page() might each need ratelimiting - especially
    for their dump_stacks, almost never of interest, yet not quite
    dispensible. Correlating corruption across neighbouring entries can be
    very helpful, so allow a burst of 60 reports before keeping quiet for the
    remainder of that minute (or allow a steady drip of one report per
    second).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Complete zap_pte_range()'s coverage of bad pagetable entries by calling
    print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
    That needs free_swap_and_cache() to tell it, which will also have shown
    one of those "swap_free" errors (but with much less information).

    Similar checks in fork's copy_one_pte()? No, that would be more noisy
    than helpful: we'll see them when parent and child exec or exit.

    Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
    case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
    VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
    with what happens if do_swap_page() operates a bad swap entry; but don't
    we have patches to be more careful about killing when VM_FAULT_OOM?

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • print_bad_pte() is so far being called only when zap_pte_range() finds
    negative page_mapcount, or there's a fault on a pte_file where it does not
    belong. That's weak coverage when we suspect pagetable corruption.

    Originally, it was called when vm_normal_page() found an invalid pfn: but
    pfn_valid is expensive on some architectures and configurations, so 2.6.24
    put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
    2.6.26 replaced it by a VM_BUG_ON (likewise).

    Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
    than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
    __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
    pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
    doubt we'll need it elsewhere: of course it's not reliable, but gives much
    stronger pagetable validation on many boxes.

    Also use print_bad_pte() when the pte_special bit is found outside a
    VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now that bad pages are kept out of circulation, there is no need for the
    infamous page_remove_rmap() BUG() - once that page is freed, its negative
    mapcount will issue a "Bad page state" message and the page won't be
    freed. Removing the BUG() allows more info, on subsequent pages, to be
    gathered.

    We do have more info about the page at this point than bad_page() can know
    - notably, what the pmd is, which might pinpoint something like low 64kB
    corruption - but page_remove_rmap() isn't given the address to find that.

    In practice, there is only one call to page_remove_rmap() which has ever
    reported anything, that from zap_pte_range() (usually on exit, sometimes
    on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
    message and leave it all to zap_pte_range().

    mm/memory.c already has a hardly used print_bad_pte() function, showing
    some of the appropriate info: extend it to show what we want for the rmap
    case: pte info, page info (when there is a page) and vma info to compare.
    zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
    use if it works that out for itself.

    Some of this info is also shown in bad_page()'s "Bad page state" message.
    Keep them separate, but adjust them to match each other as far as
    possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
    there too.

    print_bad_pte() show current->comm unconditionally (though it should get
    repeated in the usually irrelevant stack trace): sorry, I misled Nick
    Piggin to make it conditional on vm_mm == current->mm, but current->mm is
    already NULL in the exit case. Usually current->comm is good, though
    exceptionally it may not be that of the mm (when "swapoff" for example).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Until now the bad_page() checkers have special-cased PageReserved, keeping
    those pages out of circulation thereafter. Now extend the special case to
    all: we want to keep ANY page with bad state out of circulation - the
    "free" page may well be in use by something.

    Leave the bad state of those pages untouched, for examination by
    debuggers; except for PageBuddy - leaving that set would risk bringing the
    page back.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
    a page: check the same flags as before when freeing, clear ALL the flags
    (unless PageReserved) when freeing, check ALL flags off when allocating.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • zone_is_near_oom() is unused.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The vmscan bail out patch move nr_reclaimed variable to struct
    scan_control. Unfortunately, indirect access can easily happen cache
    miss.

    if heavy memory pressure happend, that's ok.
    cache miss already plenty. it is not observable.

    but, if memory pressure is lite, performance degression is obserbable.

    I compared following three pattern (it was mesured 10 times each)

    hackbench 125 process 3000
    hackbench 130 process 3000
    hackbench 135 process 3000

    2.6.28-rc6 bail-out

    125 130 135 125 130 135
    ==============================================================
    71.866 75.86 81.274 93.414 73.254 193.382
    74.145 78.295 77.27 74.897 75.021 80.17
    70.305 77.643 75.855 70.134 77.571 79.896
    74.288 73.986 75.955 77.222 78.48 80.619
    72.029 79.947 78.312 75.128 82.172 79.708
    71.499 77.615 77.042 74.177 76.532 77.306
    76.188 74.471 83.562 73.839 72.43 79.833
    73.236 75.606 78.743 76.001 76.557 82.726
    69.427 77.271 76.691 76.236 79.371 103.189
    72.473 76.978 80.643 69.128 78.932 75.736

    avg 72.545 76.767 78.534 76.017 77.03 93.256
    std 1.89 1.71 2.41 6.29 2.79 34.16
    min 69.427 73.986 75.855 69.128 72.43 75.736
    max 76.188 79.947 83.562 93.414 82.172 193.382

    about 4-5% degression.

    Then, this patch introduces a temporary local variable.

    result:

    2.6.28-rc6 this patch

    num 125 130 135 125 130 135
    ==============================================================
    71.866 75.86 81.274 67.302 68.269 77.161
    74.145 78.295 77.27 72.616 72.712 79.06
    70.305 77.643 75.855 72.475 75.712 77.735
    74.288 73.986 75.955 69.229 73.062 78.814
    72.029 79.947 78.312 71.551 74.392 78.564
    71.499 77.615 77.042 69.227 74.31 78.837
    76.188 74.471 83.562 70.759 75.256 76.6
    73.236 75.606 78.743 69.966 76.001 78.464
    69.427 77.271 76.691 69.068 75.218 80.321
    72.473 76.978 80.643 72.057 77.151 79.068

    avg 72.545 76.767 78.534 70.425 74.2083 78.462
    std 1.89 1.71 2.41 1.66 2.34 1.00
    min 69.427 73.986 75.855 67.302 68.269 76.6
    max 76.188 79.947 83.562 72.616 77.151 80.321

    OK. the degression is disappeared.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • When the VM is under pressure, it can happen that several direct reclaim
    processes are in the pageout code simultaneously. It also happens that
    the reclaiming processes run into mostly referenced, mapped and dirty
    pages in the first round.

    This results in multiple direct reclaim processes having a lower
    pageout priority, which corresponds to a higher target of pages to
    scan.

    This in turn can result in each direct reclaim process freeing
    many pages. Together, they can end up freeing way too many pages.

    This kicks useful data out of memory (in some cases more than half
    of all memory is swapped out). It also impacts performance by
    keeping tasks stuck in the pageout code for too long.

    A 30% improvement in hackbench has been observed with this patch.

    The fix is relatively simple: in shrink_zone() we can check how many
    pages we have already freed, direct reclaim tasks break out of the
    scanning loop if they have already freed enough pages and have reached
    a lower priority level.

    We do not break out of shrink_zone() when priority == DEF_PRIORITY,
    to ensure that equal pressure is applied to every zone in the common
    case.

    However, in order to do this we do need to know how many pages we already
    freed, so move nr_reclaimed into scan_control.

    akpm: a historical interlude...

    We tried this in 2004:

    :commit e468e46a9bea3297011d5918663ce6d19094cf87
    :Author: akpm
    :Date: Thu Jun 24 15:53:52 2004 +0000
    :
    :[PATCH] vmscan.c: dont reclaim too many pages
    :
    : The shrink_zone() logic can, under some circumstances, cause far too many
    : pages to be reclaimed. Say, we're scanning at high priority and suddenly hit
    : a large number of reclaimable pages on the LRU.
    : Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.

    And we reverted it in 2006:

    :commit 210fe530305ee50cd889fe9250168228b2994f32
    :Author: Andrew Morton
    :Date: Fri Jan 6 00:11:14 2006 -0800
    :
    : [PATCH] vmscan: balancing fix
    :
    : Revert a patch which went into 2.6.8-rc1. The changelog for that patch was:
    :
    : The shrink_zone() logic can, under some circumstances, cause far too many
    : pages to be reclaimed. Say, we're scanning at high priority and suddenly
    : hit a large number of reclaimable pages on the LRU.
    :
    : Change things so we bale out when SWAP_CLUSTER_MAX pages have been
    : reclaimed.
    :
    : Problem is, this change caused significant imbalance in inter-zone scan
    : balancing by truncating scans of larger zones.
    :
    : Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone
    : balancing algorithm would require that if we're scanning 100 pages of
    : ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will
    : cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
    : reclaimed. Thus effectively causing smaller zones to be scanned relatively
    : harder than large ones.
    :
    : Now I need to remember what the workload was which caused me to write this
    : patch originally, then fix it up in a different way...

    And we haven't demonstrated that whatever problem caused that reversion is
    not being reintroduced by this change in 2008.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Fix the following sparse warnings:

    mm/hugetlb.c:375:3: warning: returning void-valued expression
    mm/hugetlb.c:408:3: warning: returning void-valued expression

    Signed-off-by: Hannes Eder
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hannes Eder
     
  • Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
    there's been a coincidental discussion of earlier randomization, assume
    that goes ahead, let swapon be a client rather than stirring for itself.

    Signed-off-by: Hugh Dickins
    Cc: David Woodhouse
    Cc: Donjun Shin
    Cc: James Bottomley
    Cc: Jens Axboe
    Cc: Joern Engel
    Cc: KAMEZAWA Hiroyuki
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
    sector_t: given the constraints on swap offsets (in particular, the 5 bits
    of swap type accommodated in the same unsigned long), pgoff_t was actually
    safe as is, but it certainly looked worrying when shifted left.

    [akpm@linux-foundation.org: fix shift overflow]
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though attempting to find free clusters (Andrea), swap allocation has
    always restarted its searches from the beginning of the swap area (sct),
    to reduce seek times between swap pages, by not scattering them all over
    the partition.

    But on a solidstate swap device, seeks are cheap, and block remapping to
    level the wear may be limited by zones: in that case it's better to cycle
    around the whole partition.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap allocation has always started from the beginning of the swap area;
    but if we're dealing with a solidstate swap device which can only remap
    blocks within limited zones, that would sooner wear out the first zone.

    Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
    randomize the cluster_next starting position for allocation.

    If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
    with an "SS" at the right end of the kernel's "Adding ... swap" message
    (so that if it's both nonrot and discardable, "SSD" will be shown there).
    Perhaps something should be shown in /proc/swaps (swapon -s), but we have
    to be more cautious before making any addition to that format.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When scan_swap_map() finds a free cluster of swap pages to allocate,
    discard the old contents of the cluster if the device supports discard.
    But don't bother when swap is so fragmented that we allocate single pages.

    Be careful about racing allocations made while we're scanning for a
    cluster; and hold up allocations made while we're discarding.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When adding swap, all the old data on swap can be forgotten: sys_swapon()
    discard all but the header page of the swap partition (or every extent but
    the header of the swap file), to give a solidstate swap device the
    opportunity to optimize its wear-levelling.

    If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
    "D" at the right end of the kernel's "Adding ... swap" message. Perhaps
    something should be shown in /proc/swaps (swapon -s), but we have to be
    more cautious before making any addition to that format.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before making functional changes, rearrange scan_swap_map() to simplify
    subsequent diffs. Actually, there is one functional change in there:
    leave cluster_nr negative while scanning for a new cluster - resetting it
    early increased the likelihood that when we have difficulty finding a free
    cluster, another task may come in and try doing exactly the same - just a
    waste of cpu.

    Before making functional changes, rearrange struct swap_info_struct
    slightly: flags will be needed as an unsigned long (for wait_on_bit), next
    is a good int to pair with prio, old_block_size is uninteresting so shift
    it to the end.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can
    now safely drop its "version 0 swap is no longer supported" message - just
    say "Unable to find swap-space signature" as usual. This removes one
    level of indentation from a stretch of sys_swapon().

    I'd have liked to be specific, saying "Unable to find SWAPSPACE2
    signature", but it's just too confusing that the version 1 signature shows
    the number 2.

    Irrelevant nearby cleanup: kmap(page) already gives page_address(page).

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove trailing whitespace from swapfile.c, and odd swap_show() alignment.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as
    an int, but should be an unsigned long like the maxpages it's compared
    against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected
    with "Swap area shorter than signature indicates".

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Sparse output following warning.

    mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static?

    cleanup here.

    Signed-off-by: KOSAKI Motohiro

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • sparse output following warning

    mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static?

    cleanup here.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • sparse output following warning.

    mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static?

    cleanup here.

    Signed-off-by: KOSAKI Motohiro

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Sparse output following warnings.

    mm/memcontrol.c:782:5: warning: symbol 'mem_cgroup_resize_limit' was not
    declared. Should it be static?

    cleanup here.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro