01 Nov, 2011

40 commits

  • A process spent 30 minutes exiting, just munlocking the pages of a large
    anonymous area that had been alternately mprotected into page-sized vmas:
    for every single page there's an anon_vma walk through all the other
    little vmas to find the right one.

    A general fix to that would be a lot more complicated (use prio_tree on
    anon_vma?), but there's one very simple thing we can do to speed up the
    common case: if a page to be munlocked is mapped only once, then it is our
    vma that it is mapped into, and there's no need whatever to walk through
    all the others.

    Okay, there is a very remote race in munlock_vma_pages_range(), if between
    its follow_page() and lock_page(), another process were to munlock the
    same page, then page reclaim remove it from our vma, then another process
    mlock it again. We would find it with page_mapcount 1, yet it's still
    mlocked in another process. But never mind, that's much less likely than
    the down_read_trylock() failure which munlocking already tolerates (in
    try_to_unmap_one()): in due course page reclaim will discover and move the
    page to unevictable instead.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Hugh Dickins
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There are three cases of update_mmu_cache() in the file, and the case in
    function collapse_huge_page() has a typo, namely the last parameter used,
    which is corrected based on the other two cases.

    Due to the define of update_mmu_cache by X86, the only arch that
    implements THP currently, the change here has no really crystal point, but
    one or two minutes of efforts could be saved for those archs that are
    likely to support THP in future.

    Signed-off-by: Hillf Danton
    Cc: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • The THP copy-on-write handler falls back to regular-sized pages for a huge
    page replacement upon allocation failure or if THP has been individually
    disabled in the target VMA. The loop responsible for copying page-sized
    chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
    the virtual address arg for copy_user_highpage().

    Signed-off-by: Hillf Danton
    Acked-by: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • MCL_FUTURE does not move pages between lru list and draining the LRU per
    cpu pagevecs is a nasty activity. Avoid doing it unecessarily.

    Signed-off-by: Christoph Lameter
    Cc: David Rientjes
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If compaction can proceed, shrink_zones() stops doing any work but its
    callers still call shrink_slab() which raises the priority and potentially
    sleeps. This is unnecessary and wasteful so this patch aborts direct
    reclaim/compaction entirely if compaction can proceed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Josh Boyer
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When suffering from memory fragmentation due to unfreeable pages, THP page
    faults will repeatedly try to compact memory. Due to the unfreeable
    pages, compaction fails.

    Needless to say, at that point page reclaim also fails to create free
    contiguous 2MB areas. However, that doesn't stop the current code from
    trying, over and over again, and freeing a minimum of 4MB (2UL <<
    sc->order pages) at every single invocation.

    This resulted in my 12GB system having 2-3GB free memory, a corresponding
    amount of used swap and very sluggish response times.

    This can be avoided by having the direct reclaim code not reclaim from
    zones that already have plenty of free memory available for compaction.

    If compaction still fails due to unmovable memory, doing additional
    reclaim will only hurt the system, not help.

    [jweiner@redhat.com: change comment to explain the order check]
    Signed-off-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When a race between putback_lru_page() and shmem_lock with lock=0 happens,
    progrom execution order is as follows, but clear_bit in processor #1 could
    be reordered right before spin_unlock of processor #1. Then, the page
    would be stranded on the unevictable list.

    spin_lock
    SetPageLRU
    spin_unlock
    clear_bit(AS_UNEVICTABLE)
    spin_lock
    if PageLRU()
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    smp_mb
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    spin_unlock

    But, pagevec_lookup() in scan_mapping_unevictable_pages() has
    rcu_read_[un]lock() so it could protect reordering before reaching
    test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
    But it's a unexpected side effect and we should solve this problem
    properly.

    This patch adds a barrier after mapping_clear_unevictable.

    I didn't meet this problem but just found during review.

    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Quiet the sparse noise:

    warning: symbol 'khugepaged_scan' was not declared. Should it be static?
    warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock

    Signed-off-by: H Hartley Sweeten
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Quiet the spares noise:

    warning: symbol 'default_policy' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: KOSAKI Motohiro
    Cc: Stephen Wilson
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Quiet the following sparse noise:

    warning: symbol 'swap_token_memcg' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • Quiet the following sparse noise in this file:

    warning: symbol 'memblock_overlaps_region' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: Tomi Valkeinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     
  • At one point, anonymous pages were supposed to go on the unevictable list
    when no swap space was configured, and the idea was to manually rescue
    those pages after adding swap and making them evictable again. But
    nowadays, swap-backed pages on the anon LRU list are not scanned without
    available swap space anyway, so there is no point in moving them to a
    separate list anymore.

    The manual rescue could also be used in case pages were stranded on the
    unevictable list due to race conditions. But the code has been around for
    a while now and newly discovered bugs should be properly reported and
    dealt with instead of relying on such a manual fixup.

    In addition to the lack of a usecase, the sysfs interface to rescue pages
    from a specific NUMA node has been broken since its introduction, so it's
    unlikely that anybody ever relied on that.

    This patch removes the functionality behind the sysctl and the
    node-interface and emits a one-time warning when somebody tries to access
    either of them.

    Signed-off-by: Johannes Weiner
    Reported-by: Kautuk Consul
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • write_scan_unevictable_node() checks the value req returned by
    strict_strtoul() and returns 1 if req is 0.

    However, when strict_strtoul() returns 0, it means successful conversion
    of buf to unsigned long.

    Due to this, the function was not proceeding to scan the zones for
    unevictable pages even though we write a valid value to the
    scan_unevictable_pages sys file.

    Change this check slightly to check for invalid value in buf as well as 0
    value stored in res after successful conversion via strict_strtoul. In
    both cases, we do not perform the scanning of this node's zones.

    Signed-off-by: Kautuk Consul
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • Signed-off-by: Li Haifeng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Haifeng
     
  • There's no compact_zone_order() user outside file scope, so make it static.

    Signed-off-by: Kyungmin Park
    Acked-by: David Rientjes
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyungmin Park
     
  • Commit fb46e73520940b ("HWPOISON: Convert pr_debugs to pr_info) authored
    by Andi Kleen converted a number of pr_debug()s to pr_info()s.

    About the same time additional code with pr_debug()s was added by two
    other commits 8c6c2ecb4466 ("HWPOSION, hugetlb: recover from free hugepage
    error when !MF_COUNT_INCREASED") and d950b95882f3d ("HWPOISON, hugetlb:
    soft offlining for hugepage"). And these pr_debug()s failed to get
    converted to pr_info()s.

    This patch converts them as well. And does some minor related whitespace
    cleanup.

    Signed-off-by: Dean Nelson
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     
  • On the ext4 mailing list[1], we got some report about errors in
    __find_get_block_slow(), but the information is very limited.

    If the device information is given, we can know the name of the sick
    volume. Futhermore, we can get the corresponding status of that
    block(group, inode block etc) by analyzing the disk layout.

    [1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2

    Signed-off-by: Tao Ma
    Cc: Al Viro
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • The ret variable is really not needed in mm_take_all_locks().

    Signed-off-by: Kautuk Consul
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • The callback must not return -1 when nr_to_scan is zero. Fix the bug in
    fs/super.c and add this requirement to the callback specification.

    Signed-off-by: Mikulas Patocka
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • fiddle wording

    Cc: Jan Kara
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • try_to_unmap_one() is called by try_to_unmap_ksm(), too.

    Signed-off-by: Wanlong Gao
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanlong Gao
     
  • Some vmalloc failure paths do not report OOM conditions.

    Add warn_alloc_failed, which also does a dump_stack, to those failure
    paths.

    This allows more site specific vmalloc failure logging message printks to
    be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • There 2 places to read pgdat in kswapd. One is return from a successful
    balance, another is waked up from kswapd sleeping. The new_order and
    new_classzone_idx represent the balance input order and classzone_idx.

    But current new_order and new_classzone_idx are not assigned after
    kswapd_try_to_sleep(), that will cause a bug in the following scenario.

    1: after a successful balance, kswapd goes to sleep, and new_order = 0;
    new_classzone_idx = __MAX_NR_ZONES - 1;

    2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

    3: in the balance_pgdat() running, a new balance wakeup happened with
    order = 5, and classzone_idx = ZONE_NORMAL

    4: the first wakeup(order = 3) finished successufly, return order = 3
    but, the new_order is still 0, so, this balancing will be treated as a
    failed balance. And then the second tighter balancing will be missed.

    So, to avoid the above problem, the new_order and new_classzone_idx need
    to be assigned for later successful comparison.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Tested-by: Pádraig Brady
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex,Shi
     
  • warning: function 'memblock_memory_can_coalesce'
    with external linkage has definition.

    Signed-off-by: Jonghwan Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonghwan Choi
     
  • In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
    when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
    after a unsuccessful balancing if there is tighter reclaim request pending
    in the balancing. But in the following scenario, kswapd do something that
    is not matched our expectation. The patch fixes this issue.

    1, Read pgdat request A (classzone_idx, order = 3)
    2, balance_pgdat()
    3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
    4, balance_pgdat() returns but failed since returned order = 0
    5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
    While the expectation behavior of kswapd should try to sleep.

    Signed-off-by: Alex Shi
    Reviewed-by: Tim Chen
    Acked-by: Mel Gorman
    Tested-by: Pádraig Brady
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex,Shi
     
  • This adds support for highmem pages poisoning and verification to the
    debug-pagealloc feature for no-architecture support.

    [akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Add __attribute__((format (printf...) to the function to validate format
    and arguments. Use vsprintf extension %pV to avoid any possible message
    interleaving. Coalesce format string. Convert printks/pr_warning to
    pr_warn.

    [akpm@linux-foundation.org: use the __printf() macro]
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • On NOMMU architectures, if physical memory doesn't start from 0,
    ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
    Because virtual address is equal to physical address, PAGE_OFFSET is
    always 0. virt_to_page and page_to_virt should not index page by
    PAGE_OFFSET directly.

    Signed-off-by: Sonic Zhang
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sonic Zhang
     
  • This adds THP support to mremap (decreases the number of split_huge_page()
    calls).

    Here are also some benchmarks with a proggy like this:

    ===
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include

    #define SIZE (5UL*1024*1024*1024)

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    gettimeofday(&oldstamp, NULL);
    p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    gettimeofday(&newstamp, NULL);
    diffsec = newstamp.tv_sec - oldstamp.tv_sec;
    diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
    printf("usec %ld\n", diffsec);
    if (p == MAP_FAILED || p4 != p3)
    //if (p == MAP_FAILED)
    perror("mremap"), exit(1);
    if (memcmp(p4, p2, SIZE))
    printf("mremap bug\n"), exit(1);
    printf("ok\n");

    return 0;
    }
    ===

    THP on

    Performance counter stats for './largepage13' (3 runs):

    69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
    60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
    676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
    29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
    1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
    8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

    7.314454164 seconds time elapsed ( +- 0.023% )

    THP off

    Performance counter stats for './largepage13' (3 runs):

    1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
    9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
    2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
    3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
    6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
    8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

    9.693655243 seconds time elapsed ( +- 0.069% )

    grep thp /proc/vmstat
    thp_fault_alloc 35849
    thp_fault_fallback 0
    thp_collapse_alloc 3
    thp_collapse_alloc_failed 0
    thp_split 0

    thp_split 0 confirms no thp split despite plenty of hugepages allocated.

    The measurement of only the mremap time (so excluding the 3 long
    memset and final long 10GB memory accessing memcmp):

    THP on

    usec 14824
    usec 14862
    usec 14859

    THP off

    usec 256416
    usec 255981
    usec 255847

    With an older kernel without the mremap optimizations (the below patch
    optimizes the non THP version too).

    THP on

    usec 392107
    usec 390237
    usec 404124

    THP off

    usec 444294
    usec 445237
    usec 445820

    I guess with a threaded program that sends more IPI on large SMP it'd
    create an even larger difference.

    All debug options are off except DEBUG_VM to avoid skewing the
    results.

    The only problem for native 2M mremap like it happens above both the
    source and destination address must be 2M aligned or the hugepmd can't be
    moved without a split but that is an hardware limitation.

    [akpm@linux-foundation.org: coding-style nitpicking]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
    flush_tlb_range() at the end of the loop, to avoid sending one IPI for
    each page.

    The mmu_notifier_invalidate_range_start/end section is enlarged
    accordingly but this is not going to fundamentally change things. It was
    more by accident that the region under mremap was for the most part still
    available for secondary MMUs: the primary MMU was never allowed to
    reliably access that region for the duration of the mremap (modulo
    trapping SIGSEGV on the old address range which sounds unpractical and
    flakey). If users wants secondary MMUs not to lose access to a large
    region under mremap they should reduce the mremap size accordingly in
    userland and run multiple calls. Overall this will run faster so it's
    actually going to reduce the time the region is under mremap for the
    primary MMU which should provide a net benefit to apps.

    For KVM this is a noop because the guest physical memory is never
    mremapped, there's just no point it ever moving it while guest runs. One
    target of this optimization is JVM GC (so unrelated to the mmu notifier
    logic).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
    those are reasonable requirements but the check remains obscure and it
    looks more like an off by one error than an overflow check. This I feel
    will improve readability.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With the NO_BOOTMEM symbol added architectures may now use the following
    syntax to tell that they do not need bootmem:

    select NO_BOOTMEM

    This is much more convinient than adding a new kconfig symbol which was
    otherwise required.

    Adding this symbol does not conflict with the architctures that already
    define their own symbol.

    Signed-off-by: Sam Ravnborg
    Cc: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     
  • SPARC32 require access to the start address. Add a new helper
    memblock_start_of_DRAM() to give access to the address of the first
    memblock - which contains the lowest address.

    The awkward name was chosen to match the already present
    memblock_end_of_DRAM().

    Signed-off-by: Sam Ravnborg
    Cc: "David S. Miller"
    Cc: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     
  • The /proc/vmallocinfo shows information about vmalloc allocations in
    vmlist that is a linklist of vm_struct. It, however, may access pages
    field of vm_struct where a page was not allocated. This results in a null
    pointer access and leads to a kernel panic.

    Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
    allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
    some fields of vm_struct such as nr_pages and pages are set at
    __vmalloc_area_node(). In other words, it is added to vmlist before it is
    fully initialized. At the same time, when the /proc/vmallocinfo is read,
    it accesses the pages field of vm_struct according to the nr_pages field
    at show_numa_info(). Thus, a null pointer access happens.

    The patch adds the newly allocated vm_struct to the vmlist *after* it is
    fully initialized. So, it can avoid accessing the pages field with
    unallocated page when show_numa_info() is called.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Andrew Morton
    Cc: David Rientjes
    Cc: Namhyung Kim
    Cc: "Paul E. McKenney"
    Cc: Jeremy Fitzhardinge
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitsuo Hayasaka
     
  • Use newly introduced memchr_inv() for page verification.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • memchr_inv() is mainly used to check whether the whole buffer is filled
    with just a specified byte.

    The function name and prototype are stolen from logfs and the
    implementation is from SLUB.

    Signed-off-by: Akinobu Mita
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Acked-by: Joern Engel
    Cc: Marcin Slusarz
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • printk_ratelimit() should not be used, because it shares ratelimiting
    state with all other unrelated printk_ratelimit() callsites.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • It's possible a zone watermark is ok when entering the balance_pgdat()
    loop, while the zone is within the requested classzone_idx. Count pages
    from this zone into `balanced'. In this way, we can skip shrinking zones
    too much for high order allocation.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When direct reclaim encounters a dirty page, it gets recycled around the
    LRU for another cycle. This patch marks the page PageReclaim similar to
    deactivate_page() so that the page gets reclaimed almost immediately after
    the page gets cleaned. This is to avoid reclaiming clean pages that are
    younger than a dirty page encountered at the end of the LRU that might
    have been something like a use-once page.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Workloads that are allocating frequently and writing files place a large
    number of dirty pages on the LRU. With use-once logic, it is possible for
    them to reach the end of the LRU quickly requiring the reclaimer to scan
    more to find clean pages. Ordinarily, processes that are dirtying memory
    will get throttled by dirty balancing but this is a global heuristic and
    does not take into account that LRUs are maintained on a per-zone basis.
    This can lead to a situation whereby reclaim is scanning heavily, skipping
    over a large number of pages under writeback and recycling them around the
    LRU consuming CPU.

    This patch checks how many of the number of pages isolated from the LRU
    were dirty and under writeback. If a percentage of them under writeback,
    the process will be throttled if a backing device or the zone is
    congested. Note that this applies whether it is anonymous or file-backed
    pages that are under writeback meaning that swapping is potentially
    throttled. This is intentional due to the fact if the swap device is
    congested, scanning more pages and dispatching more IO is not going to
    help matters.

    The percentage that must be in writeback depends on the priority. At
    default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of
    them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the
    greater the likelihood the process will get throttled to allow the flusher
    threads to make some progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman