13 Jan, 2012

3 commits

  • Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • While implementing cmpxchg_double() on s390 I realized that we don't set
    CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

    However setting that option will increase the size of struct page by
    eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
    make sense that a present cpu feature should increase the size of struct
    page.

    Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
    that it should depend on CMPXCHG_DOUBLE instead.

    This patch:

    If an architecture supports CMPXCHG_LOCAL this shouldn't result
    automatically in larger struct pages if the SLUB allocator is used.
    Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
    can be selected if a double word aligned struct page is required. Also
    update x86 Kconfig so that it should work as before.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

12 Jan, 2012

3 commits

  • * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/numa: Add constraints check for nid parameters
    mm, x86: Remove debug_pagealloc_enabled
    x86/mm: Initialize high mem before free_all_bootmem()
    arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
    arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
    x86: Fix mmap random address range
    x86, mm: Unify zone_sizes_init()
    x86, mm: Prepare zone_sizes_init() for unification
    x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
    x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
    x86, mm: Use max_pfn instead of highend_pfn
    x86, mm: Move zone init from paging_init() on 64-bit
    x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit

    Linus Torvalds
     
  • * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    slub: disallow changing cpu_partial from userspace for debug caches
    slub: add missed accounting
    slub: Extract get_freelist from __slab_alloc
    slub: Switch per cpu partial page support off for debugging
    slub: fix a possible memleak in __slab_alloc()
    slub: fix slub_max_order Documentation
    slub: add missed accounting
    slab: add taint flag outputting to debug paths.
    slub: add taint flag outputting to debug paths
    slab: introduce slab_max_order kernel parameter
    slab: rename slab_break_gfp_order to slab_max_order

    Linus Torvalds
     
  • Pekka Enberg
     

11 Jan, 2012

34 commits

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • vmap_area->private is void* but we don't use the field for various purpose
    but use only for vm_struct. So change it to a vm_struct* with naming to
    improve for readability and type checking.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • It is not the tag page but the cursor page that we should process, and it
    looks a typo.

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
    better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_to_page is not used in mm/migrate.c.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: Mel Gorman
    Acked-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • If we have to hand back the newly allocated huge page to page allocator,
    for any reason, the changed counter should be recovered.

    This affects only s390 at present.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • mempool modifies gfp_mask so that the backing allocator doesn't try too
    hard or trigger warning message when there's pool to fall back on. In
    addition, for the first try, it removes __GFP_WAIT and IO, so that it
    doesn't trigger reclaim or wait when allocation can be fulfilled from
    pool; however, when that allocation fails and pool is empty too, it waits
    for the pool to be replenished before retrying.

    Allocation which could have succeeded after a bit of reclaim has to wait
    on the reserved items and it's not like mempool doesn't retry with
    __GFP_WAIT and IO. It just does that *after* someone returns an element,
    pointlessly delaying things.

    Fix it by retrying immediately if the first round of allocation attempts
    w/o __GFP_WAIT and IO fails.

    [akpm@linux-foundation.org: shorten the lock hold time]
    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • mempool_destroy() is a thin wrapper around free_pool(). The only thing it
    adds is BUG_ON(pool->curr_nr != pool->min_nr). The intention seems to be
    to enforce that all allocated elements are freed; however, the BUG_ON()
    can't achieve that (it doesn't know anything about objects above min_nr)
    and incorrect as mempool_resize() is allowed to leave the pool extended
    but not filled. Furthermore, panicking is way worse than any memory leak
    and there are better debug tools to track memory leaks.

    Drop the BUG_ON() from mempool_destory() and as that leaves the function
    identical to free_pool(), replace it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • mempool_alloc/free() use undocumented smp_mb()'s. The code is slightly
    broken and misleading.

    The lockless part is in mempool_free(). It wants to determine whether the
    item being freed needs to be returned to the pool or backing allocator
    without grabbing pool->lock. Two things need to be guaranteed for correct
    operation.

    1. pool->curr_nr + #allocated should never dip below pool->min_nr.
    2. Waiters shouldn't be left dangling.

    For #1, The only necessary condition is that curr_nr visible at free is
    from after the allocation of the element being freed (details in the
    comment). For most cases, this is true without any barrier but there can
    be fringe cases where the allocated pointer is passed to the freeing task
    without going through memory barriers. To cover this case, wmb is
    necessary before returning from allocation and rmb is necessary before
    reading curr_nr. IOW,

    ALLOCATING TASK FREEING TASK

    update pool state after alloc;
    wmb();
    pass pointer to freeing task;
    read pointer;
    rmb();
    read pool state to free;

    The current code doesn't have wmb after pool update during allocation and
    may theoretically, on machines where unlock doesn't behave as full wmb,
    lead to pool depletion and deadlock. smp_wmb() needs to be added after
    successful allocation from reserved elements and smp_mb() in
    mempool_free() can be replaced with smp_rmb().

    For #2, the waiter needs to add itself to waitqueue and then check the
    wait condition and the waker needs to update the wait condition and then
    wake up. Because waitqueue operations always go through full spinlock
    synchronization, there is no need for extra memory barriers.

    Furthermore, mempool_alloc() is already holding pool->lock when it decides
    that it needs to wait. There is no reason to do unlock - add waitqueue -
    test condition again. It can simply add itself to waitqueue while holding
    pool->lock and then unlock and sleep.

    This patch adds smp_wmb() after successful allocation from reserved pool,
    replaces smp_mb() in mempool_free() with smp_rmb() and extend pool->lock
    over waitqueue addition. More importantly, it explains what memory
    barriers do and how the lockless testing is correct.

    -v2: Oleg pointed out that unlock doesn't imply wmb. Added explicit
    smp_wmb() after successful allocation from reserved pool and
    updated comments accordingly.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • migration_entry_wait() can also be called from hugetlb_fault() now.
    Remove the incorrect comment.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • mpol_equal() logically returns a boolean. Use a bool type to slightly
    improve readability.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The computation for pgoff is incorrect, at least with

    (vma->vm_pgoff >> PAGE_SHIFT)

    involved. It is fixed with the available method if HPAGE_SIZE is
    concerned in page cache lookup.

    [akpm@linux-foundation.org: use vma_hugecache_offset() directly, per Michal]
    Signed-off-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • It's pointless to continue reclaiming when we have no swap space and lots
    of anon pages in the inactive list.

    Without this patch, it is possible when swap is disabled to continue
    trying to reclaim when there are only anonymous pages in the system even
    though that will not make any progress.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The loop that frees pages to the page allocator while bootstrapping tries
    to free higher-order blocks only when the starting address is aligned to
    that block size. Otherwise it will free all pages on that node
    one-by-one.

    Change it to free individual pages up to the first aligned block and then
    try higher-order frees from there.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The area node_bootmem_map represents is aligned to BITS_PER_LONG, and all
    bits in any aligned word of that map valid. When the represented area
    extends beyond the end of the node, the non-existant pages will be marked
    as reserved.

    As a result, when freeing a page block, doing an explicit range check for
    whether that block is within the node's range is redundant as the bitmap
    is consulted anyway to see whether all pages in the block are unreserved.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __free_pages_bootmem() used to special-case higher-order frees to save
    individual page checking with free_pages_bulk().

    Nowadays, both zero order and non-zero order frees use free_pages(), which
    checks each individual page anyway, and so there is little point in making
    the distinction anymore. The higher-order loop will work just fine for
    zero order pages.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
    vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
    using it. Also, this change helps to improve page fault performance
    because it has stronger locality of reference.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Shaohua Li
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
    threshold when memory is low") changed the form how free_pages is
    calculated but it forgot that we used to do free_pages - ((1 << order) -
    1) so we ended up with off-by-two when calculating free_pages.

    Reported-by: Wang Sheng-Hui
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The first entry of bdata->node_bootmem_map holds the data for
    bdata->node_min_pfn up to bdata->node_min_pfn + BITS_PER_LONG - 1. So the
    test for freeing all pages of a single map entry can be slightly relaxed.

    Moreover use DIV_ROUND_UP in another place instead of open coding it.

    Signed-off-by: Uwe Kleine-König
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • After isolated the current pfn will no longer be scanned and isolated if
    the next round is necessary, so push the isolate_migratepages search base
    of the given compact_control one step ahead.

    Signed-off-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Tell the page allocator that pages allocated through
    grab_cache_page_write_begin() are expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The next patch will introduce per-zone dirty limiting functions in
    addition to the traditional global dirty limiting.

    Rename determine_dirtyable_memory() to global_dirtyable_memory() before
    adding the zone-specific version, and fix up its documentation.

    Also, move the functions to determine the dirtyable memory and the
    function to calculate the dirty limit based on that together so that their
    relationship is more apparent and that they can be commented on as a
    group.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If we need to know a usecase, caller program name is critical important.
    Show it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    David Rientjes
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Previously POSIX_FADV_DONTNEED would start writeback for the entire file
    when the bdi was not write congested. This negatively impacts performance
    if the file contains dirty pages outside of the requested range. This
    change uses __filemap_fdatawrite_range() to only initiate writeback for
    the requested range.

    Signed-off-by: Shawn Bohrer
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Bohrer
     
  • Disable slub debug facilities and allocate slabs at minimal order when
    debug_guardpage_minorder > 0 to increase probability to catch random
    memory corruption by cpu exception.

    Signed-off-by: Stanislaw Gruszka
    Cc: "Rafael J. Wysocki"
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Stanislaw Gruszka
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
    on access (read,write) to an unallocated page, which permits us to catch
    code which corrupts memory. However the kernel is trying to maximise
    memory usage, hence there are usually few free pages in the system and
    buggy code usually corrupts some crucial data.

    This patch changes the buddy allocator to keep more free/protected pages
    and to interlace free/protected and allocated pages to increase the
    probability of catching corruption.

    When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
    debug_guardpage_minorder defines the minimum order used by the page
    allocator to grant a request. The requested size will be returned with
    the remaining pages used as guard pages.

    The default value of debug_guardpage_minorder is zero: no change from
    current behaviour.

    [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
    Signed-off-by: Stanislaw Gruszka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: "Rafael J. Wysocki"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • handle_mm_fault() passes 'faulted' address to hugetlb_fault(). This
    address is not aligned to a hugepage boundary.

    Most of the functions for hugetlb pages are aware of that and calculate an
    alignment themselves. However some functions such as
    copy_user_huge_page() and clear_huge_page() don't handle alignment by
    themselves.

    This patch make hugeltb_fault() fix the alignment and pass an aligned
    addresss (to address of a faulted hugepage) to functions.

    [akpm@linux-foundation.org: use &=]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Let's make it clear that we cannot race with other fault handlers due to
    hugetlb (global) mutex. Also make it clear that we want to keep pte_same
    checks anayway to have a transition from the global mutex easier.

    Signed-off-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we are not rechecking pte_same in hugetlb_cow after we take ptl
    lock again in the page allocation failure code path and simply retry
    again. This is not an issue at the moment because hugetlb fault path is
    protected by hugetlb_instantiation_mutex so we cannot race.

    The original page is locked and so we cannot race even with the page
    migration.

    Let's add the pte_same check anyway as we want to be consistent with the
    other check later in this function and be safe if we ever remove the
    mutex.

    [mhocko@suse.cz: reworded the changelog]
    Signed-off-by: Hillf Danton
    Signed-off-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman