11 Jan, 2012

40 commits

  • commit 0c6967b5a0 ("serial:blackfin: rename Blackfin serial driver to
    bfin_uart.c") renamed the file, update the pattern.

    Signed-off-by: Joe Perches
    Acked-by: Sonic Zhang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 4860c73804c ("staging: Move media drivers to staging/media") moved
    the files, update the F: patterns.

    Signed-off-by: Joe Perches
    Acked-by: Mauro Carvalho Chehab
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 61cf45d0199 ("encrypted-keys: create encrypted-keys directory")
    moved the files, update the patterns.

    Signed-off-by: Joe Perches
    Cc: Mimi Zohar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 1fe003fd424 ("greth: Move the Aeroflex Gaisler driver") moved the
    files, update the patterns.

    Signed-off-by: Joe Perches
    Cc: Kristoffer Glembo
    Cc: Jeff Kirsher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit a88394cfb58 ("ewrk3/tulip: Move the DEC - Tulip drivers") moved the
    files, update the patterns.

    Signed-off-by: Joe Perches
    Acked-by: Grant Grundler
    Cc: Jeff Kirsher
    Cc: Tobias Ringstrom
    Cc: Grant Grundler
    Cc: David Davies
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 38576af1f8c ("mmc: sdhci: make sdhci-of device drivers self
    registered") moved the files around. Update the patterns.

    Signed-off-by: Joe Perches
    Cc: Shawn Guo
    Cc: Chris Ball
    Acked-by: Anton Vorontsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 8959e74399c ("mfd: Delete ab3550 driver") removed the driver,
    update the patterns.

    Signed-off-by: Joe Perches
    Acked-by: Linus Walleij
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Commit f8fc729870ee ("[media] marvell-cam: Move cafe-ccic into its own
    directory") moved the files, update the pattern.

    Signed-off-by: Joe Perches
    Cc: Jonathan Corbet
    Acked-by: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Commit c103de240439d ("gpio: reorganize drivers") renamed the file, update
    the pattern.

    Signed-off-by: Joe Perches
    Cc: Grant Likely
    Cc: Michael Buesch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Commit c103de240439df ("gpio: reorganize drivers") renamed the files,
    update the patterns.

    Signed-off-by: Joe Perches
    Acked-by: Grant Likely
    Acked-by: Michael Hennerich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Track renames and missing or deleted files.

    Signed-off-by: Joe Perches
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • I happen to have had a commit to various network drivers since the big
    renaming/reorg which happened to drivers/net recently. This means that I
    now appear to be in the top few commit signers (by %age) for many of them
    so am getting sent all sorts of stuff and people who are involved with the
    driver are not. e.g. (to pick one at random):

    $ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
    "David S. Miller" (commit_signer:5/7=71%)
    Ian Campbell (commit_signer:2/7=29%)
    Eric Dumazet (commit_signer:1/7=14%)
    Jeff Kirsher (commit_signer:1/7=14%)
    Jiri Pirko (commit_signer:1/7=14%)
    netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
    linux-kernel@vger.kernel.org (open list)

    With the following patch the renames are followed and the result appears
    much more sensible:

    $ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
    "David S. Miller" (commit_signer:31/34=91%)
    Joe Perches (commit_signer:11/34=32%)
    Szymon Janc (commit_signer:5/34=15%)
    Jiri Pirko (commit_signer:3/34=9%)
    Paul (commit_signer:2/34=6%)
    netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
    linux-kernel@vger.kernel.org (open list)

    Signed-off-by: Ian Campbell
    Acked-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Campbell
     
  • vmap_area->private is void* but we don't use the field for various purpose
    but use only for vm_struct. So change it to a vm_struct* with naming to
    improve for readability and type checking.

    Signed-off-by: Minchan Kim
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • It is not the tag page but the cursor page that we should process, and it
    looks a typo.

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
    better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_to_page is not used in mm/migrate.c.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: Mel Gorman
    Acked-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • If we have to hand back the newly allocated huge page to page allocator,
    for any reason, the changed counter should be recovered.

    This affects only s390 at present.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • mempool modifies gfp_mask so that the backing allocator doesn't try too
    hard or trigger warning message when there's pool to fall back on. In
    addition, for the first try, it removes __GFP_WAIT and IO, so that it
    doesn't trigger reclaim or wait when allocation can be fulfilled from
    pool; however, when that allocation fails and pool is empty too, it waits
    for the pool to be replenished before retrying.

    Allocation which could have succeeded after a bit of reclaim has to wait
    on the reserved items and it's not like mempool doesn't retry with
    __GFP_WAIT and IO. It just does that *after* someone returns an element,
    pointlessly delaying things.

    Fix it by retrying immediately if the first round of allocation attempts
    w/o __GFP_WAIT and IO fails.

    [akpm@linux-foundation.org: shorten the lock hold time]
    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • mempool_destroy() is a thin wrapper around free_pool(). The only thing it
    adds is BUG_ON(pool->curr_nr != pool->min_nr). The intention seems to be
    to enforce that all allocated elements are freed; however, the BUG_ON()
    can't achieve that (it doesn't know anything about objects above min_nr)
    and incorrect as mempool_resize() is allowed to leave the pool extended
    but not filled. Furthermore, panicking is way worse than any memory leak
    and there are better debug tools to track memory leaks.

    Drop the BUG_ON() from mempool_destory() and as that leaves the function
    identical to free_pool(), replace it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • mempool_alloc/free() use undocumented smp_mb()'s. The code is slightly
    broken and misleading.

    The lockless part is in mempool_free(). It wants to determine whether the
    item being freed needs to be returned to the pool or backing allocator
    without grabbing pool->lock. Two things need to be guaranteed for correct
    operation.

    1. pool->curr_nr + #allocated should never dip below pool->min_nr.
    2. Waiters shouldn't be left dangling.

    For #1, The only necessary condition is that curr_nr visible at free is
    from after the allocation of the element being freed (details in the
    comment). For most cases, this is true without any barrier but there can
    be fringe cases where the allocated pointer is passed to the freeing task
    without going through memory barriers. To cover this case, wmb is
    necessary before returning from allocation and rmb is necessary before
    reading curr_nr. IOW,

    ALLOCATING TASK FREEING TASK

    update pool state after alloc;
    wmb();
    pass pointer to freeing task;
    read pointer;
    rmb();
    read pool state to free;

    The current code doesn't have wmb after pool update during allocation and
    may theoretically, on machines where unlock doesn't behave as full wmb,
    lead to pool depletion and deadlock. smp_wmb() needs to be added after
    successful allocation from reserved elements and smp_mb() in
    mempool_free() can be replaced with smp_rmb().

    For #2, the waiter needs to add itself to waitqueue and then check the
    wait condition and the waker needs to update the wait condition and then
    wake up. Because waitqueue operations always go through full spinlock
    synchronization, there is no need for extra memory barriers.

    Furthermore, mempool_alloc() is already holding pool->lock when it decides
    that it needs to wait. There is no reason to do unlock - add waitqueue -
    test condition again. It can simply add itself to waitqueue while holding
    pool->lock and then unlock and sleep.

    This patch adds smp_wmb() after successful allocation from reserved pool,
    replaces smp_mb() in mempool_free() with smp_rmb() and extend pool->lock
    over waitqueue addition. More importantly, it explains what memory
    barriers do and how the lockless testing is correct.

    -v2: Oleg pointed out that unlock doesn't imply wmb. Added explicit
    smp_wmb() after successful allocation from reserved pool and
    updated comments accordingly.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • migration_entry_wait() can also be called from hugetlb_fault() now.
    Remove the incorrect comment.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • mpol_equal() logically returns a boolean. Use a bool type to slightly
    improve readability.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The computation for pgoff is incorrect, at least with

    (vma->vm_pgoff >> PAGE_SHIFT)

    involved. It is fixed with the available method if HPAGE_SIZE is
    concerned in page cache lookup.

    [akpm@linux-foundation.org: use vma_hugecache_offset() directly, per Michal]
    Signed-off-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • It's pointless to continue reclaiming when we have no swap space and lots
    of anon pages in the inactive list.

    Without this patch, it is possible when swap is disabled to continue
    trying to reclaim when there are only anonymous pages in the system even
    though that will not make any progress.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The loop that frees pages to the page allocator while bootstrapping tries
    to free higher-order blocks only when the starting address is aligned to
    that block size. Otherwise it will free all pages on that node
    one-by-one.

    Change it to free individual pages up to the first aligned block and then
    try higher-order frees from there.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The area node_bootmem_map represents is aligned to BITS_PER_LONG, and all
    bits in any aligned word of that map valid. When the represented area
    extends beyond the end of the node, the non-existant pages will be marked
    as reserved.

    As a result, when freeing a page block, doing an explicit range check for
    whether that block is within the node's range is redundant as the bitmap
    is consulted anyway to see whether all pages in the block are unreserved.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __free_pages_bootmem() used to special-case higher-order frees to save
    individual page checking with free_pages_bulk().

    Nowadays, both zero order and non-zero order frees use free_pages(), which
    checks each individual page anyway, and so there is little point in making
    the distinction anymore. The higher-order loop will work just fine for
    zero order pages.

    Signed-off-by: Johannes Weiner
    Cc: Uwe Kleine-König
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
    vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
    using it. Also, this change helps to improve page fault performance
    because it has stronger locality of reference.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Shaohua Li
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
    threshold when memory is low") changed the form how free_pages is
    calculated but it forgot that we used to do free_pages - ((1 << order) -
    1) so we ended up with off-by-two when calculating free_pages.

    Reported-by: Wang Sheng-Hui
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The first entry of bdata->node_bootmem_map holds the data for
    bdata->node_min_pfn up to bdata->node_min_pfn + BITS_PER_LONG - 1. So the
    test for freeing all pages of a single map entry can be slightly relaxed.

    Moreover use DIV_ROUND_UP in another place instead of open coding it.

    Signed-off-by: Uwe Kleine-König
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • After isolated the current pfn will no longer be scanned and isolated if
    the next round is necessary, so push the isolate_migratepages search base
    of the given compact_control one step ahead.

    Signed-off-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Tell the page allocator that pages allocated for a buffered write are
    expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Tell the page allocator that pages allocated through
    grab_cache_page_write_begin() are expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The next patch will introduce per-zone dirty limiting functions in
    addition to the traditional global dirty limiting.

    Rename determine_dirtyable_memory() to global_dirtyable_memory() before
    adding the zone-specific version, and fix up its documentation.

    Also, move the functions to determine the dirtyable memory and the
    function to calculate the dirty limit based on that together so that their
    relationship is more apparent and that they can be commented on as a
    group.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If we need to know a usecase, caller program name is critical important.
    Show it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    David Rientjes
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Calling alloc_pages_exact_node() means the allocation only passes the
    zonelist of a single node into the page allocator. If that node isn't
    online, it's zonelist may never have been initialized causing a strange
    oops that may not immediately be clear.

    I recently debugged an issue where node 0 wasn't online and an allocator
    was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
    pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
    would be nice to catch this a bit earlier.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes