01 Nov, 2011

40 commits

  • The ret variable is really not needed in mm_take_all_locks().

    Signed-off-by: Kautuk Consul
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • The callback must not return -1 when nr_to_scan is zero. Fix the bug in
    fs/super.c and add this requirement to the callback specification.

    Signed-off-by: Mikulas Patocka
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • fiddle wording

    Cc: Jan Kara
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • try_to_unmap_one() is called by try_to_unmap_ksm(), too.

    Signed-off-by: Wanlong Gao
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanlong Gao
     
  • Some vmalloc failure paths do not report OOM conditions.

    Add warn_alloc_failed, which also does a dump_stack, to those failure
    paths.

    This allows more site specific vmalloc failure logging message printks to
    be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • There 2 places to read pgdat in kswapd. One is return from a successful
    balance, another is waked up from kswapd sleeping. The new_order and
    new_classzone_idx represent the balance input order and classzone_idx.

    But current new_order and new_classzone_idx are not assigned after
    kswapd_try_to_sleep(), that will cause a bug in the following scenario.

    1: after a successful balance, kswapd goes to sleep, and new_order = 0;
    new_classzone_idx = __MAX_NR_ZONES - 1;

    2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

    3: in the balance_pgdat() running, a new balance wakeup happened with
    order = 5, and classzone_idx = ZONE_NORMAL

    4: the first wakeup(order = 3) finished successufly, return order = 3
    but, the new_order is still 0, so, this balancing will be treated as a
    failed balance. And then the second tighter balancing will be missed.

    So, to avoid the above problem, the new_order and new_classzone_idx need
    to be assigned for later successful comparison.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Tested-by: Pádraig Brady
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex,Shi
     
  • warning: function 'memblock_memory_can_coalesce'
    with external linkage has definition.

    Signed-off-by: Jonghwan Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonghwan Choi
     
  • In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
    when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
    after a unsuccessful balancing if there is tighter reclaim request pending
    in the balancing. But in the following scenario, kswapd do something that
    is not matched our expectation. The patch fixes this issue.

    1, Read pgdat request A (classzone_idx, order = 3)
    2, balance_pgdat()
    3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
    4, balance_pgdat() returns but failed since returned order = 0
    5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
    While the expectation behavior of kswapd should try to sleep.

    Signed-off-by: Alex Shi
    Reviewed-by: Tim Chen
    Acked-by: Mel Gorman
    Tested-by: Pádraig Brady
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex,Shi
     
  • This adds support for highmem pages poisoning and verification to the
    debug-pagealloc feature for no-architecture support.

    [akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Add __attribute__((format (printf...) to the function to validate format
    and arguments. Use vsprintf extension %pV to avoid any possible message
    interleaving. Coalesce format string. Convert printks/pr_warning to
    pr_warn.

    [akpm@linux-foundation.org: use the __printf() macro]
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • On NOMMU architectures, if physical memory doesn't start from 0,
    ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
    Because virtual address is equal to physical address, PAGE_OFFSET is
    always 0. virt_to_page and page_to_virt should not index page by
    PAGE_OFFSET directly.

    Signed-off-by: Sonic Zhang
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sonic Zhang
     
  • This adds THP support to mremap (decreases the number of split_huge_page()
    calls).

    Here are also some benchmarks with a proggy like this:

    ===
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include

    #define SIZE (5UL*1024*1024*1024)

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    gettimeofday(&oldstamp, NULL);
    p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    gettimeofday(&newstamp, NULL);
    diffsec = newstamp.tv_sec - oldstamp.tv_sec;
    diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
    printf("usec %ld\n", diffsec);
    if (p == MAP_FAILED || p4 != p3)
    //if (p == MAP_FAILED)
    perror("mremap"), exit(1);
    if (memcmp(p4, p2, SIZE))
    printf("mremap bug\n"), exit(1);
    printf("ok\n");

    return 0;
    }
    ===

    THP on

    Performance counter stats for './largepage13' (3 runs):

    69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
    60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
    676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
    29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
    1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
    8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

    7.314454164 seconds time elapsed ( +- 0.023% )

    THP off

    Performance counter stats for './largepage13' (3 runs):

    1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
    9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
    2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
    3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
    6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
    8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

    9.693655243 seconds time elapsed ( +- 0.069% )

    grep thp /proc/vmstat
    thp_fault_alloc 35849
    thp_fault_fallback 0
    thp_collapse_alloc 3
    thp_collapse_alloc_failed 0
    thp_split 0

    thp_split 0 confirms no thp split despite plenty of hugepages allocated.

    The measurement of only the mremap time (so excluding the 3 long
    memset and final long 10GB memory accessing memcmp):

    THP on

    usec 14824
    usec 14862
    usec 14859

    THP off

    usec 256416
    usec 255981
    usec 255847

    With an older kernel without the mremap optimizations (the below patch
    optimizes the non THP version too).

    THP on

    usec 392107
    usec 390237
    usec 404124

    THP off

    usec 444294
    usec 445237
    usec 445820

    I guess with a threaded program that sends more IPI on large SMP it'd
    create an even larger difference.

    All debug options are off except DEBUG_VM to avoid skewing the
    results.

    The only problem for native 2M mremap like it happens above both the
    source and destination address must be 2M aligned or the hugepmd can't be
    moved without a split but that is an hardware limitation.

    [akpm@linux-foundation.org: coding-style nitpicking]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
    flush_tlb_range() at the end of the loop, to avoid sending one IPI for
    each page.

    The mmu_notifier_invalidate_range_start/end section is enlarged
    accordingly but this is not going to fundamentally change things. It was
    more by accident that the region under mremap was for the most part still
    available for secondary MMUs: the primary MMU was never allowed to
    reliably access that region for the duration of the mremap (modulo
    trapping SIGSEGV on the old address range which sounds unpractical and
    flakey). If users wants secondary MMUs not to lose access to a large
    region under mremap they should reduce the mremap size accordingly in
    userland and run multiple calls. Overall this will run faster so it's
    actually going to reduce the time the region is under mremap for the
    primary MMU which should provide a net benefit to apps.

    For KVM this is a noop because the guest physical memory is never
    mremapped, there's just no point it ever moving it while guest runs. One
    target of this optimization is JVM GC (so unrelated to the mmu notifier
    logic).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
    those are reasonable requirements but the check remains obscure and it
    looks more like an off by one error than an overflow check. This I feel
    will improve readability.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With the NO_BOOTMEM symbol added architectures may now use the following
    syntax to tell that they do not need bootmem:

    select NO_BOOTMEM

    This is much more convinient than adding a new kconfig symbol which was
    otherwise required.

    Adding this symbol does not conflict with the architctures that already
    define their own symbol.

    Signed-off-by: Sam Ravnborg
    Cc: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     
  • SPARC32 require access to the start address. Add a new helper
    memblock_start_of_DRAM() to give access to the address of the first
    memblock - which contains the lowest address.

    The awkward name was chosen to match the already present
    memblock_end_of_DRAM().

    Signed-off-by: Sam Ravnborg
    Cc: "David S. Miller"
    Cc: Yinghai Lu
    Acked-by: Tejun Heo
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Ravnborg
     
  • The /proc/vmallocinfo shows information about vmalloc allocations in
    vmlist that is a linklist of vm_struct. It, however, may access pages
    field of vm_struct where a page was not allocated. This results in a null
    pointer access and leads to a kernel panic.

    Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
    allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
    some fields of vm_struct such as nr_pages and pages are set at
    __vmalloc_area_node(). In other words, it is added to vmlist before it is
    fully initialized. At the same time, when the /proc/vmallocinfo is read,
    it accesses the pages field of vm_struct according to the nr_pages field
    at show_numa_info(). Thus, a null pointer access happens.

    The patch adds the newly allocated vm_struct to the vmlist *after* it is
    fully initialized. So, it can avoid accessing the pages field with
    unallocated page when show_numa_info() is called.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Andrew Morton
    Cc: David Rientjes
    Cc: Namhyung Kim
    Cc: "Paul E. McKenney"
    Cc: Jeremy Fitzhardinge
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitsuo Hayasaka
     
  • Use newly introduced memchr_inv() for page verification.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • memchr_inv() is mainly used to check whether the whole buffer is filled
    with just a specified byte.

    The function name and prototype are stolen from logfs and the
    implementation is from SLUB.

    Signed-off-by: Akinobu Mita
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Acked-by: Joern Engel
    Cc: Marcin Slusarz
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • printk_ratelimit() should not be used, because it shares ratelimiting
    state with all other unrelated printk_ratelimit() callsites.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • It's possible a zone watermark is ok when entering the balance_pgdat()
    loop, while the zone is within the requested classzone_idx. Count pages
    from this zone into `balanced'. In this way, we can skip shrinking zones
    too much for high order allocation.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When direct reclaim encounters a dirty page, it gets recycled around the
    LRU for another cycle. This patch marks the page PageReclaim similar to
    deactivate_page() so that the page gets reclaimed almost immediately after
    the page gets cleaned. This is to avoid reclaiming clean pages that are
    younger than a dirty page encountered at the end of the LRU that might
    have been something like a use-once page.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Workloads that are allocating frequently and writing files place a large
    number of dirty pages on the LRU. With use-once logic, it is possible for
    them to reach the end of the LRU quickly requiring the reclaimer to scan
    more to find clean pages. Ordinarily, processes that are dirtying memory
    will get throttled by dirty balancing but this is a global heuristic and
    does not take into account that LRUs are maintained on a per-zone basis.
    This can lead to a situation whereby reclaim is scanning heavily, skipping
    over a large number of pages under writeback and recycling them around the
    LRU consuming CPU.

    This patch checks how many of the number of pages isolated from the LRU
    were dirty and under writeback. If a percentage of them under writeback,
    the process will be throttled if a backing device or the zone is
    congested. Note that this applies whether it is anonymous or file-backed
    pages that are under writeback meaning that swapping is potentially
    throttled. This is intentional due to the fact if the swap device is
    congested, scanning more pages and dispatching more IO is not going to
    help matters.

    The percentage that must be in writeback depends on the priority. At
    default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of
    them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the
    greater the likelihood the process will get throttled to allow the flusher
    threads to make some progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is preferable that no dirty pages are dispatched for cleaning from the
    page reclaim path. At normal priorities, this patch prevents kswapd
    writing pages.

    However, page reclaim does have a requirement that pages be freed in a
    particular zone. If it is failing to make sufficient progress (reclaiming
    < SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
    scan more pages. A priority of DEF_PRIORITY - 3 is considered to be the
    point where kswapd is getting into trouble reclaiming pages. If this
    priority is reached, kswapd will dispatch pages for writing.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim should never writeback pages. Warn if an attempt is made.

    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim should never writeback pages. For now, handle the
    situation and warn about it. Ultimately, this will be a BUG_ON.

    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Lumpy reclaim worked with two passes - the first which queued pages for IO
    and the second which waited on writeback. As direct reclaim can no longer
    write pages there is some dead code. This patch removes it but direct
    reclaim will continue to wait on pages under writeback while in
    synchronous reclaim mode.

    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Testing from the XFS folk revealed that there is still too much I/O from
    the end of the LRU in kswapd. Previously it was considered acceptable by
    VM people for a small number of pages to be written back from reclaim with
    testing generally showing about 0.3% of pages reclaimed were written back
    (higher if memory was low). That writing back a small number of pages is
    ok has been heavily disputed for quite some time and Dave Chinner
    explained it well;

    It doesn't have to be a very high number to be a problem. IO
    is orders of magnitude slower than the CPU time it takes to
    flush a page, so the cost of making a bad flush decision is
    very high. And single page writeback from the LRU is almost
    always a bad flush decision.

    To complicate matters, filesystems respond very differently to requests
    from reclaim according to Christoph Hellwig;

    xfs tries to write it back if the requester is kswapd
    ext4 ignores the request if it's a delayed allocation
    btrfs ignores the request

    As a result, each filesystem has different performance characteristics
    when under memory pressure and there are many pages being dirtied. In
    some cases, the request is ignored entirely so the VM cannot depend on the
    IO being dispatched.

    The objective of this series is to reduce writing of filesystem-backed
    pages from reclaim, play nicely with writeback that is already in progress
    and throttle reclaim appropriately when writeback pages are encountered.
    The assumption is that the flushers will always write pages faster than if
    reclaim issues the IO.

    A secondary goal is to avoid the problem whereby direct reclaim splices
    two potentially deep call stacks together.

    There is a potential new problem as reclaim has less control over how long
    before a page in a particularly zone or container is cleaned and direct
    reclaimers depend on kswapd or flusher threads to do the necessary work.
    However, as filesystems sometimes ignore direct reclaim requests already,
    it is not expected to be a serious issue.

    Patch 1 disables writeback of filesystem pages from direct reclaim
    entirely. Anonymous pages are still written.

    Patch 2 removes dead code in lumpy reclaim as it is no longer able
    to synchronously write pages. This hurts lumpy reclaim but
    there is an expectation that compaction is used for hugepage
    allocations these days and lumpy reclaim's days are numbered.

    Patches 3-4 add warnings to XFS and ext4 if called from
    direct reclaim. With patch 1, this "never happens" and is
    intended to catch regressions in this logic in the future.

    Patch 5 disables writeback of filesystem pages from kswapd unless
    the priority is raised to the point where kswapd is considered
    to be in trouble.

    Patch 6 throttles reclaimers if too many dirty pages are being
    encountered and the zones or backing devices are congested.

    Patch 7 invalidates dirty pages found at the end of the LRU so they
    are reclaimed quickly after being written back rather than
    waiting for a reclaimer to find them

    I consider this series to be orthogonal to the writeback work but it is
    worth noting that the writeback work affects the viability of patch 8 in
    particular.

    I tested this on ext4 and xfs using fs_mark, a simple writeback test based
    on dd and a micro benchmark that does a streaming write to a large mapping
    (exercises use-once LRU logic) followed by streaming writes to a mix of
    anonymous and file-backed mappings. The command line for fs_mark when
    botted with 512M looked something like

    ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

    The number of files was adjusted depending on the amount of available
    memory so that the files created was about 3xRAM. For multiple threads,
    the -d switch is specified multiple times.

    The test machine is x86-64 with an older generation of AMD processor with
    4 cores. The underlying storage was 4 disks configured as RAID-0 as this
    was the best configuration of storage I had available. Swap is on a
    separate disk. Dirty ratio was tuned to 40% instead of the default of
    20%.

    Testing was run with and without monitors to both verify that the patches
    were operating as expected and that any performance gain was real and not
    due to interference from monitors.

    Here is a summary of results based on testing XFS.

    512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
    512M1P-xfs Elapsed Time fsmark 51.41 48.29
    512M1P-xfs Elapsed Time simple-wb 114.09 108.61
    512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
    512M1P-xfs Kswapd efficiency fsmark 62% 63%
    512M1P-xfs Kswapd efficiency simple-wb 56% 61%
    512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
    512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
    512M-xfs Elapsed Time fsmark 56.08 48.90
    512M-xfs Elapsed Time simple-wb 112.22 98.13
    512M-xfs Elapsed Time mmap-strm 219.15 196.67
    512M-xfs Kswapd efficiency fsmark 54% 56%
    512M-xfs Kswapd efficiency simple-wb 54% 55%
    512M-xfs Kswapd efficiency mmap-strm 45% 44%
    512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
    512M-4X-xfs Elapsed Time fsmark 63.26 55.88
    512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
    512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
    512M-4X-xfs Kswapd efficiency fsmark 49% 50%
    512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
    512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
    512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
    512M-16X-xfs Elapsed Time fsmark 67.47 58.25
    512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
    512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
    512M-16X-xfs Kswapd efficiency fsmark 45% 46%
    512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
    512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%

    Up until 512-4X, the FSmark improvements were statistically significant.
    For the 4X and 16X tests the results were within standard deviations but
    just barely. The time to completion for all tests is improved which is an
    important result. In general, kswapd efficiency is not affected by
    skipping dirty pages.

    1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%)
    1024M1P-xfs Elapsed Time fsmark 84.14 80.41
    1024M1P-xfs Elapsed Time simple-wb 210.77 184.78
    1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34
    1024M1P-xfs Kswapd efficiency fsmark 69% 75%
    1024M1P-xfs Kswapd efficiency simple-wb 71% 77%
    1024M1P-xfs Kswapd efficiency mmap-strm 43% 44%
    1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%)
    1024M-xfs Elapsed Time fsmark 94.59 91.00
    1024M-xfs Elapsed Time simple-wb 229.84 195.08
    1024M-xfs Elapsed Time mmap-strm 405.38 440.29
    1024M-xfs Kswapd efficiency fsmark 79% 71%
    1024M-xfs Kswapd efficiency simple-wb 74% 74%
    1024M-xfs Kswapd efficiency mmap-strm 39% 42%
    1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%)
    1024M-4X-xfs Elapsed Time fsmark 103.33 97.74
    1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57
    1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88
    1024M-4X-xfs Kswapd efficiency fsmark 81% 70%
    1024M-4X-xfs Kswapd efficiency simple-wb 73% 72%
    1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38%
    1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%)
    1024M-16X-xfs Elapsed Time fsmark 103.11 99.11
    1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24
    1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82
    1024M-16X-xfs Kswapd efficiency fsmark 84% 69%
    1024M-16X-xfs Kswapd efficiency simple-wb 74% 73%
    1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40%

    All FSMark tests up to 16X had statistically significant improvements.
    For the most part, tests are completing faster with the exception of the
    streaming writes to a mixture of anonymous and file-backed mappings which
    were slower in two cases

    In the cases where the mmap-strm tests were slower, there was more
    swapping due to dirty pages being skipped. The number of additional pages
    swapped is almost identical to the fewer number of pages written from
    reclaim. In other words, roughly the same number of pages were reclaimed
    but swapping was slower. As the test is a bit unrealistic and stresses
    memory heavily, the small shift is acceptable.

    4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
    4608M1P-xfs Elapsed Time fsmark 512.01 492.15
    4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
    4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
    4608M1P-xfs Kswapd efficiency fsmark 93% 86%
    4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
    4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
    4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
    4608M-xfs Elapsed Time fsmark 555.96 532.34
    4608M-xfs Elapsed Time simple-wb 659.72 571.85
    4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
    4608M-xfs Kswapd efficiency fsmark 89% 91%
    4608M-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-xfs Kswapd efficiency mmap-strm 48% 46%
    4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
    4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
    4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
    4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
    4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
    4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
    4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
    4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
    4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
    4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
    4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
    4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
    4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%

    Unlike the other tests, the fsmark results are not statistically
    significant but the min and max times are both improved and for the most
    part, tests completed faster.

    There are other indications that this is an improvement as well. For
    example, in the vast majority of cases, there were fewer pages scanned by
    direct reclaim implying in many cases that stalls due to direct reclaim
    are reduced. KSwapd is scanning more due to skipping dirty pages which is
    unfortunate but the CPU usage is still acceptable

    In an earlier set of tests, I used blktrace and in almost all cases
    throughput throughout the entire test was higher. However, I ended up
    discarding those results as recording blktrace data was too heavy for my
    liking.

    On a laptop, I plugged in a USB stick and ran a similar tests of tests
    using it as backing storage. A desktop environment was running and for
    the entire duration of the tests, firefox and gnome terminal were
    launching and exiting to vaguely simulate a user.

    1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
    1024M-xfs Elapsed Time fsmark 2053.52 1641.03
    1024M-xfs Elapsed Time simple-wb 1229.53 768.05
    1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
    1024M-xfs Kswapd efficiency fsmark 84% 85%
    1024M-xfs Kswapd efficiency simple-wb 92% 81%
    1024M-xfs Kswapd efficiency mmap-strm 60% 51%
    1024M-xfs Avg wait ms fsmark 5404.53 4473.87
    1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
    1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53

    The mmap-strm results were hurt because firefox launching had a tendency
    to push the test out of memory. On the postive side, firefox launched
    marginally faster with the patches applied. Time to completion for many
    tests was faster but more importantly - the "Avg wait" time as measured by
    iostat was far lower implying the system would be more responsive. It was
    also the case that "Avg wait ms" on the root filesystem was lower. I
    tested it manually and while the system felt slightly more responsive
    while copying data to a USB stick, it was marginal enough that it could be
    my imagination.

    This patch: do not writeback filesystem pages in direct reclaim.

    When kswapd is failing to keep zones above the min watermark, a process
    will enter direct reclaim in the same manner kswapd does. If a dirty page
    is encountered during the scan, this page is written to backing storage
    using mapping->writepage.

    This causes two problems. First, it can result in very deep call stacks,
    particularly if the target storage or filesystem are complex. Some
    filesystems ignore write requests from direct reclaim as a result. The
    second is that a single-page flush is inefficient in terms of IO. While
    there is an expectation that the elevator will merge requests, this does
    not always happen. Quoting Christoph Hellwig;

    The elevator has a relatively small window it can operate on,
    and can never fix up a bad large scale writeback pattern.

    This patch prevents direct reclaim writing back filesystem pages by
    checking if current is kswapd. Anonymous pages are still written to swap
    as there is not the equivalent of a flusher thread for anonymous pages.
    If the dirty pages cannot be written back, they are placed back on the LRU
    lists. There is now a direct dependency on dirty page balancing to
    prevent too many pages in the system being dirtied which would prevent
    reclaim making forward progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add comments to explain the page statistics field in the mm_struct.

    [akpm@linux-foundation.org: add missing ;]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The nr_force_scan[] tuple holds the effective scan numbers for anon and
    file pages in case the situation called for a forced scan and the
    regularly calculated scan numbers turned out zero.

    However, the effective scan number can always be assumed to be
    SWAP_CLUSTER_MAX right before the division into anon and file. The
    numerators and denominator are properly set up for all cases, be it force
    scan for just file, just anon, or both, to do the right thing.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When we get a bad_page bug report, it's useful to see what modules the
    user had loaded.

    Signed-off-by: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Add the leading word "tmpfs" to the Kconfig string to make it blindingly
    obvious that this selection refers to tmpfs.

    Signed-off-by: Robert P. J. Day
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
    PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
    current's oom_score_adj for ksm and swapoff without requiring an
    additional per-process flag.

    Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
    then reinstate the previous value is racy since it's possible that
    userspace can set the value to something else itself before the old value
    is reinstated. That results in userspace setting current's oom_score_adj
    to a different value and then the kernel immediately setting it back to
    its previous value without notification.

    To fix this, a new compare_swap_oom_score_adj() function is introduced
    with the same semantics as the compare and swap CAS instruction, or
    CMPXCHG on x86. It is used to reinstate the previous value of
    oom_score_adj if and only if the present value is the same as the old
    value.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This removes mm->oom_disable_count entirely since it's unnecessary and
    currently buggy. The counter was intended to be per-process but it's
    currently decremented in the exit path for each thread that exits, causing
    it to underflow.

    The count was originally intended to prevent oom killing threads that
    share memory with threads that cannot be killed since it doesn't lead to
    future memory freeing. The counter could be fixed to represent all
    threads sharing the same mm, but it's better to remove the count since:

    - it is possible that the OOM_DISABLE thread sharing memory with the
    victim is waiting on that thread to exit and will actually cause
    future memory freeing, and

    - there is no guarantee that a thread is disabled from oom killing just
    because another thread sharing its mm is oom disabled.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • After selecting a task to kill, the oom killer iterates all processes and
    kills all other threads that share the same mm_struct in different thread
    groups. It would not otherwise be helpful to kill a thread if its memory
    would not be subsequently freed.

    A kernel thread, however, may assume a user thread's mm by using
    use_mm(). This is only temporary and should not result in sending a
    SIGKILL to that kthread.

    This patch ensures that only user threads and not kthreads are sent a
    SIGKILL if they share the same mm_struct as the oom killed task.

    Signed-off-by: David Rientjes
    Reviewed-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If a thread has been oom killed and is frozen, thaw it before returning to
    the page allocator. Otherwise, it can stay frozen indefinitely and no
    memory will be freed.

    Signed-off-by: David Rientjes
    Reported-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Looks like someone got distracted after adding the comment characters.

    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • per-task block plug can reduce block queue lock contention and increase
    request merge. Currently page reclaim doesn't support it. I originally
    thought page reclaim doesn't need it, because kswapd thread count is
    limited and file cache write is done at flusher mostly.

    When I test a workload with heavy swap in a 4-node machine, each CPU is
    doing direct page reclaim and swap. This causes block queue lock
    contention. In my test, without below patch, the CPU utilization is about
    2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk
    throughput isn't changed. This should improve normal kswapd write and
    file cache write too (increase request merge for example), but might not
    be so obvious as I explain above.

    Signed-off-by: Shaohua Li
    Cc: Jens Axboe
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was
    unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind.

    Remove it now, and return 0 as soon as we see unset tag - we already rely
    upon the root tag to be correct, returning 0 immediately if it's not set.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins