08 Apr, 2014

5 commits

  • A new dump_page() routine was recently added, and marked
    EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE()
    macro, and so the end result is that non-GPL code can no longer call
    get_page() and a few other routines.

    This only happens if the kernel was compiled with CONFIG_DEBUG_VM.

    Change dump_page() to be EXPORT_SYMBOL.

    Longer explanation:

    Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON
    using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
    drivers on Fedora. Fedora is semi-unique, in that it sets
    CONFIG_VM_DEBUG.

    Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
    via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's
    new, open source, "UVM-Lite" kernel module, I originally choose to use
    the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
    because get_page() has always seemed to be very clearly intended for use
    by non-GPL, driver code.

    So I'm hoping that making get_page() widely accessible again will not be
    too controversial. We did check with Fedora first, and they responded
    (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
    try to get upstream changed, before asking Fedora to change. Their
    reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
    Fedora to help catch mm bugs.

    Signed-off-by: John Hubbard
    Cc: Sasha Levin
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • This is a small cleanup.

    Signed-off-by: Emil Medve
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Emil Medve
     
  • On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
    warning:

    warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]

    Let's convert 'reason' to 'const char *' in dump_page() and friends: we
    shouldn't modify it anyway.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out. In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM. Although this sounds like a bug
    somewhere in the kswapd vs. zone reclaim vs. direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:

    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus:
    node 2 size: 7168 MB
    node 2 free: 6019 MB
    node distances:
    node 0 2
    0: 10 40
    2: 40 10

    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory. Node distances cause an
    automatic zone_reclaim_mode enabling.

    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes. So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Nishanth Aravamudan
    Tested-by: Nishanth Aravamudan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Apr, 2014

1 commit

  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Mar, 2014

2 commits

  • Jan Stancek reports manual page migration encountering allocation
    failures after some pages when there is still plenty of memory free, and
    bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
    zone allocator policy").

    The problem is that GFP_THISNODE obeys the zone fairness allocation
    batches on one hand, but doesn't reset them and wake kswapd on the other
    hand. After a few of those allocations, the batches are exhausted and
    the allocations fail.

    Fixing this means either having GFP_THISNODE wake up kswapd, or
    GFP_THISNODE not participating in zone fairness at all. The latter
    seems safer as an acute bugfix, we can clean up later.

    Reported-by: Jan Stancek
    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Jan, 2014

4 commits

  • min_free_kbytes may be raised during THP's initialization. Sometimes,
    this will change the value which was set by the user. Showing this
    message will clarify this confusion.

    Only show this message when changing a value which was set by the user
    according to Michal Hocko's suggestion.

    Show the old value of min_free_kbytes according to Dave Hansen's
    suggestion. This will give user the chance to restore old value of
    min_free_kbytes.

    Signed-off-by: Han Pingtian
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
    proc_dointvec() to proc_dointvec_minmax() in the
    min_free_kbytes_sysctl_handler() can prevent this to happen.

    mhocko said:

    : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
    : your machine unusable but I agree that proc_dointvec_minmax is more
    : suitable here as we already have:
    :
    : .proc_handler = min_free_kbytes_sysctl_handler,
    : .extra1 = &zero,
    :
    : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
    : to ctl_name and strategy from the generic sysctl table") has removed
    : sysctl_intvec strategy and so extra1 is ignored.

    Signed-off-by: Han Pingtian
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • bad_page() is cool in that it prints out a bunch of data about the page.
    But, I can never remember which page flags are good and which are bad,
    or whether ->index or ->mapping is required to be NULL.

    This patch allows bad/dump_page() callers to specify a string about why
    they are dumping the page and adds explanation strings to a number of
    places. It also adds a 'bad_flags' argument to bad_page(), which it
    then dumps out separately from the flags which are actually set.

    This way, the messages will show specifically why the page was bad,
    *specifically* which flags it is complaining about, if it was a page
    flag combination which was the problem.

    [akpm@linux-foundation.org: switch to pr_alert]
    Signed-off-by: Dave Hansen
    Reviewed-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

22 Jan, 2014

6 commits

  • __GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.

    Luckily, nothing currently does such craziness. So instead of causing
    such allocations to loop (potentially forever), we maintain the current
    behavior and also warn about the new users of the deprecated flag.

    Suggested-by: Andrew Morton
    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fallback to
    exiting bootmem APIs.

    Signed-off-by: Grygorii Strashko
    Signed-off-by: Santosh Shilimkar
    Cc: Yinghai Lu
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tony Lindgren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     
  • If users specify the original movablecore=nn@ss boot option, the kernel
    will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot
    option is similar except it specifies ZONE_NORMAL ranges.

    Now, if users specify "movable_node" in kernel commandline, the kernel
    will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users
    do this, all the other movablecore=nn@ss and kernelcore=nn@ss options
    should be ignored.

    For those who don't want this, just specify nothing. The kernel will
    act as before.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Reviewed-by: Wanpeng Li
    Cc: "H. Peter Anvin"
    Cc: "Rafael J . Wysocki"
    Cc: Chen Tang
    Cc: Gong Chen
    Cc: Ingo Molnar
    Cc: Jiang Liu
    Cc: Johannes Weiner
    Cc: Lai Jiangshan
    Cc: Larry Woodman
    Cc: Len Brown
    Cc: Liu Jiang
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Prarit Bhargava
    Cc: Rik van Riel
    Cc: Taku Izumi
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Thomas Renninger
    Cc: Toshi Kani
    Cc: Vasilis Liaskovitis
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in
    non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to
    suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm:
    do not walk all of system memory during show_mem") avoided a PFN walk in
    the generic show_mem helper which removes the requirement for
    SHOW_MEM_FILTER_PAGE_COUNT in that case.

    This patch removes PFN walkers from the arch-specific implementations
    that report on a per-node or per-zone granularity. ARM and unicore32
    still do a PFN walk as they report memory usage on each bank which is a
    much finer granularity where the debugging information may still be of
    use. As the remaining arches doing PFN walks have relatively small
    amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.

    [akpm@linux-foundation.org: fix parisc]
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Tony Luck
    Cc: Russell King
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

21 Dec, 2013

2 commits

  • Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant
    to bring aging fairness among zones in system, but it was overzealous
    and badly regressed basic workloads on NUMA systems.

    Due to the way kswapd and page allocator interacts, we still want to
    make sure that all zones in any given node are used equally for all
    allocations to maximize memory utilization and prevent thrashing on the
    highest zone in the node.

    While the same principle applies to NUMA nodes - memory utilization is
    obviously improved by spreading allocations throughout all nodes -
    remote references can be costly and so many workloads prefer locality
    over memory utilization. The original change assumed that
    zone_reclaim_mode would be a good enough predictor for that, but it
    turned out to be as indicative as a coin flip.

    Revert the NUMA aspect of the fairness until we can find a proper way to
    make it configurable and agree on a sane default.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc: # 3.12
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This reverts commit 73f038b863df. The NUMA behaviour of this patch is
    less than ideal. An alternative approch is to interleave allocations
    only within local zones which is implemented in the next patch.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

1 commit

  • Dave Hansen noted a regression in a microbenchmark that loops around
    open() and close() on an 8-node NUMA machine and bisected it down to
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").
    That change forces the slab allocations of the file descriptor to spread
    out to all 8 nodes, causing remote references in the page allocator and
    slab.

    The round-robin policy is only there to provide fairness among memory
    allocations that are reclaimed involuntarily based on pressure in each
    zone. It does not make sense to apply it to unreclaimable kernel
    allocations that are freed manually, in this case instantly after the
    allocation, and incur the remote reference costs twice for no reason.

    Only round-robin allocations that are usually freed through page reclaim
    or slab shrinking.

    Bisected by Dave Hansen.

    Signed-off-by: Johannes Weiner
    Cc: Dave Hansen
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Nov, 2013

7 commits

  • Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     
  • When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.

    But it has one serious mistake. When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA. However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty. That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e. requested size is smaller than half of page block). The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.

    Mel's original anti fragmentation does the right thing. But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

    This patch restores sane and old behavior. It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
    MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
    the argument to MIGRATE_UNMOVABLE and we lost these attribute.

    The problem was introduced by commit 49255c619fbd ("page allocator: move
    check for disabled anti-fragmentation out of fastpath"). So a 4 year
    old issue may mean that nobody uses page_group_by_mobility_disabled.

    But anyway, this patch fixes the problem.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Use helper function to check if we need to deal with oom condition.

    Signed-off-by: Qiang Huang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
    Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

09 Oct, 2013

2 commits

  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

01 Oct, 2013

1 commit

  • This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
    overflow when offline pages").

    The fixed bug by commit cea27eb was fixed to another way by commit
    3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
    enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory, for details please refer to:

    http://marc.info/?l=linux-mm&m=136957578620221&w=2

    As a result, commit cea27eb2a202 currently causes duplicated decreasing
    of totalhigh_pages, thus the revert.

    Signed-off-by: Joonyoung Shim
    Reviewed-by: Wanpeng Li
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonyoung Shim
     

12 Sep, 2013

9 commits

  • Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
    magic number -2.

    Signed-off-by: Wang Sheng-Hui
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
    comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

    Signed-off-by: SeungHun Lee
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeungHun Lee
     
  • Current early_pfn_to_nid() on arch that support memblock go over
    memblock.memory one by one, so will take too many try near the end.

    We can use existing memblock_search to find the node id for given pfn,
    that could save some time on bigger system that have many entries
    memblock.memory array.

    Here are the timing differences for several machines. In each case with
    the patch less time was spent in __early_pfn_to_nid().

    3.11-rc5 with patch difference (%)
    -------- ---------- --------------
    UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
    UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
    UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
    UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
    Time in seconds.

    Signed-off-by: Yinghai Lu
    Cc: Tejun Heo
    Acked-by: Russ Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Until now we can't offline memory blocks which contain hugepages because a
    hugepage is considered as an unmovable page. But now with this patch
    series, a hugepage has become movable, so by using hugepage migration we
    can offline such memory blocks.

    What's different from other users of hugepage migration is that we need to
    decompose all the hugepages inside the target memory block into free buddy
    pages after hugepage migration, because otherwise free hugepages remaining
    in the memory block intervene the memory offlining. For this reason we
    introduce new functions dissolve_free_huge_page() and
    dissolve_free_huge_pages().

    Other than that, what this patch does is straightforwardly to add hugepage
    migration code, that is, adding hugepage code to the functions which scan
    over pfn and collect hugepages to be migrated, and adding a hugepage
    allocation function to alloc_migrate_target().

    As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
    over them because it's larger than memory block. So we now simply leave
    it to fail as it is.

    [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
    Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The main idea behind this patchset is to reduce the vmstat update overhead
    by avoiding interrupt enable/disable and the use of per cpu atomics.

    This patch (of 3):

    It is better to have a separate folding function because
    refresh_cpu_vm_stats() also does other things like expire pages in the
    page allocator caches.

    If we have a separate function then refresh_cpu_vm_stats() is only called
    from the local cpu which allows additional optimizations.

    The folding function is only called when a cpu is being downed and
    therefore no other processor will be accessing the counters. Also
    simplifies synchronization.

    [akpm@linux-foundation.org: fix UP build]
    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
    path. For helping compiler optimization, add unlikely macro to
    ALLOC_NO_WATERMARKS checking.

    This patch doesn't have any effect now, because gcc already optimize this
    properly. But we cannot assume that gcc always does right and nobody
    re-evaluate if gcc do proper optimization with their change, for example,
    it is not optimized properly on v3.10. So adding compiler hint here is
    reasonable.

    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Each zone that holds userspace pages of one workload must be aged at a
    speed proportional to the zone size. Otherwise, the time an individual
    page gets to stay in memory depends on the zone it happened to be
    allocated in. Asymmetry in the zone aging creates rather unpredictable
    aging behavior and results in the wrong pages being reclaimed, activated
    etc.

    But exactly this happens right now because of the way the page allocator
    and kswapd interact. The page allocator uses per-node lists of all zones
    in the system, ordered by preference, when allocating a new page. When
    the first iteration does not yield any results, kswapd is woken up and the
    allocator retries. Due to the way kswapd reclaims zones below the high
    watermark while a zone can be allocated from when it is above the low
    watermark, the allocator may keep kswapd running while kswapd reclaim
    ensures that the page allocator can keep allocating from the first zone in
    the zonelist for extended periods of time. Meanwhile the other zones
    rarely see new allocations and thus get aged much slower in comparison.

    The result is that the occasional page placed in lower zones gets
    relatively more time in memory, even gets promoted to the active list
    after its peers have long been evicted. Meanwhile, the bulk of the
    working set may be thrashing on the preferred zone even though there may
    be significant amounts of memory available in the lower zones.

    Even the most basic test -- repeatedly reading a file slightly bigger than
    memory -- shows how broken the zone aging is. In this scenario, no single
    page should be able stay in memory long enough to get referenced twice and
    activated, but activation happens in spades:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 0
    nr_active_file 8
    nr_inactive_file 1582
    nr_active_file 11994
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 70
    nr_inactive_file 258753
    nr_active_file 443214
    nr_inactive_file 149793
    nr_active_file 12021

    Fix this with a very simple round robin allocator. Each zone is allowed a
    batch of allocations that is proportional to the zone's size, after which
    it is treated as full. The batch counters are reset when all zones have
    been tried and the allocator enters the slowpath and kicks off kswapd
    reclaim. Allocation and reclaim is now fairly spread out to all
    available/allowable zones:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 174
    nr_active_file 4865
    nr_inactive_file 53
    nr_active_file 860
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 666622
    nr_active_file 4988
    nr_inactive_file 190969
    nr_active_file 937

    When zone_reclaim_mode is enabled, allocations will now spread out to all
    zones on the local node, not just the first preferred zone (which on a 4G
    node might be a tiny Normal zone).

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Cc: Zlatko Calusic
    Tested-by: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner