22 Nov, 2014

3 commits

  • commit 5bcc9f86ef09a933255ee66bd899d4601785dad5 upstream.

    For the MIGRATE_RESERVE pages, it is useful when they do not get
    misplaced on free_list of other migratetype, otherwise they might get
    allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks.
    While this cannot be avoided completely when allocating new
    MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
    prevent the misplacement where possible.

    Currently, it is possible for the misplacement to happen when a
    MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
    fallback for other desired migratetype, and then later freed back
    through free_pcppages_bulk() without being actually used. This happens
    because free_pcppages_bulk() uses get_freepage_migratetype() to choose
    the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
    the *desired* migratetype and not the page's original MIGRATE_RESERVE
    migratetype.

    This patch fixes the problem by moving the call to
    set_freepage_migratetype() from rmqueue_bulk() down to
    __rmqueue_smallest() and __rmqueue_fallback() where the actual page's
    migratetype (e.g. from which free_list the page is taken from) is used.
    Note that this migratetype might be different from the pageblock's
    migratetype due to freepage stealing decisions. This is OK, as page
    stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
    to leave all MIGRATE_CMA pages on the correct freelist.

    Therefore, as an additional benefit, the call to
    get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
    be removed completely. This relies on the fact that MIGRATE_CMA
    pageblocks are created only during system init, and the above. The
    related is_migrate_isolate() check is also unnecessary, as memory
    isolation has other ways to move pages between freelists, and drain pcp
    lists containing pages that should be isolated. The buffered_rmqueue()
    can also benefit from calling get_freepage_migratetype() instead of
    get_pageblock_migratetype().

    Signed-off-by: Vlastimil Babka
    Reported-by: Yong-Taek Lee
    Reported-by: Bartlomiej Zolnierkiewicz
    Suggested-by: Joonsoo Kim
    Acked-by: Joonsoo Kim
    Suggested-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Marek Szyprowski
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Michal Nazarewicz
    Cc: "Wang, Yalin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream.

    We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.

    Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

15 Nov, 2014

2 commits

  • commit abe5f972912d086c080be4bde67750630b6fb38b upstream.

    The zone allocation batches can easily underflow due to higher-order
    allocations or spills to remote nodes. On SMP that's fine, because
    underflows are expected from concurrency and dealt with by returning 0.
    But on UP, zone_page_state will just return a wrapped unsigned long,
    which will get past the
    Reported-by: Leon Romanovsky
    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.

    PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.

    Reduce the race window by checking all tasks after OOM killer has been
    disabled. This is still not race free completely unfortunately because
    oom_killer_disable cannot stop an already ongoing OOM killer so a task
    might still wake up from the fridge and get killed without
    freeze_processes noticing. Full synchronization of OOM and freezer is,
    however, too heavy weight for this highly unlikely case.

    Introduce and check oom_kills counter which gets incremented early when
    the allocator enters __alloc_pages_may_oom path and only check all the
    tasks if the counter changes during the freezing attempt. The counter
    is updated so early to reduce the race window since allocator checked
    oom_killer_disabled which is set by PM-freezing code. A false positive
    will push the PM-freezer into a slow path but that is not a big deal.

    Changes since v1
    - push the re-check loop out of freeze_processes into
    check_frozen_processes and invert the condition to make the code more
    readable as per Rafael

    Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

10 Oct, 2014

2 commits

  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.

    We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out. In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM. Although this sounds like a bug
    somewhere in the kswapd vs. zone reclaim vs. direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:

    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus:
    node 2 size: 7168 MB
    node 2 free: 6019 MB
    node distances:
    node 0 2
    0: 10 40
    2: 40 10

    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory. Node distances cause an
    automatic zone_reclaim_mode enabling.

    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes. So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Nishanth Aravamudan
    Tested-by: Nishanth Aravamudan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

08 Aug, 2014

1 commit

  • commit b104a35d32025ca740539db2808aa3385d0f30eb upstream.

    The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
    should be set in allocflags. ALLOC_CPUSET controls if a page allocation
    should be restricted only to the set of allowed cpuset mems.

    Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
    the fault path from using memory compaction or direct reclaim. Thus, it
    is unfairly able to allocate outside of its cpuset mems restriction as a
    side-effect.

    This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
    truly GFP_ATOMIC by verifying it is also not a thp allocation.

    Signed-off-by: David Rientjes
    Reported-by: Alex Thorlton
    Tested-by: Alex Thorlton
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: Hedi Berriche
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Srivatsa S. Bhat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

10 Jul, 2014

2 commits

  • commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 upstream.

    With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
    the following is triggered at early boot:

    SMP: Total of 8 processors activated.
    devtmpfs: initialized
    Unable to handle kernel NULL pointer dereference at virtual address 00000008
    pgd = fffffe0000050000
    [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
    Internal error: Oops: 96000006 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
    task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
    PC is at __list_add+0x10/0xd4
    LR is at free_one_page+0x270/0x638
    ...
    Call trace:
    __list_add+0x10/0xd4
    free_one_page+0x26c/0x638
    __free_pages_ok.part.52+0x84/0xbc
    __free_pages+0x74/0xbc
    init_cma_reserved_pageblock+0xe8/0x104
    cma_init_reserved_areas+0x190/0x1e4
    do_one_initcall+0xc4/0x154
    kernel_init_freeable+0x204/0x2a8
    kernel_init+0xc/0xd4

    This happens because init_cma_reserved_pageblock() calls
    __free_one_page() with pageblock_order as page order but it is bigger
    than MAX_ORDER. This in turn causes accesses past zone->free_list[].

    Fix the problem by changing init_cma_reserved_pageblock() such that it
    splits pageblock into individual MAX_ORDER pages if pageblock is bigger
    than a MAX_ORDER page.

    In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
    architectures expect for ia64, powerpc and tile at the moment, the
    “pageblock_order > MAX_ORDER” condition will be optimised out since both
    sides of the operator are constants. In cases where pageblock size is
    variable, the performance degradation should not be significant anyway
    since init_cma_reserved_pageblock() is called only at boot time at most
    MAX_CMA_AREAS times which by default is eight.

    Signed-off-by: Michal Nazarewicz
    Reported-by: Mark Salter
    Tested-by: Mark Salter
    Tested-by: Christopher Covington
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Marek Szyprowski
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Nazarewicz
     
  • commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.

    Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

01 Jul, 2014

1 commit

  • commit e58469bafd0524e848c3733bc3918d854595e20f upstream.

    The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

06 May, 2014

1 commit

  • commit 3a025760fc158b3726eac89ee95d7f29599e9dfa upstream.

    On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

04 Mar, 2014

2 commits

  • Jan Stancek reports manual page migration encountering allocation
    failures after some pages when there is still plenty of memory free, and
    bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
    zone allocator policy").

    The problem is that GFP_THISNODE obeys the zone fairness allocation
    batches on one hand, but doesn't reset them and wake kswapd on the other
    hand. After a few of those allocations, the batches are exhausted and
    the allocations fail.

    Fixing this means either having GFP_THISNODE wake up kswapd, or
    GFP_THISNODE not participating in zone fairness at all. The latter
    seems safer as an acute bugfix, we can clean up later.

    Reported-by: Jan Stancek
    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Jan, 2014

4 commits

  • min_free_kbytes may be raised during THP's initialization. Sometimes,
    this will change the value which was set by the user. Showing this
    message will clarify this confusion.

    Only show this message when changing a value which was set by the user
    according to Michal Hocko's suggestion.

    Show the old value of min_free_kbytes according to Dave Hansen's
    suggestion. This will give user the chance to restore old value of
    min_free_kbytes.

    Signed-off-by: Han Pingtian
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
    proc_dointvec() to proc_dointvec_minmax() in the
    min_free_kbytes_sysctl_handler() can prevent this to happen.

    mhocko said:

    : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
    : your machine unusable but I agree that proc_dointvec_minmax is more
    : suitable here as we already have:
    :
    : .proc_handler = min_free_kbytes_sysctl_handler,
    : .extra1 = &zero,
    :
    : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
    : to ctl_name and strategy from the generic sysctl table") has removed
    : sysctl_intvec strategy and so extra1 is ignored.

    Signed-off-by: Han Pingtian
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • bad_page() is cool in that it prints out a bunch of data about the page.
    But, I can never remember which page flags are good and which are bad,
    or whether ->index or ->mapping is required to be NULL.

    This patch allows bad/dump_page() callers to specify a string about why
    they are dumping the page and adds explanation strings to a number of
    places. It also adds a 'bad_flags' argument to bad_page(), which it
    then dumps out separately from the flags which are actually set.

    This way, the messages will show specifically why the page was bad,
    *specifically* which flags it is complaining about, if it was a page
    flag combination which was the problem.

    [akpm@linux-foundation.org: switch to pr_alert]
    Signed-off-by: Dave Hansen
    Reviewed-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

22 Jan, 2014

6 commits

  • __GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.

    Luckily, nothing currently does such craziness. So instead of causing
    such allocations to loop (potentially forever), we maintain the current
    behavior and also warn about the new users of the deprecated flag.

    Suggested-by: Andrew Morton
    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fallback to
    exiting bootmem APIs.

    Signed-off-by: Grygorii Strashko
    Signed-off-by: Santosh Shilimkar
    Cc: Yinghai Lu
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tony Lindgren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     
  • If users specify the original movablecore=nn@ss boot option, the kernel
    will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot
    option is similar except it specifies ZONE_NORMAL ranges.

    Now, if users specify "movable_node" in kernel commandline, the kernel
    will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users
    do this, all the other movablecore=nn@ss and kernelcore=nn@ss options
    should be ignored.

    For those who don't want this, just specify nothing. The kernel will
    act as before.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Reviewed-by: Wanpeng Li
    Cc: "H. Peter Anvin"
    Cc: "Rafael J . Wysocki"
    Cc: Chen Tang
    Cc: Gong Chen
    Cc: Ingo Molnar
    Cc: Jiang Liu
    Cc: Johannes Weiner
    Cc: Lai Jiangshan
    Cc: Larry Woodman
    Cc: Len Brown
    Cc: Liu Jiang
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Prarit Bhargava
    Cc: Rik van Riel
    Cc: Taku Izumi
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Thomas Renninger
    Cc: Toshi Kani
    Cc: Vasilis Liaskovitis
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in
    non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to
    suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm:
    do not walk all of system memory during show_mem") avoided a PFN walk in
    the generic show_mem helper which removes the requirement for
    SHOW_MEM_FILTER_PAGE_COUNT in that case.

    This patch removes PFN walkers from the arch-specific implementations
    that report on a per-node or per-zone granularity. ARM and unicore32
    still do a PFN walk as they report memory usage on each bank which is a
    much finer granularity where the debugging information may still be of
    use. As the remaining arches doing PFN walks have relatively small
    amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.

    [akpm@linux-foundation.org: fix parisc]
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Tony Luck
    Cc: Russell King
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

21 Dec, 2013

2 commits

  • Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant
    to bring aging fairness among zones in system, but it was overzealous
    and badly regressed basic workloads on NUMA systems.

    Due to the way kswapd and page allocator interacts, we still want to
    make sure that all zones in any given node are used equally for all
    allocations to maximize memory utilization and prevent thrashing on the
    highest zone in the node.

    While the same principle applies to NUMA nodes - memory utilization is
    obviously improved by spreading allocations throughout all nodes -
    remote references can be costly and so many workloads prefer locality
    over memory utilization. The original change assumed that
    zone_reclaim_mode would be a good enough predictor for that, but it
    turned out to be as indicative as a coin flip.

    Revert the NUMA aspect of the fairness until we can find a proper way to
    make it configurable and agree on a sane default.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc: # 3.12
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This reverts commit 73f038b863df. The NUMA behaviour of this patch is
    less than ideal. An alternative approch is to interleave allocations
    only within local zones which is implemented in the next patch.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

1 commit

  • Dave Hansen noted a regression in a microbenchmark that loops around
    open() and close() on an 8-node NUMA machine and bisected it down to
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").
    That change forces the slab allocations of the file descriptor to spread
    out to all 8 nodes, causing remote references in the page allocator and
    slab.

    The round-robin policy is only there to provide fairness among memory
    allocations that are reclaimed involuntarily based on pressure in each
    zone. It does not make sense to apply it to unreclaimable kernel
    allocations that are freed manually, in this case instantly after the
    allocation, and incur the remote reference costs twice for no reason.

    Only round-robin allocations that are usually freed through page reclaim
    or slab shrinking.

    Bisected by Dave Hansen.

    Signed-off-by: Johannes Weiner
    Cc: Dave Hansen
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Nov, 2013

7 commits

  • Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     
  • When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.

    But it has one serious mistake. When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA. However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty. That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e. requested size is smaller than half of page block). The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.

    Mel's original anti fragmentation does the right thing. But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

    This patch restores sane and old behavior. It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
    MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
    the argument to MIGRATE_UNMOVABLE and we lost these attribute.

    The problem was introduced by commit 49255c619fbd ("page allocator: move
    check for disabled anti-fragmentation out of fastpath"). So a 4 year
    old issue may mean that nobody uses page_group_by_mobility_disabled.

    But anyway, this patch fixes the problem.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Use helper function to check if we need to deal with oom condition.

    Signed-off-by: Qiang Huang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
    Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

09 Oct, 2013

2 commits

  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

01 Oct, 2013

1 commit

  • This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
    overflow when offline pages").

    The fixed bug by commit cea27eb was fixed to another way by commit
    3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
    enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory, for details please refer to:

    http://marc.info/?l=linux-mm&m=136957578620221&w=2

    As a result, commit cea27eb2a202 currently causes duplicated decreasing
    of totalhigh_pages, thus the revert.

    Signed-off-by: Joonyoung Shim
    Reviewed-by: Wanpeng Li
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonyoung Shim
     

12 Sep, 2013

3 commits

  • Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
    magic number -2.

    Signed-off-by: Wang Sheng-Hui
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
    comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

    Signed-off-by: SeungHun Lee
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeungHun Lee