19 May, 2016

1 commit

  • commit 44f43e99fe70833058482d183e99fdfd11220996 upstream.

    zs_can_compact() has two race conditions in its core calculation:

    unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
    zs_stat_get(class, OBJ_USED);

    1) classes are not locked, so the numbers of allocated and used
    objects can change by the concurrent ops happening on other CPUs
    2) shrinker invokes it from preemptible context

    Depending on the circumstances, thus, OBJ_ALLOCATED can become
    less than OBJ_USED, which can result in either very high or
    negative `total_scan' value calculated later in do_shrink_slab().

    do_shrink_slab() has some logic to prevent those cases:

    vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
    vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
    vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64
    vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
    vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
    vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62

    However, due to the way `total_scan' is calculated, not every
    shrinker->count_objects() overflow can be spotted and handled.
    To demonstrate the latter, I added some debugging code to do_shrink_slab()
    (x86_64) and the results were:

    vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
    vmscan: but total_scan > 0: 92679974445502
    vmscan: resulting total_scan: 92679974445502
    [..]
    vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
    vmscan: but total_scan > 0: 22634041808232578
    vmscan: resulting total_scan: 22634041808232578

    Even though shrinker->count_objects() has returned an overflowed value,
    the resulting `total_scan' is positive, and, what is more worrisome, it
    is insanely huge. This value is getting used later on in
    shrinker->scan_objects() loop:

    while (total_scan >= batch_size ||
    total_scan >= freeable) {
    unsigned long ret;
    unsigned long nr_to_scan = min(batch_size, total_scan);

    shrinkctl->nr_to_scan = nr_to_scan;
    ret = shrinker->scan_objects(shrinker, shrinkctl);
    if (ret == SHRINK_STOP)
    break;
    freed += ret;

    count_vm_events(SLABS_SCANNED, nr_to_scan);
    total_scan -= nr_to_scan;

    cond_resched();
    }

    `total_scan >= batch_size' is true for a very-very long time and
    'total_scan >= freeable' is also true for quite some time, because
    `freeable < 0' and `total_scan' is large enough, for example,
    22634041808232578. The only break condition, in the given scheme of
    things, is shrinker->scan_objects() == SHRINK_STOP test, which is a
    bit too weak to rely on, especially in heavy zsmalloc-usage scenarios.

    To fix the issue, take a pool stat snapshot and use it instead of
    racy zs_stat_get() calls.

    Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     

11 May, 2016

4 commits

  • commit 74d369443325063a5f0260e63971decb950fd8fa upstream.

    Commit 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use
    wb_domain aware operations") unintentionally changed this function's
    meaning from "are there more dirty pages than the background writeback
    threshold" to "are there more dirty pages than the writeback threshold".
    The background writeback threshold is typically half of the writeback
    threshold, so this had the effect of raising the number of dirty pages
    required to cause a writeback worker to perform background writeout.

    This can cause a very severe performance regression when a BDI uses
    BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker
    can now disagree on whether writeback should be initiated.

    For example, in a system having 1GB of RAM, a single spinning disk, and a
    "pass-through" FUSE filesystem mounted over the disk, application code
    mmapped a 128MB file on the disk and was randomly dirtying pages in that
    mapping.

    Because FUSE uses strictlimit and has a default max_ratio of only 1%, in
    balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the
    dirty_freerun_ceiling is the average of those, ~150. So, it pauses the
    dirtying processes when we have 151 dirty pages and wakes up a background
    writeback worker. But the worker tests the wrong threshold (200 instead of
    100), so it does not initiate writeback and just returns.

    Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
    worker who will do nothing. It remains stuck in this state until the few
    dirty pages that we have finally expire and we write them back for that
    reason. Then the whole process repeats, resulting in near-zero throughput
    through the FUSE BDI.

    The fix is to call the parameterized variant of wb_calc_thresh, so that the
    worker will do writeback if the bg_thresh is exceeded which was the
    behavior before the referenced commit.

    Fixes: 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations")
    Signed-off-by: Howard Cochran
    Acked-by: Tejun Heo
    Signed-off-by: Miklos Szeredi
    Tested-by Sedat Dilek
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Howard Cochran
     
  • commit bc22af74f271ef76b2e6f72f3941f91f0da3f5f8 upstream.

    Khugepaged attempts to raise min_free_kbytes if its set too low.
    However, on boot khugepaged sets min_free_kbytes first from
    subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
    after from init_per_zone_wmark_min(), via a module_init() call.

    Khugepaged used to use a late_initcall() to set min_free_kbytes (such
    that it occurred after the core initialization), however this was
    removed when the initialization of min_free_kbytes was integrated into
    the starting of the khugepaged thread.

    The fix here is simply to invoke the core initialization using a
    core_initcall() instead of module_init(), such that the previous
    initialization ordering is restored. I didn't restore the
    late_initcall() since start_stop_khugepaged() already sets
    min_free_kbytes via set_recommended_min_free_kbytes().

    This was noticed when we had a number of page allocation failures when
    moving a workload to a kernel with this new initialization ordering. On
    an 8GB system this restores min_free_kbytes back to 67584 from 11365
    when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
    CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
    CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.

    Fixes: 79553da293d3 ("thp: cleanup khugepaged startup")
    Signed-off-by: Jason Baron
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jason Baron
     
  • commit 32a4e169039927bfb6ee9f0ccbbe3a8aaf13a4bc upstream.

    Instead of using "zswap" as the name for all zpools created, add an
    atomic counter and use "zswap%x" with the counter number for each zpool
    created, to provide a unique name for each new zpool.

    As zsmalloc, one of the zpool implementations, requires/expects a unique
    name for each pool created, zswap should provide a unique name. The
    zsmalloc pool creation does not fail if a new pool with a conflicting
    name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
    zsmalloc pool creation fails with -ENOMEM. Then zswap will be unable to
    change its compressor parameter if its zpool is zsmalloc; it also will
    be unable to change its zpool parameter back to zsmalloc, if it has any
    existing old zpool using zsmalloc with page(s) in it. Attempts to
    change the parameters will result in failure to create the zpool. This
    changes zswap to provide a unique name for each zpool creation.

    Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
    Signed-off-by: Dan Streetman
    Reported-by: Sergey Senozhatsky
    Reviewed-by: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Streetman
     
  • commit 14af4a5e9b26ad251f81c174e8a43f3e179434a5 upstream.

    /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
    increasingly negative under compaction: which would add delay when
    should be none, or no delay when should delay. The bug in compaction
    was due to a recent mmotm patch, but much older instance of the bug was
    also noticed in isolate_migratepages_range() which is used for CMA and
    gigantic hugepage allocations.

    The bug is caused by putback_movable_pages() in an error path
    decrementing the isolated counters without them being previously
    incremented by acct_isolated(). Fix isolate_migratepages_range() by
    removing the error-path putback, thus reaching acct_isolated() with
    migratepages still isolated, and leaving putback to caller like most
    other places do.

    Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
    [vbabka@suse.cz: expanded the changelog]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

05 May, 2016

6 commits

  • commit d7e69488bd04de165667f6bc741c1c0ec6042ab9 upstream.

    Currently, migration code increses num_poisoned_pages on *failed*
    migration page as well as successfully migrated one at the trial of
    memory-failure. It will make the stat wrong. As well, it marks the
    page as PG_HWPoison even if the migration trial failed. It would mean
    we cannot recover the corrupted page using memory-failure facility.

    This patches fixes it.

    Signed-off-by: Minchan Kim
    Reported-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 7bf52fb891b64b8d61caf0b82060adb9db761aec upstream.

    We have been reclaimed highmem zone if buffer_heads is over limit but
    commit 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from
    shrink_zone()") changed the behavior so it doesn't reclaim highmem zone
    although buffer_heads is over the limit. This patch restores the logic.

    Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 28093f9f34cedeaea0f481c58446d9dac6dd620f upstream.

    In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
    because the layouts may differ depending on the architecture. On s390
    this will lead to inaccurate numa_maps accounting in /proc because of
    misguided pte_present() and pte_dirty() checks on the fake pte.

    On other architectures pte_present() and pte_dirty() may work by chance,
    but there may be an issue with direct-access (dax) mappings w/o
    underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
    available. In vm_normal_page() the fake pte will be checked with
    pte_special() and because there is no "special" bit in a pmd, this will
    always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
    skipped. On dax mappings w/o struct pages, an invalid struct page
    pointer would then be returned that can crash the kernel.

    This patch fixes the numa_maps THP handling by introducing new "_pmd"
    variants of the can_gather_numa_stats() and vm_normal_page() functions.

    Signed-off-by: Gerald Schaefer
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Dan Williams
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michael Holzheu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gerald Schaefer
     
  • commit 3486b85a29c1741db99d0c522211c82d2b7a56d0 upstream.

    Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
    it cannot distinguish private /dev/zero mappings from other special
    mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.

    This fixes false-positive VM_BUG_ON and prevents installing THP where
    they are not expected.

    Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
    Fixes: 78f11a255749 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups")
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Dmitry Vyukov
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Dmitry Vyukov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit 264a0ae164bc0e9144bebcd25ff030d067b1a878 upstream.

    Hello,

    So, this ended up a lot simpler than I originally expected. I tested
    it lightly and it seems to work fine. Petr, can you please test these
    two patches w/o the lru drain drop patch and see whether the problem
    is gone?

    Thanks.
    ------ 8< ------
    If charge moving is used, memcg performs relabeling of the affected
    pages from its ->attach callback which is called under both
    cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
    fragile as various operations may depend on workqueues making forward
    progress which relies on the ability to create new kthreads.

    There's no reason to perform charge moving from ->attach which is deep
    in the task migration path. Move it to ->post_attach which is called
    after the actual migration is finished and cgroup_threadgroup_rwsem is
    dropped.

    * move_charge_struct->mm is added and ->can_attach is now responsible
    for pinning and recording the target mm. mem_cgroup_clear_mc() is
    updated accordingly. This also simplifies mem_cgroup_move_task().

    * mem_cgroup_move_task() is now called from ->post_attach instead of
    ->attach.

    Signed-off-by: Tejun Heo
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Debugged-and-tested-by: Petr Mladek
    Reported-by: Cyril Hrubis
    Reported-by: Johannes Weiner
    Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 376bf125ac781d32e202760ed7deb1ae4ed35d31 upstream.

    This change is primarily an attempt to make it easier to realize the
    optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
    enabled.

    Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
    overhead is zero. This is because, as long as no process have enabled
    kmem cgroups accounting, the assignment is replaced by asm-NOP
    operations. This is possible because memcg_kmem_enabled() uses a
    static_key_false() construct.

    It also helps readability as it avoid accessing the p[] array like:
    p[size - 1] which "expose" that the array is processed backwards inside
    helper function build_detached_freelist().

    Lastly this also makes the code more robust, in error case like passing
    NULL pointers in the array. Which were previously handled before commit
    033745189b1b ("slub: add missing kmem cgroup support to
    kmem_cache_free_bulk").

    Fixes: 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jesper Dangaard Brouer
     

20 Apr, 2016

1 commit

  • commit 6f25a14a7053b69917e2ebea0d31dd444cd31fd5 upstream.

    It is incorrect to use next_node to find a target node, it will return
    MAX_NUMNODES or invalid node. This will lead to crash in buddy system
    allocation.

    Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Signed-off-by: Xishi Qiu
    Acked-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: "Laura Abbott"
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Xishi Qiu
     

13 Apr, 2016

3 commits

  • commit d9dddbf556674bf125ecd925b24e43a5cf2a568a upstream.

    Hanjun Guo has reported that a CMA stress test causes broken accounting of
    CMA and free pages:

    > Before the test, I got:
    > -bash-4.3# cat /proc/meminfo | grep Cma
    > CmaTotal: 204800 kB
    > CmaFree: 195044 kB
    >
    >
    > After running the test:
    > -bash-4.3# cat /proc/meminfo | grep Cma
    > CmaTotal: 204800 kB
    > CmaFree: 6602584 kB
    >
    > So the freed CMA memory is more than total..
    >
    > Also the the MemFree is more than mem total:
    >
    > -bash-4.3# cat /proc/meminfo
    > MemTotal: 16342016 kB
    > MemFree: 22367268 kB
    > MemAvailable: 22370528 kB

    Laura Abbott has confirmed the issue and suspected the freepage accounting
    rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
    caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
    pageblocks:

    > CMA isolates MAX_ORDER aligned blocks, but, during the process,
    > partialy isolated block exists. If MAX_ORDER is 11 and
    > pageblock_order is 9, two pageblocks make up MAX_ORDER
    > aligned block and I can think following scenario because pageblock
    > (un)isolation would be done one by one.
    >
    > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
    > MIGRATE_ISOLATE, respectively.
    >
    > CC -> IC -> II (Isolation)
    > II -> CI -> CC (Un-isolation)
    >
    > If some pages are freed at this intermediate state such as IC or CI,
    > that page could be merged to the other page that is resident on
    > different type of pageblock and it will cause wrong freepage count.

    This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
    but since it doesn't hold the zone->lock between pageblocks, a race
    window does exist.

    It's also likely that unexpected merging can occur between
    MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
    __free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict
    max order of merging on isolated pageblock"). However, we only check
    the migratetype of the pageblock where buddy merging has been initiated,
    not the migratetype of the buddy pageblock (or group of pageblocks)
    which can be MIGRATE_ISOLATE.

    Joonsoo has suggested checking for buddy migratetype as part of
    page_is_buddy(), but that would add extra checks in allocator hotpath
    and bloat-o-meter has shown significant code bloat (the function is
    inline).

    This patch reduces the bloat at some expense of more complicated code.
    The buddy-merging while-loop in __free_one_page() is initially bounded
    to pageblock_border and without any migratetype checks. The checks are
    placed outside, bumping the max_order if merging is allowed, and
    returning to the while-loop with a statement which can't be possibly
    considered harmful.

    This fixes the accounting bug and also removes the arguably weird state
    in the original commit 3c605096d315 where buddies could be left
    unmerged.

    Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
    Link: https://lkml.org/lkml/2016/3/2/280
    Signed-off-by: Vlastimil Babka
    Reported-by: Hanjun Guo
    Tested-by: Hanjun Guo
    Acked-by: Joonsoo Kim
    Debugged-by: Laura Abbott
    Debugged-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit b6e6edcfa40561e9c8abe5eecf1c96f8e5fd9c6f upstream.

    Setting the original memory.limit_in_bytes hardlimit is subject to a
    race condition when the desired value is below the current usage. The
    code tries a few times to first reclaim and then see if the usage has
    dropped to where we would like it to be, but there is no locking, and
    the workload is free to continue making new charges up to the old limit.
    Thus, attempting to shrink a workload relies on pure luck and hope that
    the workload happens to cooperate.

    To fix this in the cgroup2 memory.max knob, do it the other way round:
    set the limit first, then try enforcement. And if reclaim is not able
    to succeed, trigger OOM kills in the group. Keep going until the new
    limit is met, we run out of OOM victims and there's only unreclaimable
    memory left, or the task writing to memory.max is killed. This allows
    users to shrink groups reliably, and the behavior is consistent with
    what happens when new charges are attempted in excess of memory.max.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 588083bb37a3cea8533c392370a554417c8f29cb upstream.

    When setting memory.high below usage, nothing happens until the next
    charge comes along, and then it will only reclaim its own charge and not
    the now potentially huge excess of the new memory.high. This can cause
    groups to stay in excess of their memory.high indefinitely.

    To fix that, when shrinking memory.high, kick off a reclaim cycle that
    goes after the delta.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

04 Mar, 2016

4 commits

  • commit 3ed47db34f480df7caf44436e3e63e555351ae9a upstream.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 21ea9fb69e7c4b1b1559c3e410943d3ff248ffcb upstream.

    In balloon_page_dequeue, pages_lock should cover the loop
    (ie, list_for_each_entry_safe). Otherwise, the cursor page could
    be isolated by compaction and then list_del by isolation could
    poison the page->lru.{prev,next} so the loop finally could
    access wrong address like this. This patch fixes the bug.

    general protection fault: 0000 [#1] SMP
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 2 PID: 82 Comm: vballoon Not tainted 4.4.0-rc5-mm1-access_bit+ #1906
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff8800a7ff0000 ti: ffff8800a7fec000 task.ti: ffff8800a7fec000
    RIP: 0010:[] [] balloon_page_dequeue+0x54/0x130
    RSP: 0018:ffff8800a7fefdc0 EFLAGS: 00010246
    RAX: ffff88013fff9a70 RBX: ffffea000056fe00 RCX: 0000000000002b7d
    RDX: ffff88013fff9a70 RSI: ffffea000056fe00 RDI: ffff88013fff9a68
    RBP: ffff8800a7fefde8 R08: ffffea000056fda0 R09: 0000000000000000
    R10: ffff8800a7fefd90 R11: 0000000000000001 R12: dead0000000000e0
    R13: ffffea000056fe20 R14: ffff880138809070 R15: ffff880138809060
    FS: 0000000000000000(0000) GS:ffff88013fc40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f229c10e000 CR3: 00000000b8b53000 CR4: 00000000000006a0
    Stack:
    0000000000000100 ffff880138809088 ffff880138809000 ffff880138809060
    0000000000000046 ffff8800a7fefe28 ffffffff812c86d3 ffff880138809020
    ffff880138809000 fffffffffff91900 0000000000000100 ffff880138809060
    Call Trace:
    [] leak_balloon+0x93/0x1a0
    [] balloon+0x217/0x2a0
    [] ? __schedule+0x31e/0x8b0
    [] ? abort_exclusive_wait+0xb0/0xb0
    [] ? update_balloon_stats+0xf0/0xf0
    [] kthread+0xc9/0xe0
    [] ? kthread_park+0x60/0x60
    [] ret_from_fork+0x3f/0x70
    [] ? kthread_park+0x60/0x60
    Code: 8d 60 e0 0f 84 af 00 00 00 48 8b 43 20 a8 01 75 3b 48 89 d8 f0 0f ba 28 00 72 10 48 8b 03 f6 c4 08 75 2f 48 89 df e8 8c 83 f9 ff 8b 44 24 20 4d 8d 6c 24 20 48 83 e8 20 4d 39 f5 74 7a 4c 89
    RIP [] balloon_page_dequeue+0x54/0x130
    RSP
    ---[ end trace 43cf28060d708d5f ]---
    Kernel panic - not syncing: Fatal exception
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Kernel Offset: disabled

    Signed-off-by: Minchan Kim
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Rafael Aquini
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 8479eba7781fa9ffb28268840de6facfc12c35a7 upstream.

    Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
    flag combination due to confusing semantics. It noted that
    alloc_misplaced_dst_page() was one such user after changes made by
    commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify").

    Unfortunately when GFP_THISNODE was removed, users of
    alloc_misplaced_dst_page() started waking kswapd and entering direct
    reclaim because the wrong GFP flags are cleared. The consequence is
    that workloads that used to fit into memory now get reclaimed which is
    addressed by this patch.

    The problem can be demonstrated with "mutilate" that exercises memcached
    which is software dedicated to memory object caching. The configuration
    uses 80% of memory and is run 3 times for varying numbers of clients.
    The results on a 4-socket NUMA box are

    mutilate
    4.4.0 4.4.0
    vanilla numaswap-v1
    Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%)
    Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%)
    Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%)
    Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%)
    Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%)
    Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%)
    Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%)

    The metric is queries/second with the more the better. The results are
    way outside of the noise and the reason for the improvement is obvious
    from some of the vmstats

    4.4.0 4.4.0
    vanillanumaswap-v1r1
    Minor Faults 1929399272 2146148218
    Major Faults 19746529 3567
    Swap Ins 57307366 9913
    Swap Outs 50623229 17094
    Allocation stalls 35909 443
    DMA allocs 0 0
    DMA32 allocs 72976349 170567396
    Normal allocs 5306640898 5310651252
    Movable allocs 0 0
    Direct pages scanned 404130893 799577
    Kswapd pages scanned 160230174 0
    Kswapd pages reclaimed 55928786 0
    Direct pages reclaimed 1843936 41921
    Page writes file 2391 0
    Page writes anon 50623229 17094

    The vanilla kernel is swapping like crazy with large amounts of direct
    reclaim and kswapd activity. The figures are aggregate but it's known
    that the bad activity is throughout the entire test.

    Note that simple streaming anon/file memory consumers also see this
    problem but it's not as obvious. In those cases, kswapd is awake when
    it should not be.

    As there are at least two reclaim-related bugs out there, it's worth
    spelling out the user-visible impact. This patch only addresses bugs
    related to excessive reclaim on NUMA hardware when the working set is
    larger than a NUMA node. There is a bug related to high kswapd CPU
    usage but the reports are against laptops and other UMA hardware and is
    not addressed by this patch.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream.

    pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
    introduced to locklessy (but atomically) detect when a pmd is a regular
    (stable) pmd or when the pmd is unstable and can infinitely transition
    from pmd_none() and pmd_trans_huge() from under us, while only holding
    the mmap_sem for reading (for writing not).

    While holding the mmap_sem only for reading, MADV_DONTNEED can run from
    under us and so before we can assume the pmd to be a regular stable pmd
    we need to compare it against pmd_none() and pmd_trans_huge() in an
    atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a
    tiny window for a race.

    Useful applications are unlikely to notice the difference as doing
    MADV_DONTNEED concurrently with a page fault would lead to undefined
    behavior.

    [akpm@linux-foundation.org: tidy up comment grammar/layout]
    Signed-off-by: Andrea Arcangeli
    Reported-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

26 Feb, 2016

7 commits

  • commit 6a6ac72fd6ea32594b316513e1826c3f6db4cc93 upstream.

    This showed up on ARC when running LMBench bw_mem tests as Overlapping
    TLB Machine Check Exception triggered due to STLB entry (2M pages)
    overlapping some NTLB entry (regular 8K page).

    bw_mem 2m touches a large chunk of vaddr creating NTLB entries. In the
    interim khugepaged kicks in, collapsing the contiguous ptes into a
    single pmd. pmdp_collapse_flush()->flush_pmd_tlb_range() is called to
    flush out NTLB entries for the ptes. This for ARC (by design) can only
    shootdown STLB entries (for pmd). The stray NTLB entries cause the
    overlap with the subsequent STLB entry for collapsed page. So make
    pmdp_collapse_flush() call pte flush interface not pmd flush.

    Note that originally all thp flush call sites in generic code called
    flush_tlb_range() leaving it to architecture to implement the flush for
    pte and/or pmd. Commit 12ebc1581ad11454 changed this by calling a new
    opt-in API flush_pmd_tlb_range() which made the semantics more explicit
    but failed to distinguish the pte vs pmd flush in generic code, which is
    what this patch fixes.

    Note that ARC can fixed w/o touching the generic pmdp_collapse_flush()
    by defining a ARC version, but that defeats the purpose of generic
    version, plus sementically this is the right thing to do.

    Fixes STAR 9000961194: LMBench on AXS103 triggering duplicate TLB
    exceptions with super pages

    Fixes: 12ebc1581ad11454 ("mm,thp: introduce flush_pmd_tlb_range")
    Signed-off-by: Vineet Gupta
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vineet Gupta
     
  • commit 6611d8d76132f86faa501de9451a89bf23fb2371 upstream.

    A spare array holding mem cgroup threshold events is kept around to make
    sure we can always safely deregister an event and have an array to store
    the new set of events in.

    In the scenario where we're going from 1 to 0 registered events, the
    pointer to the primary array containing 1 event is copied to the spare
    slot, and then the spare slot is freed because no events are left.
    However, it is freed before calling synchronize_rcu(), which means
    readers may still be accessing threshold->primary after it is freed.

    Fixed by only freeing after synchronize_rcu().

    Signed-off-by: Martijn Coenen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Martijn Coenen
     
  • commit 48f7df329474b49d83d0dffec1b6186647f11976 upstream.

    Grazvydas Ignotas has reported a regression in remap_file_pages()
    emulation.

    Testcase:
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define SIZE (4096 * 3)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i;

    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    if (remap_file_pages(p, 4096, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    assert(p[0] == 1);

    munmap(p, SIZE);

    return 0;
    }

    The second remap_file_pages() fails with -EINVAL.

    The reason is that remap_file_pages() emulation assumes that the target
    vma covers whole area we want to over map. That assumption is broken by
    first remap_file_pages() call: it split the area into two vma.

    The solution is to check next adjacent vmas, if they map the same file
    with the same flags.

    Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Grazvydas Ignotas
    Tested-by: Grazvydas Ignotas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 12352d3cae2cebe18805a91fab34b534d7444231 upstream.

    Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
    anon_vma appeared between lock and unlock. We have to check anon_vma
    first or call anon_vma_prepare() to be sure that it's here. There are
    only few users of these legacy helpers. Let's get rid of them.

    This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
    isn't required here, read lock is enough.

    And reorders expand_downwards/expand_upwards: security_mmap_addr() and
    wrapping-around check don't have to be under anon vma lock.

    Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Dmitry Vyukov
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit 7162a1e87b3e380133dadc7909081bb70d0a7041 upstream.

    Tetsuo Handa reported underflow of NR_MLOCK on munlock.

    Testcase:

    #include
    #include
    #include

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL << 21)

    int main(int argc, char *argv[])
    {
    void *addr;

    system("grep Mlocked /proc/meminfo");
    addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
    -1, 0);
    if (addr == MAP_FAILED)
    printf("mmap() failed\n"), exit(1);
    munmap(addr, SIZE);
    system("grep Mlocked /proc/meminfo");
    return 0;
    }

    It happens on munlock_vma_page() due to unfortunate choice of nr_pages
    data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

    For unsigned int nr_pages, implicitly casted to long in
    __mod_zone_page_state(), it becomes something around UINT_MAX.

    munlock_vma_page() usually called for THP as small pages go though
    pagevec.

    Let's make nr_pages signed int.

    Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
    mod_zone_page_state()") used `long' type, but `int' here is OK for a
    count of the number of sub-pages in a huge page.

    Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Cc: Michel Lespinasse
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit d96b339f453997f2f08c52da3f41423be48c978f upstream.

    I saw the following BUG_ON triggered in a testcase where a process calls
    madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
    calls migratepages command repeatedly (doing ping-pong among different
    NUMA nodes) for the first process:

    Soft offlining page 0x60000 at 0x700000600000
    __get_any_page: 0x60000 free buddy page
    page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1
    flags: 0x1fffc0000000000()
    page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/include/linux/mm.h:342!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
    CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
    RIP: 0010:[] [] put_page+0x5c/0x60
    RSP: 0018:ffff88007c213e00 EFLAGS: 00010246
    Call Trace:
    put_hwpoison_page+0x4e/0x80
    soft_offline_page+0x501/0x520
    SyS_madvise+0x6bc/0x6f0
    entry_SYSCALL_64_fastpath+0x12/0x6a
    Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
    RIP [] put_page+0x5c/0x60
    RSP

    The root cause resides in get_any_page() which retries to get a refcount
    of the page to be soft-offlined. This function calls
    put_hwpoison_page(), expecting that the target page is putback to LRU
    list. But it can be also freed to buddy. So the second check need to
    care about such case.

    Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()")
    Signed-off-by: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream.

    By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

18 Feb, 2016

2 commits

  • commit 564e81a57f9788b1475127012e0fd44e9049e342 upstream.

    Jan Stancek has reported that system occasionally hanging after "oom01"
    testcase from LTP triggers OOM. Guessing from a result that there is a
    kworker thread doing memory allocation and the values between "Node 0
    Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
    up-to-date for some reason.

    According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
    discover memory reclaim doesn't make any progress"), it meant to force
    the kworker thread to take a short sleep, but it by error used
    schedule_timeout(1). We missed that schedule_timeout() in state
    TASK_RUNNING doesn't do anything.

    Fix it by using schedule_timeout_uninterruptible(1) which forces the
    kworker thread to take a short sleep in order to make sure that vmstat
    is up-to-date.

    Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
    Signed-off-by: Tetsuo Handa
    Reported-by: Jan Stancek
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     
  • commit c102f07ca0b04f2cb49cfc161c83f6239d17f491 upstream.

    record_obj() in migrate_zspage() does not preserve handle's
    HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
    (accidentally) un-pins the handle, while migrate_zspage() still performs
    an explicit unpin_tag() on the that handle. This additional explicit
    unpin_tag() introduces a race condition with zs_free(), which can pin
    that handle by this time, so the handle becomes un-pinned.

    Schematically, it goes like this:

    CPU0 CPU1
    migrate_zspage
    find_alloced_obj
    trypin_tag
    set HANDLE_PIN_BIT zs_free()
    pin_tag()
    obj_malloc() -- new object, no tag
    record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT
    unpin_tag() -- remove zs_free's HANDLE_PIN_BIT

    The race condition may result in a NULL pointer dereference:

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
    PC is at get_zspage_mapping+0x0/0x24
    LR is at obj_free.isra.22+0x64/0x128
    Call trace:
    get_zspage_mapping+0x0/0x24
    zs_free+0x88/0x114
    zram_free_page+0x64/0xcc
    zram_slot_free_notify+0x90/0x108
    swap_entry_free+0x278/0x294
    free_swap_and_cache+0x38/0x11c
    unmap_single_vma+0x480/0x5c8
    unmap_vmas+0x44/0x60
    exit_mmap+0x50/0x110
    mmput+0x58/0xe0
    do_exit+0x320/0x8dc
    do_group_exit+0x44/0xa8
    get_signal+0x538/0x580
    do_signal+0x98/0x4b8
    do_notify_resume+0x14/0x5c

    This patch keeps the lock bit in migration path and update value
    atomically.

    Signed-off-by: Junil Lee
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junil Lee
     

09 Jan, 2016

1 commit

  • kernel test robot has reported the following crash:

    BUG: unable to handle kernel NULL pointer dereference at 00000100
    IP: [] __queue_work+0x26/0x390
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
    Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
    CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
    Workqueue: events vmstat_shepherd
    task: cb684600 ti: cb7ba000 task.ti: cb7ba000
    EIP: 0060:[] EFLAGS: 00010046 CPU: 0
    EIP is at __queue_work+0x26/0x390
    EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
    ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
    Stack:
    Call Trace:
    __queue_delayed_work+0xa1/0x160
    queue_delayed_work_on+0x36/0x60
    vmstat_shepherd+0xad/0xf0
    process_one_work+0x1aa/0x4c0
    worker_thread+0x41/0x440
    kthread+0xb0/0xd0
    ret_from_kernel_thread+0x21/0x40

    The reason is that start_shepherd_timer schedules the shepherd work item
    which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
    that workqueue so if the further initialization takes more than HZ we
    might end up scheduling on a NULL vmstat_wq. This is really unlikely
    but not impossible.

    Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
    Reported-by: kernel test robot
    Signed-off-by: Michal Hocko
    Tested-by: Tetsuo Handa
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Dec, 2015

3 commits

  • mod_zone_page_state() takes a "delta" integer argument. delta contains
    the number of pages that should be added or subtracted from a struct
    zone's vm_stat field.

    If a zone is larger than 8TB this will cause overflows. E.g. for a
    zone with a size slightly larger than 8TB the line

    mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

    in mm/page_alloc.c:free_area_init_core() will result in a negative
    result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
    contain 0x8xxxxxxx pages which will be sign extended to a negative
    value.

    Fix this by changing the delta argument to long type.

    This could fix an early boot problem seen on s390, where we have a 9TB
    system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
    rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
    completely empty, so it tries to reclaim pages in an endless loop.

    This was seen on a heavily patched 3.10 kernel. One possible
    explaination seem to be the overflows caused by mod_zone_page_state().
    Unfortunately I did not have the chance to verify that this patch
    actually fixes the problem, since I don't have access to the system
    right now. However the overflow problem does exist anyway.

    Given the description that a system with slightly less than 8TB does
    work, this seems to be a candidate for the observed problem.

    Signed-off-by: Heiko Carstens
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • test_pages_in_a_zone() does not account for the possibility of missing
    sections in the given pfn range. pfn_valid_within always returns 1 when
    CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
    sections to pass the test, leading to a kernel oops.

    Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
    for missing sections before proceeding into the zone-check code.

    This also prevents a crash from offlining memory devices with missing
    sections. Despite this, it may be a good idea to keep the related patch
    '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
    missing sections' because missing sections in a memory block may lead to
    other problems not covered by the scope of this fix.

    Signed-off-by: Andrew Banman
    Acked-by: Alex Thorlton
    Cc: Russ Anderson
    Cc: Alex Thorlton
    Cc: Yinghai Lu
    Cc: Greg KH
    Cc: Seth Jennings
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Banman
     
  • Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
    once enough pages have been reclaimed, in which case, in contrast to a
    full round-trip over a cgroup sub-tree, the current position stored in
    mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
    and so is left holding the reference to the last scanned cgroup. If the
    target cgroup does not get scanned again (we might have just reclaimed
    the last page or all processes might exit and free their memory
    voluntary), we will leak it, because there is nobody to put the
    reference held by the iterator.

    The problem is easy to reproduce by running the following command
    sequence in a loop:

    mkdir /sys/fs/cgroup/memory/test
    echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
    memhog 150M
    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
    rmdir test

    The cgroups generated by it will never get freed.

    This patch fixes this issue by making mem_cgroup_iter avoid taking
    reference to the current position. In order not to hit use-after-free
    bug while running reclaim in parallel with cgroup deletion, we make use
    of ->css_released cgroup callback to clear references to the dying
    cgroup in all reclaim iterators that might refer to it. This callback
    is called right before scheduling rcu work which will free css, so if we
    access iter->position from rcu read section, we might be sure it won't
    go away under us.

    [hannes@cmpxchg.org: clean up css ref handling]
    Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

19 Dec, 2015

1 commit

  • Change the use of strncmp in zswap_pool_find_get() to strcmp.

    The use of strncmp is no longer correct, now that zswap_zpool_type is
    not an array; sizeof() will return the size of a pointer, which isn't
    the right length to compare. We don't need to use strncmp anyway,
    because the existing params and the passed in params are all guaranteed
    to be null terminated, so strcmp should be used.

    Signed-off-by: Dan Streetman
    Reported-by: Weijie Yang
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

13 Dec, 2015

7 commits

  • It's possible that an oom killed victim shares an ->mm with the init
    process and thus oom_kill_process() would end up trying to kill init as
    well.

    This has been shown in practice:

    Out of memory: Kill process 9134 (init) score 3 or sacrifice child
    Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
    Kill process 1 (init) sharing same memory
    ...
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

    And this will result in a kernel panic.

    If a process is forked by init and selected for oom kill while still
    sharing init_mm, then it's likely this system is in a recoverable state.
    However, it's better not to try to kill init and allow the machine to
    panic due to unkillable processes.

    [rientjes@google.com: rewrote changelog]
    [akpm@linux-foundation.org: fix inverted test, per Ben]
    Signed-off-by: Chen Jie
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Ben Hutchings
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Jie
     
  • Dmitry Vyukov provides a little program, autogenerated by syzkaller,
    which races a fault on a mapping of a sparse memfd object, against
    truncation of that object below the fault address: run repeatedly for a
    few minutes, it reliably generates shmem_evict_inode()'s
    WARN_ON(inode->i_blocks).

    (But there's nothing specific to memfd here, nor to the fstat which it
    happened to use to generate the fault: though that looked suspicious,
    since a shmem_recalc_inode() had been added there recently. The same
    problem can be reproduced with open+unlink in place of memfd_create, and
    with fstatfs in place of fstat.)

    v3.7 commit 0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING")
    explains one cause of such a warning (a race with shmem_writepage to
    swap), and possible solutions; but we never took it further, and this
    syzkaller incident turns out to have a different cause.

    shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
    then found to be beyond eof, looks plausible - decrementing the alloced
    count that was just before incremented - but in fact can go wrong, if a
    racing thread (the truncator, for example) gets its shmem_recalc_inode()
    in just after our delete_from_page_cache(). delete_from_page_cache()
    decrements nrpages, that shmem_recalc_inode() will balance the books by
    decrementing alloced itself, then our decrement of alloced take it one
    too low: leading to the WARNING when the object is finally evicted.

    Once the new page has been exposed in the page cache,
    shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
    the accounting right in all cases (and not fall through from "trunc:" to
    "decused:"). Adjust that error recovery block; and the reinitialization
    of info and sbinfo can be removed too.

    While we're here, fix shmem_writepage() to avoid the original issue: it
    will be safe against a racing shmem_recalc_inode(), if it merely
    increments swapped before the shmem_delete_from_page_cache() which
    decrements nrpages (but it must then do its own shmem_recalc_inode()
    before that, while still in balance, instead of after). (Aside: why do
    we shmem_recalc_inode() here in the swap path? Because its raison d'etre
    is to cope with clean sparse shmem pages being reclaimed behind our
    back: so here when swapping is a good place to look for that case.) But
    I've not now managed to reproduce this bug, even without the patch.

    I don't see why I didn't do that earlier: perhaps inhibited by the
    preference to eliminate shmem_recalc_inode() altogether. Driven by this
    incident, I do now have a patch to do so at last; but still want to sit
    on it for a bit, there's a couple of questions yet to be resolved.

    Signed-off-by: Hugh Dickins
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Dmitry Vyukov reported the following memory leak

    unreferenced object 0xffff88002eaafd88 (size 32):
    comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
    hex dump (first 32 bytes):
    28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    kmalloc include/linux/slab.h:458
    region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
    __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
    vma_needs_reservation mm/hugetlb.c:1813
    alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
    hugetlb_no_page mm/hugetlb.c:3543
    hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
    follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
    __get_user_pages+0x542/0xf30 mm/gup.c:497
    populate_vma_page_range+0xde/0x110 mm/gup.c:919
    __mm_populate+0x1c7/0x310 mm/gup.c:969
    do_mlock+0x291/0x360 mm/mlock.c:637
    SYSC_mlock2 mm/mlock.c:658
    SyS_mlock2+0x4b/0x70 mm/mlock.c:648

    Dmitry identified a potential memory leak in the routine region_chg,
    where a region descriptor is not free'ed on an error path.

    However, the root cause for the above memory leak resides in region_del.
    In this specific case, a "placeholder" entry is created in region_chg.
    The associated page allocation fails, and the placeholder entry is left
    in the reserve map. This is "by design" as the entry should be deleted
    when the map is released. The bug is in the region_del routine which is
    used to delete entries within a specific range (and when the map is
    released). region_del did not handle the case where a placeholder entry
    exactly matched the start of the range range to be deleted. In this
    case, the entry would not be deleted and leaked. The fix is to take
    these special placeholder entries into account in region_del.

    The region_chg error path leak is also fixed.

    Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
    and check whether the obtained *ptep is a migration/hwpoison entry or
    not. And if not, then we get to call huge_pte_alloc(). This is racy
    because the *ptep could turn into migration/hwpoison entry after the
    huge_pte_offset() check. This race results in BUG_ON in
    huge_pte_alloc().

    We don't have to call huge_pte_alloc() when the huge_pte_offset()
    returns non-NULL, so let's fix this bug with moving the code into else
    block.

    Note that the *ptep could turn into a migration/hwpoison entry after
    this block, but that's not a problem because we have another
    !pte_present check later (we never go into hugetlb_no_page() in that
    case.)

    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Mike Kravetz
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Whoops, I missed removing the kerneldoc comment of the lrucare arg
    removed from mem_cgroup_replace_page; but it's a good comment, keep it.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tetsuo Handa has reported that the system might basically livelock in
    OOM condition without triggering the OOM killer.

    The issue is caused by internal dependency of the direct reclaim on
    vmstat counter updates (via zone_reclaimable) which are performed from
    the workqueue context. If all the current workers get assigned to an
    allocation request, though, they will be looping inside the allocator
    trying to reclaim memory but zone_reclaimable can see stalled numbers so
    it will consider a zone reclaimable even though it has been scanned way
    too much. WQ concurrency logic will not consider this situation as a
    congested workqueue because it relies that worker would have to sleep in
    such a situation. This also means that it doesn't try to spawn new
    workers or invoke the rescuer thread if the one is assigned to the
    queue.

    In order to fix this issue we need to do two things. First we have to
    let wq concurrency code know that we are in trouble so we have to do a
    short sleep. In order to prevent from issues handled by 0e093d99763e
    ("writeback: do not sleep on the congestion queue if there are no
    congested BDIs or if significant congestion is not being encountered in
    the current zone") we limit the sleep only to worker threads which are
    the ones of the interest anyway.

    The second thing to do is to create a dedicated workqueue for vmstat and
    mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
    have a spare worker thread for it.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 016c13daa5c9 ("mm, page_alloc: use masks and shifts when
    converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
    MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
    wasn't updated to reflect that.

    As a result, the file /proc/pagetypeinfo shows the counts for Movable as
    Reclaimable and vice versa.

    Additionally, commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks
    for high-order atomic allocations on demand") introduced
    MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
    show_migration_types(), so it doesn't appear in the listing of free
    areas during page alloc failures or oom kills.

    This patch fixes both problems. The atomic reserves will show with a
    letter 'H' in the free areas listings.

    Fixes: 016c13daa5c9 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
    Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka