17 Apr, 2019

2 commits

  • commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.

    Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") memcg dirty and writeback counters are managed
    as:

    1) per-memcg per-cpu values in range of [-32..32]

    2) per-memcg atomic counter

    When a per-cpu counter cannot fit in [-32..32] it's flushed to the
    atomic. Stat readers only check the atomic. Thus readers such as
    balance_dirty_pages() may see a nontrivial error margin: 32 pages per
    cpu.

    Assuming 100 cpus:
    4k x86 page_size: 13 MiB error per memcg
    64k ppc page_size: 200 MiB error per memcg

    Considering that dirty+writeback are used together for some decisions the
    errors double.

    This inaccuracy can lead to undeserved oom kills. One nasty case is
    when all per-cpu counters hold positive values offsetting an atomic
    negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
    balance_dirty_pages() only consults the atomic and does not consider
    throttling the next n_cpu*32 dirty pages. If the file_lru is in the
    13..200 MiB range then there's absolutely no dirty throttling, which
    burdens vmscan with only dirty+writeback pages thus resorting to oom
    kill.

    It could be argued that tiny containers are not supported, but it's more
    subtle. It's the amount the space available for file lru that matters.
    If a container has memory.max-200MiB of non reclaimable memory, then it
    will also suffer such oom kills on a 100 cpu machine.

    The following test reliably ooms without this patch. This patch avoids
    oom kills.

    $ cat test
    mount -t cgroup2 none /dev/cgroup
    cd /dev/cgroup
    echo +io +memory > cgroup.subtree_control
    mkdir test
    cd test
    echo 10M > memory.max
    (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
    (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

    $ cat memcg-writeback-stress.c
    /*
    * Dirty pages from all but one cpu.
    * Clean pages from the non dirtying cpu.
    * This is to stress per cpu counter imbalance.
    * On a 100 cpu machine:
    * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
    * - per memcg atomic is -99*32 pages
    * - thus the complete dirty limit: sum of all counters 0
    * - balance_dirty_pages() only sees atomic count -99*32 pages, which
    * it max()s to 0.
    * - So a workload can dirty -99*32 pages before balance_dirty_pages()
    * cares.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static char *buf;
    static int bufSize;

    static void set_affinity(int cpu)
    {
    cpu_set_t affinity;

    CPU_ZERO(&affinity);
    CPU_SET(cpu, &affinity);
    if (sched_setaffinity(0, sizeof(affinity), &affinity))
    err(1, "sched_setaffinity");
    }

    static void dirty_on(int output_fd, int cpu)
    {
    int i, wrote;

    set_affinity(cpu);
    for (i = 0; i < 32; i++) {
    for (wrote = 0; wrote < bufSize; ) {
    int ret = write(output_fd, buf+wrote, bufSize-wrote);
    if (ret == -1)
    err(1, "write");
    wrote += ret;
    }
    }
    }

    int main(int argc, char **argv)
    {
    int cpu, flush_cpu = 1, output_fd;
    const char *output;

    if (argc != 2)
    errx(1, "usage: output_file");

    output = argv[1];
    bufSize = getpagesize();
    buf = malloc(getpagesize());
    if (buf == NULL)
    errx(1, "malloc failed");

    output_fd = open(output, O_CREAT|O_RDWR);
    if (output_fd == -1)
    err(1, "open(%s)", output);

    for (cpu = 0; cpu < get_nprocs(); cpu++) {
    if (cpu != flush_cpu)
    dirty_on(output_fd, cpu);
    }

    set_affinity(flush_cpu);
    if (fsync(output_fd))
    err(1, "fsync(%s)", output);
    if (close(output_fd))
    err(1, "close(%s)", output);
    free(buf);
    }

    Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
    collect exact per memcg counters. This avoids the aforementioned oom
    kills.

    This does not affect the overhead of memory.stat, which still reads the
    single atomic counter.

    Why not use percpu_counter? memcg already handles cpus going offline, so
    no need for that overhead from percpu_counter. And the percpu_counter
    spinlocks are more heavyweight than is required.

    It probably also makes sense to use exact dirty and writeback counters
    in memcg oom reports. But that is saved for later.

    Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: [4.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Greg Thelen
     
  • commit c6f3c5ee40c10bb65725047a220570f718507001 upstream.

    With some architectures like ppc64, set_pmd_at() cannot cope with a
    situation where there is already some (different) valid entry present.

    Use pmdp_set_access_flags() instead to modify the pfn which is built to
    deal with modifying existing PMD entries.

    This is similar to commit cae85cb8add3 ("mm/memory.c: fix modifying of
    page protection by insert_pfn()")

    We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
    support pud pfn entries now.

    Without this patch we also see the below message in kernel log "BUG:
    non-zero pgtables_bytes on freeing mm:"

    Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Chandan Rajendra
    Reviewed-by: Jan Kara
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

06 Apr, 2019

10 commits

  • [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]

    KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
    It triggers false positives in the allocation path:

    BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
    Read of size 8 at addr ffff88881f800000 by task swapper/0
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    __asan_report_load8_noabort+0x19/0x20
    memchr_inv+0x2ea/0x330
    kernel_poison_pages+0x103/0x3d5
    get_page_from_freelist+0x15e7/0x4d90

    because KASAN has not yet unpoisoned the shadow page for allocation
    before it checks memchr_inv() but only found a stale poison pattern.

    Also, false positives in free path,

    BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
    Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
    CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    check_memory_region+0x22d/0x250
    memset+0x28/0x40
    kernel_poison_pages+0x29e/0x3d5
    __free_pages_ok+0x75f/0x13e0

    due to KASAN adds poisoned redzones around slab objects, but the page
    poisoning needs to poison the whole page.

    Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 92d1d07daad65c300c7d0b68bbef8867e9895d54 ]

    Kmemleak throws endless warnings during boot due to in
    __alloc_alien_cache(),

    alc = kmalloc_node(memsize, gfp, node);
    init_arraycache(&alc->ac, entries, batch);
    kmemleak_no_scan(ac);

    Kmemleak does not track the array cache (alc->ac) but the alien cache
    (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
    out of init_arraycache().

    There is another place that calls init_arraycache(), but
    alloc_kmem_cache_cpus() uses the percpu allocation where will never be
    considered as a leak.

    kmemleak: Found object by alias at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    lookup_object+0x84/0xac
    find_and_get_object+0x84/0xe4
    kmemleak_no_scan+0x74/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18
    kmemleak: Object 0xffff8007b9aa7e00 (size 256):
    kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
    kmemleak: min_count = 1
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    kmemleak_alloc+0x84/0xb8
    kmem_cache_alloc_node_trace+0x31c/0x3a0
    __kmalloc_node+0x58/0x78
    setup_kmem_cache_node+0x26c/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    kmemleak_no_scan+0x90/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
    Fixes: 1fe00d50a9e8 ("slab: factor out initialization of array cache")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit afd07389d3f4933c7f7817a92fb5e053d59a3182 ]

    One of the vmalloc stress test case triggers the kernel BUG():


    [60.562151] ------------[ cut here ]------------
    [60.562154] kernel BUG at mm/vmalloc.c:512!
    [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
    [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390

    it can happen due to big align request resulting in overflowing of
    calculated address, i.e. it becomes 0 after ALIGN()'s fixup.

    Fix it by checking if calculated address is within vstart/vend range.

    Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Uladzislau Rezki (Sony)
     
  • [ Upstream commit 2e25644e8da4ed3a27e7b8315aaae74660be72dc ]

    Syzbot with KMSAN reports (excerpt):

    ==================================================================
    BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
    BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x173/0x1d0 lib/dump_stack.c:113
    kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
    __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
    mpol_rebind_policy mm/mempolicy.c:353 [inline]
    mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
    update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
    update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
    cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728

    ...

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
    kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
    kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
    kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
    mpol_new mm/mempolicy.c:276 [inline]
    do_mbind mm/mempolicy.c:1180 [inline]
    kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
    __do_sys_mbind mm/mempolicy.c:1354 [inline]

    As it's difficult to report where exactly the uninit value resides in
    the mempolicy object, we have to guess a bit. mm/mempolicy.c:353
    contains this part of mpol_rebind_policy():

    if (!mpol_store_user_nodemask(pol) &&
    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))

    "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
    ever see being uninitialized after leaving mpol_new(). So I'll guess
    it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
    but still part of statement starting on line 353.

    For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
    reachable for a mempolicy where mpol_set_nodemask() is called in
    do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
    with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL
    flag. Let's exclude such policies from the nodes_equal() check. Note
    the uninit access should be benign anyway, as rebinding this kind of
    policy is always a no-op. Therefore no actual need for stable
    inclusion.

    Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
    Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Yisheng Xie
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vlastimil Babka
     
  • [ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

    If a memory cgroup contains a single process with many threads
    (including different process group sharing the mm) then it is possible
    to trigger a race when the oom killer complains that there are no oom
    elible tasks and complain into the log which is both annoying and
    confusing because there is no actual problem. The race looks as
    follows:

    P1 oom_reaper P2
    try_charge try_charge
    mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
    out_of_memory
    oom_kill_process(P1,P2)
    wake_oom_reaper
    mutex_unlock(oom_lock)
    oom_reap_task
    mutex_lock(oom_lock)
    select_bad_process # no victim

    The problem is more visible with many threads.

    Fix this by checking for fatal_signal_pending from
    mem_cgroup_out_of_memory when the oom_lock is already held.

    The oom bypass is safe because we do the same early in the try_charge
    path already. The situation migh have changed in the mean time. It
    should be safe to check for fatal_signal_pending and tsk_is_oom_victim
    but for a better code readability abstract the current charge bypass
    condition into should_force_charge and reuse it from that path. "

    Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     
  • [ Upstream commit d342a0b38674867ea67fde47b0e1e60ffe9f17a2 ]

    Since setting global init process to some memory cgroup is technically
    possible, oom_kill_memcg_member() must check it.

    Tasks in /test1 are going to be killed due to memory.oom.group set
    Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b

    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    static char buffer[10485760];
    static int pipe_fd[2] = { EOF, EOF };
    unsigned int i;
    int fd;
    char buf[64] = { };
    if (pipe(pipe_fd))
    return 1;
    if (chdir("/sys/fs/cgroup/"))
    return 1;
    fd = open("cgroup.subtree_control", O_WRONLY);
    write(fd, "+memory", 7);
    close(fd);
    mkdir("test1", 0755);
    fd = open("test1/memory.oom.group", O_WRONLY);
    write(fd, "1", 1);
    close(fd);
    fd = open("test1/cgroup.procs", O_WRONLY);
    write(fd, "1", 1);
    snprintf(buf, sizeof(buf) - 1, "%d", getpid());
    write(fd, buf, strlen(buf));
    close(fd);
    snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
    fd = open("test1/memory.max", O_WRONLY);
    write(fd, buf, strlen(buf));
    close(fd);
    for (i = 0; i < 10; i++)
    if (fork() == 0) {
    char c;
    close(pipe_fd[1]);
    read(pipe_fd[0], &c, 1);
    memset(buffer, 0, sizeof(buffer));
    sleep(3);
    _exit(0);
    }
    close(pipe_fd[0]);
    close(pipe_fd[1]);
    sleep(3);
    return 0;
    }

    [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.062954][ T9185] Call Trace:
    [ 37.063976][ T9185] dump_stack+0x67/0x95
    [ 37.065263][ T9185] dump_header+0x51/0x570
    [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110
    [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.069967][ T9185] oom_kill_process+0x18d/0x210
    [ 37.071515][ T9185] out_of_memory+0x11b/0x380
    [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0
    [ 37.074601][ T9185] try_charge+0x790/0x820
    [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0
    [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30
    [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0
    [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070
    [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0
    [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0
    [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0
    [ 37.086529][ T9185] do_page_fault+0x28/0x260
    [ 37.087788][ T9185] ? page_fault+0x8/0x30
    [ 37.088978][ T9185] page_fault+0x1e/0x30
    [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
    [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
    [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
    [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
    [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
    [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
    [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
    [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
    [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
    [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
    [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
    [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
    [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
    [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
    [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
    [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
    [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
    [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
    [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.361685][ T1] Call Trace:
    [ 37.363239][ T1] dump_stack+0x67/0x95
    [ 37.365010][ T1] panic+0xfc/0x2b0
    [ 37.366853][ T1] do_exit+0xd55/0xd60
    [ 37.368595][ T1] do_group_exit+0x47/0xc0
    [ 37.370415][ T1] get_signal+0x32a/0x920
    [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.374596][ T1] do_signal+0x32/0x6e0
    [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b
    [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0
    [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b
    [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0
    [ 37.384594][ T1] ? page_fault+0x8/0x30
    [ 37.386453][ T1] retint_user+0x8/0x18
    [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8
    [ 37.389922][ T1] Code: Bad RIP value.
    [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
    [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
    [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
    [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
    [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
    [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0

    Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
    Fixes: 3d8b38eb81cac813 ("mm, oom: introduce memory.oom.group")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     
  • [ Upstream commit c10d38cc8d3e43f946b6c2bf4602c86791587f30 ]

    Dan Carpenter reports a potential NULL dereference in
    get_swap_page_of_type:

    Smatch complains that the NULL checks on "si" aren't consistent. This
    seems like a real bug because we have not ensured that the type is
    valid and so "si" can be NULL.

    Add the missing check for NULL, taking care to use a read barrier to
    ensure CPU1 observes CPU0's updates in the correct order:

    CPU0 CPU1
    alloc_swap_info() if (type >= nr_swapfiles)
    swap_info[type] = p /* handle invalid entry */
    smp_wmb() smp_rmb()
    ++nr_swapfiles p = swap_info[type]

    Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
    CPU0's write to swap_info[type] and read NULL from swap_info[type].

    Ying Huang noticed other places in swapfile.c don't order these reads
    properly. Introduce swap_type_to_swap_info to encourage correct usage.

    Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
    (see tools/memory-model/Documentation/explanation.txt).

    This ordering need not be enforced in places where swap_lock is held
    (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
    and the swap_info array.

    Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
    Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
    Signed-off-by: Daniel Jordan
    Reported-by: Dan Carpenter
    Suggested-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Acked-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Omar Sandoval
    Cc: Paul McKenney
    Cc: Shaohua Li
    Cc: Stephen Rothwell
    Cc: Tejun Heo
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Daniel Jordan
     
  • [ Upstream commit 0c81585499601acd1d0e1cbf424cabfaee60628c ]

    After offlining a memory block, kmemleak scan will trigger a crash, as
    it encounters a page ext address that has already been freed during
    memory offlining. At the beginning in alloc_page_ext(), it calls
    kmemleak_alloc(), but it does not call kmemleak_free() in
    free_page_ext().

    BUG: unable to handle kernel paging request at ffff888453d00000
    PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800ffffbac2ff060
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
    Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
    RIP: 0010:scan_block+0xb5/0x290
    Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 8b 3b e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
    RSP: 0018:ffff8881ec57f8e0 EFLAGS: 00010082
    RAX: 0000000000000000 RBX: ffff888453d00000 RCX: ffffffffa61e5a54
    RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888453d00000
    RBP: ffff8881ec57f920 R08: fffffbfff4ed588d R09: fffffbfff4ed588c
    R10: fffffbfff4ed588c R11: ffffffffa76ac463 R12: dffffc0000000000
    R13: ffff888453d00ff9 R14: ffff8881f80cef48 R15: ffff8881f80cef48
    FS: 00007f6c0e3f8740(0000) GS:ffff8881f7680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff888453d00000 CR3: 00000001c4244003 CR4: 00000000001606a0
    Call Trace:
    scan_gray_list+0x269/0x430
    kmemleak_scan+0x5a8/0x10f0
    kmemleak_write+0x541/0x6ca
    full_proxy_write+0xf8/0x190
    __vfs_write+0xeb/0x980
    vfs_write+0x15a/0x4f0
    ksys_write+0xd2/0x1b0
    __x64_sys_write+0x73/0xb0
    do_syscall_64+0xeb/0xaaa
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f6c0dad73b8
    Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
    RSP: 002b:00007ffd5b863cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f6c0dad73b8
    RDX: 0000000000000005 RSI: 000055a9216e1710 RDI: 0000000000000001
    RBP: 000055a9216e1710 R08: 000000000000000a R09: 00007ffd5b863840
    R10: 000000000000000a R11: 0000000000000246 R12: 00007f6c0dda9780
    R13: 0000000000000005 R14: 00007f6c0dda4740 R15: 0000000000000005
    Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod efivarfs
    CR2: ffff888453d00000
    ---[ end trace ccf646c7456717c5 ]---
    Kernel panic - not syncing: Fatal exception
    Shutting down cpus with NMI
    Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range:
    0xffffffff80000000-0xffffffffbfffffff)
    ---[ end Kernel panic - not syncing: Fatal exception ]---

    Link: http://lkml.kernel.org/r/20190227173147.75650-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 0d3bd18a5efd66097ef58622b898d3139790aa9d ]

    In case cma_init_reserved_mem failed, need to free the memblock
    allocated by memblock_reserve or memblock_alloc_range.

    Quote Catalin's comments:
    https://lkml.org/lkml/2019/2/26/482

    Kmemleak is supposed to work with the memblock_{alloc,free} pair and it
    ignores the memblock_reserve() as a memblock_alloc() implementation
    detail. It is, however, tolerant to memblock_free() being called on
    a sub-range or just a different range from a previous memblock_alloc().
    So the original patch looks fine to me. FWIW:

    Link: http://lkml.kernel.org/r/20190227144631.16708-1-peng.fan@nxp.com
    Signed-off-by: Peng Fan
    Reviewed-by: Catalin Marinas
    Reviewed-by: Mike Rapoport
    Cc: Laura Abbott
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Marek Szyprowski
    Cc: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Peng Fan
     
  • [ Upstream commit d778015ac95bc036af73342c878ab19250e01fe1 ]

    next_present_section_nr() could only return an unsigned number -1, so
    just check it specifically where compilers will convert -1 to unsigned
    if needed.

    mm/sparse.c: In function 'sparse_init_nid':
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:478:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:497:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/sparse.c: In function 'sparse_init':
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:520:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin + 1, pnum_end) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~

    Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
    Fixes: c4e1be9ec113 ("mm, sparsemem: break out of loops early")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     

03 Apr, 2019

3 commits

  • commit d2b2c6dd227ba5b8a802858748ec9a780cb75b47 upstream.

    Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL
    and SIGSEGV that could not be traced back to a userspace code bug. They
    had all the magic signs of an I/D cache coherency issue.

    Now recently we noticed that the /proc/sys/vm/compact_memory interface
    was quite efficient at provoking this class of userspace crashes.

    Studying the code in mm/migrate.c there is a distinction made between
    migrating a page that is mapped at the instant of migration and one that
    is not mapped. Our problem turned out to be the non-mapped pages.

    For the non-mapped page the code performs a copy of the page content and
    all relevant meta-data of the page without doing the required D-cache
    maintenance. This leaves dirty data in the D-cache of the CPU and on
    the 1004K cores this data is not visible to the I-cache. A subsequent
    page-fault that triggers a mapping of the page will happily serve the
    process with potentially stale code.

    What about ARM then, this bug should have seen greater exposure? Well
    ARM became immune to this flaw back in 2010, see commit c01778001a4f
    ("ARM: 6379/1: Assume new page cache pages have dirty D-cache").

    My proposed fix moves the D-cache maintenance inside move_to_new_page to
    make it common for both cases.

    Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
    Fixes: 97ee0524614 ("flush cache before installing new page at migraton")
    Signed-off-by: Lars Persson
    Reviewed-by: Paul Burton
    Acked-by: Mel Gorman
    Cc: Ralf Baechle
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Lars Persson
     
  • commit a7f40cfe3b7ada57af9b62fd28430eeb4a7cfcb7 upstream.

    When MPOL_MF_STRICT was specified and an existing page was already on a
    node that does not follow the policy, mbind() should return -EIO. But
    commit 6f4576e3687b ("mempolicy: apply page table walker on
    queue_pages_range()") broke the rule.

    And commit c8633798497c ("mm: mempolicy: mbind and migrate_pages support
    thp migration") didn't return the correct value for THP mbind() too.

    If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it
    reaches queue_pages_to_pte_range() or queue_pages_pmd() to check if an
    existing page was already on a node that does not follow the policy.
    And, non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL was specified.

    Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c

    [akpm@linux-foundation.org: tweak code comment]
    Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()")
    Signed-off-by: Yang Shi
    Signed-off-by: Oscar Salvador
    Reported-by: Cyril Hrubis
    Suggested-by: Kirill A. Shutemov
    Acked-by: Rafael Aquini
    Reviewed-by: Oscar Salvador
    Acked-by: David Rientjes
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     
  • commit 6d6ea1e967a246f12cfe2f5fb743b70b2e608d4a upstream.

    Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Boichat
     

24 Mar, 2019

9 commits

  • commit fc8efd2ddfed3f343c11b693e87140ff358d7ff5 upstream.

    LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
    This is a stress test, where one thread mmaps/writes/munmaps memory area
    and other thread is trying to read from it:

    CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
    Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
    Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
    Call Trace:
    ([] (null))
    [] lock_acquire+0xec/0x258
    [] _raw_spin_lock_bh+0x5c/0x98
    [] page_table_free+0x48/0x1a8
    [] do_fault+0xdc/0x670
    [] __handle_mm_fault+0x416/0x5f0
    [] handle_mm_fault+0x1b0/0x320
    [] do_dat_exception+0x19c/0x2c8
    [] pgm_check_handler+0x19e/0x200

    page_table_free() is called with NULL mm parameter, but because "0" is a
    valid address on s390 (see S390_lowcore), it keeps going until it
    eventually crashes in lockdep's lock_acquire. This crash is
    reproducible at least since 4.14.

    Problem is that "vmf->vma" used in do_fault() can become stale. Because
    mmap_sem may be released, other threads can come in, call munmap() and
    cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
    re-used:

    handle_mm_fault |
    __handle_mm_fault |
    do_fault |
    vma = vmf->vma |
    do_read_fault |
    __do_fault |
    vma->vm_ops->fault(vmf); |
    mmap_sem is released |
    |
    | do_munmap()
    | remove_vma_list()
    | remove_vma()
    | vm_area_free()
    | # vma is released
    | ...
    | # same vma is allocated
    | # from kmem cache
    | do_mmap()
    | vm_area_alloc()
    | memset(vma, 0, ...)
    |
    pte_free(vma->vm_mm, ...); |
    page_table_free |
    spin_lock_bh(&mm->context.lock);|
    |

    Cache mm_struct to avoid using potentially stale "vma".

    [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

    Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
    Signed-off-by: Jan Stancek
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Matthew Wilcox
    Acked-by: Rafael Aquini
    Reviewed-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Souptick Joarder
    Cc: Jerome Glisse
    Cc: Aneesh Kumar K.V
    Cc: David Hildenbrand
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Stancek
     
  • commit 401592d2e095947344e10ec0623adbcd58934dd4 upstream.

    When VM_NO_GUARD is not set area->size includes adjacent guard page,
    thus for correct size checking get_vm_area_size() should be used, but
    not area->size.

    This fixes possible kernel oops when userspace tries to mmap an area on
    1 page bigger than was allocated by vmalloc_user() call: the size check
    inside remap_vmalloc_range_partial() accounts non-existing guard page
    also, so check successfully passes but vmalloc_to_page() returns NULL
    (guard page does not physically exist).

    The following code pattern example should trigger an oops:

    static int oops_mmap(struct file *file, struct vm_area_struct *vma)
    {
    void *mem;

    mem = vmalloc_user(4096);
    BUG_ON(!mem);
    /* Do not care about mem leak */

    return remap_vmalloc_range(vma, mem, 0);
    }

    And userspace simply mmaps size + PAGE_SIZE:

    mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);

    Possible candidates for oops which do not have any explicit size
    checks:

    *** drivers/media/usb/stkwebcam/stk-webcam.c:
    v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0);

    Or the following one:

    *** drivers/video/fbdev/core/fbmem.c
    static int
    fb_mmap(struct file *file, struct vm_area_struct * vma)
    ...
    res = fb->fb_mmap(info, vma);

    Where fb_mmap callback calls remap_vmalloc_range() directly without any
    explicit checks:

    *** drivers/video/fbdev/vfb.c
    static int vfb_mmap(struct fb_info *info,
    struct vm_area_struct *vma)
    {
    return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
    }

    Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Acked-by: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: Joe Perches
    Cc: "Luis R. Rodriguez"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Penyaev
     
  • commit 46612b751c4941c5c0472ddf04027e877ae5990f upstream.

    When soft_offline_in_use_page() runs on a thp tail page after pmd is
    split, we trigger the following VM_BUG_ON_PAGE():

    Memory failure: 0x3755ff: non anonymous thp
    __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
    Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
    page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
    flags: 0x2fffff80000000()
    raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
    ------------[ cut here ]------------
    kernel BUG at ./include/linux/mm.h:519!

    soft_offline_in_use_page() passed refcount and page lock from tail page
    to head page, which is not needed because we can pass any subpage to
    split_huge_page().

    Naoya had fixed a similar issue in c3901e722b29 ("mm: hwpoison: fix thp
    split handling in memory_failure()"). But he missed fixing soft
    offline.

    Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
    Fixes: 61f5d698cc97 ("mm: re-enable THP")
    Signed-off-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    zhongjiang
     
  • [ Upstream commit 29b00e609960ae0fcff382f4c7079dd0874a5311 ]

    When we made the shmem_reserve_inode call in shmem_link conditional, we
    forgot to update the declaration for ret so that it always has a known
    value. Dan Carpenter pointed out this deficiency in the original patch.

    Fixes: 1062af920c07 ("tmpfs: fix link accounting when a tmpfile is linked in")
    Reported-by: Dan Carpenter
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Hugh Dickins
    Cc: Matej Kupljen
    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 1062af920c07f5b54cf5060fde3339da6df0cf6b ]

    tmpfs has a peculiarity of accounting hard links as if they were
    separate inodes: so that when the number of inodes is limited, as it is
    by default, a user cannot soak up an unlimited amount of unreclaimable
    dcache memory just by repeatedly linking a file.

    But when v3.11 added O_TMPFILE, and the ability to use linkat() on the
    fd, we missed accommodating this new case in tmpfs: "df -i" shows that
    an extra "inode" remains accounted after the file is unlinked and the fd
    closed and the actual inode evicted. If a user repeatedly links
    tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they
    are deleted.

    Just skip the extra reservation from shmem_link() in this case: there's
    a sense in which this first link of a tmpfile is then cheaper than a
    hard link of another file, but the accounting works out, and there's
    still good limiting, so no need to do anything more complicated.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils
    Fixes: f4e0c30c191 ("allow the temp files created by open() to be linked to")
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Hugh Dickins
    Reported-by: Matej Kupljen
    Acked-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 6ea183d60c469560e7b08a83c9804299e84ec9eb ]

    Since for_each_cpu(cpu, mask) added by commit 2d3854a37e8b767a
    ("cpumask: introduce new API, without changing anything") did not
    evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n,
    lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by
    commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without
    INIT_WORK().") by unconditionally calling flush_work() [1].

    Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all
    implementation. There is no real need to defer the implementation to
    the workqueue as the draining is going to happen on the local cpu. So
    alias lru_add_drain_all to lru_add_drain which does all the necessary
    work.

    [akpm@linux-foundation.org: fix various build warnings]
    [1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net
    Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Guenter Roeck
    Debugged-by: Tetsuo Handa
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     
  • [ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ]

    The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
    number of references that we might need to create in the fastpath later,
    the bump-allocation fastpath only has to modify the non-atomic bias value
    that tracks the number of extra references we hold instead of the atomic
    refcount. The maximum number of allocations we can serve (under the
    assumption that no allocation is made with size 0) is nc->size, so that's
    the bias used.

    However, even when all memory in the allocation has been given away, a
    reference to the page is still held; and in the `offset < 0` slowpath, the
    page may be reused if everyone else has dropped their references.
    This means that the necessary number of references is actually
    `nc->size+1`.

    Luckily, from a quick grep, it looks like the only path that can call
    page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
    requires CAP_NET_ADMIN in the init namespace and is only intended to be
    used for kernel testing and fuzzing.

    To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
    `offset < 0` path, below the virt_to_page() call, and then repeatedly call
    writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
    with a vector consisting of 15 elements containing 1 byte each.

    Signed-off-by: Jann Horn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Jann Horn
     
  • [ Upstream commit 2f1ee0913ce58efe7f18fbd518bd54c598559b89 ]

    This reverts commit fe53ca54270a ("mm: use early_pfn_to_nid in
    page_ext_init").

    When booting a system with "page_owner=on",

    start_kernel
    page_ext_init
    invoke_init_callbacks
    init_section_page_ext
    init_page_owner
    init_early_allocated_pages
    init_zones_in_node
    init_pages_in_zone
    lookup_page_ext
    page_to_nid

    The issue here is that page_to_nid() will not work since some page flags
    have no node information until later in page_alloc_init_late() due to
    DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
    access with an invalid nid.

    UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
    index 7 is out of range for type 'zone [5]'

    Also, kernel will panic since flags were poisoned earlier with,

    CONFIG_DEBUG_VM_PGFLAGS=y
    CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

    start_kernel
    setup_arch
    pagetable_init
    paging_init
    sparse_init
    sparse_init_nid
    memblock_alloc_try_nid_raw

    It did not handle it well in init_pages_in_zone() which ends up calling
    page_to_nid().

    page:ffffea0004200000 is uninitialized and poisoned
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    page_owner info is not active (free page?)
    kernel BUG at include/linux/mm.h:990!
    RIP: 0010:init_page_owner+0x486/0x520

    This means that assumptions behind commit fe53ca54270a ("mm: use
    early_pfn_to_nid in page_ext_init") are incomplete. Therefore, revert
    the commit for now. A proper way to move the page_owner initialization
    to sooner is to hook into memmap initialization.

    Link: http://lkml.kernel.org/r/20190115202812.75820-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Pasha Tatashin
    Cc: Mel Gorman
    Cc: Yang Shi
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 414fd080d125408cb15d04ff4907e1dd8145c8c7 ]

    For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true
    on x86. So the function works as long as hugetlb is configured.
    However, dax doesn't depend on hugetlb.

    Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Reviewed-by: Jan Kara
    Cc: Dan Williams
    Cc: Huang Ying
    Cc: Matthew Wilcox
    Cc: Keith Busch
    Cc: "Michael S . Tsirkin"
    Cc: John Hubbard
    Cc: Wei Yang
    Cc: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Yu Zhao
     

14 Mar, 2019

3 commits

  • [ Upstream commit 891cb2a72d821f930a39d5900cb7a3aa752c1d5b ]

    Rong Chen has reported the following boot crash:

    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    RIP: 0010:page_mapping+0x12/0x80
    Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
    RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
    RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
    RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
    RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
    R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
    R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
    FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
    Call Trace:
    __dump_page+0x14/0x2c0
    is_mem_section_removable+0x24c/0x2c0
    removable_show+0x87/0xa0
    dev_attr_show+0x25/0x60
    sysfs_kf_seq_show+0xba/0x110
    seq_read+0x196/0x3f0
    __vfs_read+0x34/0x180
    vfs_read+0xa0/0x150
    ksys_read+0x44/0xb0
    do_syscall_64+0x5e/0x4a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    and bisected it down to commit efad4e475c31 ("mm, memory_hotplug:
    is_mem_section_removable do not pass the end of a zone").

    The reason for the crash is that the mapping is garbage for poisoned
    (uninitialized) page. This shouldn't happen as all pages in the zone's
    boundary should be initialized.

    Later debugging revealed that the actual problem is an off-by-one when
    evaluating the end_page. 'start_pfn + nr_pages' resp 'zone_end_pfn'
    refers to a pfn after the range and as such it might belong to a
    differen memory section.

    This along with CONFIG_SPARSEMEM then makes the loop condition
    completely bogus because a pointer arithmetic doesn't work for pages
    from two different sections in that memory model.

    Fix the issue by reworking is_pageblock_removable to be pfn based and
    only use struct page where necessary. This makes the code slightly
    easier to follow and we will remove the problematic pointer arithmetic
    completely.

    Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
    Fixes: efad4e475c31 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
    Signed-off-by: Michal Hocko
    Reported-by:
    Tested-by:
    Acked-by: Mike Rapoport
    Reviewed-by: Oscar Salvador
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     
  • [ Upstream commit 24feb47c5fa5b825efb0151f28906dfdad027e61 ]

    If memory end is not aligned with the sparse memory section boundary,
    the mapping of such a section is only partly initialized. This may lead
    to VM_BUG_ON due to uninitialized struct pages access from
    test_pages_in_a_zone() function triggered by memory_hotplug sysfs
    handlers.

    Here are the the panic examples:
    CONFIG_DEBUG_VM_PGFLAGS=y
    kernel parameter mem=2050M
    --------------------------
    page:000003d082008000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    test_pages_in_a_zone+0xde/0x160
    show_valid_zones+0x5c/0x190
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    test_pages_in_a_zone+0xde/0x160
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    Fix this by checking whether the pfn to check is within the zone.

    [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
    Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org

    [mhocko@suse.com: separated this change from
    http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
    Signed-off-by: Michal Hocko
    Signed-off-by: Mikhail Zaslonko
    Tested-by: Mikhail Gavrilov
    Reviewed-by: Oscar Salvador
    Tested-by: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Mikhail Gavrilov
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Mikhail Zaslonko
     
  • [ Upstream commit efad4e475c312456edb3c789d0996d12ed744c13 ]

    Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.

    Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
    [1]. I have pushed back on those fixes because I believed that it is
    much better to plug the problem at the initialization time rather than
    play whack-a-mole all over the hotplug code and find all the places
    which expect the full memory section to be initialized.

    We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug:
    initialize struct pages for the full memory section") merged and cause a
    regression [2][3]. The reason is that there might be memory layouts
    when two NUMA nodes share the same memory section so the merged fix is
    simply incorrect.

    In order to plug this hole we really have to be zone range aware in
    those handlers. I have split up the original patch into two. One is
    unchanged (patch 2) and I took a different approach for `removable'
    crash.

    [1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
    [3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz

    This patch (of 2):

    Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
    removable state of a memory block:

    page:000003d08300c000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    is_mem_section_removable+0xb4/0x190
    show_mem_removable+0x9a/0xd8
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    is_mem_section_removable+0xb4/0x190
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    The reason is that the memory block spans the zone boundary and we are
    stumbling over an unitialized struct page. Fix this by enforcing zone
    range in is_mem_section_removable so that we never run away from a zone.

    Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikhail Zaslonko
    Debugged-by: Mikhail Zaslonko
    Tested-by: Gerald Schaefer
    Tested-by: Mikhail Gavrilov
    Reviewed-by: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     

06 Mar, 2019

3 commits

  • commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream.

    hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • commit 0a1d52994d440e21def1c2174932410b4f2a98a1 upstream.

    security_mmap_addr() does a capability check with current_cred(), but
    we can reach this code from contexts like a VFS write handler where
    current_cred() must not be used.

    This can be abused on systems without SMAP to make NULL pointer
    dereferences exploitable again.

    Fixes: 8869477a49c3 ("security: protect from stack expansion into low vm addresses")
    Cc: stable@kernel.org
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • [ Upstream commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 ]

    sync_inodes_sb() can race against cgwb (cgroup writeback) membership
    switches and fail to writeback some inodes. For example, if an inode
    switches to another wb while sync_inodes_sb() is in progress, the new
    wb might not be visible to bdi_split_work_to_wbs() at all or the inode
    might jump from a wb which hasn't issued writebacks yet to one which
    already has.

    This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
    switch path against sync_inodes_sb() so that sync_inodes_sb() is
    guaranteed to see all the target wbs and inodes can't jump wbs to
    escape syncing.

    v2: Fixed misplaced rwsem init. Spotted by Jiufei.

    Signed-off-by: Tejun Heo
    Reported-by: Jiufei Xue
    Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tejun Heo
     

27 Feb, 2019

1 commit

  • commit 050c17f239fd53adb55aa768d4f41bc76c0fe045 upstream.

    The system call, get_mempolicy() [1], passes an unsigned long *nodemask
    pointer and an unsigned long maxnode argument which specifies the length
    of the user's nodemask array in bits (which is rounded up). The manual
    page says that if the maxnode value is too small, get_mempolicy will
    return EINVAL but there is no system call to return this minimum value.
    To determine this value, some programs search /proc//status for a
    line starting with "Mems_allowed:" and use the number of digits in the
    mask to determine the minimum value. A recent change to the way this line
    is formatted [2] causes these programs to compute a value less than
    MAX_NUMNODES so get_mempolicy() returns EINVAL.

    Change get_mempolicy(), the older compat version of get_mempolicy(), and
    the copy_nodes_to_user() function to use nr_node_ids instead of
    MAX_NUMNODES, thus preserving the defacto method of computing the minimum
    size for the nodemask array and the maxnode argument.

    [1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html
    [2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redhat.com

    Link: http://lkml.kernel.org/r/20190211180245.22295-1-rcampbell@nvidia.com
    Fixes: 4fb8e5b89bcbbbb ("include/linux/nodemask.h: use nr_node_ids (not MAX_NUMNODES) in __nodemask_pr_numnodes()")
    Signed-off-by: Ralph Campbell
    Suggested-by: Alexander Duyck
    Cc: Waiman Long
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ralph Campbell
     

20 Feb, 2019

1 commit

  • commit a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96 upstream.

    This reverts commit 172b06c32b9497 ("mm: slowly shrink slabs with a
    relatively small number of objects").

    This change changes the agressiveness of shrinker reclaim, causing small
    cache and low priority reclaim to greatly increase scanning pressure on
    small caches. As a result, light memory pressure has a disproportionate
    affect on small caches, and causes large caches to be reclaimed much
    faster than previously.

    As a result, it greatly perturbs the delicate balance of the VFS caches
    (dentry/inode vs file page cache) such that the inode/dentry caches are
    reclaimed much, much faster than the page cache and this drives us into
    several other caching imbalance related problems.

    As such, this is a bad change and needs to be reverted.

    [ Needs some massaging to retain the later seekless shrinker
    modifications.]

    Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
    Fixes: 172b06c32b9497 ("mm: slowly shrink slabs with a relatively small number of objects")
    Signed-off-by: Dave Chinner
    Cc: Wolfgang Walter
    Cc: Roman Gushchin
    Cc: Spock
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     

13 Feb, 2019

2 commits

  • [ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]

    When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
    pages initialization can take a long time. Below were the reported init
    times on a 8-socket 96-core 4TB IvyBridge system.

    1) Non-debug kernel without CONFIG_KASAN
    [ 8.764222] node 1 initialised, 132086516 pages in 7027ms

    2) Debug kernel with CONFIG_KASAN
    [ 146.288115] node 1 initialised, 132075466 pages in 143052ms

    So the page init time in a debug kernel was 20X of the non-debug kernel.
    The long init time can be problematic as the page initialization is done
    with interrupt disabled. In this particular case, it caused the
    appearance of following warning messages as well as NMI backtraces of all
    the cores that were doing the initialization.

    [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
    [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
    [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
    [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
    [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
    [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
    [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
    [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

    The long init time was mainly caused by the call to kasan_free_pages() to
    poison the newly initialized pages. On a 4TB system, we are talking about
    almost 500GB of memory probably on the same node.

    In reality, we may not need to poison the newly initialized pages before
    they are ever allocated. So KASAN poisoning of freed pages before the
    completion of deferred memory initialization is now disabled. Those pages
    will be properly poisoned when they are allocated or freed after deferred
    pages initialization is done.

    With this change, the new page initialization time became:

    [ 21.948010] node 1 initialised, 132075466 pages in 18702ms

    This was still about double the non-debug kernel time, but was much
    better than before.

    Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Pasha Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     
  • [ Upstream commit 6ab7d47bcbf0144a8cb81536c2cead4cde18acfe ]

    From Michael Cree:
    "Bisection lead to commit b38d08f3181c ("percpu: restructure
    locking") as being the cause of lockups at initial boot on
    the kernel built for generic Alpha.

    On a suggestion by Tejun Heo that:

    So, the only thing I can think of is that it's calling
    spin_unlock_irq() while irq handling isn't set up yet.
    Can you please try the followings?

    1. Convert all spin_[un]lock_irq() to
    spin_lock_irqsave/unlock_irqrestore()."

    Fixes: b38d08f3181c ("percpu: restructure locking")
    Reported-and-tested-by: Michael Cree
    Acked-by: Tejun Heo
    Signed-off-by: Dennis Zhou
    Signed-off-by: Sasha Levin

    Dennis Zhou
     

07 Feb, 2019

6 commits

  • commit e0a352fabce61f730341d119fbedf71ffdb8663f upstream.

    We had a race in the old balloon compaction code before b1123ea6d3b3
    ("mm: balloon: use general non-lru movable page feature") refactored it
    that became visible after backporting 195a8c43e93d ("virtio-balloon:
    deflate via a page list") without the refactoring.

    The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
    redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
    use general non-lru movable page feature"). d6d86c0a7f8d
    ("mm/balloon_compaction: redesign ballooned pages management") was
    backported to 3.12, so the broken kernels are stable kernels [3.12 -
    4.7].

    There was a subtle race between dropping the page lock of the newpage in
    __unmap_and_move() and checking for __is_movable_balloon_page(newpage).

    Just after dropping this page lock, virtio-balloon could go ahead and
    deflate the newpage, effectively dequeueing it and clearing PageBalloon,
    in turn making __is_movable_balloon_page(newpage) fail.

    This resulted in dropping the reference of the newpage via
    putback_lru_page(newpage) instead of put_page(newpage), leading to
    page->lru getting modified and a !LRU page ending up in the LRU lists.
    With 195a8c43e93d ("virtio-balloon: deflate via a page list")
    backported, one would suddenly get corrupted lists in
    release_pages_balloon():

    - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
    - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100

    Nowadays this race is no longer possible, but it is hidden behind very
    ugly handling of __ClearPageMovable() and __PageMovable().

    __ClearPageMovable() will not make __PageMovable() fail, only
    PageMovable(). So the new check (__PageMovable(newpage)) will still
    hold even after newpage was dequeued by virtio-balloon.

    If anybody would ever change that special handling, the BUG would be
    introduced again. So instead, make it explicit and use the information
    of the original isolated page before migration.

    This patch can be backported fairly easy to stable kernels (in contrast
    to the refactoring).

    Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
    Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management")
    Signed-off-by: David Hildenbrand
    Reported-by: Vratislav Bendel
    Acked-by: Michal Hocko
    Acked-by: Rafael Aquini
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Dominik Brodowski
    Cc: Matthew Wilcox
    Cc: Vratislav Bendel
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Cc: Minchan Kim
    Cc: [3.12 - 4.7]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream.

    Currently memory_failure() is racy against process's exiting, which
    results in kernel crash by null pointer dereference.

    The root cause is that memory_failure() uses force_sig() to forcibly
    kill asynchronous (meaning not in the current context) processes. As
    discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
    fixes, this is not a right thing to do. OOM solves this issue by using
    do_send_sig_info() as done in commit d2d393099de2 ("signal:
    oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
    patch is suggesting to do the same for hwpoison. do_send_sig_info()
    properly accesses to siglock with lock_task_sighand(), so is free from
    the reported race.

    I confirmed that the reported bug reproduces with inserting some delay
    in kill_procs(), and it never reproduces with this patch.

    Note that memory_failure() can send another type of signal using
    force_sig_mceerr(), and the reported race shouldn't happen on it because
    force_sig_mceerr() is called only for synchronous processes (i.e.
    BUS_MCEERR_AR happens only when some process accesses to the corrupted
    memory.)

    Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Jane Chu
    Reviewed-by: Dan Williams
    Reviewed-by: William Kucharski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream.

    Syzbot instance running on upstream kernel found a use-after-free bug in
    oom_kill_process. On further inspection it seems like the process
    selected to be oom-killed has exited even before reaching
    read_lock(&tasklist_lock) in oom_kill_process(). More specifically the
    tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
    and the put_task_struct within for_each_thread() frees the tsk and
    for_each_thread() tries to access the tsk. The easiest fix is to do
    get/put across the for_each_thread() on the selected task.

    Now the next question is should we continue with the oom-kill as the
    previously selected task has exited? However before adding more
    complexity and heuristics, let's answer why we even look at the children
    of oom-kill selected task? The select_bad_process() has already selected
    the worst process in the system/memcg. Due to race, the selected
    process might not be the worst at the kill time but does that matter?
    The userspace can use the oom_score_adj interface to prefer children to
    be killed before the parent. I looked at the history but it seems like
    this is there before git history.

    Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
    Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
    Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     
  • commit eeb0efd071d821a88da3fbd35f2d478f40d3b2ea upstream.

    This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm,
    page_alloc: fix has_unmovable_pages for HugePages").

    Gigantic hugepages cross several memblocks, so it can be that the page
    we get in scan_movable_pages() is a page-tail belonging to a
    1G-hugepage. If that happens, page_hstate()->size_to_hstate() will
    return NULL, and we will blow up in hugepage_migration_supported().

    The splat is as follows:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 1 PID: 1350 Comm: bash Tainted: G E 5.0.0-rc1-mm1-1-default+ #27
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__offline_pages+0x6ae/0x900
    Call Trace:
    memory_subsys_offline+0x42/0x60
    device_offline+0x80/0xa0
    state_store+0xab/0xc0
    kernfs_fop_write+0x102/0x180
    __vfs_write+0x26/0x190
    vfs_write+0xad/0x1b0
    ksys_write+0x42/0x90
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)

    [akpm@linux-foundation.org: fix brace layout, per David. Reduce indentation]
    Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Anthony Yznaga
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oscar Salvador
     
  • commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.

    Arkadiusz reported that enabling memcg's group oom killing causes
    strange memcg statistics where there is no task in a memcg despite the
    number of tasks in that memcg is not 0. It turned out that there is a
    bug in wake_oom_reaper() which allows enqueuing same task twice which
    makes impossible to decrease the number of tasks in that memcg due to a
    refcount leak.

    This bug existed since the OOM reaper became invokable from
    task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

    T1@P1 |T2@P1 |T3@P1 |OOM reaper
    ----------+----------+----------+------------
    # Processing an OOM victim in a different memcg domain.
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    out_of_memory()
    oom_kill_process(P1)
    do_send_sig_info(SIGKILL, @P1)
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T2@P1)
    wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
    mutex_unlock(&oom_lock)
    # Completed processing an OOM victim in a different memcg domain.
    spin_lock(&oom_reaper_lock)
    # T1P1 is dequeued.
    spin_unlock(&oom_reaper_lock)

    but memcg's group oom killing made it easier to trigger this bug by
    calling wake_oom_reaper() on the same task from one out_of_memory()
    request.

    Fix this bug using an approach used by commit 855b018325737f76 ("oom,
    oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
    side effect of this patch, this patch also avoids enqueuing multiple
    threads sharing memory via task_will_free_mem(current) path.

    Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
    Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
    Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
    Signed-off-by: Tetsuo Handa
    Reported-by: Arkadiusz Miskiewicz
    Tested-by: Arkadiusz Miskiewicz
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Tejun Heo
    Cc: Aleksa Sarai
    Cc: Jay Kamat
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     
  • commit 1ac25013fb9e4ed595cd608a406191e93520881e upstream.

    hugetlb needs the same fix as faultin_nopage (which was applied in
    commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle
    FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
    released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
    in the FOLL_NOWAIT case.

    Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
    Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
    Signed-off-by: Andrea Arcangeli
    Tested-by: "Dr. David Alan Gilbert"
    Reported-by: "Dr. David Alan Gilbert"
    Reviewed-by: Mike Kravetz
    Reviewed-by: Peter Xu
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli