25 Nov, 2019

1 commit


17 Dec, 2018

1 commit

  • [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]

    init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
    'zone_idx(zone) + 1' unconditionally. This is correct in the normal
    case, while not exact in hot-plug situation.

    This function is used in two places:

    * free_area_init_core()
    * move_pfn_range_to_zone()

    In the first case, we are sure zone index increase monotonically. While
    in the second one, this is under users control.

    One way to reproduce this is:
    ----------------------------

    1. create a virtual machine with empty node1

    -m 4G,slots=32,maxmem=32G \
    -smp 4,maxcpus=8 \
    -numa node,nodeid=0,mem=4G,cpus=0-3 \
    -numa node,nodeid=1,mem=0G,cpus=4-7

    2. hot-add cpu 3-7

    cpu-add [3-7]

    2. hot-add memory to nod1

    object_add memory-backend-ram,id=ram0,size=1G
    device_add pc-dimm,id=dimm0,memdev=ram0,node=1

    3. online memory with following order

    echo online_movable > memory47/state
    echo online > memory40/state

    After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
    instead of (ZONE_MOVABLE + 1).

    Michal said:
    "Having an incorrect nr_zones might result in all sorts of problems
    which would be quite hard to debug (e.g. reclaim not considering the
    movable zone). I do not expect many users would suffer from this it
    but still this is trivial and obviously right thing to do so
    backporting to the stable tree shouldn't be harmful (last famous
    words)"

    Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Sasha Levin

    Wei Yang
     

13 Dec, 2018

1 commit

  • [ Upstream commit 400e22499dd92613821374c8c6c88c7225359980 ]

    Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") was a great step for reducing possibility of silent hang up
    problem caused by memory allocation stalls. But this commit reverts it,
    for it is possible to trigger OOM lockup and/or soft lockups when many
    threads concurrently called warn_alloc() (in order to warn about memory
    allocation stalls) due to current implementation of printk(), and it is
    difficult to obtain useful information due to limitation of synchronous
    warning approach.

    Current printk() implementation flushes all pending logs using the
    context of a thread which called console_unlock(). printk() should be
    able to flush all pending logs eventually unless somebody continues
    appending to printk() buffer.

    Since warn_alloc() started appending to printk() buffer while waiting
    for oom_kill_process() to make forward progress when oom_kill_process()
    is processing pending logs, it became possible for warn_alloc() to force
    oom_kill_process() loop inside printk(). As a result, warn_alloc()
    significantly increased possibility of preventing oom_kill_process()
    from making forward progress.

    ---------- Pseudo code start ----------
    Before warn_alloc() was introduced:

    retry:
    if (mutex_trylock(&oom_lock)) {
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_lock)
    }
    goto retry;

    After warn_alloc() was introduced:

    retry:
    if (mutex_trylock(&oom_lock)) {
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_lock)
    } else if (waited_for_10seconds()) {
    atomic_inc(&printk_pending_logs);
    }
    goto retry;
    ---------- Pseudo code end ----------

    Although waited_for_10seconds() becomes true once per 10 seconds,
    unbounded number of threads can call waited_for_10seconds() at the same
    time. Also, since threads doing waited_for_10seconds() keep doing
    almost busy loop, the thread doing print_one_log() can use little CPU
    resource. Therefore, this situation can be simplified like

    ---------- Pseudo code start ----------
    retry:
    if (mutex_trylock(&oom_lock)) {
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_lock)
    } else {
    atomic_inc(&printk_pending_logs);
    }
    goto retry;
    ---------- Pseudo code end ----------

    when printk() is called faster than print_one_log() can process a log.

    One of possible mitigation would be to introduce a new lock in order to
    make sure that no other series of printk() (either oom_kill_process() or
    warn_alloc()) can append to printk() buffer when one series of printk()
    (either oom_kill_process() or warn_alloc()) is already in progress.

    Such serialization will also help obtaining kernel messages in readable
    form.

    ---------- Pseudo code start ----------
    retry:
    if (mutex_trylock(&oom_lock)) {
    mutex_lock(&oom_printk_lock);
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_printk_lock);
    mutex_unlock(&oom_lock)
    } else {
    if (mutex_trylock(&oom_printk_lock)) {
    atomic_inc(&printk_pending_logs);
    mutex_unlock(&oom_printk_lock);
    }
    }
    goto retry;
    ---------- Pseudo code end ----------

    But this commit does not go that direction, for we don't want to
    introduce a new lock dependency, and we unlikely be able to obtain
    useful information even if we serialized oom_kill_process() and
    warn_alloc().

    Synchronous approach is prone to unexpected results (e.g. too late [1],
    too frequent [2], overlooked [3]). As far as I know, warn_alloc() never
    helped with providing information other than "something is going wrong".
    I want to consider asynchronous approach which can obtain information
    during stalls with possibly relevant threads (e.g. the owner of
    oom_lock and kswapd-like threads) and serve as a trigger for actions
    (e.g. turn on/off tracepoints, ask libvirt daemon to take a memory dump
    of stalling KVM guest for diagnostic purpose).

    This commit temporarily loses ability to report e.g. OOM lockup due to
    unable to invoke the OOM killer due to !__GFP_FS allocation request.
    But asynchronous approach will be able to detect such situation and emit
    warning. Thus, let's remove warn_alloc().

    [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
    [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
    [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

    Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reported-by: Cong Wang
    Reported-by: yuwang.yuwang
    Reported-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Sergey Senozhatsky
    Cc: Petr Mladek
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Signed-off-by: Sasha Levin

    Tetsuo Handa
     

01 Dec, 2018

1 commit

  • [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]

    Konstantin has noticed that kvmalloc might trigger the following
    warning:

    WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
    [...]
    Call Trace:
    fragmentation_index+0x76/0x90
    compaction_suitable+0x4f/0xf0
    shrink_node+0x295/0x310
    node_reclaim+0x205/0x250
    get_page_from_freelist+0x649/0xad0
    __alloc_pages_nodemask+0x12a/0x2a0
    kmalloc_large_node+0x47/0x90
    __kmalloc_node+0x22b/0x2e0
    kvmalloc_node+0x3e/0x70
    xt_alloc_table_info+0x3a/0x80 [x_tables]
    do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
    nf_setsockopt+0x44/0x60
    SyS_setsockopt+0x6f/0xc0
    do_syscall_64+0x67/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    the problem is that we only check for an out of bound order in the slow
    path and the node reclaim might happen from the fast path already. This
    is fixable by making sure that kvmalloc doesn't ever use kmalloc for
    requests that are larger than KMALLOC_MAX_SIZE but this also shows that
    the code is rather fragile. A recent UBSAN report just underlines that
    by the following report

    UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
    shift exponent 51 is too large for 32-bit type 'int'
    CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xd2/0x148 lib/dump_stack.c:113
    ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
    __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
    __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
    zone_watermark_fast mm/page_alloc.c:3216 [inline]
    get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
    __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
    alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
    alloc_pages include/linux/gfp.h:509 [inline]
    __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
    dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
    raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
    raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
    fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
    fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
    __blkdev_driver_ioctl block/ioctl.c:303 [inline]
    blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
    block_ioctl+0x105/0x150 fs/block_dev.c:1883
    vfs_ioctl fs/ioctl.c:46 [inline]
    do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
    ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
    __do_sys_ioctl fs/ioctl.c:709 [inline]
    __se_sys_ioctl fs/ioctl.c:707 [inline]
    __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
    do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Note that this is not a kvmalloc path. It is just that the fast path
    really depends on having sanitzed order as well. Therefore move the
    order check to the fast path.

    Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Konstantin Khlebnikov
    Reported-by: Kyungtae Kim
    Acked-by: Vlastimil Babka
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Byoungyoung Lee
    Cc: "Dae R. Jeong"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     

18 Oct, 2018

1 commit

  • commit 034ebf65c3c21d85b963d39f992258a64a85e3a9 upstream.

    Adjust /proc/meminfo MemAvailable calculation by adding the amount of
    indirectly reclaimable memory (rounded to the PAGE_SIZE).

    Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

26 Jun, 2018

1 commit

  • commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream.

    In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
    allocations that can ignore memory policies. The zonelist is obtained
    from current CPU's node. This is a problem for __GFP_THISNODE
    allocations that want to allocate on a different node, e.g. because the
    allocating thread has been migrated to a different CPU.

    This has been observed to break SLAB in our 4.4-based kernel, because
    there it relies on __GFP_THISNODE working as intended. If a slab page
    is put on wrong node's list, then further list manipulations may corrupt
    the list because page_to_nid() is used to determine which node's
    list_lock should be locked and thus we may take a wrong lock and race.

    Current SLAB implementation seems to be immune by luck thanks to commit
    511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
    arbitrary node") but there may be others assuming that __GFP_THISNODE
    works as promised.

    We can fix it by simply removing the zonelist reset completely. There
    is actually no reason to reset it, because memory policies and cpusets
    don't affect the zonelist choice in the first place. This was different
    when commit 183f6371aac2 ("mm: ignore mempolicies when using
    ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
    own restricted zonelists.

    We might consider this for 4.17 although I don't know if there's
    anything currently broken.

    SLAB is currently not affected, but in kernels older than 4.7 that don't
    yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
    allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
    ones I'll have to check.

    So stable backports should be more important, but will have to be
    reviewed carefully, as the code went through many changes. BTW I think
    that also the ac->preferred_zoneref reset is currently useless if we
    don't also reset ac->nodemask from a mempolicy to NULL first (which we
    probably should for the OOM victims etc?), but I would leave that for a
    separate patch.

    Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

29 Mar, 2018

2 commits

  • commit f59f1caf72ba00d519c793c3deb32cd3be32edc2 upstream.

    This reverts commit b92df1de5d28 ("mm: page_alloc: skip over regions of
    invalid pfns where possible"). The commit is meant to be a boot init
    speed up skipping the loop in memmap_init_zone() for invalid pfns.

    But given some specific memory mapping on x86_64 (or more generally
    theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
    implementation also skips valid pfns which is plain wrong and causes
    'kernel BUG at mm/page_alloc.c:1389!'

    crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
    kernel BUG at mm/page_alloc.c:1389!
    invalid opcode: 0000 [#1] SMP
    --
    RIP: 0010: move_freepages+0x15e/0x160
    --
    Call Trace:
    move_freepages_block+0x73/0x80
    __rmqueue+0x263/0x460
    get_page_from_freelist+0x7e1/0x9e0
    __alloc_pages_nodemask+0x176/0x420
    --

    crash> page_init_bug -v | grep RAM
    1000 - 9bfff System RAM (620.00 KiB)
    100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
    4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
    4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    100000000 - 67fffffff System RAM ( 22.00 GiB)

    crash> page_init_bug | head -6
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    1fffff00000000 0 1 DMA32 4096 1048575
    505736 505344 505855
    0 0 0 DMA 1 4095
    1fffff00000400 0 1 DMA32 4096 1048575
    BUG, zones differ!

    crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
    PAGE PHYSICAL MAPPING INDEX CNT FLAGS
    ffffea0001e00000 78000000 0 0 0 0
    ffffea0001ed7fc0 7b5ff000 0 0 0 0
    ffffea0001ed8000 7b600000 0 0 0 0 <<<<
    ffffea0001ede1c0 7b787000 0 0 0 0
    ffffea0001ede200 7b788000 0 0 1 1fffff00000000

    Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
    Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
    Signed-off-by: Daniel Vacek
    Acked-by: Ard Biesheuvel
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Paul Burton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Vacek
     
  • commit 2e517d681632326ed98399cb4dd99519efe3e32c upstream.

    Dave Jones reported fs_reclaim lockdep warnings.

    ============================================
    WARNING: possible recursive locking detected
    4.15.0-rc9-backup-debug+ #1 Not tainted
    --------------------------------------------
    sshd/24800 is trying to acquire lock:
    (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    but task is already holding lock:
    (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(fs_reclaim);
    lock(fs_reclaim);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    2 locks held by sshd/24800:
    #0: (sk_lock-AF_INET6){+.+.}, at: [] tcp_sendmsg+0x19/0x40
    #1: (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    stack backtrace:
    CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
    Call Trace:
    dump_stack+0xbc/0x13f
    __lock_acquire+0xa09/0x2040
    lock_acquire+0x12e/0x350
    fs_reclaim_acquire.part.102+0x29/0x30
    kmem_cache_alloc+0x3d/0x2c0
    alloc_extent_state+0xa7/0x410
    __clear_extent_bit+0x3ea/0x570
    try_release_extent_mapping+0x21a/0x260
    __btrfs_releasepage+0xb0/0x1c0
    btrfs_releasepage+0x161/0x170
    try_to_release_page+0x162/0x1c0
    shrink_page_list+0x1d5a/0x2fb0
    shrink_inactive_list+0x451/0x940
    shrink_node_memcg.constprop.88+0x4c9/0x5e0
    shrink_node+0x12d/0x260
    try_to_free_pages+0x418/0xaf0
    __alloc_pages_slowpath+0x976/0x1790
    __alloc_pages_nodemask+0x52c/0x5c0
    new_slab+0x374/0x3f0
    ___slab_alloc.constprop.81+0x47e/0x5a0
    __slab_alloc.constprop.80+0x32/0x60
    __kmalloc_track_caller+0x267/0x310
    __kmalloc_reserve.isra.40+0x29/0x80
    __alloc_skb+0xee/0x390
    sk_stream_alloc_skb+0xb8/0x340
    tcp_sendmsg_locked+0x8e6/0x1d30
    tcp_sendmsg+0x27/0x40
    inet_sendmsg+0xd0/0x310
    sock_write_iter+0x17a/0x240
    __vfs_write+0x2ab/0x380
    vfs_write+0xfb/0x260
    SyS_write+0xb6/0x140
    do_syscall_64+0x1e5/0xc05
    entry_SYSCALL64_slow_path+0x25/0x25

    This warning is caused by commit d92a8cfcb37e ("locking/lockdep:
    Rework FS_RECLAIM annotation") which replaced the use of
    lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
    and lockdep_trace_alloc() in slab_pre_alloc_hook() with
    fs_reclaim_acquire()/ fs_reclaim_release().

    Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
    __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
    __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
    trying to grab the 'fake' lock again when __perform_reclaim() already
    grabbed the 'fake' lock.

    The

    /* this guy won't enter reclaim */
    if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
    return false;

    test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
    was added by commit cf40bd16fdad ("lockdep: annotate reclaim context
    (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread
    won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
    341ce06f69ab ("page allocator: calculate the alloc_flags for allocation
    only once") added the PF_MEMALLOC safeguard (

    /* Avoid recursion of direct reclaim */
    if (p->flags & PF_MEMALLOC)
    goto nopage;

    in __alloc_pages_slowpath()).

    Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
    allow __need_fs_reclaim() to return false.

    Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
    Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation")
    Signed-off-by: Tetsuo Handa
    Reported-by: Dave Jones
    Tested-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Nikolay Borisov
    Cc: Michal Hocko
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

31 Jan, 2018

1 commit

  • commit b050e3769c6b4013bb937e879fc43bf1847ee819 upstream.

    Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
    order-0 allocations"), __zone_watermark_ok() check for high-order
    allocations will shortcut per-migratetype free list checks for
    ALLOC_HARDER allocations, and return true as long as there's free page
    of any migratetype. The intention is that ALLOC_HARDER can allocate
    from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.

    However, as a side effect, the watermark check will then also return
    true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
    CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
    allocation cannot actually obtain isolated pages, and might not be able
    to obtain CMA pages, this can result in a false positive.

    The condition should be rare and perhaps the outcome is not a fatal one.
    Still, it's better if the watermark check is correct. There also
    shouldn't be a performance tradeoff here.

    Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
    Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations")
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

25 Dec, 2017

2 commits

  • commit 629a359bdb0e0652a8227b4ff3125431995fec6e upstream.

    Since commit:

    83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")

    we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(),
    but some architectures, like arm64, don't call the routine to initialize sparsemem.

    Let's move the initialization into memory_present() it should cover all
    architectures.

    Reported-and-tested-by: Sudeep Holla
    Tested-by: Bjorn Andersson
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
    Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar
    Cc: Dan Rue
    Cc: Naresh Kamboju
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 83e3c48729d9ebb7af5a31a504f3fd6aff0348c4 upstream.

    Size of the mem_section[] array depends on the size of the physical address space.

    In preparation for boot-time switching between paging modes on x86-64
    we need to make the allocation of mem_section[] dynamic, because otherwise
    we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
    for 4-level paging and 2MB for 5-level paging mode.

    The patch allocates the array on the first call to sparse_memory_present_with_active_regions().

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Cyrill Gorcunov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

05 Dec, 2017

2 commits

  • commit 63cd448908b5eb51d84c52f02b31b9b4ccd1cb5a upstream.

    If the call __alloc_contig_migrate_range() in alloc_contig_range returns
    -EBUSY, processing continues so that test_pages_isolated() is called
    where there is a tracepoint to identify the busy pages. However, it is
    possible for busy pages to become available between the calls to these
    two routines. In this case, the range of pages may be allocated.
    Unfortunately, the original return code (ret == -EBUSY) is still set and
    returned to the caller. Therefore, the caller believes the pages were
    not allocated and they are leaked.

    Update the comment to indicate that allocation is still possible even if
    __alloc_contig_migrate_range returns -EBUSY. Also, clear return code in
    this case so that it is not accidentally used or returned to caller.

    Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
    Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation failure")
    Signed-off-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • commit 4b81cb2ff69c8a8e297a147d2eb4d9b5e8d7c435 upstream.

    drain_all_pages backs off when called from a kworker context since
    commit 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue
    context") because the original IPI based pcp draining has been replaced
    by a WQ based one and the check wanted to prevent from recursion and
    inter workers dependencies. This has made some sense at the time
    because the system WQ has been used and one worker holding the lock
    could be blocked while waiting for new workers to emerge which can be a
    problem under OOM conditions.

    Since then commit ce612879ddc7 ("mm: move pcp and lru-pcp draining into
    single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a
    rescuer so we shouldn't depend on any other WQ activity to make a
    forward progress so calling drain_all_pages from a worker context is
    safe as long as this doesn't happen from mm_percpu_wq itself which is
    not the case because all workers are required to _not_ depend on any MM
    locks.

    Why is this a problem in the first place? ACPI driven memory hot-remove
    (acpi_device_hotplug) is executed from the worker context. We end up
    calling __offline_pages to free all the pages and that requires both
    lru_add_drain_all_cpuslocked and drain_all_pages to do their job
    otherwise we can have dangling pages on pcp lists and fail the offline
    operation (__test_page_isolated_in_pageblock would see a page with 0 ref
    count but without PageBuddy set).

    Fix the issue by removing the worker check in drain_all_pages.
    lru_add_drain_all_cpuslocked doesn't have this restriction so it works
    as expected.

    Link: http://lkml.kernel.org/r/20170828093341.26341-1-mhocko@kernel.org
    Fixes: 0ccce3b924212 ("mm, page_alloc: drain per-cpu pages from workqueue context")
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

24 Nov, 2017

1 commit

  • commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.

    In reset_deferred_meminit() we determine number of pages that must not
    be deferred. We initialize pages for at least 2G of memory, but also
    pages for reserved memory in this node.

    The reserved memory is determined in this function:
    memblock_reserved_memory_within(), which operates over physical
    addresses, and returns size in bytes. However, reset_deferred_meminit()
    assumes that that this function operates with pfns, and returns page
    count.

    The result is that in the best case machine boots slower than expected
    due to initializing more pages than needed in single thread, and in the
    worst case panics because fewer than needed pages are initialized early.

    Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
    Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     

04 Oct, 2017

2 commits

  • memmap_init_zone gets a pfn range to initialize and it can be really
    large resulting in a soft lockup on non-preemptible kernels

    NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720]
    [...]
    task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000
    RIP: move_pfn_range_to_zone+0x185/0x1d0
    [...]
    Call Trace:
    devm_memremap_pages+0x2c7/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70

    Fix this by adding a scheduling point once per page block.

    Link: http://lkml.kernel.org/r/20170918121410.24466-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The function is called from __meminit context and calls other __meminit
    functions but isn't it self mark as such today:

    WARNING: vmlinux.o(.text.unlikely+0x4516): Section mismatch in reference from the function init_reserved_page() to the function .meminit.text:early_pfn_to_nid()
    The function init_reserved_page() references the function __meminit early_pfn_to_nid().
    This is often because init_reserved_page lacks a __meminit annotation or the annotation of early_pfn_to_nid is wrong.

    On most compilers, we don't notice this because the function gets
    inlined all the time. Adding __meminit here fixes the harmless warning
    for the old versions and is generally the correct annotation.

    Link: http://lkml.kernel.org/r/20170915193149.901180-1-arnd@arndb.de
    Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
    Signed-off-by: Arnd Bergmann
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

09 Sep, 2017

2 commits

  • We are by error initializing alloc_flags before gfp_allowed_mask is
    applied. This could cause problems after pm_restrict_gfp_mask() is called
    during suspend operation. Apply gfp_allowed_mask before initializing
    alloc_flags so that the first allocation attempt uses correct flags.

    Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
    Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Patch series "Separate NUMA statistics from zone statistics", v2.

    Each page allocation updates a set of per-zone statistics with a call to
    zone_statistics(). As discussed in 2017 MM summit, these are a
    substantial source of overhead in the page allocator and are very rarely
    consumed. This significant overhead in cache bouncing caused by zone
    counters (NUMA associated counters) update in parallel in multi-threaded
    page allocation (pointed out by Dave Hansen).

    A link to the MM summit slides:
    http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

    To mitigate this overhead, this patchset separates NUMA statistics from
    zone statistics framework, and update NUMA counter threshold to a fixed
    size of MAX_U16 - 2, as a small threshold greatly increases the update
    frequency of the global counter from local per cpu counter (suggested by
    Ying Huang). The rationality is that these statistics counters don't
    need to be read often, unlike other VM counters, so it's not a problem
    to use a large threshold and make readers more expensive.

    With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
    below) for per single page allocation and reclaim on Jesper's
    page_bench03 benchmark. Meanwhile, this patchset keeps the same style
    of virtual memory statistics with little end-user-visible effects (only
    move the numa stats to show behind zone page stats, see the first patch
    for details).

    I did an experiment of single page allocation and reclaim concurrently
    using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
    server (88 processors with 126G memory) with different size of threshold
    of pcp counter.

    Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    Threshold CPU cycles Throughput(88 threads)
    32 799 241760478
    64 640 301628829
    125 537 358906028 system by default
    256 468 412397590
    512 428 450550704
    4096 399 482520943
    20000 394 489009617
    30000 395 488017817
    65533 369(-31.3%) 521661345(+45.3%) with this patchset
    N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

    This patch (of 3):

    In this patch, NUMA statistics is separated from zone statistics
    framework, all the call sites of NUMA stats are changed to use
    numa-stats-specific functions, it does not have any functionality change
    except that the number of NUMA stats is shown behind zone page stats
    when users *read* the zone info.

    E.g. cat /proc/zoneinfo
    ***Base*** ***With this patch***
    nr_free_pages 3976 nr_free_pages 3976
    nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
    nr_zone_active_anon 0 nr_zone_active_anon 0
    nr_zone_inactive_file 0 nr_zone_inactive_file 0
    nr_zone_active_file 0 nr_zone_active_file 0
    nr_zone_unevictable 0 nr_zone_unevictable 0
    nr_zone_write_pending 0 nr_zone_write_pending 0
    nr_mlock 0 nr_mlock 0
    nr_page_table_pages 0 nr_page_table_pages 0
    nr_kernel_stack 0 nr_kernel_stack 0
    nr_bounce 0 nr_bounce 0
    nr_zspages 0 nr_zspages 0
    numa_hit 0 *nr_free_cma 0*
    numa_miss 0 numa_hit 0
    numa_foreign 0 numa_miss 0
    numa_interleave 0 numa_foreign 0
    numa_local 0 numa_interleave 0
    numa_other 0 numa_local 0
    *nr_free_cma 0* numa_other 0
    ... ...
    vm stats threshold: 10 vm stats threshold: 10
    ... ...

    The next patch updates the numa stats counter size and threshold.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ying Huang
    Cc: Aaron Lu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     

07 Sep, 2017

10 commits

  • For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves. There are few shortcomings of this implementation,
    though.

    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.

    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion. We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.

    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.

    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.

    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.

    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase
    without worrying about complete memory reserves depletion.

    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
    potential race while building zonelist for new populated zone") to
    protect zonelist building from races. This is no longer needed though
    because both memory online and offline are fully serialized. New users
    have grown since then.

    Notably setup_per_zone_wmarks wants to prevent from races between memory
    hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
    (see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
    add a private lock for that purpose. This will not prevent from seeing
    halfway through memory hotplug operation but that shouldn't be a big
    deal becuse memory hotplug will update watermarks explicitly so we will
    eventually get a full picture. The lock just makes sure we won't race
    when updating watermarks leading to weird results.

    Also __build_all_zonelists manipulates global data so add a private lock
    for it as well. This doesn't seem to be necessary today but it is more
    robust to have a lock there.

    While we are at it make sure we document that memory online/offline
    depends on a full serialization either via mem_hotplug_begin() or
    device_lock.

    Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Haicheng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists has been (ab)using stop_machine to make sure that
    zonelists do not change while somebody is looking at them. This is is
    just a gross hack because a) it complicates the context from which we
    can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
    switch locking to a percpu rwsem")) and b) is is not really necessary
    especially after "mm, page_alloc: simplify zonelist initialization" and
    c) it doesn't really provide the protection it claims (see below).

    Updates of the zonelists happen very seldom, basically only when a zone
    becomes populated during memory online or when it loses all the memory
    during offline. A racing iteration over zonelists could either miss a
    zone or try to work on one zone twice. Both of these are something we
    can live with occasionally because there will always be at least one
    zone visible so we are not likely to fail allocation too easily for
    example.

    Please note that the original stop_machine approach doesn't really
    provide a better exclusion because the iteration might be interrupted
    half way (unless the whole iteration is preempt disabled which is not
    the case in most cases) so the some zones could still be seen twice or a
    zone missed.

    I have run the pathological online/offline of the single memblock in the
    movable zone while stressing the same small node with some memory
    pressure.

    Node 1, zone DMA
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    managed 0
    protection: (0, 943, 943, 943)
    Node 1, zone DMA32
    pages free 227310
    min 8294
    low 10367
    high 12440
    spanned 262112
    present 262112
    managed 241436
    protection: (0, 0, 0, 0)
    Node 1, zone Normal
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    managed 0
    protection: (0, 0, 0, 1024)
    Node 1, zone Movable
    pages free 32722
    min 85
    low 117
    high 149
    spanned 32768
    present 32768
    managed 32768
    protection: (0, 0, 0, 0)

    root@test1:/sys/devices/system/node/node1# while true
    do
    echo offline > memory34/state
    echo online_movable > memory34/state
    done

    root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

    and it survived without any unexpected behavior. While this is not
    really a great testing coverage it should exercise the allocation path
    quite a lot.

    Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_zonelists gradually builds zonelists from the nearest to the most
    distant node. As we do not know how many populated zones we will have
    in each node we rely on the _zoneref to terminate initialized part of
    the zonelist by a NULL zone. While this is functionally correct it is
    quite suboptimal because we cannot allow updaters to race with zonelists
    users because they could see an empty zonelist and fail the allocation
    or hit the OOM killer in the worst case.

    We can do much better, though. We can store the node ordering into an
    already existing node_order array and then give this array to
    build_zonelists_in_node_order and do the whole initialization at once.
    zonelists consumers still might see halfway initialized state but that
    should be much more tolerateable because the list will not be empty and
    they would either see some zone twice or skip over some zone(s) in the
    worst case which shouldn't lead to immediate failures.

    While at it let's simplify build_zonelists_node which is rather
    confusing now. It gets an index into the zoneref array and returns the
    updated index for the next iteration. Let's rename the function to
    build_zonerefs_node to better reflect its purpose and give it zoneref
    array to update. The function doesn't the index anymore. It just
    returns the number of added zones so that the caller can advance the
    zonered array start for the next update.

    This patch alone doesn't introduce any functional change yet, though, it
    is merely a preparatory work for later changes.

    Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __build_all_zonelists reinitializes each online cpu local node for
    CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously
    memory less nodes could gain some memory during memory hotplug and so
    the local node should be changed for CPUs close to such a node. It
    makes less sense to do that unconditionally for a newly creaded NUMA
    node which is still offline and without any memory.

    Let's also simplify the cpu loop and use for_each_online_cpu instead of
    an explicit cpu_online check for all possible cpus.

    Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • boot_pageset is a boot time hack which gets superseded by normal
    pagesets later in the boot process. It makes zero sense to reinitialize
    it again and again during memory hotplug.

    Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "cleanup zonelists initialization", v1.

    This is aimed at cleaning up the zonelists initialization code we have
    but the primary motivation was bug report [2] which got resolved but the
    usage of stop_machine is just too ugly to live. Most patches are
    straightforward but 3 of them need a special consideration.

    Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
    because this is a user visible change. As I argue in the patch
    description I do not think we have a strong usecase for it these days.
    I have kept sysctl in place and warn into the log if somebody tries to
    configure zone lists ordering. If somebody has a real usecase for it we
    can revert this patch but I do not expect anybody will actually notice
    runtime differences. This patch is not strictly needed for the rest but
    it made patch 6 easier to implement.

    Patch 7 removes stop_machine from build_all_zonelists without adding any
    special synchronization between iterators and updater which I _believe_
    is acceptable as explained in the changelog. I hope I am not missing
    anything.

    Patch 8 then removes zonelists_mutex which is kind of ugly as well and
    not really needed AFAICS but a care should be taken when double checking
    my thinking.

    This patch (of 9):

    Supporting zone ordered zonelists costs us just a lot of code while the
    usefulness is arguable if existent at all. Mel has already made node
    ordering default on 64b systems. 32b systems are still using
    ZONELIST_ORDER_ZONE because it is considered better to fallback to a
    different NUMA node rather than consume precious lowmem zones.

    This argument is, however, weaken by the fact that the memory reclaim
    has been reworked to be node rather than zone oriented. This means that
    lowmem requests have to skip over all highmem pages on LRUs already and
    so zone ordering doesn't save the reclaim time much. So the only
    advantage of the zone ordering is under a light memory pressure when
    highmem requests do not ever hit into lowmem zones and the lowmem
    pressure doesn't need to reclaim.

    Considering that 32b NUMA systems are rather suboptimal already and it
    is generally advisable to use 64b kernel on such a HW I believe we
    should rather care about the code maintainability and just get rid of
    ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
    somebody tries to set zone ordering either from kernel command line or
    the sysctl.

    [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
    Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Abdul Haleem
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 9adb62a5df9c ("mm/hotplug: correctly setup fallback zonelists
    when creating new pgdat") tries to build the correct zonelist for a
    newly added node, while it is not necessary to rebuild it for already
    exist nodes.

    In build_zonelists(), it will iterate on nodes with memory. For a newly
    added node, it will have memory until node_states_set_node() is called
    in online_pages().

    This patch avoids rebuilding the zonelists for already existing nodes.

    build_zonelists_node() uses managed_zone(zone) checks, so it should not
    include empty zones anyway. So effectively we avoid some pointless work
    under stop_machine().

    [akpm@linux-foundation.org: tweak comment text]
    [akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
    Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Jiang Liu
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

04 Sep, 2017

1 commit


01 Sep, 2017

1 commit

  • We are doing a last second memory allocation attempt before calling
    out_of_memory(). But since slab shrinker functions might indirectly
    wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
    allocations via sleeping locks, calling slab shrinker functions from
    node_reclaim() from get_page_from_freelist() with oom_lock held has
    possibility of deadlock. Therefore, make sure that last second memory
    allocation attempt does not call slab shrinker functions.

    Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

26 Aug, 2017

1 commit

  • There is a problem that when counting the pages for creating the
    hibernation snapshot will take significant amount of time, especially on
    system with large memory. Since the counting job is performed with irq
    disabled, this might lead to NMI lockup. The following warning were
    found on a system with 1.5TB DRAM:

    Freezing user space processes ... (elapsed 0.002 seconds) done.
    OOM killer disabled.
    PM: Preallocating image memory...
    NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
    CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
    task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
    RIP: 0010:memory_bm_find_bit+0xf4/0x100
    Call Trace:
    swsusp_set_page_free+0x2b/0x30
    mark_free_pages+0x147/0x1c0
    count_data_pages+0x41/0xa0
    hibernate_preallocate_memory+0x80/0x450
    hibernation_snapshot+0x58/0x410
    hibernate+0x17c/0x310
    state_store+0xdf/0xf0
    kobj_attr_store+0xf/0x20
    sysfs_kf_write+0x37/0x40
    kernfs_fop_write+0x11c/0x1a0
    __vfs_write+0x37/0x170
    vfs_write+0xb1/0x1a0
    SyS_write+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xa5
    ...
    done (allocated 6590003 pages)
    PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)

    It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
    triggered. In case the timeout of the NMI watch dog has been set to 1
    second, a safe interval should be 6590003/20 = 320k pages in theory.
    However there might also be some platforms running at a lower frequency,
    so feed the watchdog every 100k pages.

    [yu.c.chen@intel.com: simplification]
    Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
    [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
    Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.com
    Signed-off-by: Chen Yu
    Reported-by: Jan Filipcewicz
    Suggested-by: Michal Hocko
    Reviewed-by: Michal Hocko
    Acked-by: Rafael J. Wysocki
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Len Brown
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yu
     

25 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • There is existing use after free bug when deferred struct pages are
    enabled:

    The memblock_add() allocates memory for the memory array if more than
    128 entries are needed. See comment in e820__memblock_setup():

    * The bootstrap memblock region count maximum is 128 entries
    * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
    * than that - so allow memblock resizing.

    This memblock memory is freed here:
    free_low_memory_core_early()

    We access the freed memblock.memory later in boot when deferred pages
    are initialized in this path:

    deferred_init_memmap()
    for_each_mem_pfn_range()
    __next_mem_pfn_range()
    type = &memblock.memory;

    One possible explanation for why this use-after-free hasn't been hit
    before is that the limit of INIT_MEMBLOCK_REGIONS has never been
    exceeded at least on systems where deferred struct pages were enabled.

    Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
    and verifying in qemu that this code is getting excuted and that the
    freed pages are sane.

    Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
    Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Steven Sistare
    Reviewed-by: Daniel Jordan
    Reviewed-by: Bob Picco
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

11 Aug, 2017

3 commits

  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The RDMA subsystem can generate several thousand of these messages per
    second eventually leading to a kernel crash. Ratelimit these messages
    to prevent this crash.

    Doug said:
    "I've been carrying a version of this for several kernel versions. I
    don't remember when they started, but we have one (and only one) class
    of machines: Dell PE R730xd, that generate these errors. When it
    happens, without a rate limit, we get rcu timeouts and kernel oopses.
    With the rate limit, we just get a lot of annoying kernel messages but
    the machine continues on, recovers, and eventually the memory
    operations all succeed"

    And:
    "> Well... why are all these EBUSY's occurring? It sounds inefficient
    > (at least) but if it is expected, normal and unavoidable then
    > perhaps we should just remove that message altogether?

    I don't have an answer to that question. To be honest, I haven't
    looked real hard. We never had this at all, then it started out of the
    blue, but only on our Dell 730xd machines (and it hits all of them),
    but no other classes or brands of machines. And we have our 730xd
    machines loaded up with different brands and models of cards (for
    instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
    ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
    meant it wasn't tied to any particular brand/model of RDMA hardware.
    To me, it always smelled of a hardware oddity specific to maybe the
    CPUs or mainboard chipsets in these machines, so given that I'm not an
    mm expert anyway, I never chased it down.

    A few other relevant details: it showed up somewhere around 4.8/4.9 or
    thereabouts. It never happened before, but the prinkt has been there
    since the 3.18 days, so possibly the test to trigger this message was
    changed, or something else in the allocator changed such that the
    situation started happening on these machines?

    And, like I said, it is specific to our 730xd machines (but they are
    all identical, so that could mean it's something like their specific
    ram configuration is causing the allocator to hit this on these
    machine but not on other machines in the cluster, I don't want to say
    it's necessarily the model of chipset or CPU, there are other bits of
    identicalness between these machines)"

    Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
    Signed-off-by: Jonathan Toppins
    Reviewed-by: Doug Ledford
    Tested-by: Doug Ledford
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Toppins
     
  • As Tetsuo points out:
    "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
    node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
    0kB"

    In addition to /proc/meminfo, this problem also affects the slab
    counters OOM/allocation failure info dumps, can cause early -ENOMEM from
    overcommit protection, and miscalculate image size requirements during
    suspend-to-disk.

    This is because the patch in question switched the slab counters from
    the zone level to the node level, but forgot to update the global
    accessor functions to read the aggregate node data instead of the
    aggregate zone data.

    Use global_node_page_state() to access the global slab counters.

    Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
    Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Cc: Stefan Agner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Aug, 2017

1 commit

  • A while ago someone, and I cannot find the email just now, asked if we
    could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
    like we use for other things like workqueues etc. I think this should
    be possible which allows reducing the 'irq' states and will reduce the
    amount of __bfs() lookups we do.

    Removing the 1 IRQ state results in 4 less __bfs() walks per
    dependency, improving lockdep performance. And by moving this
    annotation out of the lockdep code it becomes easier for the mm people
    to extend.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: iamjoonsoo.kim@lge.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Aug, 2017

1 commit

  • Andre Wild reported the following warning:

    WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
    Modules linked in:
    CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
    Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
    task: 00000000701d8100 task.stack: 0000000073594000
    Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
    ...
    Call Trace:
    lockdep_assert_cpus_held+0x42/0x60)
    stop_machine_cpuslocked+0x62/0xf0
    build_all_zonelists+0x92/0x150
    numa_zonelist_order_handler+0x102/0x150
    proc_sys_call_handler.isra.12+0xda/0x118
    proc_sys_write+0x34/0x48
    __vfs_write+0x3c/0x178
    vfs_write+0xbc/0x1a0
    SyS_write+0x66/0xc0
    system_call+0xc4/0x2b0
    locks held by bash/1205:
    #0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
    #1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
    #2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
    Last Breaking-Event-Address:
    lockdep_assert_cpus_held+0x48/0x60

    This can be easily triggered with e.g.

    echo n > /proc/sys/vm/numa_zonelist_order

    In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu
    rwsem") memory hotplug locking was changed to fix a potential deadlock.

    This also switched the stop_machine() invocation within
    build_all_zonelists() to stop_machine_cpuslocked() which now expects
    that online cpus are locked when being called.

    This assumption is not true if build_all_zonelists() is being called
    from numa_zonelist_order_handler().

    In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
    pair to numa_zonelist_order_handler().

    Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
    Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem")
    Signed-off-by: Heiko Carstens
    Reported-by: Andre Wild
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko