25 Apr, 2018

1 commit

  • commit ce9962bf7e22bb3891655c349faff618922d4a73

    0day reported warnings at boot on 32-bit systems without NX support:

    attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff
    WARNING: CPU: 0 PID: 1 at
    arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0:
    check_pgprot at arch/x86/include/asm/pgtable.h:535
    (inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549
    (inlined by) do_anonymous_page at mm/memory.c:3169
    (inlined by) handle_pte_fault at mm/memory.c:3961
    (inlined by) __handle_mm_fault at mm/memory.c:4087
    (inlined by) handle_mm_fault at mm/memory.c:4124

    The problem is that due to the recent commit which removed auto-massaging
    of page protections, filtering page permissions at PTE creation time is not
    longer done, so vma->vm_page_prot is passed unfiltered to PTE creation.

    Filter the page protections before they are installed in vma->vm_page_prot.

    Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections")
    Reported-by: Fengguang Wu
    Signed-off-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Cc: Andrea Arcangeli
    Cc: Juergen Gross
    Cc: Kees Cook
    Cc: Josh Poimboeuf
    Cc: Peter Zijlstra
    Cc: David Woodhouse
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Cc: Linus Torvalds
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Greg Kroah-Hartman
    Cc: Nadav Amit
    Cc: Dan Williams
    Cc: Arjan van de Ven
    Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com

    Dave Hansen
     

21 Apr, 2018

5 commits

  • f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
    Unfortunately, the page cache also uses the mapping's GFP flags for
    allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
    flag, and masks off __GFP_ZERO in some paths, but not all. That causes
    radix tree nodes to be allocated with a NULL list_head, which causes
    backtraces like:

    __list_del_entry+0x30/0xd0
    list_lru_del+0xac/0x1ac
    page_cache_tree_insert+0xd8/0x110

    The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
    if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
    innermost location, and remove it from earlier in the callchain.

    Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Matthew Wilcox
    Reported-by: Chris Fries
    Debugged-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If there is heavy memory pressure, page allocation with __GFP_NOWAIT
    fails easily although it's order-0 request. I got below warning 9 times
    for normal boot.

    : page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
    .. snip ..
    Call trace:
    dump_backtrace+0x0/0x4
    dump_stack+0xa4/0xc0
    warn_alloc+0xd4/0x15c
    __alloc_pages_nodemask+0xf88/0x10fc
    alloc_slab_page+0x40/0x18c
    new_slab+0x2b8/0x2e0
    ___slab_alloc+0x25c/0x464
    __kmalloc+0x394/0x498
    memcg_kmem_get_cache+0x114/0x2b8
    kmem_cache_alloc+0x98/0x3e8
    mmap_region+0x3bc/0x8c0
    do_mmap+0x40c/0x43c
    vm_mmap_pgoff+0x15c/0x1e4
    sys_mmap+0xb0/0xc8
    el0_svc_naked+0x24/0x28
    Mem-Info:
    active_anon:17124 inactive_anon:193 isolated_anon:0
    active_file:7898 inactive_file:712955 isolated_file:55
    unevictable:0 dirty:27 writeback:18 unstable:0
    slab_reclaimable:12250 slab_unreclaimable:23334
    mapped:19310 shmem:212 pagetables:816 bounce:0
    free:36561 free_pcp:1205 free_cma:35615
    Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
    lowmem_reserve[]: 0 1842 1842
    Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
    lowmem_reserve[]: 0 0 0
    DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
    Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
    721350 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    945512 pages RAM
    0 pages HighMem/MovableOnly
    63408 pages reserved
    51200 pages cma reserved

    __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
    and the worker allocation failure is not really critical because we will
    retry on the next kmem charge. We might miss some charges but that
    shouldn't be critical. The excessive allocation failure report is not
    very helpful.

    [mhocko@kernel.org: changelog update]
    Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • My testing for the latest kernel supporting thp migration showed an
    infinite loop in offlining the memory block that is filled with shmem
    thps. We can get out of the loop with a signal, but kernel should return
    with failure in this case.

    What happens in the loop is that scan_movable_pages() repeats returning
    the same pfn without any progress. That's because page migration always
    fails for shmem thps.

    In memory offline code, memory blocks containing unmovable pages should be
    prevented from being offline targets by has_unmovable_pages() inside
    start_isolate_page_range(). So it's possible to change migratability for
    non-anonymous thps to avoid the issue, but it introduces more complex and
    thp-specific handling in migration code, so it might not good.

    So this patch is suggesting to fix the issue by enabling thp migration for
    shmem thp. Both of anon/shmem thp are migratable so we don't need
    precheck about the type of thps.

    Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
    Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
    the page's memcg is undergoing move accounting, which occurs when a
    process leaves its memcg for a new one that has
    memory.move_charge_at_immigrate set.

    unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
    the given inode is switching writeback domains. Switches occur when
    enough writes are issued from a new domain.

    This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

    If both inode switch and process memcg migration are both in-flight then
    unlocked_inode_to_wb_end() will unconditionally enable interrupts while
    still holding the lock_page_memcg() irq spinlock. This suggests the
    possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end


    end_page_writeback
    test_clear_page_writeback
    lock_page_memcg

    unlock_page_memcg

    Due to configuration limitations this deadlock is not currently possible
    because we don't mix cgroup writeback (a cgroupv2 feature) and
    memory.move_charge_at_immigrate (a cgroupv1 feature).

    If the kernel is hacked to always claim inode switching and memcg
    moving_account, then this script triggers lockup in less than a minute:

    cd /mnt/cgroup/memory
    mkdir a b
    echo 1 > a/memory.move_charge_at_immigrate
    echo 1 > b/memory.move_charge_at_immigrate
    (
    echo $BASHPID > a/cgroup.procs
    while true; do
    dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
    ) &
    while true; do
    sync
    done &
    sleep 1h &
    SLEEP=$!
    while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
    done

    The deadlock does not seem possible, so it's debatable if there's any
    reason to modify the kernel. I suggest we should to prevent future
    surprises. And Wang Long said "this deadlock occurs three times in our
    environment", so there's more reason to apply this, even to stable.
    Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
    see "[PATCH for-4.4] writeback: safer lock nesting"
    https://lkml.org/lkml/2018/4/11/146

    Wang Long said "this deadlock occurs three times in our environment"

    [gthelen@google.com: v4]
    Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
    [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
    Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
    Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
    Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
    Signed-off-by: Greg Thelen
    Reported-by: Wang Long
    Acked-by: Wang Long
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Nicholas Piggin
    Cc: [v4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Li Wang has reported that LTP move_pages04 test fails with the current
    tree:

    LTP move_pages04:
    TFAIL : move_pages04.c:143: status[1] is EPERM, expected EFAULT

    The test allocates an array of two pages, one is present while the other
    is not (resp. backed by zero page) and it expects EFAULT for the second
    page as the man page suggests. We are reporting EPERM which doesn't make
    any sense and this is a result of a bug from cf5f16b23ec9 ("mm: unclutter
    THP migration").

    do_pages_move tries to handle as many pages in one batch as possible so we
    queue all pages with the same node target together and that corresponds to
    [start, i] range which is then used to update status array.
    add_page_for_migration will correctly notice the zero (resp. !present)
    page and returns with EFAULT which gets written to the status. But if
    this is the last page in the array we do not update start and so the last
    store_status after the loop will overwrite the range of the last batch
    with NUMA_NO_NODE (which corresponds to EPERM).

    Fix this by simply bailing out from the last flush if the pagelist is
    empty as there is clearly nothing more to do.

    Link: http://lkml.kernel.org/r/20180418121255.334-1-mhocko@kernel.org
    Fixes: cf5f16b23ec9 ("mm: unclutter THP migration")
    Signed-off-by: Michal Hocko
    Reported-by: Li Wang
    Tested-by: Li Wang
    Cc: Zi Yan
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

16 Apr, 2018

1 commit

  • syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97
    ("sget(): handle failures of register_shrinker()"). That commit expected
    that calling kill_sb() from deactivate_locked_super() without successful
    fill_super() is safe, but the reality was different; some callers assign
    attributes which are needed for kill_sb() after sget() succeeds.

    For example, [1] is a report where sb->s_mode (which seems to be either
    FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
    assigned unless sget() succeeds. But it does not worth complicate sget()
    so that register_shrinker() failure path can safely call
    kill_block_super() via kill_sb(). Making alloc_super() fail if memory
    allocation for register_shrinker() failed is much simpler. Let's avoid
    calling deactivate_locked_super() from sget_userns() by preallocating
    memory for the shrinker and making register_shrinker() in sget_userns()
    never fail.

    [1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Cc: Al Viro
    Cc: Michal Hocko
    Signed-off-by: Al Viro

    Tetsuo Handa
     

14 Apr, 2018

5 commits

  • cache_reap() is initially scheduled in start_cpu_timer() via
    schedule_delayed_work_on(). But then the next iterations are scheduled
    via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND.

    Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND
    work on wq_unbound_cpumask CPUs") there is no guarantee the future
    iterations will run on the originally intended cpu, although it's still
    preferred. I was able to demonstrate this with
    /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also
    happen due to migrating timers in nohz context. As a result, some cpu's
    would be calling cache_reap() more frequently and others never.

    This patch uses schedule_delayed_work_on() with the current cpu when
    scheduling the next iteration.

    Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz
    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Signed-off-by: Vlastimil Babka
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Stephen Boyd
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Building orangefs on MMU-less machines now results in a link error
    because of the newly introduced use of the filemap_page_mkwrite()
    function:

    ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

    This adds a dummy version for it, similar to the existing
    generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
    to avoid the link error without adding #ifdefs in each file system that
    uses these.

    Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
    Fixes: a5135eeab2e5 ("orangefs: implement vm_ops->fault")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Martin Brandenburg
    Cc: Mike Marshall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • __get_user_pages_fast handles errors differently from
    get_user_pages_fast: the former always returns the number of pages
    pinned, the later might return a negative error code.

    Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • get_user_pages_fast is supposed to be a faster drop-in equivalent of
    get_user_pages. As such, callers expect it to return a negative return
    code when passed an invalid address, and never expect it to return 0
    when passed a positive number of pages, since its documentation says:

    * Returns number of pages pinned. This may be fewer than the number
    * requested. If nr_pages is 0 or negative, returns 0. If no pages
    * were pinned, returns -errno.

    When get_user_pages_fast fall back on get_user_pages this is exactly
    what happens. Unfortunately the implementation is inconsistent: it
    returns 0 if passed a kernel address, confusing callers: for example,
    the following is pretty common but does not appear to do the right thing
    with a kernel address:

    ret = get_user_pages_fast(addr, 1, writeable, &page);
    if (ret < 0)
    return ret;

    Change get_user_pages_fast to return -EFAULT when supplied a kernel
    address to make it match expectations.

    All callers have been audited for consistency with the documented
    semantics.

    Link: http://lkml.kernel.org/r/1522962072-182137-4-git-send-email-mst@redhat.com
    Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
    Signed-off-by: Michael S. Tsirkin
    Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Patch series "mm/get_user_pages_fast fixes, cleanups", v2.

    Turns out get_user_pages_fast and __get_user_pages_fast return different
    values on error when given a single page: __get_user_pages_fast returns
    0. get_user_pages_fast returns either 0 or an error.

    Callers of get_user_pages_fast expect an error so fix it up to return an
    error consistently.

    Stress the difference between get_user_pages_fast and
    __get_user_pages_fast to make sure callers aren't confused.

    This patch (of 3):

    __gup_benchmark_ioctl does not handle the case where get_user_pages_fast
    fails:

    - a negative return code will cause a buffer overrun

    - returning with partial success will cause use of uninitialized
    memory.

    [akpm@linux-foundation.org: simplification]
    Link: http://lkml.kernel.org/r/1522962072-182137-3-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

12 Apr, 2018

28 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Juergen Gross noticed that commit f7f99100d8d ("mm: stop zeroing memory
    during allocation in vmemmap") broke XEN PV domains when deferred struct
    page initialization is enabled.

    This is because the xen's PagePinned() flag is getting erased from
    struct pages when they are initialized later in boot.

    Juergen fixed this problem by disabling deferred pages on xen pv
    domains. It is desirable, however, to have this feature available as it
    reduces boot time. This fix re-enables the feature for pv-dmains, and
    fixes the problem the following way:

    The fix is to delay setting PagePinned flag until struct pages for all
    allocated memory are initialized, i.e. until after free_all_bootmem().

    A new x86_init.hyper op init_after_bootmem() is called to let xen know
    that boot allocator is done, and hence struct pages for all the
    allocated memory are now initialized. If deferred page initialization
    is enabled, the rest of struct pages are going to be initialized later
    in boot once page_alloc_init_late() is called.

    xen_after_bootmem() walks page table's pages and marks them pinned.

    Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Acked-by: Ingo Molnar
    Reviewed-by: Juergen Gross
    Tested-by: Juergen Gross
    Cc: Daniel Jordan
    Cc: Pavel Tatashin
    Cc: Alok Kataria
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Boris Ostrovsky
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Andy Lutomirski
    Cc: Laura Abbott
    Cc: Kirill A. Shutemov
    Cc: Borislav Petkov
    Cc: Mathias Krause
    Cc: Jinbum Park
    Cc: Dan Williams
    Cc: Baoquan He
    Cc: Jia Zhang
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Stefano Stabellini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.

    This has started as a follow up discussion [3][4] resulting in the
    runtime failure caused by hardening patch [5] which removes MAP_FIXED
    from the elf loader because MAP_FIXED is inherently dangerous as it
    might silently clobber an existing underlying mapping (e.g. stack).
    The reason for the failure is that some architectures enforce an
    alignment for the given address hint without MAP_FIXED used (e.g. for
    shared or file backed mappings).

    One way around this would be excluding those archs which do alignment
    tricks from the hardening [6]. The patch is really trivial but it has
    been objected, rightfully so, that this screams for a more generic
    solution. We basically want a non-destructive MAP_FIXED.

    The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
    address but unlike MAP_FIXED it fails with EEXIST if the given range
    conflicts with an existing one. The flag is introduced as a completely
    new one rather than a MAP_FIXED extension because of the backward
    compatibility. We really want a never-clobber semantic even on older
    kernels which do not recognize the flag. Unfortunately mmap sucks
    wrt flags evaluation because we do not EINVAL on unknown flags. On
    those kernels we would simply use the traditional hint based semantic so
    the caller can still get a different address (which sucks) but at least
    not silently corrupt an existing mapping. I do not see a good way
    around that. Except we won't export expose the new semantic to the
    userspace at all.

    It seems there are users who would like to have something like that.
    Jemalloc has been mentioned by Michael Ellerman [7]

    Florian Weimer has mentioned the following:
    : glibc ld.so currently maps DSOs without hints. This means that the kernel
    : will map right next to each other, and the offsets between them a completely
    : predictable. We would like to change that and supply a random address in a
    : window of the address space. If there is a conflict, we do not want the
    : kernel to pick a non-random address. Instead, we would try again with a
    : random address.

    John Hubbard has mentioned CUDA example
    : a) Searches /proc//maps for a "suitable" region of available
    : VA space. "Suitable" generally means it has to have a base address
    : within a certain limited range (a particular device model might
    : have odd limitations, for example), it has to be large enough, and
    : alignment has to be large enough (again, various devices may have
    : constraints that lead us to do this).
    :
    : This is of course subject to races with other threads in the process.
    :
    : Let's say it finds a region starting at va.
    :
    : b) Next it does:
    : p = mmap(va, ...)
    :
    : *without* setting MAP_FIXED, of course (so va is just a hint), to
    : attempt to safely reserve that region. If p != va, then in most cases,
    : this is a failure (almost certainly due to another thread getting a
    : mapping from that region before we did), and so this layer now has to
    : call munmap(), before returning a "failure: retry" to upper layers.
    :
    : IMPROVEMENT: --> if instead, we could call this:
    :
    : p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
    :
    : , then we could skip the munmap() call upon failure. This
    : is a small thing, but it is useful here. (Thanks to Piotr
    : Jaroszynski and Mark Hairgrove for helping me get that detail
    : exactly right, btw.)
    :
    : c) After that, CUDA suballocates from p, via:
    :
    : q = mmap(sub_region_start, ... MAP_FIXED ...)
    :
    : Interestingly enough, "freeing" is also done via MAP_FIXED, and
    : setting PROT_NONE to the subregion. Anyway, I just included (c) for
    : general interest.

    Atomic address range probing in the multithreaded programs in general
    sounds like an interesting thing to me.

    The second patch simply replaces MAP_FIXED use in elf loader by
    MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
    should follow. Actually real MAP_FIXED usages should be docummented
    properly and they should be more of an exception.

    [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
    [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
    [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
    [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
    [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
    [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

    This patch (of 2):

    MAP_FIXED is used quite often to enforce mapping at the particular range.
    The main problem of this flag is, however, that it is inherently dangerous
    because it unmaps existing mappings covered by the requested range. This
    can cause silent memory corruptions. Some of them even with serious
    security implications. While the current semantic might be really
    desiderable in many cases there are others which would want to enforce the
    given range but rather see a failure than a silent memory corruption on a
    clashing range. Please note that there is no guarantee that a given range
    is obeyed by the mmap even when it is free - e.g. arch specific code is
    allowed to apply an alignment.

    Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
    behavior. It has the same semantic as MAP_FIXED wrt. the given address
    request with a single exception that it fails with EEXIST if the requested
    address is already covered by an existing mapping. We still do rely on
    get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
    check for a conflicting vma after it returns.

    The flag is introduced as a completely new one rather than a MAP_FIXED
    extension because of the backward compatibility. We really want a
    never-clobber semantic even on older kernels which do not recognize the
    flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
    EINVAL on unknown flags. On those kernels we would simply use the
    traditional hint based semantic so the caller can still get a different
    address (which sucks) but at least not silently corrupt an existing
    mapping. I do not see a good way around that.

    [mpe@ellerman.id.au: fix whitespace]
    [fail on clashing range with EEXIST as per Florian Weimer]
    [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
    Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
    Reviewed-by: Khalid Aziz
    Signed-off-by: Michal Hocko
    Acked-by: Michael Ellerman
    Cc: Khalid Aziz
    Cc: Russell King - ARM Linux
    Cc: Andrea Arcangeli
    Cc: Florian Weimer
    Cc: John Hubbard
    Cc: Matthew Wilcox
    Cc: Abdul Haleem
    Cc: Joel Stanley
    Cc: Kees Cook
    Cc: Michal Hocko
    Cc: Jason Evans
    Cc: David Goldblatt
    Cc: Edward Tomasz Napierała
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The kasan_slab_free hook's return value denotes whether the reuse of a
    slab object must be delayed (e.g. when the object is put into memory
    qurantine).

    The current way SLUB handles this hook is by ignoring its return value
    and hardcoding checks similar (but not exactly the same) to the ones
    performed in kasan_slab_free, which is prone to making mistakes.

    The main difference between the hardcoded checks and the ones in
    kasan_slab_free is whether we want to perform a free in case when an
    invalid-free or a double-free was detected (we don't).

    This patch changes the way SLUB handles this by:
    1. taking into account the return value of kasan_slab_free for each of
    the objects, that are being freed;
    2. reconstructing the freelist of objects to exclude the ones, whose
    reuse must be delayed.

    [andreyknvl@google.com: eliminate unnecessary branch in slab_free]
    Link: http://lkml.kernel.org/r/a62759a2545fddf69b0c034547212ca1eb1b3ce2.1520359686.git.andreyknvl@google.com
    Link: http://lkml.kernel.org/r/083f58501e54731203801d899632d76175868e97.1519400992.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Kostya Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • There was a regression report for "mm/cma: manage the memory of the CMA
    area by using the ZONE_MOVABLE" [1] and I think that it is related to
    this problem. CMA patchset makes the system use one more zone
    (ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable
    memory and it could cause regression.

    ZONE_MOVABLE only has movable pages so we don't need to keep enough
    freepages to avoid or deal with fragmentation. So, don't count it.

    This changes min_free_kbytes and thus min_watermark greatly if
    ZONE_MOVABLE is used. It will make the user uses more memory.

    System:
    22GB ram, fakenuma, 2 nodes. 5 zones are used.

    Before:
    min_free_kbytes: 112640

    zone_info (min_watermark):
    Node 0, zone DMA
    min 19
    Node 0, zone DMA32
    min 3778
    Node 0, zone Normal
    min 10191
    Node 0, zone Movable
    min 0
    Node 0, zone Device
    min 0
    Node 1, zone DMA
    min 0
    Node 1, zone DMA32
    min 0
    Node 1, zone Normal
    min 14043
    Node 1, zone Movable
    min 127
    Node 1, zone Device
    min 0

    After:
    min_free_kbytes: 90112

    zone_info (min_watermark):
    Node 0, zone DMA
    min 15
    Node 0, zone DMA32
    min 3022
    Node 0, zone Normal
    min 8152
    Node 0, zone Movable
    min 0
    Node 0, zone Device
    min 0
    Node 1, zone DMA
    min 0
    Node 1, zone DMA32
    min 0
    Node 1, zone Normal
    min 11234
    Node 1, zone Movable
    min 102
    Node 1, zone Device
    min 0

    [1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)

    Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
    and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.

    Therefore, we don't need to maintain ALLOC_CMA at all.

    Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "mm/cma: manage the memory of the CMA area by using the
    ZONE_MOVABLE", v2.

    0. History

    This patchset is the follow-up of the discussion about the "Introduce
    ZONE_CMA (v7)" [1]. Please reference it if more information is needed.

    1. What does this patch do?

    This patch changes the management way for the memory of the CMA area in
    the MM subsystem. Currently the memory of the CMA area is managed by
    the zone where their pfn is belong to. However, this approach has some
    problems since MM subsystem doesn't have enough logic to handle the
    situation that different characteristic memories are in a single zone.
    To solve this issue, this patch try to manage all the memory of the CMA
    area by using the MOVABLE zone. In MM subsystem's point of view,
    characteristic of the memory on the MOVABLE zone and the memory of the
    CMA area are the same. So, managing the memory of the CMA area by using
    the MOVABLE zone will not have any problem.

    2. Motivation

    There are some problems with current approach. See following. Although
    these problem would not be inherent and it could be fixed without this
    conception change, it requires many hooks addition in various code path
    and it would be intrusive to core MM and would be really error-prone.
    Therefore, I try to solve them with this new approach. Anyway,
    following is the problems of the current implementation.

    o CMA memory utilization

    First, following is the freepage calculation logic in MM.

    - For movable allocation: freepage = total freepage
    - For unmovable allocation: freepage = total freepage - CMA freepage

    Freepages on the CMA area is used after the normal freepages in the zone
    where the memory of the CMA area is belong to are exhausted. At that
    moment that the number of the normal freepages is zero, so

    - For movable allocation: freepage = total freepage = CMA freepage
    - For unmovable allocation: freepage = 0

    If unmovable allocation comes at this moment, allocation request would
    fail to pass the watermark check and reclaim is started. After reclaim,
    there would exist the normal freepages so freepages on the CMA areas
    would not be used.

    FYI, there is another attempt [2] trying to solve this problem in lkml.
    And, as far as I know, Qualcomm also has out-of-tree solution for this
    problem.

    Useless reclaim:

    There is no logic to distinguish CMA pages in the reclaim path. Hence,
    CMA page is reclaimed even if the system just needs the page that can be
    usable for the kernel allocation.

    Atomic allocation failure:

    This is also related to the fallback allocation policy for the memory of
    the CMA area. Consider the situation that the number of the normal
    freepages is *zero* since the bunch of the movable allocation requests
    come. Kswapd would not be woken up due to following freepage
    calculation logic.

    - For movable allocation: freepage = total freepage = CMA freepage

    If atomic unmovable allocation request comes at this moment, it would
    fails due to following logic.

    - For unmovable allocation: freepage = total freepage - CMA freepage = 0

    It was reported by Aneesh [3].

    Useless compaction:

    Usual high-order allocation request is unmovable allocation request and
    it cannot be served from the memory of the CMA area. In compaction,
    migration scanner try to migrate the page in the CMA area and make
    high-order page there. As mentioned above, it cannot be usable for the
    unmovable allocation request so it's just waste.

    3. Current approach and new approach

    Current approach is that the memory of the CMA area is managed by the
    zone where their pfn is belong to. However, these memory should be
    distinguishable since they have a strong limitation. So, they are
    marked as MIGRATE_CMA in pageblock flag and handled specially. However,
    as mentioned in section 2, the MM subsystem doesn't have enough logic to
    deal with this special pageblock so many problems raised.

    New approach is that the memory of the CMA area is managed by the
    MOVABLE zone. MM already have enough logic to deal with special zone
    like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA
    area by the MOVABLE zone just naturally work well because constraints
    for the memory of the CMA area that the memory should always be
    migratable is the same with the constraint for the MOVABLE zone.

    There is one side-effect for the usability of the memory of the CMA
    area. The use of MOVABLE zone is only allowed for a request with
    GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
    only allowed for this gfp flag. Before this patchset, a request with
    GFP_MOVABLE can use them. IMO, It would not be a big issue since most
    of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file
    cache page and anonymous page. However, file cache page for blockdev
    file is an exception. Request for it has no GFP_HIGHMEM flag. There is
    pros and cons on this exception. In my experience, blockdev file cache
    pages are one of the top reason that causes cma_alloc() to fail
    temporarily. So, we can get more guarantee of cma_alloc() success by
    discarding this case.

    Note that there is no change in admin POV since this patchset is just
    for internal implementation change in MM subsystem. Just one minor
    difference for admin is that the memory stat for CMA area will be
    printed in the MOVABLE zone. That's all.

    4. Result

    Following is the experimental result related to utilization problem.

    8 CPUs, 1024 MB, VIRTUAL MACHINE
    make -j16

    CMA area: 0 MB 512 MB
    Elapsed-time: 92.4 186.5
    pswpin: 82 18647
    pswpout: 160 69839

    CMA : 0 MB 512 MB
    Elapsed-time: 93.1 93.4
    pswpin: 84 46
    pswpout: 183 92

    akpm: "kernel test robot" reported a 26% improvement in
    vm-scalability.throughput:
    http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

    [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
    [2]: https://lkml.org/lkml/2014/10/15/623
    [3]: http://www.spinics.net/lists/linux-mm/msg100562.html

    Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
    important to reserve. When ZONE_MOVABLE is used, this problem would
    theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
    allocation request which is mainly used for page cache and anon page
    allocation. So, fix it by setting 0 to
    sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].

    And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
    makes code complex. For example, if there is highmem system, following
    reserve ratio is activated for *NORMAL ZONE* which would be easyily
    misleading people.

    #ifdef CONFIG_HIGHMEM
    32
    #endif

    This patch also fixes this situation by defining
    sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
    right place.

    Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Tested-by: Tony Lindgren
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: "Aneesh Kumar K . V"
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Russell King
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by spliting the THP into small pages while moving the head
    page to the newly allocated order-0 page. Remaning pages are moved to
    the LRU list by split_huge_page. The same happens if the THP allocation
    fails. This is really ugly and error prone [1].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    This patch tries to unclutter the situation by moving the special THP
    handling up to the migrate_pages layer where it actually belongs. We
    simply split the THP page into the existing list if unmap_and_move fails
    with ENOMEM and retry. So we will _always_ migrate all THP subpages and
    specific migrate_pages users do not have to deal with this case in a
    special way.

    [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

    Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "unclutter thp migration"

    Motivation:

    THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by splitting the THP into small pages while moving the
    head page to the newly allocated order-0 page. Remaining pages are
    moved to the LRU list by split_huge_page. The same happens if the THP
    allocation fails. This is really ugly and error prone [2].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    The first patch reworks do_pages_move which relies on a very ugly
    calling semantic when the return status is pushed to the migration path
    via private pointer. It uses pre allocated fixed size batching to
    achieve that. We simply cannot do the same if a THP is to be split
    during the migration path which is done in the patch 3. Patch 2 is
    follow up cleanup which removes the mentioned return status calling
    convention ugliness.

    On a side note:

    There are some semantic issues I have encountered on the way when
    working on patch 1 but I am not addressing them here. E.g. trying to
    move THP tail pages will result in either success or EBUSY (the later
    one more likely once we isolate head from the LRU list). Hugetlb
    reports EACCESS on tail pages. Some errors are reported via status
    parameter but migration failures are not even though the original
    `reason' argument suggests there was an intention to do so. From a
    quick look into git history this never worked. I have tried to keep the
    semantic unchanged.

    Then there is a relatively minor thing that the page isolation might
    fail because of pages not being on the LRU - e.g. because they are
    sitting on the per-cpu LRU caches. Easily fixable.

    This patch (of 3):

    do_pages_move is supposed to move user defined memory (an array of
    addresses) to the user defined numa nodes (an array of nodes one for
    each address). The user provided status array then contains resulting
    numa node for each address or an error. The semantic of this function
    is little bit confusing because only some errors are reported back.
    Notably migrate_pages error is only reported via the return value. This
    patch doesn't try to address these semantic nuances but rather change
    the underlying implementation.

    Currently we are processing user input (which can be really large) in
    batches which are stored to a temporarily allocated page. Each address
    is resolved to its struct page and stored to page_to_node structure
    along with the requested target numa node. The array of these
    structures is then conveyed down the page migration path via private
    argument. new_page_node then finds the corresponding structure and
    allocates the proper target page.

    What is the problem with the current implementation and why to change
    it? Apart from being quite ugly it also doesn't cope with unexpected
    pages showing up on the migration list inside migrate_pages path. That
    doesn't happen currently but the follow up patch would like to make the
    thp migration code more clear and that would need to split a THP into
    the list for some cases.

    How does the new implementation work? Well, instead of batching into a
    fixed size array we simply batch all pages that should be migrated to
    the same node and isolate all of them into a linked list which doesn't
    require any additional storage. This should work reasonably well
    because page migration usually migrates larger ranges of memory to a
    specific node. So the common case should work equally well as the
    current implementation. Even if somebody constructs an input where the
    target numa nodes would be interleaved we shouldn't see a large
    performance impact because page migration alone doesn't really benefit
    from batching. mmap_sem batching for the lookup is quite questionable
    and isolate_lru_page which would benefit from batching is not using it
    even in the current implementation.

    Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Anshuman Khandual
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Reale
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The pointer swap_avail_heads is local to the source and does not need to
    be in global scope, so make it static.

    Cleans up sparse warning:

    mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180206215836.12366-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • syzbot has triggered a NULL ptr dereference when allocation fault
    injection enforces a failure and alloc_mem_cgroup_per_node_info
    initializes memcg->nodeinfo only half way through.

    But __mem_cgroup_free still tries to free all per-node data and
    dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
    per-node data hasn't been initialized.

    The bug is quite unlikely to hit because small allocations do not fail
    and we would need quite some numa nodes to make struct
    mem_cgroup_per_node large enough to cross the costly order.

    Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
    Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
    Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrey Ryabinin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Abraham
     
  • Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") added per-cpu drift to all memory cgroup stats
    and events shown in memory.stat and memory.events.

    For memory.stat this is acceptable. But memory.events issues file
    notifications, and somebody polling the file for changes will be
    confused when the counters in it are unchanged after a wakeup.

    Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
    MEMCG_OOM - are sufficiently rare and high-level that we don't need
    per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
    frequent, but they're counting invocations of reclaim, which is a
    complex operation that touches many shared cachelines.

    This splits memory.events from the generic VM events and tracks them in
    their own, unbuffered atomic counters. That's also cleaner, as it
    eliminates the ugly enum nesting of VM and cgroup events.

    [hannes@cmpxchg.org: "array subscript is above array bounds"]
    Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
    Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When using KSM with use_zero_pages, we replace anonymous pages
    containing only zeroes with actual zero pages, which are not anonymous.
    We need to do proper accounting of the mm counters, otherwise we will
    get wrong values in /proc and a BUG message in dmesg when tearing down
    the mm.

    Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
    Signed-off-by: Claudio Imbrenda
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     
  • We have a perfectly good macro to determine whether the gfp flags allow
    you to sleep or not; use it instead of trying to infer it.

    Link: http://lkml.kernel.org/r/20180408062206.GC16007@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
    released on the error path that pool->compact_wq , which holds the
    return value of create_singlethread_workqueue(), is NULL. This will
    result in a memory leak bug.

    [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
    Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
    Signed-off-by: Xidong Wang
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xidong Wang
     
  • A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
    thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
    We have used an explicit __GFP_NORETRY previously which ruled the OOM
    killer automagically.

    Memcg charge path should be semantically compliant with the allocation
    path and that means that if we do not trigger the OOM killer for costly
    orders which should do the same in the memcg charge path as well.
    Otherwise we are forcing callers to distinguish the two and use
    different gfp masks which is both non-intuitive and bug prone. As soon
    as we get a costly high order kmalloc user we even do not have any means
    to tell the memcg specific gfp mask to prevent from OOM because the
    charging is deep within guts of the slab allocator.

    The unexpected memcg OOM on THP has already been fixed upstream by
    9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
    one-off fix rather than a generic solution. Teach mem_cgroup_oom to
    bail out on costly order requests to fix the THP issue as well as any
    other costly OOM eligible allocations to be added in future.

    Also revert 9d3c3354bb85 because special gfp for THP is no longer
    needed.

    Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Use of pte_write(pte) is only valid for present pte, the common code
    which set the migration entry can be reach for both valid present pte
    and special swap entry (for device memory). Fix the code to use the
    mpfn value which properly handle both cases.

    On x86 this did not have any bad side effect because pte write bit is
    below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
    in turn means we were always creating read only special migration entry.

    So once migration did finish we always write protected the CPU page
    table entry (moreover this is only an issue when migrating from device
    memory to system memory). End effect is that CPU write access would
    fault again and restore write permission.

    This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
    take a second fault on write access. ie, double faulting the same
    address. There is no corruption or incorrect states (it behaves as a
    COWed page from a fork with a mapcount of 1).

    Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • change_pte_range is called from task work context to mark PTEs for
    receiving NUMA faulting hints. If the marked pages are dirty then
    migration may fail. Some filesystems cannot migrate dirty pages without
    blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
    Even when they can, it can be a waste of cycles when the pages are
    shared forcing higher scan rates. This patch avoids marking shared
    dirty pages for hinting faults but also will skip a migration if the
    page was dirtied after the scanner updated a clean page.

    This is most noticeable running the NASA Parallel Benchmark when backed
    by btrfs, the default root filesystem for some distributions, but also
    noticeable when using XFS.

    The following are results from a 4-socket machine running a 4.16-rc4
    kernel with some scheduler patches that are pending for the next merge
    window.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
    Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
    Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
    Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
    Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)

    is.D regresses slightly in terms of absolute time but note that that
    particular load varies quite a bit from run to run. The more relevant
    observation is the total system CPU usage.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    User 71471.91 70627.04
    System 11078.96 8256.13
    Elapsed 661.66 632.74

    That is a substantial drop in system CPU usage and overall the workload
    completes faster. The NUMA balancing statistics are also interesting

    NUMA base PTE updates 111407972 139848884
    NUMA huge PMD updates 206506 264869
    NUMA page range updates 217139044 275461812
    NUMA hint faults 4300924 3719784
    NUMA hint local faults 3012539 3416618
    NUMA hint local percent 70 91
    NUMA pages migrated 1517487 1358420

    While more PTEs are scanned due to changes in what faults are gathered,
    it's clear that a far higher percentage of faults are local as the bulk
    of the remote hits were dirty pages that, in this case with btrfs, had
    no chance of migrating.

    The following is a comparison when using XFS as that is a more realistic
    filesystem choice for a data partition

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1r47
    Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
    Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
    Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
    Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
    Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)

    That is a reasonable gain on two relatively long-lived workloads. While
    not presented, there is also a substantial drop in system CPu usage and
    the NUMA balancing stats show similar improvements in locality as btrfs
    did.

    Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
    actually uses the RCU protection. The only caller is
    hmm_devmem_pages_create() which already grabs the mutex and does
    superfluous rcu_read_lock/unlock() around the function.

    This doesn't add anything and just adds to confusion. Remove the RCU
    protection and open-code the radix tree lookup. If this needs to become
    more sophisticated in the future, let's add them back when necessary.

    Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reviewed-by: Jérôme Glisse
    Cc: Paul E. McKenney
    Cc: Benjamin LaHaise
    Cc: Al Viro
    Cc: Kent Overstreet
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
    pfn shift value allowing them to define their own encoding for HMM pfn
    that are fill inside the pfns array of the hmm_range struct. With this
    device driver can get pfn that match their own private encoding out of HMM
    without having to do any conversion.

    [rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
    Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
    [rcampbell@nvidia.com: clarify fault logic for device private memory]
    Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Ralph Campbell
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This changes hmm_vma_fault() to not take a global write fault flag for a
    range but instead rely on caller to populate HMM pfns array with proper
    fault flag ie HMM_PFN_VALID if driver want read fault for that address or
    HMM_PFN_VALID and HMM_PFN_WRITE for write.

    Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
    device private memory to be migrated back to system memory through page
    fault.

    This is more flexible API and it better reflects how device handles and
    reports fault.

    Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • No functional change, just create one function to handle pmd and one to
    handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

    Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Move hmm_pfns_clear() closer to where it is used to make it clear it is
    not use by page table walkers.

    Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Make naming consistent across code, DEVICE_PRIVATE is the name use outside
    HMM code so use that one.

    Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse