21 Jan, 2016

4 commits

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • record_obj() in migrate_zspage() does not preserve handle's
    HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
    (accidentally) un-pins the handle, while migrate_zspage() still performs
    an explicit unpin_tag() on the that handle. This additional explicit
    unpin_tag() introduces a race condition with zs_free(), which can pin
    that handle by this time, so the handle becomes un-pinned.

    Schematically, it goes like this:

    CPU0 CPU1
    migrate_zspage
    find_alloced_obj
    trypin_tag
    set HANDLE_PIN_BIT zs_free()
    pin_tag()
    obj_malloc() -- new object, no tag
    record_obj() -- remove HANDLE_PIN_BIT set HANDLE_PIN_BIT
    unpin_tag() -- remove zs_free's HANDLE_PIN_BIT

    The race condition may result in a NULL pointer dereference:

    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
    PC is at get_zspage_mapping+0x0/0x24
    LR is at obj_free.isra.22+0x64/0x128
    Call trace:
    get_zspage_mapping+0x0/0x24
    zs_free+0x88/0x114
    zram_free_page+0x64/0xcc
    zram_slot_free_notify+0x90/0x108
    swap_entry_free+0x278/0x294
    free_swap_and_cache+0x38/0x11c
    unmap_single_vma+0x480/0x5c8
    unmap_vmas+0x44/0x60
    exit_mmap+0x50/0x110
    mmput+0x58/0xe0
    do_exit+0x320/0x8dc
    do_group_exit+0x44/0xa8
    get_signal+0x538/0x580
    do_signal+0x98/0x4b8
    do_notify_resume+0x14/0x5c

    This patch keeps the lock bit in migration path and update value
    atomically.

    Signed-off-by: Junil Lee
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junil Lee
     
  • split_queue_lock can be taken from interrupt context in some cases, but
    I forgot to convert locking in split_huge_page() to interrupt-safe
    primitives.

    Let's fix this.

    lockdep output:

    ======================================================
    [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
    4.4.0+ #259 Tainted: G W
    ------------------------------------------------------
    syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
    (split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436

    and this task is already holding:
    (slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
    (slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
    which would create a new lock dependency:
    (slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}

    but this new dependency connects a SOFTIRQ-irq-safe lock:
    (slock-AF_INET){+.-...}
    ... which became SOFTIRQ-irq-safe at:
    mark_irqflags kernel/locking/lockdep.c:2799
    __lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
    lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
    __raw_spin_lock include/linux/spinlock_api_smp.h:144
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:302
    udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
    flush_stack+0x50/0x330 net/ipv6/udp.c:799
    __udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
    __udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
    udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
    ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
    NF_HOOK_THRESH include/linux/netfilter.h:226
    NF_HOOK include/linux/netfilter.h:249
    ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:498
    ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
    NF_HOOK_THRESH include/linux/netfilter.h:226
    NF_HOOK include/linux/netfilter.h:249
    ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
    __netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
    __netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
    netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
    napi_skb_finish net/core/dev.c:4542
    napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
    e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
    e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
    napi_poll net/core/dev.c:5074
    net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
    __do_softirq+0x26a/0x920 kernel/softirq.c:273
    invoke_softirq kernel/softirq.c:350
    irq_exit+0x18f/0x1d0 kernel/softirq.c:391
    exiting_irq ./arch/x86/include/asm/apic.h:659
    do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
    ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
    arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
    default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
    arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
    default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
    cpuidle_idle_call kernel/sched/idle.c:156
    cpu_idle_loop kernel/sched/idle.c:252
    cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
    rest_init+0x192/0x1a0 init/main.c:412
    start_kernel+0x678/0x69e init/main.c:683
    x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
    x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184

    to a SOFTIRQ-irq-unsafe lock:
    (split_queue_lock){+.+...}
    which became SOFTIRQ-irq-unsafe at:
    mark_irqflags kernel/locking/lockdep.c:2817
    __lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
    lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
    __raw_spin_lock include/linux/spinlock_api_smp.h:144
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:302
    split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
    split_huge_page include/linux/huge_mm.h:99
    queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
    walk_pmd_range mm/pagewalk.c:50
    walk_pud_range mm/pagewalk.c:90
    walk_pgd_range mm/pagewalk.c:116
    __walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
    walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
    queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
    migrate_to_node mm/mempolicy.c:1004
    do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
    SYSC_migrate_pages mm/mempolicy.c:1453
    SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
    entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185

    other info that might help us debug this:

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(split_queue_lock);
    local_irq_disable();
    lock(slock-AF_INET);
    lock(split_queue_lock);

    lock(slock-AF_INET);

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Acked-by: David Rientjes
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • A newly added tracepoint in the hugepage code uses a variable in the
    error handling that is not initialized at that point:

    include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    The result is relatively harmless, as the trace data will in rare
    cases contain incorrect data.

    This works around the problem by adding an explicit initialization.

    Signed-off-by: Arnd Bergmann
    Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages")
    Reviewed-by: Ebru Akagunduz
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

19 Jan, 2016

1 commit

  • Pull virtio barrier rework+fixes from Michael Tsirkin:
    "This adds a new kind of barrier, and reworks virtio and xen to use it.

    Plus some fixes here and there"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (44 commits)
    checkpatch: add virt barriers
    checkpatch: check for __smp outside barrier.h
    checkpatch.pl: add missing memory barriers
    virtio: make find_vqs() checkpatch.pl-friendly
    virtio_balloon: fix race between migration and ballooning
    virtio_balloon: fix race by fill and leak
    s390: more efficient smp barriers
    s390: use generic memory barriers
    xen/events: use virt_xxx barriers
    xen/io: use virt_xxx barriers
    xenbus: use virt_xxx barriers
    virtio_ring: use virt_store_mb
    sh: move xchg_cmpxchg to a header by itself
    sh: support 1 and 2 byte xchg
    virtio_ring: update weak barriers to use virt_xxx
    Revert "virtio_ring: Update weak barriers to use dma_wmb/rmb"
    asm-generic: implement virt_xxx memory barriers
    x86: define __smp_xxx
    xtensa: define __smp_xxx
    tile: define __smp_xxx
    ...

    Linus Torvalds
     

18 Jan, 2016

1 commit

  • Commit b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when
    MADV_FREE syscall is called") introduced this new function, but got the
    error handling for when pmd_trans_huge_lock() fails wrong. In the
    failure case, the lock has not been taken, and we should not unlock on
    the way out.

    Cc: Minchan Kim
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Jan, 2016

34 commits

  • A spare array holding mem cgroup threshold events is kept around to make
    sure we can always safely deregister an event and have an array to store
    the new set of events in.

    In the scenario where we're going from 1 to 0 registered events, the
    pointer to the primary array containing 1 event is copied to the spare
    slot, and then the spare slot is freed because no events are left.
    However, it is freed before calling synchronize_rcu(), which means
    readers may still be accessing threshold->primary after it is freed.

    Fixed by only freeing after synchronize_rcu().

    Signed-off-by: Martijn Coenen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martijn Coenen
     
  • Currently memory_failure() doesn't handle non anonymous thp case,
    because we can hardly expect the error handling to be successful, and it
    can just hit some corner case which results in BUG_ON or something
    severe like that. This is also the case for soft offline code, so let's
    make it in the same way.

    Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
    but it's unnecessary because get_any_page() is already called when
    running on this code, which takes a refcount of the target page
    regardress of the flag. So this patch also removes it.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • soft_offline_page() has some deeply indented code, that's the sign of
    demand for cleanup. So let's do this. No functionality change.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Both s390 and powerpc have hit the issue of swapoff hanging, when
    CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
    quite as x86_64 had them. I think it would be much clearer if
    HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
    determine whether the MEM_SOFT_DIRTY option should be offered, and the
    actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.

    But won't embark on that change myself: instead make swapoff more
    robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
    without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
    whether the bit in question is defined as 0 or the asm-generic fallback
    is used, unless soft dirty is fully turned on.

    Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().

    Signed-off-by: Hugh Dickins
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Cyrill Gorcunov
    Cc: Laurent Dufour
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Dmitry Vyukov has reported[1] possible deadlock (triggered by his
    syzkaller fuzzer):

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);

    Both traces points to mm_take_all_locks() as a source of the problem.
    It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
    mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.

    huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
    and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
    take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
    rest of VMAs.

    The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

    [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
    mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
    it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
    VM_HUGETLB VMAs, and avoid useless overhead of minor faults.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen
     
  • Remove unused struct zone *z variable which appeared in 86051ca5eaf5
    ("mm: fix usemap initialization").

    Signed-off-by: Alexander Kuleshov
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • Since can_do_mlock only return 1 or 0, so make it boolean.

    No functional change.

    [akpm@linux-foundation.org: update declaration in mm.h]
    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • Just cleanup, no functional change.

    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • In earlier versions, mem_cgroup_css_from_page() could return non-root
    css on a legacy hierarchy which can go away and required rcu locking;
    however, the eventual version simply returns the root cgroup if memcg is
    on a legacy hierarchy and thus doesn't need rcu locking around or in it.
    Remove spurious rcu lockings.

    Signed-off-by: Tejun Heo
    Reported-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Use "IS_ALIGNED" to judge the alignment, rather than directly judging.

    Signed-off-by: Wang Xiaoqiang
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • During Jason's work with postcopy migration support for s390 a problem
    regarding gmap faults was discovered.

    The gmap code will call fixup_user_fault which will end up always in
    handle_mm_fault. Till now we never cared about retries, but as the
    userfaultfd code kind of relies on it. this needs some fix.

    This patchset does not take care of the futex code. I will now look
    closer at this.

    This patch (of 2):

    With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
    to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
    faulting we ever unlocked mmap_sem.

    This patch brings in the logic to handle retries as well as it cleans up
    the current documentation. fixup_user_fault was not having the same
    semantics as filemap_fault. It never indicated if a retry happened and
    so a caller wasn't able to handle that case. So we now changed the
    behaviour to always retry a locked mmap_sem.

    Signed-off-by: Dominik Dingel
    Reviewed-by: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Christian Borntraeger
    Cc: "Jason J. Herne"
    Cc: David Rientjes
    Cc: Eric B Munson
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: Dominik Dingel
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dominik Dingel
     
  • A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
    has established a devm_memremap_pages() mapping, i.e. when the pfn_t
    return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
    encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
    struct dev_pagemap instance to keep the result of pfn_to_page() valid
    until put_page().

    Signed-off-by: Dan Williams
    Tested-by: Logan Gunthorpe
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • A dax-huge-page mapping while it uses some thp helpers is ultimately not
    a transparent huge page. The distinction is especially important in the
    get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
    from pmd_huge() and pmd_trans_huge() which have slightly different
    semantics.

    Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
    pmd_devmap() checks.

    [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Similar to the conversion of vm_insert_mixed() use pfn_t in the
    vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
    pfn is backed by a devm_memremap_pages() mapping.

    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
    evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers
    _PAGE_DEVMAP to be set in the resulting pte.

    There are no functional changes to the gpu drivers as a result of this
    conversion.

    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: David Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In support of providing struct page for large persistent memory
    capacities, use struct vmem_altmap to change the default policy for
    allocating memory for the memmap array. The default vmemmap_populate()
    allocates page table storage area from the page allocator. Given
    persistent memory capacities relative to DRAM it may not be feasible to
    store the memmap in 'System Memory'. Instead vmem_altmap represents
    pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
    requests.

    Signed-off-by: Dan Williams
    Reported-by: kbuild test robot
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prior to this change DAX PMD mappings that were made read-only were
    never able to be made writable again. This is because the code in
    insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
    these calls if the PMD already existed in the page table.

    Instead, if we are doing a write always mark the PMD entry as dirty and
    writeable. Without this code we can get into a condition where we mark
    the PMD as read-only, and then on a subsequent write fault we get into
    an infinite loop of PMD faults where we try unsuccessfully to make the
    PMD writeable.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Reported-by: Jeff Moyer
    Reported-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Sasha Levin has reported KASAN out-of-bounds bug[1]. It points to "if
    (!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.

    The cause is that split_huge_page() doesn't handle THP correctly if it's
    not allingned to PMD boundary. It can happen after mremap().

    Test-case (not always triggers the bug):

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MB (1024UL*1024)
    #define SIZE (2*MB)
    #define BASE ((void *)0x400000000000)

    int main()
    {
    char *p;

    p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
    -1, 0);
    if (p == MAP_FAILED)
    perror("mmap"), exit(1);
    p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
    BASE + SIZE + 8192);
    if (p == MAP_FAILED)
    perror("mremap"), exit(1);
    system("echo 1 > /sys/kernel/debug/split_huge_pages");
    return 0;
    }

    The patch fixes freeze and unfreeze paths to handle page table boundary
    crossing.

    It also makes mapcount vs count check in split_huge_page_to_list()
    stricter:
    - after freeze we don't expect any subpage mapped as we remove them
    from rmap when setting up migration entries;
    - count must be 1, meaning only caller has reference to the page;

    [1] https://gist.github.com/sashalevin/c67fbea55e7c0576972a

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't need to split THP page when MADV_FREE syscall is called if
    [start, len] is aligned with THP size. The split could be done when VM
    decide to free it in reclaim path if memory pressure is heavy. With
    that, we could avoid unnecessary THP split.

    For the feature, this patch changes pte dirtness marking logic of THP.
    Now, it marks every ptes of pages dirty unconditionally in splitting,
    which makes MADV_FREE void. So, instead, this patch propagates pmd
    dirtiness to all pages via PG_dirty and restores pte dirtiness from
    PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
    happens(e,g, shrink_page_list), all of pages are clean too so we could
    discard them.

    Signed-off-by: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The MADV_FREE patchset changes page reclaim to simply free a clean
    anonymous page with no dirty ptes, instead of swapping it out; but KSM
    uses clean write-protected ptes to reference the stable ksm page. So be
    sure to mark that page dirty, so it's never mistakenly discarded.

    [hughd@google.com: adjusted comments]
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • MADV_FREE is a hint that it's okay to discard pages if there is memory
    pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
    them so there is no value keeping them in the active anonymous LRU so
    this patch moves them to inactive LRU list's head.

    This means that MADV_FREE-ed pages which were living on the inactive
    list are reclaimed first because they are more likely to be cold rather
    than recently active pages.

    An arguable issue for the approach would be whether we should put the
    page to the head or tail of the inactive list. I chose head because the
    kernel cannot make sure it's really cold or warm for every MADV_FREE
    usecase but at least we know it's not *hot*, so landing of inactive head
    would be a comprimise for various usecases.

    This fixes suboptimal behavior of MADV_FREE when pages living on the
    active list will sit there for a long time even under memory pressure
    while the inactive list is reclaimed heavily. This basically breaks the
    whole purpose of using MADV_FREE to help the system to free memory which
    is might not be used.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Kerrisk
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
    consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
    siginficat slower (ie, 2x times) than madvise_dontneed.

    loop = 5;
    mmap(512M);
    while (loop--) {
    memset(512M);
    madvise(MADV_FREE or MADV_DONTNEED);
    }

    The reason is lots of swapin.

    1) dontneed: 1,612 swapin
    2) madvfree: 879,585 swapin

    If we find hinted pages were already swapped out when syscall is called,
    it's pointless to keep the swapped-out pages in pte. Instead, let's
    free the cold page because swapin is more expensive than (alloc page +
    zeroing).

    With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

    1) dontneed: 6.10user 233.50system 0:50.44elapsed
    2) madvfree: 6.03user 401.17system 1:30.67elapsed
    2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
    code for looking up pte of a (possibly transhuge) page. Move this code
    to a new helper function, page_check_address_transhuge(), and make the
    above mentioned functions use it.

    This is just a cleanup, no functional changes are intended.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • During freeze_page(), we remove the page from rmap. It munlocks the
    page if it was mlocked. clear_page_mlock() uses thelru cache, which
    temporary pins the page.

    Let's drain the lru cache before checking page's count vs. mapcount.
    The change makes mlocked page split on first attempt, if it was not
    pinned by somebody else.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Writing 1 into 'split_huge_pages' will try to find and split all huge
    pages in the system. This is useful for debuging.

    [akpm@linux-foundation.org: fix printk text, per Vlastimil]
    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Both page_referenced() and page_idle_clear_pte_refs_one() assume that
    THP can only be mapped with PMD, so there's no reason to look on PTEs
    for PageTransHuge() pages. That's no true anymore: THP can be mapped
    with PTEs too.

    The patch removes PageTransHuge() test from the functions and opencode
    page table check.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Vladimir Davydov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • All parts of THP with new refcounting are now in place. We can now
    allow to enable THP.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently we don't split huge page on partial unmap. It's not an ideal
    situation. It can lead to memory overhead.

    Furtunately, we can detect partial unmap on page_remove_rmap(). But we
    cannot call split_huge_page() from there due to locking context.

    It's also counterproductive to do directly from munmap() codepath: in
    many cases we will hit this from exit(2) and splitting the huge page
    just to free it up in small pages is not what we really want.

    The patch introduce deferred_split_huge_page() which put the huge page
    into queue for splitting. The splitting itself will happen when we get
    memory pressure via shrinker interface. The page will be dropped from
    list on freeing through compound page destructor.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are not able to migrate THPs. It means it's not enough to split only
    PMD on migration -- we need to split compound page under it too.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch adds implementation of split_huge_page() for new
    refcountings.

    Unlike previous implementation, new split_huge_page() can fail if
    somebody holds GUP pin on the page. It also means that pin on page
    would prevent it from bening split under you. It makes situation in
    many places much cleaner.

    The basic scheme of split_huge_page():

    - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

    - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

    - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

    - Split compound page.

    - Unfreeze the page by removing migration entries.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
    changes in thp refcounting rule. This patch fixes them up.

    In the new refcounting, we no longer use tail->_mapcount to keep tail's
    refcount, and thereby we can simplify get/put_hwpoison_page().

    And another change is that tail's refcount is not transferred to the raw
    page during thp split (more precisely, in new rule we don't take
    refcount on tail page any more.) So when we need thp split, we have to
    transfer the refcount properly to the 4kB soft-offlined page before
    migration.

    thp split code goes into core code only when precheck
    (total_mapcount(head) == page_count(head) - 1) passes to avoid useless
    split, where we assume that one refcount is held by the caller of thp
    split and the others are taken via mapping. To meet this assumption,
    this patch moves thp split part in soft_offline_page() after
    get_any_page().

    [akpm@linux-foundation.org: remove unneeded #define, per Kirill]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi