25 Nov, 2019

1 commit


12 Feb, 2019

1 commit


07 Feb, 2019

4 commits

  • commit e0a352fabce61f730341d119fbedf71ffdb8663f upstream.

    We had a race in the old balloon compaction code before b1123ea6d3b3
    ("mm: balloon: use general non-lru movable page feature") refactored it
    that became visible after backporting 195a8c43e93d ("virtio-balloon:
    deflate via a page list") without the refactoring.

    The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
    redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
    use general non-lru movable page feature"). d6d86c0a7f8d
    ("mm/balloon_compaction: redesign ballooned pages management") was
    backported to 3.12, so the broken kernels are stable kernels [3.12 -
    4.7].

    There was a subtle race between dropping the page lock of the newpage in
    __unmap_and_move() and checking for __is_movable_balloon_page(newpage).

    Just after dropping this page lock, virtio-balloon could go ahead and
    deflate the newpage, effectively dequeueing it and clearing PageBalloon,
    in turn making __is_movable_balloon_page(newpage) fail.

    This resulted in dropping the reference of the newpage via
    putback_lru_page(newpage) instead of put_page(newpage), leading to
    page->lru getting modified and a !LRU page ending up in the LRU lists.
    With 195a8c43e93d ("virtio-balloon: deflate via a page list")
    backported, one would suddenly get corrupted lists in
    release_pages_balloon():

    - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
    - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100

    Nowadays this race is no longer possible, but it is hidden behind very
    ugly handling of __ClearPageMovable() and __PageMovable().

    __ClearPageMovable() will not make __PageMovable() fail, only
    PageMovable(). So the new check (__PageMovable(newpage)) will still
    hold even after newpage was dequeued by virtio-balloon.

    If anybody would ever change that special handling, the BUG would be
    introduced again. So instead, make it explicit and use the information
    of the original isolated page before migration.

    This patch can be backported fairly easy to stable kernels (in contrast
    to the refactoring).

    Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
    Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management")
    Signed-off-by: David Hildenbrand
    Reported-by: Vratislav Bendel
    Acked-by: Michal Hocko
    Acked-by: Rafael Aquini
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Dominik Brodowski
    Cc: Matthew Wilcox
    Cc: Vratislav Bendel
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Cc: Minchan Kim
    Cc: [3.12 - 4.7]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit 6376360ecbe525a9c17b3d081dfd88ba3e4ed65b upstream.

    Currently memory_failure() is racy against process's exiting, which
    results in kernel crash by null pointer dereference.

    The root cause is that memory_failure() uses force_sig() to forcibly
    kill asynchronous (meaning not in the current context) processes. As
    discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
    fixes, this is not a right thing to do. OOM solves this issue by using
    do_send_sig_info() as done in commit d2d393099de2 ("signal:
    oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
    patch is suggesting to do the same for hwpoison. do_send_sig_info()
    properly accesses to siglock with lock_task_sighand(), so is free from
    the reported race.

    I confirmed that the reported bug reproduces with inserting some delay
    in kill_procs(), and it never reproduces with this patch.

    Note that memory_failure() can send another type of signal using
    force_sig_mceerr(), and the reported race shouldn't happen on it because
    force_sig_mceerr() is called only for synchronous processes (i.e.
    BUS_MCEERR_AR happens only when some process accesses to the corrupted
    memory.)

    Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Jane Chu
    Reviewed-by: Dan Williams
    Reviewed-by: William Kucharski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream.

    Syzbot instance running on upstream kernel found a use-after-free bug in
    oom_kill_process. On further inspection it seems like the process
    selected to be oom-killed has exited even before reaching
    read_lock(&tasklist_lock) in oom_kill_process(). More specifically the
    tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
    and the put_task_struct within for_each_thread() frees the tsk and
    for_each_thread() tries to access the tsk. The easiest fix is to do
    get/put across the for_each_thread() on the selected task.

    Now the next question is should we continue with the oom-kill as the
    previously selected task has exited? However before adding more
    complexity and heuristics, let's answer why we even look at the children
    of oom-kill selected task? The select_bad_process() has already selected
    the worst process in the system/memcg. Due to race, the selected
    process might not be the worst at the kill time but does that matter?
    The userspace can use the oom_score_adj interface to prefer children to
    be killed before the parent. I looked at the history but it seems like
    this is there before git history.

    Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
    Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
    Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     
  • commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.

    Arkadiusz reported that enabling memcg's group oom killing causes
    strange memcg statistics where there is no task in a memcg despite the
    number of tasks in that memcg is not 0. It turned out that there is a
    bug in wake_oom_reaper() which allows enqueuing same task twice which
    makes impossible to decrease the number of tasks in that memcg due to a
    refcount leak.

    This bug existed since the OOM reaper became invokable from
    task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

    T1@P1 |T2@P1 |T3@P1 |OOM reaper
    ----------+----------+----------+------------
    # Processing an OOM victim in a different memcg domain.
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    out_of_memory()
    oom_kill_process(P1)
    do_send_sig_info(SIGKILL, @P1)
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T2@P1)
    wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
    mutex_unlock(&oom_lock)
    # Completed processing an OOM victim in a different memcg domain.
    spin_lock(&oom_reaper_lock)
    # T1P1 is dequeued.
    spin_unlock(&oom_reaper_lock)

    but memcg's group oom killing made it easier to trigger this bug by
    calling wake_oom_reaper() on the same task from one out_of_memory()
    request.

    Fix this bug using an approach used by commit 855b018325737f76 ("oom,
    oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
    side effect of this patch, this patch also avoids enqueuing multiple
    threads sharing memory via task_will_free_mem(current) path.

    Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
    Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
    Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
    Signed-off-by: Tetsuo Handa
    Reported-by: Arkadiusz Miskiewicz
    Tested-by: Arkadiusz Miskiewicz
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Tejun Heo
    Cc: Aleksa Sarai
    Cc: Jay Kamat
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

26 Jan, 2019

2 commits

  • [ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ]

    Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
    avail_lists field of swap_info_struct is changed to an array with
    MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB
    and needs an order-4 page to hold it.

    This is not optimal in that:
    1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
    is a waste of memory;
    2 It could cause swapon failure if the swap device is swapped on
    after system has been running for a while, due to no order-4
    page is available as pointed out by Vasily Averin.

    Solve the above two issues by using nr_node_ids(which is the actual
    possible node number the running system has) for avail_lists instead of
    MAX_NUMNODES.

    nr_node_ids is unknown at compile time so can't be directly used when
    declaring this array. What I did here is to declare avail_lists as zero
    element array and allocate space for it when allocating space for
    swap_info_struct. The reason why keep using array but not pointer is
    plist_for_each_entry needs the field to be part of the struct, so pointer
    will not work.

    This patch is on top of Vasily Averin's fix commit. I think the use of
    kvzalloc for swap_info_struct is still needed in case nr_node_ids is
    really big on some systems.

    Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
    Signed-off-by: Aaron Lu
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vasily Averin
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Aaron Lu
     
  • [ Upstream commit 3fa750dcf29e8606e3969d13d8e188cc1c0f511d ]

    write_cache_pages() is used in both background and integrity writeback
    scenarios by various filesystems. Background writeback is mostly
    concerned with cleaning a certain number of dirty pages based on various
    mm heuristics. It may not write the full set of dirty pages or wait for
    I/O to complete. Integrity writeback is responsible for persisting a set
    of dirty pages before the writeback job completes. For example, an
    fsync() call must perform integrity writeback to ensure data is on disk
    before the call returns.

    write_cache_pages() unconditionally breaks out of its processing loop in
    the event of a ->writepage() error. This is fine for background
    writeback, which had no strict requirements and will eventually come
    around again. This can cause problems for integrity writeback on
    filesystems that might need to clean up state associated with failed page
    writeouts. For example, XFS performs internal delayed allocation
    accounting before returning a ->writepage() error, where applicable. If
    the current writeback happens to be associated with an unmount and
    write_cache_pages() completes the writeback prematurely due to error, the
    filesystem is unmounted in an inconsistent state if dirty+delalloc pages
    still exist.

    To handle this problem, update write_cache_pages() to always process the
    full set of pages for integrity writeback regardless of ->writepage()
    errors. Save the first encountered error and return it to the caller once
    complete. This facilitates XFS (or any other fs that expects integrity
    writeback to process the entire set of dirty pages) to clean up its
    internal state completely in the event of persistent mapping errors.
    Background writeback continues to exit on the first error encountered.

    [akpm@linux-foundation.org: fix typo in comment]
    Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Brian Foster
     

17 Jan, 2019

3 commits

  • commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.

    Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
    ext4 writeback

    task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

    task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

    He adds
    "task1 is waiting for the PageWriteback bit of the page that task2 has
    collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
    LOCKED bit the page which tasks1 has locked"

    More precisely task1 is handling a page fault and it has a page locked
    while it charges a new page table to a memcg. That in turn hits a
    memory limit reclaim and the memcg reclaim for legacy controller is
    waiting on the writeback but that is never going to finish because the
    writeback itself is waiting for the page locked in the #PF path. So
    this is essentially ABBA deadlock:

    lock_page(A)
    SetPageWriteback(A)
    unlock_page(A)
    lock_page(B)
    lock_page(B)
    pte_alloc_pne
    shrink_page_list
    wait_on_page_writeback(A)
    SetPageWriteback(B)
    unlock_page(B)

    # flush A, B to clear the writeback

    This accumulating of more pages to flush is used by several filesystems
    to generate a more optimal IO patterns.

    Waiting for the writeback in legacy memcg controller is a workaround for
    pre-mature OOM killer invocations because there is no dirty IO
    throttling available for the controller. There is no easy way around
    that unfortunately. Therefore fix this specific issue by pre-allocating
    the page table outside of the page lock. We have that handy
    infrastructure for that already so simply reuse the fault-around pattern
    which already does this.

    There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
    from under a fs page locked but they should be really rare. I am not
    aware of a better solution unfortunately.

    [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: enhance comment, per Johannes]
    Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Signed-off-by: Michal Hocko
    Reported-by: Liu Bo
    Debugged-by: Liu Bo
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Reviewed-by: Liu Bo
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 8ab88c7169b7fba98812ead6524b9d05bc76cf00 upstream.

    LTP proc01 testcase has been observed to rarely trigger crashes
    on arm64:
    page_mapped+0x78/0xb4
    stable_page_flags+0x27c/0x338
    kpageflags_read+0xfc/0x164
    proc_reg_read+0x7c/0xb8
    __vfs_read+0x58/0x178
    vfs_read+0x90/0x14c
    SyS_read+0x60/0xc0

    The issue is that page_mapped() assumes that if compound page is not
    huge, then it must be THP. But if this is 'normal' compound page
    (COMPOUND_PAGE_DTOR), then following loop can keep running (for
    HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
    mapped and triggers a panic:

    for (i = 0; i < hpage_nr_pages(page); i++) {
    if (atomic_read(&page[i]._mapcount) >= 0)
    return true;
    }

    I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
    with a custom kernel module [1] which:
    - allocates compound page (PAGEC) of order 1
    - allocates 2 normal pages (COPY), which are initialized to 0xff (to
    satisfy _mapcount >= 0)
    - 2 PAGEC page structs are copied to address of first COPY page
    - second page of COPY is marked as not present
    - call to page_mapped(COPY) now triggers fault on access to 2nd COPY
    page at offset 0x30 (_mapcount)

    [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c

    Fix the loop to iterate for "1 << compound_order" pages.

    Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
    pages".

    Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
    Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
    Signed-off-by: Jan Stancek
    Debugged-by: Laszlo Ersek
    Suggested-by: "Kirill A. Shutemov"
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: David Hildenbrand
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Stancek
     
  • commit 09c2e76ed734a1d36470d257a778aaba28e86531 upstream.

    Callers of __alloc_alien() check for NULL. We must do the same check in
    __alloc_alien_cache to avoid NULL pointer dereferences on allocation
    failures.

    Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com
    Fixes: 49dfc304ba241 ("slab: use the lock on alien_cache, instead of the lock on array_cache")
    Fixes: c8522a3a5832b ("Slab: introduce alloc_alien")
    Signed-off-by: Christoph Lameter
    Reported-by: syzbot+d6ed4ec679652b4fd4e4@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christoph Lameter
     

13 Jan, 2019

4 commits

  • commit 7af7a8e19f0c5425ff639b0f0d2d244c2a647724 upstream.

    KSM pages may be mapped to the multiple VMAs that cannot be reached from
    one anon_vma. So during swapin, a new copy of the page need to be
    generated if a different anon_vma is needed, please refer to comments of
    ksm_might_need_to_copy() for details.

    During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
    virtual address mapped to the page, so not all mappings to a swapped out
    KSM page could be found. So in try_to_unuse(), even if the swap count of
    a swap entry isn't zero, the page needs to be deleted from swap cache, so
    that, in the next round a new page could be allocated and swapin for the
    other mappings of the swapped out KSM page.

    But this contradicts with the THP swap support. Where the THP could be
    deleted from swap cache only after the swap count of every swap entry in
    the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
    changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
    space for THP swapped out") to check that before delete a page from swap
    cache, but this has broken KSM swapoff too.

    Fortunately, KSM is for the normal pages only, so the original behavior
    for KSM pages could be restored easily via checking PageTransCompound().
    That is how this patch works.

    The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
    swap space for THP swapped out"), which is merged by v4.14-rc1. So I
    think we should backport the fix to from 4.14 on. But Hugh thinks it may
    be rare for the KSM pages being in the swap device when swapoff, so nobody
    reports the bug so far.

    Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
    Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
    Signed-off-by: "Huang, Ying"
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Daniel Jordan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     
  • commit 02917e9f8676207a4c577d4d94eae12bf348e9d7 upstream.

    At Maintainer Summit, Greg brought up a topic I proposed around
    EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
    EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
    step of reclassifying an existing export. Specifically, I wanted to make
    the case that although the line is fuzzy and hard to specify in abstract
    terms, it is nonetheless clear that devm_memremap_pages() and HMM
    (Heterogeneous Memory Management) have crossed it. The
    devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
    beginning, and HMM as a derivative of that functionality should have
    naturally picked up that designation as well.

    Contrary to typical rules, the HMM infrastructure was merged upstream with
    zero in-tree consumers. There was a promise at the time that those users
    would be merged "soon", but it has been over a year with no drivers
    arriving. While the Nouveau driver is about to belatedly make good on
    that promise it is clear that HMM was targeted first and foremost at an
    out-of-tree consumer.

    HMM is derived from devm_memremap_pages(), a facility Christoph and I
    spearheaded to support persistent memory. It combines a device lifetime
    model with a dynamically created 'struct page' / memmap array for any
    physical address range. It enables coordination and control of the many
    code paths in the kernel built to interact with memory via 'struct page'
    objects. With HMM the integration goes even deeper by allowing device
    drivers to hook and manipulate page fault and page free events.

    One interpretation of when EXPORT_SYMBOL is suitable is when it is
    exporting stable and generic leaf functionality. The
    devm_memremap_pages() facility continues to see expanding use cases,
    peer-to-peer DMA being the most recent, with no clear end date when it
    will stop attracting reworks and semantic changes. It is not suitable to
    export devm_memremap_pages() as a stable 3rd party driver API due to the
    fact that it is still changing and manipulates core behavior. Moreover,
    it is not in the best interest of the long term development of the core
    memory management subsystem to permit any external driver to effectively
    define its own system-wide memory management policies with no
    encouragement to engage with upstream.

    I am also concerned that HMM was designed in a way to minimize further
    engagement with the core-MM. That, with these hooks in place,
    device-drivers are free to implement their own policies without much
    consideration for whether and how the core-MM could grow to meet that
    need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
    core-MM should be allowed the opportunity and stimulus to change and
    address these new use cases as first class functionality.

    Original changelog:

    hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
    devm_memremap_pages() and are now simple now wrappers around the core
    facility to inject a dev_pagemap instance into the global pgmap_radix and
    hook page-idle events. The devm_memremap_pages() interface is base
    infrastructure for HMM. HMM has more and deeper ties into the kernel
    memory management implementation than base ZONE_DEVICE which is itself a
    EXPORT_SYMBOL_GPL facility.

    Originally, the HMM page structure creation routines copied the
    devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the
    implementations was discussed during the initial review:
    http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
    extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
    this cleanup to move forward.

    In addition to the integration with devm_memremap_pages() HMM depends on
    other GPL-only symbols:

    mmu_notifier_unregister_no_release
    percpu_ref
    region_intersects
    __class_create

    It goes further to consume / indirectly expose functionality that is not
    exported to any other driver:

    alloc_pages_vma
    walk_page_range

    HMM is derived from devm_memremap_pages(), and extends deep core-kernel
    fundamentals. Similar to devm_memremap_pages(), mark its entry points
    EXPORT_SYMBOL_GPL().

    [logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
    Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
    Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Logan Gunthorpe
    Reviewed-by: Christoph Hellwig
    Cc: Logan Gunthorpe
    Cc: "Jérôme Glisse"
    Cc: Balbir Singh ,
    Cc: Michal Hocko
    Cc: Benjamin Herrenschmidt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 58ef15b765af0d2cbe6799ec564f1dc485010ab8 upstream.

    devm semantics arrange for resources to be torn down when
    device-driver-probe fails or when device-driver-release completes.
    Similar to devm_memremap_pages() there is no need to support an explicit
    remove operation when the users properly adhere to devm semantics.

    Note that devm_kzalloc() automatically handles allocating node-local
    memory.

    Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jérôme Glisse
    Cc: "Jérôme Glisse"
    Cc: Logan Gunthorpe
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit b15c87263a69272423771118c653e9a1d0672caa upstream.

    We have received a bug report that an injected MCE about faulty memory
    prevents memory offline to succeed on 4.4 base kernel. The underlying
    reason was that the HWPoison page has an elevated reference count and the
    migration keeps failing. There are two problems with that. First of all
    it is dubious to migrate the poisoned page because we know that accessing
    that memory is possible to fail. Secondly it doesn't make any sense to
    migrate a potentially broken content and preserve the memory corruption
    over to a new location.

    Oscar has found out that 4.4 and the current upstream kernels behave
    slightly differently with his simply testcase

    ===

    int main(void)
    {
    int ret;
    int i;
    int fd;
    char *array = malloc(4096);
    char *array_locked = malloc(4096);

    fd = open("/tmp/data", O_RDONLY);
    read(fd, array, 4095);

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
    if (ret)
    perror("mlock");

    sleep (20);

    ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
    if (ret)
    perror("madvise");

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    return 0;
    }
    ===

    + offline this memory.

    In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
    list
    kernel: [] dump_trace+0x59/0x340
    kernel: [] show_stack_log_lvl+0xea/0x170
    kernel: [] show_stack+0x21/0x40
    kernel: [] dump_stack+0x5c/0x7c
    kernel: [] warn_slowpath_common+0x81/0xb0
    kernel: [] __pagevec_lru_add_fn+0x14c/0x160
    kernel: [] pagevec_lru_move_fn+0xad/0x100
    kernel: [] __lru_cache_add+0x6c/0xb0
    kernel: [] add_to_page_cache_lru+0x46/0x70
    kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
    kernel: [] __do_page_cache_readahead+0x177/0x200
    kernel: [] ondemand_readahead+0x168/0x2a0
    kernel: [] generic_file_read_iter+0x41f/0x660
    kernel: [] __vfs_read+0xcd/0x140
    kernel: [] vfs_read+0x7a/0x120
    kernel: [] kernel_read+0x3b/0x50
    kernel: [] do_execveat_common.isra.29+0x490/0x6f0
    kernel: [] do_execve+0x28/0x30
    kernel: [] call_usermodehelper_exec_async+0xfb/0x130
    kernel: [] ret_from_fork+0x55/0x80

    And that latter confuses the hotremove path because an LRU page is
    attempted to be migrated and that fails due to an elevated reference
    count. It is quite possible that the reuse of the HWPoisoned page is some
    kind of fixed race condition but I am not really sure about that.

    With the upstream kernel the failure is slightly different. The page
    doesn't seem to have LRU bit set but isolate_movable_page simply fails and
    do_migrate_range simply puts all the isolated pages back to LRU and
    therefore no progress is made and scan_movable_pages finds same set of
    pages over and over again.

    Fix both cases by explicitly checking HWPoisoned pages before we even try
    to get reference on the page, try to unmap it if it is still mapped. As
    explained by Naoya:

    : Hwpoison code never unmapped those for no big reason because
    : Ksm pages never dominate memory, so we simply didn't have strong
    : motivation to save the pages.

    Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
    HWPoison pages which shouldn't happen but I couldn't convince myself about
    that. Naoya has noted the following:

    : Theoretically no such gurantee, because try_to_unmap() doesn't have a
    : guarantee of success and then memory_failure() returns immediately
    : when hwpoison_user_mappings fails.
    : Or the following code (comes after hwpoison_user_mappings block) also impli=
    : es
    : that the target page can still have PageLRU flag.
    :
    : /*
    : * Torn down by someone else?
    : */
    : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
    : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
    : res =3D -EBUSY;
    : goto out;
    : }
    :
    : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
    : current version of your patch.

    Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Debugged-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Acked-by: David Hildenbrand
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

29 Dec, 2018

1 commit

  • commit 68600f623d69da428c6163275f97ca126e1a8ec5 upstream.

    I've noticed, that dying memory cgroups are often pinned in memory by a
    single pagecache page. Even under moderate memory pressure they sometimes
    stayed in such state for a long time. That looked strange.

    My investigation showed that the problem is caused by applying the LRU
    pressure balancing math:

    scan = div64_u64(scan * fraction[lru], denominator),

    where

    denominator = fraction[anon] + fraction[file] + 1.

    Because fraction[lru] is always less than denominator, if the initial scan
    size is 1, the result is always 0.

    This means the last page is not scanned and has
    no chances to be reclaimed.

    Fix this by rounding up the result of the division.

    In practice this change significantly improves the speed of dying cgroups
    reclaim.

    [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
    Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
    Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

17 Dec, 2018

1 commit

  • [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]

    init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
    'zone_idx(zone) + 1' unconditionally. This is correct in the normal
    case, while not exact in hot-plug situation.

    This function is used in two places:

    * free_area_init_core()
    * move_pfn_range_to_zone()

    In the first case, we are sure zone index increase monotonically. While
    in the second one, this is under users control.

    One way to reproduce this is:
    ----------------------------

    1. create a virtual machine with empty node1

    -m 4G,slots=32,maxmem=32G \
    -smp 4,maxcpus=8 \
    -numa node,nodeid=0,mem=4G,cpus=0-3 \
    -numa node,nodeid=1,mem=0G,cpus=4-7

    2. hot-add cpu 3-7

    cpu-add [3-7]

    2. hot-add memory to nod1

    object_add memory-backend-ram,id=ram0,size=1G
    device_add pc-dimm,id=dimm0,memdev=ram0,node=1

    3. online memory with following order

    echo online_movable > memory47/state
    echo online > memory40/state

    After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
    instead of (ZONE_MOVABLE + 1).

    Michal said:
    "Having an incorrect nr_zones might result in all sorts of problems
    which would be quite hard to debug (e.g. reclaim not considering the
    movable zone). I do not expect many users would suffer from this it
    but still this is trivial and obviously right thing to do so
    backporting to the stable tree shouldn't be harmful (last famous
    words)"

    Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Sasha Levin

    Wei Yang
     

13 Dec, 2018

1 commit

  • [ Upstream commit 400e22499dd92613821374c8c6c88c7225359980 ]

    Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") was a great step for reducing possibility of silent hang up
    problem caused by memory allocation stalls. But this commit reverts it,
    for it is possible to trigger OOM lockup and/or soft lockups when many
    threads concurrently called warn_alloc() (in order to warn about memory
    allocation stalls) due to current implementation of printk(), and it is
    difficult to obtain useful information due to limitation of synchronous
    warning approach.

    Current printk() implementation flushes all pending logs using the
    context of a thread which called console_unlock(). printk() should be
    able to flush all pending logs eventually unless somebody continues
    appending to printk() buffer.

    Since warn_alloc() started appending to printk() buffer while waiting
    for oom_kill_process() to make forward progress when oom_kill_process()
    is processing pending logs, it became possible for warn_alloc() to force
    oom_kill_process() loop inside printk(). As a result, warn_alloc()
    significantly increased possibility of preventing oom_kill_process()
    from making forward progress.

    ---------- Pseudo code start ----------
    Before warn_alloc() was introduced:

    retry:
    if (mutex_trylock(&oom_lock)) {
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_lock)
    }
    goto retry;

    After warn_alloc() was introduced:

    retry:
    if (mutex_trylock(&oom_lock)) {
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_lock)
    } else if (waited_for_10seconds()) {
    atomic_inc(&printk_pending_logs);
    }
    goto retry;
    ---------- Pseudo code end ----------

    Although waited_for_10seconds() becomes true once per 10 seconds,
    unbounded number of threads can call waited_for_10seconds() at the same
    time. Also, since threads doing waited_for_10seconds() keep doing
    almost busy loop, the thread doing print_one_log() can use little CPU
    resource. Therefore, this situation can be simplified like

    ---------- Pseudo code start ----------
    retry:
    if (mutex_trylock(&oom_lock)) {
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_lock)
    } else {
    atomic_inc(&printk_pending_logs);
    }
    goto retry;
    ---------- Pseudo code end ----------

    when printk() is called faster than print_one_log() can process a log.

    One of possible mitigation would be to introduce a new lock in order to
    make sure that no other series of printk() (either oom_kill_process() or
    warn_alloc()) can append to printk() buffer when one series of printk()
    (either oom_kill_process() or warn_alloc()) is already in progress.

    Such serialization will also help obtaining kernel messages in readable
    form.

    ---------- Pseudo code start ----------
    retry:
    if (mutex_trylock(&oom_lock)) {
    mutex_lock(&oom_printk_lock);
    while (atomic_read(&printk_pending_logs) > 0) {
    atomic_dec(&printk_pending_logs);
    print_one_log();
    }
    // Send SIGKILL here.
    mutex_unlock(&oom_printk_lock);
    mutex_unlock(&oom_lock)
    } else {
    if (mutex_trylock(&oom_printk_lock)) {
    atomic_inc(&printk_pending_logs);
    mutex_unlock(&oom_printk_lock);
    }
    }
    goto retry;
    ---------- Pseudo code end ----------

    But this commit does not go that direction, for we don't want to
    introduce a new lock dependency, and we unlikely be able to obtain
    useful information even if we serialized oom_kill_process() and
    warn_alloc().

    Synchronous approach is prone to unexpected results (e.g. too late [1],
    too frequent [2], overlooked [3]). As far as I know, warn_alloc() never
    helped with providing information other than "something is going wrong".
    I want to consider asynchronous approach which can obtain information
    during stalls with possibly relevant threads (e.g. the owner of
    oom_lock and kswapd-like threads) and serve as a trigger for actions
    (e.g. turn on/off tracepoints, ask libvirt daemon to take a memory dump
    of stalling KVM guest for diagnostic purpose).

    This commit temporarily loses ability to report e.g. OOM lockup due to
    unable to invoke the OOM killer due to !__GFP_FS allocation request.
    But asynchronous approach will be able to detect such situation and emit
    warning. Thus, let's remove warn_alloc().

    [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
    [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
    [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

    Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reported-by: Cong Wang
    Reported-by: yuwang.yuwang
    Reported-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Sergey Senozhatsky
    Cc: Petr Mladek
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Signed-off-by: Sasha Levin

    Tetsuo Handa
     

08 Dec, 2018

7 commits

  • [fixed differently upstream, this is a work-around to resolve it for 4.14.y]

    Yongqin reported that /proc/zoneinfo format is broken in 4.14
    due to commit 7aaf77272358 ("mm: don't show nr_indirectly_reclaimable
    in /proc/vmstat")

    Node 0, zone DMA
    per-node stats
    nr_inactive_anon 403
    nr_active_anon 89123
    nr_inactive_file 128887
    nr_active_file 47377
    nr_unevictable 2053
    nr_slab_reclaimable 7510
    nr_slab_unreclaimable 10775
    nr_isolated_anon 0
    nr_isolated_file 0

    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_dirtied 6022
    nr_written 5985
    74240
    ^^^^^^^^^^
    pages free 131656

    The problem is caused by the nr_indirectly_reclaimable counter,
    which is hidden from the /proc/vmstat, but not from the
    /proc/zoneinfo. Let's fix this inconsistency and hide the
    counter from /proc/zoneinfo exactly as from /proc/vmstat.

    BTW, in 4.19+ the counter has been renamed and exported by
    the commit b29940c1abd7 ("mm: rename and change semantics of
    nr_indirectly_reclaimable_bytes"), so there is no such a problem
    anymore.

    Cc: # 4.14.x-4.18.x
    Fixes: 7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in /proc/vmstat")
    Reported-by: Yongqin Liu
    Signed-off-by: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit 6ff38bd40230af35e446239396e5fc8ebd6a5248 upstream.

    If all pages are deleted from the mapping by memory reclaim and also
    moved to the cleancache:

    __delete_from_page_cache
    (no shadow case)
    unaccount_page_cache_page
    cleancache_put_page
    page_cache_delete
    mapping->nrpages -= nr
    (nrpages becomes 0)

    We don't clean the cleancache for an inode after final file truncation
    (removal).

    truncate_inode_pages_final
    check (nrpages || nrexceptional) is false
    no truncate_inode_pages
    no cleancache_invalidate_inode(mapping)

    These way when reading the new file created with same inode we may get
    these trash leftover pages from cleancache and see wrong data instead of
    the contents of the new file.

    Fix it by always doing truncate_inode_pages which is already ready for
    nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

    [akpm@linux-foundation.org: add comment, per Jan]
    Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com
    Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")
    Signed-off-by: Pavel Tikhomirov
    Reviewed-by: Vasily Averin
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Vasily Averin
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tikhomirov
     
  • commit 29ec90660d68bbdd69507c1c8b4e33aa299278b1 upstream.

    After the VMA to register the uffd onto is found, check that it has
    VM_MAYWRITE set before allowing registration. This way we inherit all
    common code checks before allowing to fill file holes in shmem and
    hugetlbfs with UFFDIO_COPY.

    The userfaultfd memory model is not applicable for readonly files unless
    it's a MAP_PRIVATE.

    Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
    Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit dcf7fe9d89763a28e0f43975b422ff141fe79e43 upstream.

    Set the page dirty if VM_WRITE is not set because in such case the pte
    won't be marked dirty and the page would be reclaimed without writepage
    (i.e. swapout in the shmem case).

    This was found by source review. Most apps (certainly including QEMU)
    only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't
    modify the memory in the first place. This is for correctness and it
    could help the non cooperative use case to avoid unexpected data loss.

    Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com
    Reviewed-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Reported-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Jann Horn
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit e2a50c1f64145a04959df2442305d57307e5395a upstream.

    With MAP_SHARED: recheck the i_size after taking the PT lock, to
    serialize against truncate with the PT lock. Delete the page from the
    pagecache if the i_size_read check fails.

    With MAP_PRIVATE: check the i_size after the PT lock before mapping
    anonymous memory or zeropages into the MAP_PRIVATE shmem mapping.

    A mostly irrelevant cleanup: like we do the delete_from_page_cache()
    pagecache removal after dropping the PT lock, the PT lock is a spinlock
    so drop it before the sleepable page lock.

    Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 5b51072e97d587186c2f5390c8c9c1fb7e179505 upstream.

    Userfaultfd did not create private memory when UFFDIO_COPY was invoked
    on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file,
    even when that had not been opened for writing. Though, fortunately,
    that could only happen where there was a hole in the file.

    Fix the shmem-backed implementation of UFFDIO_COPY to create private
    memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation
    was already correct.

    This change is visible to userland, if userfaultfd has been used in
    unintended ways: so it introduces a small risk of incompatibility, but
    is necessary in order to respect file permissions.

    An app that uses UFFDIO_COPY for anything like postcopy live migration
    won't notice the difference, and in fact it'll run faster because there
    will be no copy-on-write and memory waste in the tmpfs pagecache
    anymore.

    Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like
    before.

    The real zeropage can also be built on a MAP_PRIVATE shmem mapping
    through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is
    never dirty, in turn even an mprotect upgrading the vma permission from
    PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable.

    Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Jann Horn
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream.

    Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

06 Dec, 2018

11 commits

  • commit c1cb20d43728aa9b5393bd8d489bc85c142949b2 upstream.

    We changed the key of swap cache tree from swp_entry_t.val to
    swp_offset. We need to do so in shmem_replace_page() as well.

    Hugh said:
    "shmem_replace_page() has been wrong since the day I wrote it: good
    enough to work on swap "type" 0, which is all most people ever use
    (especially those few who need shmem_replace_page() at all), but
    broken once there are any non-0 swp_type bits set in the higher order
    bits"

    Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com
    Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache")
    Signed-off-by: Yu Zhao
    Reviewed-by: Matthew Wilcox
    Acked-by: Hugh Dickins
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yu Zhao
     
  • commit 06a5e1268a5fb9c2b346a3da6b97e85f2eba0f07 upstream.

    collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
    it holds page lock of the first page, racing truncation then extension
    might conceivably have inserted a hugepage there already. Fail with the
    SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
    otherwise mishandling the unexpected hugepage - though later we might
    code up a more constructive way of handling it, with SCAN_SUCCESS.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.

    khugepaged's collapse_shmem() does almost all of its work, to assemble
    the huge new_page from 512 scattered old pages, with the new_page's
    refcount frozen to 0 (and refcounts of all old pages so far also frozen
    to 0). Including shmem_getpage() to read in any which were out on swap,
    memory reclaim if necessary to allocate their intermediate pages, and
    copying over all the data from old to new.

    Imagine the frozen refcount as a spinlock held, but without any lock
    debugging to highlight the abuse: it's not good, and under serious load
    heads into lockups - speculative getters of the page are not expecting
    to spin while khugepaged is rescheduled.

    One can get a little further under load by hacking around elsewhere; but
    fortunately, freezing the new_page turns out to have been entirely
    unnecessary, with no hacks needed elsewhere.

    The huge new_page lock is already held throughout, and guards all its
    subpages as they are brought one by one into the page cache tree; and
    anything reading the data in that page, without the lock, before it has
    been marked PageUptodate, would already be in the wrong. So simply
    eliminate the freezing of the new_page.

    Each of the old pages remains frozen with refcount 0 after it has been
    replaced by a new_page subpage in the page cache tree, until they are
    all unfrozen on success or failure: just as before. They could be
    unfrozen sooner, but cause no problem once no longer visible to
    find_get_entry(), filemap_map_pages() and other speculative lookups.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 042a30824871fa3149b0127009074b75cc25863c upstream.

    Several cleanups in collapse_shmem(): most of which probably do not
    really matter, beyond doing things in a more familiar and reassuring
    order. Simplify the failure gotos in the main loop, and on success
    update stats while interrupts still disabled from the last iteration.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 2af8ff291848cc4b1cce24b6c943394eb2c761e8 upstream.

    Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
    flags khugepaged uses to allocate a huge page - in all common cases it
    would just be a waste of effort - so collapse_shmem() must remember to
    clear out any holes that it instantiates.

    The obvious place to do so, where they are put into the page cache tree,
    is not a good choice: because interrupts are disabled there. Leave it
    until further down, once success is assured, where the other pages are
    copied (before setting PageUptodate).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit aaa52e340073b7f4593b3c4ddafcafa70cf838b5 upstream.

    Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
    hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
    clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
    closed and unlinked.

    khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
    on the rollback path, after it had added but then needs to undo some
    holes.

    There is indeed an irritating asymmetry between shmem_charge(), whose
    callers want it to increment nrpages after successfully accounting
    blocks, and shmem_uncharge(), when __delete_from_page_cache() already
    decremented nrpages itself: oh well, just add a comment on that to them
    both.

    And shmem_recalc_inode() is supposed to be called when the accounting is
    expected to be in balance (so it can deduce from imbalance that reclaim
    discarded some pages): so change shmem_charge() to update nrpages
    earlier (though it's rare for the difference to matter at all).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
    Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 701270fa193aadf00bdcf607738f64997275d4c7 upstream.

    Huge tmpfs testing showed that although collapse_shmem() recognizes a
    concurrently truncated or hole-punched page correctly, its handling of
    holes was liable to refill an emptied extent. Add check to stop that.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Matthew Wilcox
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 006d3ff27e884f80bd7d306b041afc415f63598f upstream.

    Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
    __split_huge_page() was using i_size_read() while holding the irq-safe
    lru_lock and page tree lock, but the 32-bit i_size_read() uses an
    irq-unsafe seqlock which should not be nested inside them.

    Instead, read the i_size earlier in split_huge_page_to_list(), and pass
    the end offset down to __split_huge_page(): all while holding head page
    lock, which is enough to prevent truncation of that extent before the
    page tree lock has been taken.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
    Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 173d9d9fd3ddae84c110fea8aedf1f26af6be9ec upstream.

    Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
    VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).

    Move the setting of mapping and index up before the page_ref_unfreeze()
    in __split_huge_page_tail() to fix this: so that a page cache lookup
    cannot get a reference while the tail's mapping and index are unstable.

    In fact, might as well move them up before the smp_wmb(): I don't see an
    actual need for that, but if I'm missing something, this way round is
    safer than the other, and no less efficient.

    You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
    misplaced, and should be left until after the trylock_page(); but left as
    is has not crashed since, and gives more stringent assurance.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Jerome Glisse
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 605ca5ede7643a01f4c4a15913f9714ac297f8a6 upstream.

    THP split makes non-atomic change of tail page flags. This is almost ok
    because tail pages are locked and isolated but this breaks recent
    changes in page locking: non-atomic operation could clear bit
    PG_waiters.

    As a result concurrent sequence get_page_unless_zero() -> lock_page()
    might block forever. Especially if this page was truncated later.

    Fix is trivial: clone flags before unfreezing page reference counter.

    This race exists since commit 62906027091f ("mm: add PageWaiters
    indicating tasks are waiting for a page bit") while unsave unfreeze
    itself was added in commit 8df651c7059e ("thp: cleanup
    split_huge_page()").

    clear_compound_head() also must be called before unfreezing page
    reference because after successful get_page_unless_zero() might follow
    put_page() which needs correct compound_head().

    And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
    is made especially for that and has semantic of smp_store_release().

    Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Konstantin Khlebnikov
     
  • commit 906f9cdfc2a0800f13683f9e4ebdfd08c12ee81b upstream.

    The term "freeze" is used in several ways in the kernel, and in mm it
    has the particular meaning of forcing page refcount temporarily to 0.
    freeze_page() is just too confusing a name for a function that unmaps a
    page: rename it unmap_page(), and rename unfreeze_page() remap_page().

    Went to change the mention of freeze_page() added later in mm/rmap.c,
    but found it to be incorrect: ordinary page reclaim reaches there too;
    but the substance of the comment still seems correct, so edit it down.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

01 Dec, 2018

4 commits

  • [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]

    Konstantin has noticed that kvmalloc might trigger the following
    warning:

    WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
    [...]
    Call Trace:
    fragmentation_index+0x76/0x90
    compaction_suitable+0x4f/0xf0
    shrink_node+0x295/0x310
    node_reclaim+0x205/0x250
    get_page_from_freelist+0x649/0xad0
    __alloc_pages_nodemask+0x12a/0x2a0
    kmalloc_large_node+0x47/0x90
    __kmalloc_node+0x22b/0x2e0
    kvmalloc_node+0x3e/0x70
    xt_alloc_table_info+0x3a/0x80 [x_tables]
    do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
    nf_setsockopt+0x44/0x60
    SyS_setsockopt+0x6f/0xc0
    do_syscall_64+0x67/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    the problem is that we only check for an out of bound order in the slow
    path and the node reclaim might happen from the fast path already. This
    is fixable by making sure that kvmalloc doesn't ever use kmalloc for
    requests that are larger than KMALLOC_MAX_SIZE but this also shows that
    the code is rather fragile. A recent UBSAN report just underlines that
    by the following report

    UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
    shift exponent 51 is too large for 32-bit type 'int'
    CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xd2/0x148 lib/dump_stack.c:113
    ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
    __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
    __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
    zone_watermark_fast mm/page_alloc.c:3216 [inline]
    get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
    __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
    alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
    alloc_pages include/linux/gfp.h:509 [inline]
    __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
    dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
    raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
    raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
    fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
    fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
    __blkdev_driver_ioctl block/ioctl.c:303 [inline]
    blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
    block_ioctl+0x105/0x150 fs/block_dev.c:1883
    vfs_ioctl fs/ioctl.c:46 [inline]
    do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
    ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
    __do_sys_ioctl fs/ioctl.c:709 [inline]
    __se_sys_ioctl fs/ioctl.c:707 [inline]
    __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
    do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Note that this is not a kvmalloc path. It is just that the fast path
    really depends on having sanitzed order as well. Therefore move the
    order check to the fast path.

    Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Konstantin Khlebnikov
    Reported-by: Kyungtae Kim
    Acked-by: Vlastimil Babka
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Byoungyoung Lee
    Cc: "Dae R. Jeong"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     
  • [ Upstream commit 1a413646931cb14442065cfc17561e50f5b5bb44 ]

    Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
    lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

    man 2 lseek says

    : EINVAL whence is not valid. Or: the resulting file offset would be
    : negative, or beyond the end of a seekable device.
    :
    : ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
    : the end of the file.

    Make tmpfs return ENXIO under these circumstances as well. After this,
    tmpfs also passes xfstests's generic/448.

    [akpm@linux-foundation.org: rewrite changelog]
    Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
    Signed-off-by: Yufen Yu
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Hugh Dickins
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Yufen Yu
     
  • [ Upstream commit ca0246bb97c23da9d267c2107c07fb77e38205c9 ]

    Reclaim and free can race on an object which is basically fine but in
    order for reclaim to be able to map "freed" object we need to encode
    object length in the handle. handle_to_chunks() is then introduced to
    extract object length from a handle and use it during mapping.

    Moreover, to avoid racing on a z3fold "headless" page release, we should
    not try to free that page in z3fold_free() if the reclaim bit is set.
    Also, in the unlikely case of trying to reclaim a page being freed, we
    should not proceed with that page.

    While at it, fix the page accounting in reclaim function.

    This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

    Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
    Signed-off-by: Vitaly Wool
    Signed-off-by: Jongseok Kim
    Reported-by-by: Jongseok Kim
    Reviewed-by: Snild Dolkow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vitaly Wool
     
  • commit ff09d7ec9786be4ad7589aa987d7dc66e2dd9160 upstream.

    We clear the pte temporarily during read/modify/write update of the pte.
    If we take a page fault while the pte is cleared, the application can get
    SIGBUS. One such case is with remap_pfn_range without a backing
    vm_ops->fault callback. do_fault will return SIGBUS in that case.

    cpu 0 cpu1
    mprotect()
    ptep_modify_prot_start()/pte cleared.
    .
    . page fault.
    .
    .
    prep_modify_prot_commit()

    Fix this by taking page table lock and rechecking for pte_none.

    [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
    Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
    Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Willem de Bruijn
    Cc: Eric Dumazet
    Cc: Ido Schimmel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V