22 Jul, 2018

1 commit

  • commit 3ee7e8697d5860b173132606d80a9cd35e7113ee upstream.

    syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
    wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
    WB_shutting_down after wb->bdi->dev became NULL. This indicates that
    unregister_bdi() failed to call wb_shutdown() on one of wb objects.

    The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
    drops bdi's reference to wb structures before going through the list of
    wbs again and calling wb_shutdown() on each of them. This way the loop
    iterating through all wbs can easily miss a wb if that wb has already
    passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
    from cgwb_release_workfn() and as a result fully shutdown bdi although
    wb_workfn() for this wb structure is still running. In fact there are
    also other ways cgwb_bdi_unregister() can race with
    cgwb_release_workfn() leading e.g. to use-after-free issues:

    CPU1 CPU2
    cgwb_bdi_unregister()
    cgwb_kill(*slot);

    cgwb_release()
    queue_work(cgwb_release_wq, &wb->release_work);
    cgwb_release_workfn()
    wb = list_first_entry(&bdi->wb_list, ...)
    spin_unlock_irq(&cgwb_lock);
    wb_shutdown(wb);
    ...
    kfree_rcu(wb, rcu);
    wb_shutdown(wb); -> oops use-after-free

    We solve these issues by synchronizing writeback structure shutdown from
    cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
    way we also no longer need synchronization using WB_shutting_down as the
    mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
    CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
    bdi_unregister().

    Reported-by: syzbot
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

17 Jul, 2018

2 commits

  • commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.

    syzbot has noticed that a specially crafted library can easily hit
    VM_BUG_ON in __mm_populate

    kernel BUG at mm/gup.c:1242!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
    RIP: 0010:__mm_populate+0x1e2/0x1f0
    Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
    Call Trace:
    vm_brk_flags+0xc3/0x100
    vm_brk+0x1f/0x30
    load_elf_library+0x281/0x2e0
    __ia32_sys_uselib+0x170/0x1e0
    do_fast_syscall_32+0xca/0x420
    entry_SYSENTER_compat+0x70/0x7f

    The reason is that the length of the new brk is not page aligned when we
    try to populate the it. There is no reason to bug on that though.
    do_brk_flags already aligns the length properly so the mapping is
    expanded as it should. All we need is to tell mm_populate about it.
    Besides that there is absolutely no reason to to bug_on in the first
    place. The worst thing that could happen is that the last page wouldn't
    get populated and that is far from putting system into an inconsistent
    state.

    Fix the issue by moving the length sanitization code from do_brk_flags
    up to vm_brk_flags. The only other caller of do_brk_flags is brk
    syscall entry and it makes sure to provide the proper length so t here
    is no need for sanitation and so we can use do_brk_flags without it.

    Also remove the bogus BUG_ONs.

    [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
    Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: syzbot
    Tested-by: Tetsuo Handa
    Reviewed-by: Oscar Salvador
    Cc: Zi Yan
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Michael S. Tsirkin
    Cc: Al Viro
    Cc: "Huang, Ying"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit bce73e4842390f7b7309c8e253e139db71288ac3 upstream.

    KVM guests on s390 can notify the host of unused pages. This can result
    in pte_unused callbacks to be true for KVM guest memory.

    If a page is unused (checked with pte_unused) we might drop this page
    instead of paging it. This can have side-effects on userfaultd, when
    the page in question was already migrated:

    The next access of that page will trigger a fault and a user fault
    instead of faulting in a new and empty zero page. As QEMU does not
    expect a userfault on an already migrated page this migration will fail.

    The most straightforward solution is to ignore the pte_unused hint if a
    userfault context is active for this VMA.

    Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
    Signed-off-by: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Janosch Frank
    Cc: David Hildenbrand
    Cc: Cornelia Huck
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     

11 Jul, 2018

3 commits

  • commit 28557cc106e6d2aa8b8c5c7687ea9f8055ff3911 upstream.

    Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
    BUG"). Steven saw a "using smp_processor_id() in preemptible" message
    and added a preempt_disable() section around it to keep it quiet. This
    is not the right thing to do it does not fix the real problem.

    vmstat_update() is invoked by a kworker on a specific CPU. This worker
    it bound to this CPU. The name of the worker was "kworker/1:1" so it
    should have been a worker which was bound to CPU1. A worker which can
    run on any CPU would have a `u' before the first digit.

    smp_processor_id() can be used in a preempt-enabled region as long as
    the task is bound to a single CPU which is the case here. If it could
    run on an arbitrary CPU then this is the problem we have an should seek
    to resolve.

    Not only this smp_processor_id() must not be migrated to another CPU but
    also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
    Not to mention that other code relies on the fact that such a worker
    runs on one specific CPU only.

    Therefore revert that commit and we should look instead what broke the
    affinity mask of the kworker.

    Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Steven J. Hill
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     
  • commit 31286a8484a85e8b4e91ddb0f5415aee8a416827 upstream.

    Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit 520495fe96d74e05db585fc748351e0504d8f40d upstream.

    When booting with very large numbers of gigantic (i.e. 1G) pages, the
    operations in the loop of gather_bootmem_prealloc, and specifically
    prep_compound_gigantic_page, takes a very long time, and can cause a
    softlockup if enough pages are requested at boot.

    For example booting with 3844 1G pages requires prepping
    (set_compound_head, init the count) over 1 billion 4K tail pages, which
    takes considerable time.

    Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
    prevent this lockup.

    Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
    no softlockup is reported, and the hugepages are reported as
    successfully setup.

    Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
    Signed-off-by: Cannon Matthews
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Cc: Greg Thelen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Cannon Matthews
     

03 Jul, 2018

3 commits

  • commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 upstream.

    In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 1105a2fc022f3c7482e32faf516e8bc44095f778 upstream.

    In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
    PAGE_SIZE unaligned for rmap_item->address under memory pressure
    tests(start 20 guests and run memhog in the host).

    WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
    CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
    Call trace:
    kvm_age_hva_handler+0xc0/0xc8
    handle_hva_to_gpa+0xa8/0xe0
    kvm_age_hva+0x4c/0xe8
    kvm_mmu_notifier_clear_flush_young+0x54/0x98
    __mmu_notifier_clear_flush_young+0x6c/0xa0
    page_referenced_one+0x154/0x1d8
    rmap_walk_ksm+0x12c/0x1d0
    rmap_walk+0x94/0xa0
    page_referenced+0x194/0x1b0
    shrink_page_list+0x674/0xc28
    shrink_inactive_list+0x26c/0x5b8
    shrink_node_memcg+0x35c/0x620
    shrink_node+0x100/0x430
    do_try_to_free_pages+0xe0/0x3a8
    try_to_free_pages+0xe4/0x230
    __alloc_pages_nodemask+0x564/0xdc0
    alloc_pages_vma+0x90/0x228
    do_anonymous_page+0xc8/0x4d0
    __handle_mm_fault+0x4a0/0x508
    handle_mm_fault+0xf8/0x1b0
    do_page_fault+0x218/0x4b8
    do_translation_fault+0x90/0xa0
    do_mem_abort+0x68/0xf0
    el0_da+0x24/0x28

    In rmap_walk_ksm, the rmap_item->address might still have the
    STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
    PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
    on arm64.

    This patch fixes it by ignoring (not removing) the low bits of address
    when doing rmap_walk_ksm.

    IMO, it should be backported to stable tree. the storm of WARN_ONs is
    very easy for me to reproduce. More than that, I watched a panic (not
    reproducible) as follows:

    page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
    flags: 0x1fffc00000000000()
    raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
    raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
    page dumped because: nonzero _refcount
    CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
    Hardware name:
    Call trace:
    dump_backtrace+0x0/0x22c
    show_stack+0x24/0x2c
    dump_stack+0x8c/0xb0
    bad_page+0xf4/0x154
    free_pages_check_bad+0x90/0x9c
    free_pcppages_bulk+0x464/0x518
    free_hot_cold_page+0x22c/0x300
    __put_page+0x54/0x60
    unmap_stage2_range+0x170/0x2b4
    kvm_unmap_hva_handler+0x30/0x40
    handle_hva_to_gpa+0xb0/0xec
    kvm_unmap_hva_range+0x5c/0xd0

    I even injected a fault on purpose in kvm_unmap_hva_range by seting
    size=size-0x200, the call trace is similar as above. So I thought the
    panic is similarly caused by the root cause of WARN_ON.

    Andrea said:

    : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
    : zap those bits and hide the misalignment caused by the low metadata
    : bits being erroneously left set in the address, but the arm code
    : notices when that's the last page in the memslot and the hva_end is
    : getting aligned and the size is below one page.
    :
    : I think the problem triggers in the addr += PAGE_SIZE of
    : unmap_stage2_ptes that never matches end because end is aligned but
    : addr is not.
    :
    : } while (pte++, addr += PAGE_SIZE, addr != end);
    :
    : x86 again only works on hva_start/hva_end after converting it to
    : gfn_start/end and that being in pfn units the bits are zapped before
    : they risk to cause trouble.

    Jia He said:

    : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
    : this patch, the WARN_ON is very easy for reproducing. After this patch, I
    : have run the same benchmarch for a whole day without any WARN_ONs

    Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
    Signed-off-by: Jia He
    Reviewed-by: Andrea Arcangeli
    Tested-by: Jia He
    Cc: Suzuki K Poulose
    Cc: Minchan Kim
    Cc: Claudio Imbrenda
    Cc: Arvind Yadav
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jia He
     
  • commit a9b6de77b1a3ff729f7bfc54b2e17711776a416c upstream.

    get_user_pages_fast() for device pages is missing the typical validation
    that all page references have been taken while the mapping was valid.
    Without this validation truncate operations can not reliably coordinate
    against new page reference events like O_DIRECT.

    Cc:
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

26 Jun, 2018

2 commits

  • commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream.

    In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
    allocations that can ignore memory policies. The zonelist is obtained
    from current CPU's node. This is a problem for __GFP_THISNODE
    allocations that want to allocate on a different node, e.g. because the
    allocating thread has been migrated to a different CPU.

    This has been observed to break SLAB in our 4.4-based kernel, because
    there it relies on __GFP_THISNODE working as intended. If a slab page
    is put on wrong node's list, then further list manipulations may corrupt
    the list because page_to_nid() is used to determine which node's
    list_lock should be locked and thus we may take a wrong lock and race.

    Current SLAB implementation seems to be immune by luck thanks to commit
    511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
    arbitrary node") but there may be others assuming that __GFP_THISNODE
    works as promised.

    We can fix it by simply removing the zonelist reset completely. There
    is actually no reason to reset it, because memory policies and cpusets
    don't affect the zonelist choice in the first place. This was different
    when commit 183f6371aac2 ("mm: ignore mempolicies when using
    ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
    own restricted zonelists.

    We might consider this for 4.17 although I don't know if there's
    anything currently broken.

    SLAB is currently not affected, but in kernels older than 4.7 that don't
    yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
    allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
    ones I'll have to check.

    So stable backports should be more important, but will have to be
    reviewed carefully, as the code went through many changes. BTW I think
    that also the ac->preferred_zoneref reset is currently useless if we
    don't also reset ac->nodemask from a mempolicy to NULL first (which we
    probably should for the OOM victims etc?), but I would leave that for a
    separate patch.

    Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit f183464684190bacbfb14623bd3e4e51b7575b4c upstream.

    From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Wed, 23 May 2018 10:29:00 -0700

    cgwb_release() punts the actual release to cgwb_release_workfn() on
    system_wq. Depending on the number of cgroups or block devices, there
    can be a lot of cgwb_release_workfn() in flight at the same time.

    We're periodically seeing close to 256 kworkers getting stuck with the
    following stack trace and overtime the entire system gets stuck.

    [] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
    [] synchronize_rcu_expedited+0x24/0x30
    [] bdi_unregister+0x53/0x290
    [] release_bdi+0x89/0xc0
    [] wb_exit+0x85/0xa0
    [] cgwb_release_workfn+0x54/0xb0
    [] process_one_work+0x150/0x410
    [] worker_thread+0x6d/0x520
    [] kthread+0x12c/0x160
    [] ret_from_fork+0x29/0x40
    [] 0xffffffffffffffff

    The events leading to the lockup are...

    1. A lot of cgwb_release_workfn() is queued at the same time and all
    system_wq kworkers are assigned to execute them.

    2. They all end up calling synchronize_rcu_expedited(). One of them
    wins and tries to perform the expedited synchronization.

    3. However, that invovles queueing rcu_exp_work to system_wq and
    waiting for it. Because #1 is holding all available kworkers on
    system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
    is waiting for synchronize_rcu_expedited() which in turn is waiting
    for cgwb_release_workfn() to free up some of the kworkers.

    We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
    same time. There's nothing to be gained from that. This patch
    updates cgwb release path to use a dedicated percpu workqueue with
    @max_active of 1.

    While this resolves the problem at hand, it might be a good idea to
    isolate rcu_exp_work to its own workqueue too as it can be used from
    various paths and is prone to this sort of indirect A-A deadlocks.

    Signed-off-by: Tejun Heo
    Cc: "Paul E. McKenney"
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

21 Jun, 2018

1 commit

  • [ Upstream commit c892fd82cc0632d425ae011a4dd75eb59e9f84ee ]

    If there is heavy memory pressure, page allocation with __GFP_NOWAIT
    fails easily although it's order-0 request. I got below warning 9 times
    for normal boot.

    : page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK)
    .. snip ..
    Call trace:
    dump_backtrace+0x0/0x4
    dump_stack+0xa4/0xc0
    warn_alloc+0xd4/0x15c
    __alloc_pages_nodemask+0xf88/0x10fc
    alloc_slab_page+0x40/0x18c
    new_slab+0x2b8/0x2e0
    ___slab_alloc+0x25c/0x464
    __kmalloc+0x394/0x498
    memcg_kmem_get_cache+0x114/0x2b8
    kmem_cache_alloc+0x98/0x3e8
    mmap_region+0x3bc/0x8c0
    do_mmap+0x40c/0x43c
    vm_mmap_pgoff+0x15c/0x1e4
    sys_mmap+0xb0/0xc8
    el0_svc_naked+0x24/0x28
    Mem-Info:
    active_anon:17124 inactive_anon:193 isolated_anon:0
    active_file:7898 inactive_file:712955 isolated_file:55
    unevictable:0 dirty:27 writeback:18 unstable:0
    slab_reclaimable:12250 slab_unreclaimable:23334
    mapped:19310 shmem:212 pagetables:816 bounce:0
    free:36561 free_pcp:1205 free_cma:35615
    Node 0 active_anon:68496kB inactive_anon:772kB active_file:31592kB inactive_file:2851820kB unevictable:0kB isolated(anon):0kB isolated(file):220kB mapped:77240kB dirty:108kB writeback:72kB shmem:848kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:142188kB min:3056kB low:3820kB high:4584kB active_anon:10052kB inactive_anon:12kB active_file:312kB inactive_file:1412620kB unevictable:0kB writepending:0kB present:1781412kB managed:1604728kB mlocked:0kB slab_reclaimable:3592kB slab_unreclaimable:876kB kernel_stack:400kB pagetables:52kB bounce:0kB free_pcp:1436kB local_pcp:124kB free_cma:142492kB
    lowmem_reserve[]: 0 1842 1842
    Normal free:4056kB min:4172kB low:5212kB high:6252kB active_anon:58376kB inactive_anon:760kB active_file:31348kB inactive_file:1439040kB unevictable:0kB writepending:180kB present:2000636kB managed:1923688kB mlocked:0kB slab_reclaimable:45408kB slab_unreclaimable:92460kB kernel_stack:9680kB pagetables:3212kB bounce:0kB free_pcp:3392kB local_pcp:688kB free_cma:0kB
    lowmem_reserve[]: 0 0 0
    DMA: 0*4kB 0*8kB 1*16kB (C) 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB (C) 0*1024kB 1*2048kB (C) 34*4096kB (C) = 142096kB
    Normal: 228*4kB (UMEH) 172*8kB (UMH) 23*16kB (UH) 24*32kB (H) 5*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3872kB
    721350 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    945512 pages RAM
    0 pages HighMem/MovableOnly
    63408 pages reserved
    51200 pages cma reserved

    __memcg_schedule_kmem_cache_create() tries to create a shadow slab cache
    and the worker allocation failure is not really critical because we will
    retry on the next kmem charge. We might miss some charges but that
    shouldn't be critical. The excessive allocation failure report is not
    very helpful.

    [mhocko@kernel.org: changelog update]
    Link: http://lkml.kernel.org/r/20180418022912.248417-1-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

12 Jun, 2018

2 commits

  • commit 423913ad4ae5b3e8fb8983f70969fb522261ba26 upstream.

    Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
    introduced to catch problems in various ad-hoc character device drivers
    doing mmap and getting the size limits wrong. In the process, it used
    "known good" limits for the normal cases of mapping regular files and
    block device drivers.

    It turns out that the "s_maxbytes" limit was less "known good" than I
    thought. In particular, /proc doesn't set it, but exposes one regular
    file to mmap: /proc/vmcore. As a result, that file got limited to the
    default MAX_INT s_maxbytes value.

    This went unnoticed for a while, because apparently the only thing that
    needs it is the s390 kernel zfcpdump, but there might be other tools
    that use this too.

    Vasily suggested just changing s_maxbytes for all of /proc, which isn't
    wrong, but makes me nervous at this stage. So instead, just make the
    new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
    affect anything else. It wasn't the regular file case I was worried
    about.

    I'd really prefer for maxsize to have been per-inode, but that is not
    how things are today.

    Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
    Reported-by: Vasily Gorbik
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit be83bbf806822b1b89e0a0f23cd87cddc409e429 upstream.

    The internal VM "mmap()" interfaces are based on the mmap target doing
    everything using page indexes rather than byte offsets, because
    traditionally (ie 32-bit) we had the situation that the byte offset
    didn't fit in a register. So while the mmap virtual address was limited
    by the word size of the architecture, the backing store was not.

    So we're basically passing "pgoff" around as a page index, in order to
    be able to describe backing store locations that are much bigger than
    the word size (think files larger than 4GB etc).

    But while this all makes a ton of sense conceptually, we've been dogged
    by various drivers that don't really understand this, and internally
    work with byte offsets, and then try to work with the page index by
    turning it into a byte offset with "pgoff << PAGE_SHIFT".

    Which obviously can overflow.

    Adding the size of the mapping to it to get the byte offset of the end
    of the backing store just exacerbates the problem, and if you then use
    this overflow-prone value to check various limits of your device driver
    mmap capability, you're just setting yourself up for problems.

    The correct thing for drivers to do is to do their limit math in page
    indices, the way the interface is designed. Because the generic mmap
    code _does_ test that the index doesn't overflow, since that's what the
    mmap code really cares about.

    HOWEVER.

    Finding and fixing various random drivers is a sisyphean task, so let's
    just see if we can just make the core mmap() code do the limiting for
    us. Realistically, the only "big" backing stores we need to care about
    are regular files and block devices, both of which are known to do this
    properly, and which have nice well-defined limits for how much data they
    can access.

    So let's special-case just those two known cases, and then limit other
    random mmap users to a backing store that still fits in "unsigned long".
    Realistically, that's not much of a limit at all on 64-bit, and on
    32-bit architectures the only worry might be the GPU drivers, which can
    have big physical address spaces.

    To make it possible for drivers like that to say that they are 64-bit
    clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
    file flags to allow drivers to mark their file descriptors as safe in
    the full 64-bit mmap address space.

    [ The timing for doing this is less than optimal, and this should really
    go in a merge window. But realistically, this needs wide testing more
    than it needs anything else, and being main-line is the only way to do
    that.

    So the earlier the better, even if it's outside the proper development
    cycle - Linus ]

    Cc: Kees Cook
    Cc: Dan Carpenter
    Cc: Al Viro
    Cc: Willy Tarreau
    Cc: Dave Airlie
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

05 Jun, 2018

2 commits

  • commit 2d077d4b59924acd1f5180c6fb73b57f4771fde6 upstream.

    Swapping load on huge=always tmpfs (with khugepaged tuned up to be very
    eager, but I'm not sure that is relevant) soon hung uninterruptibly,
    waiting for page lock in shmem_getpage_gfp()'s find_lock_entry(), most
    often when "cp -a" was trying to write to a smallish file. Debug showed
    that the page in question was not locked, and page->mapping NULL by now,
    but page->index consistent with having been in a huge page before.

    Reproduced in minutes on a 4.15 kernel, even with 4.17's 605ca5ede764
    ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") added
    in; but took hours to reproduce on a 4.17 kernel (no idea why).

    The culprit proved to be the __ClearPageDirty() on tails beyond i_size in
    __split_huge_page(): the non-atomic __bitoperation may have been safe when
    4.8's baa355fd3314 ("thp: file pages support for split_huge_page()")
    introduced it, but liable to erase PageWaiters after 4.10's 62906027091f
    ("mm: add PageWaiters indicating tasks are waiting for a page bit").

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805291841070.3197@eggly.anvils
    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Nicholas Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream.

    George Boole would have noticed a slight error in 4.16 commit
    69d763fc6d3a ("mm: pin address_space before dereferencing it while
    isolating an LRU page"). Fix it, to match both the comment above it,
    and the original behaviour.

    Although anonymous pages are not marked PageDirty at first, we have an
    old habit of calling SetPageDirty when a page is removed from swap
    cache: so there's a category of ex-swap pages that are easily
    migratable, but were inadvertently excluded from compaction's async
    migration in 4.16.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
    Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
    Signed-off-by: Hugh Dickins
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Reported-by: Ivan Kalvachev
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

30 May, 2018

14 commits

  • [ Upstream commit f0849ac0b8e072073ec5fcc7fadd05a77434364e ]

    For PTE-mapped THP, the compound THP has not been split to normal 4K
    pages yet, the whole THP is considered referenced if any one of sub page
    is referenced.

    When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
    to retrieve referenced bit. But, the current code just returns the
    result of the last PTE. If the last PTE has not referenced, the
    referenced flag will be cleared.

    Just set referenced when ptep{pmdp}_clear_young_notify() returns true.

    Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reported-by: Gang Deng
    Suggested-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     
  • [ Upstream commit e92bb4dd9673945179b1fc738c9817dd91bfb629 ]

    When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     
  • [ Upstream commit 77da2ba0648a4fd52e5ff97b8b2b8dd312aec4b0 ]

    This patch fixes a corner case for KSM. When two pages belong or
    belonged to the same transparent hugepage, and they should be merged,
    KSM fails to split the page, and therefore no merging happens.

    This bug can be reproduced by:
    * making sure ksm is running (in case disabling ksmtuned)
    * enabling transparent hugepages
    * allocating a THP-aligned 1-THP-sized buffer
    e.g. on amd64: posix_memalign(&p, 1<<<<
    Co-authored-by: Gerald Schaefer
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Claudio Imbrenda
     
  • [ Upstream commit 1ec6995d1290bfb87cc3a51f0836c889e857cef9 ]

    In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
    released on the error path that pool->compact_wq , which holds the
    return value of create_singlethread_workqueue(), is NULL. This will
    result in a memory leak bug.

    [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
    Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
    Signed-off-by: Xidong Wang
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xidong Wang
     
  • [ Upstream commit a06ad633a37c64a0cd4c229fc605cee8725d376e ]

    Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tom Abraham
     
  • [ Upstream commit 914b6dfff790544d9b77dfd1723adb3745ec9700 ]

    A crash is observed when kmemleak_scan accesses the object->pointer,
    likely due to the following race.

    TASK A TASK B TASK C
    kmemleak_write
    (with "scan" and
    NOT "scan=on")
    kmemleak_scan()
    create_object
    kmem_cache_alloc fails
    kmemleak_disable
    kmemleak_do_cleanup
    kmemleak_free_enabled = 0
    kfree
    kmemleak_free bails out
    (kmemleak_free_enabled is 0)
    slub frees object->pointer
    update_checksum
    crash - object->pointer
    freed (DEBUG_PAGEALLOC)

    kmemleak_do_cleanup waits for the scan thread to complete, but not for
    direct call to kmemleak_scan via kmemleak_write. So add a wait for
    kmemleak_scan completion before disabling kmemleak_free, and while at it
    fix the comment on stop_scan_thread.

    [vinmenon@codeaurora.org: fix stop_scan_thread comment]
    Link: http://lkml.kernel.org/r/1522219972-22809-1-git-send-email-vinmenon@codeaurora.org
    Link: http://lkml.kernel.org/r/1522063429-18992-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Reviewed-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vinayak Menon
     
  • [ Upstream commit c7f26ccfb2c31eb1bf810ba13d044fcf583232db ]

    Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
    cause vmstat_update() to report a BUG due to preemption not being
    disabled around smp_processor_id().

    Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

    BUG: using smp_processor_id() in preemptible [00000000] code:
    kworker/1:1/269
    caller is vmstat_update+0x50/0xa0
    CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
    4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
    Workqueue: mm_percpu_wq vmstat_update
    Call Trace:
    show_stack+0x94/0x128
    dump_stack+0xa4/0xe0
    check_preemption_disabled+0x118/0x120
    vmstat_update+0x50/0xa0
    process_one_work+0x144/0x348
    worker_thread+0x150/0x4b8
    kthread+0x110/0x140
    ret_from_kernel_thread+0x14/0x1c

    Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
    Signed-off-by: Steven J. Hill
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Steven J. Hill
     
  • [ Upstream commit 299815a4fba9f3c7a81434dba0072148f1690608 ]

    This patch fixes commit 5f48f0bd4e36 ("mm, page_owner: skip unnecessary
    stack_trace entries").

    Because if we skip first two entries then logic of checking count value
    as 2 for recursion is broken and code will go in one depth recursion.

    so we need to check only one call of _RET_IP(__set_page_owner) while
    checking for recursion.

    Current Backtrace while checking for recursion:-

    (save_stack) from (__set_page_owner) // (But recursion returns true here)
    (__set_page_owner) from (get_page_from_freelist)
    (get_page_from_freelist) from (__alloc_pages_nodemask)
    (__alloc_pages_nodemask) from (depot_save_stack)
    (depot_save_stack) from (save_stack) // recursion should return true here
    (save_stack) from (__set_page_owner)
    (__set_page_owner) from (get_page_from_freelist)
    (get_page_from_freelist) from (__alloc_pages_nodemask+)
    (__alloc_pages_nodemask) from (depot_save_stack)
    (depot_save_stack) from (save_stack)
    (save_stack) from (__set_page_owner)
    (__set_page_owner) from (get_page_from_freelist)

    Correct Backtrace with fix:

    (save_stack) from (__set_page_owner) // recursion returned true here
    (__set_page_owner) from (get_page_from_freelist)
    (get_page_from_freelist) from (__alloc_pages_nodemask+)
    (__alloc_pages_nodemask) from (depot_save_stack)
    (depot_save_stack) from (save_stack)
    (save_stack) from (__set_page_owner)
    (__set_page_owner) from (get_page_from_freelist)

    Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com
    Fixes: 5f48f0bd4e36 ("mm, page_owner: skip unnecessary stack_trace entries")
    Signed-off-by: Maninder Singh
    Signed-off-by: Vaneet Narang
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Greg Kroah-Hartman
    Cc: Ayush Mittal
    Cc: Prakash Gupta
    Cc: Vinayak Menon
    Cc: Vasyl Gomonovych
    Cc: Amit Sahrawat
    Cc:
    Cc: Vaneet Narang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Maninder Singh
     
  • [ Upstream commit 880cd276dff17ea29e9a8404275c9502b265afa7 ]

    All the root caches are linked into slab_root_caches which was
    introduced by the commit 510ded33e075 ("slab: implement slab_root_caches
    list") but it missed to add the SLAB's kmem_cache.

    While experimenting with opt-in/opt-out kmem accounting, I noticed
    system crashes due to NULL dereference inside cache_from_memcg_idx()
    while deferencing kmem_cache.memcg_params.memcg_caches. The upstream
    clean kernel will not see these crashes but SLAB should be consistent
    with SLUB which does linked its boot caches (kmem_cache_node and
    kmem_cache) into slab_root_caches.

    Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
    Fixes: 510ded33e075c ("slab: implement slab_root_caches list")
    Signed-off-by: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     
  • [ Upstream commit 9d3c3354bb85bab4d865fe95039443f09a4c8394 ]

    Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
    madvised allocations") changed the page allocator to no longer detect
    thp allocations based on __GFP_NORETRY.

    It did not, however, modify the mem cgroup try_charge() path to avoid
    oom kill for either khugepaged collapsing or thp faulting. It is never
    expected to oom kill a process to allocate a hugepage for thp; reclaim
    is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
    (and charging) should fallback instead of oom killing processes.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • [ Upstream commit 8970a63e965b43288c4f5f40efbc2bbf80de7f16 ]

    Alexander reported a use of uninitialized memory in __mpol_equal(),
    which is caused by incorrect use of preferred_node.

    When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
    numa_node_id() instead of preferred_node, however, __mpol_equal() uses
    preferred_node without checking whether it is MPOL_F_LOCAL or not.

    [akpm@linux-foundation.org: slight comment tweak]
    Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
    Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
    Signed-off-by: Yisheng Xie
    Reported-by: Alexander Potapenko
    Tested-by: Alexander Potapenko
    Reviewed-by: Andrew Morton
    Cc: Dmitriy Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     
  • commit 3f1959721558a976aaf9c2024d5bc884e6411bf7 upstream.

    Using module_init() is wrong. E.g. ACPI adds and onlines memory before
    our memory notifier gets registered.

    This makes sure that ACPI memory detected during boot up will not result
    in a kernel crash.

    Easily reproducible with QEMU, just specify a DIMM when starting up.

    Link: http://lkml.kernel.org/r/20180522100756.18478-3-david@redhat.com
    Fixes: 786a8959912e ("kasan: disable memory hotplug")
    Signed-off-by: David Hildenbrand
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit ed1596f9ab958dd156a66c9ff1029d3761c1786a upstream.

    We have to free memory again when we cancel onlining, otherwise a later
    onlining attempt will fail.

    Link: http://lkml.kernel.org/r/20180522100756.18478-2-david@redhat.com
    Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug")
    Signed-off-by: David Hildenbrand
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit 0f901dcbc31f88ae41a2aaa365f7802b5d520a28 upstream.

    KASAN uses different routines to map shadow for hot added memory and
    memory obtained in boot process. Attempt to offline memory onlined by
    normal boot process leads to this:

    Trying to vfree() nonexistent vm area (000000005d3b34b9)
    WARNING: CPU: 2 PID: 13215 at mm/vmalloc.c:1525 __vunmap+0x147/0x190

    Call Trace:
    kasan_mem_notifier+0xad/0xb9
    notifier_call_chain+0x166/0x260
    __blocking_notifier_call_chain+0xdb/0x140
    __offline_pages+0x96a/0xb10
    memory_subsys_offline+0x76/0xc0
    device_offline+0xb8/0x120
    store_mem_state+0xfa/0x120
    kernfs_fop_write+0x1d5/0x320
    __vfs_write+0xd4/0x530
    vfs_write+0x105/0x340
    SyS_write+0xb0/0x140

    Obviously we can't call vfree() to free memory that wasn't allocated via
    vmalloc(). Use find_vm_area() to see if we can call vfree().

    Unfortunately it's a bit tricky to properly unmap and free shadow
    allocated during boot, so we'll have to keep it. If memory will come
    online again that shadow will be reused.

    Matthew asked: how can you call vfree() on something that isn't a
    vmalloc address?

    vfree() is able to free any address returned by
    __vmalloc_node_range(). And __vmalloc_node_range() gives you any
    address you ask. It doesn't have to be an address in [VMALLOC_START,
    VMALLOC_END] range.

    That's also how the module_alloc()/module_memfree() works on
    architectures that have designated area for modules.

    [aryabinin@virtuozzo.com: improve comments]
    Link: http://lkml.kernel.org/r/dabee6ab-3a7a-51cd-3b86-5468718e0390@virtuozzo.com
    [akpm@linux-foundation.org: fix typos, reflow comment]
    Link: http://lkml.kernel.org/r/20180201163349.8700-1-aryabinin@virtuozzo.com
    Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug")
    Signed-off-by: Andrey Ryabinin
    Reported-by: Paul Menzel
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

23 May, 2018

1 commit

  • commit ab1e8d8960b68f54af42b6484b5950bd13a4054b upstream.

    It is unsafe to do virtual to physical translations before mm_init() is
    called if struct page is needed in order to determine the memory section
    number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
    we initialize struct pages for all the allocated memory when deferred
    struct pages are used.

    My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
    during boot") exposed this problem, because it greatly reduced number of
    pages that are initialized before mm_init(), but the problem existed
    even before my fix, as Fengguang Wu found.

    Below is a more detailed explanation of the problem.

    We initialize struct pages in four places:

    1. Early in boot a small set of struct pages is initialized to fill the
    first section, and lower zones.

    2. During mm_init() we initialize "struct pages" for all the memory that
    is allocated, i.e reserved in memblock.

    3. Using on-demand logic when pages are allocated after mm_init call
    (when memblock is finished)

    4. After smp_init() when the rest free deferred pages are initialized.

    The problem occurs if we try to do va to phys translation of a memory
    between steps 1 and 2. Because we have not yet initialized struct pages
    for all the reserved pages, it is inherently unsafe to do va to phys if
    the translation itself requires access of "struct page" as in case of
    this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP

    The following path exposes the problem:

    start_kernel()
    trap_init()
    setup_cpu_entry_areas()
    setup_cpu_entry_area(cpu)
    get_cpu_gdt_paddr(cpu)
    per_cpu_ptr_to_phys(addr)
    pcpu_addr_to_page(addr)
    virt_to_page(addr)
    pfn_to_page(__pa(addr) >> PAGE_SHIFT)

    We disable this path by not allowing NEED_PER_CPU_KM with deferred
    struct pages feature.

    The problems are discussed in these threads:
    http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com
    http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com
    http://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com

    Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
    Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: Mel Gorman
    Cc: Fengguang Wu
    Cc: Dennis Zhou
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     

19 May, 2018

1 commit

  • commit 7f7ccc2ccc2e70c6054685f5e3522efa81556830 upstream.

    proc_pid_cmdline_read() and environ_read() directly access the target
    process' VM to retrieve the command line and environment. If this
    process remaps these areas onto a file via mmap(), the requesting
    process may experience various issues such as extra delays if the
    underlying device is slow to respond.

    Let's simply refuse to access file-backed areas in these functions.
    For this we add a new FOLL_ANON gup flag that is passed to all calls
    to access_remote_vm(). The code already takes care of such failures
    (including unmapped areas). Accesses via /proc/pid/mem were not
    changed though.

    This was assigned CVE-2018-1120.

    Note for stable backports: the patch may apply to kernels prior to 4.11
    but silently miss one location; it must be checked that no call to
    access_remote_vm() keeps zero as the last argument.

    Reported-by: Qualys Security Advisory
    Cc: Linus Torvalds
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Signed-off-by: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Willy Tarreau
     

16 May, 2018

5 commits

  • commit 27ae357fa82be5ab73b2ef8d39dcb8ca2563483a upstream.

    Since exit_mmap() is done without the protection of mm->mmap_sem, it is
    possible for the oom reaper to concurrently operate on an mm until
    MMF_OOM_SKIP is set.

    This allows munlock_vma_pages_all() to concurrently run while the oom
    reaper is operating on a vma. Since munlock_vma_pages_range() depends
    on clearing VM_LOCKED from vm_flags before actually doing the munlock to
    determine if any other vmas are locking the same memory, the check for
    VM_LOCKED in the oom reaper is racy.

    This is especially noticeable on architectures such as powerpc where
    clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
    is zapped by the oom reaper during follow_page_mask() after the check
    for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
    kernel oops.

    Fix this by manually freeing all possible memory from the mm before
    doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not
    run on the mm anymore so the munlock is safe to do in exit_mmap(). It
    also matches the logic that the oom reaper currently uses for
    determining when to set MMF_OOM_SKIP itself, so there's no new risk of
    excessive oom killing.

    This issue fixes CVE-2018-1000200.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: David Rientjes
    Suggested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit 27227c733852f71008e9bf165950bb2edaed3a90 upstream.

    Memory hotplug and hotremove operate with per-block granularity. If the
    machine has a large amount of memory (more than 64G), the size of a
    memory block can span multiple sections. By mistake, during hotremove
    we set only the first section to offline state.

    The bug was discovered because kernel selftest started to fail:
    https://lkml.kernel.org/r/20180423011247.GK5563@yexl-desktop

    After commit, "mm/memory_hotplug: optimize probe routine". But, the bug
    is older than this commit. In this optimization we also added a check
    for sections to be in a proper state during hotplug operation.

    Link: http://lkml.kernel.org/r/20180427145257.15222-1-pasha.tatashin@oracle.com
    Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit 6098d7e136692f9c6e23ae362c62ec822343e4d5 upstream.

    Do not try to optimize in-page object layout while the page is under
    reclaim. This fixes lock-ups on reclaim and improves reclaim
    performance at the same time.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20180430125800.444cae9706489f412ad12621@gmail.com
    Signed-off-by: Vitaly Wool
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Cc:
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Wool
     
  • commit 8236b0ae31c837d2b3a2565c5f8d77f637e824cc upstream.

    syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
    wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
    Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
    wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).

    Introduce a helper function clear_and_wake_up_bit() and use it, in order
    to avoid similar errors in future.

    [1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()")
    Cc: Tejun Heo
    Reviewed-by: Jan Kara
    Suggested-by: Linus Torvalds
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     
  • commit 4eaf431f6f71bbed40a4c733ffe93a7e8cedf9d9 upstream.

    syzbot has triggered a NULL ptr dereference when allocation fault
    injection enforces a failure and alloc_mem_cgroup_per_node_info
    initializes memcg->nodeinfo only half way through.

    But __mem_cgroup_free still tries to free all per-node data and
    dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
    per-node data hasn't been initialized.

    The bug is quite unlikely to hit because small allocations do not fail
    and we would need quite some numa nodes to make struct
    mem_cgroup_per_node large enough to cross the costly order.

    Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
    Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
    Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrey Ryabinin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

09 May, 2018

1 commit

  • commit 71546d100422bcc2c543dadeb9328728997cd23a upstream.

    microblaze build broke due to missing declaration of the
    cond_resched() invocation added recently. Let's include linux/sched.h
    explicitly.

    Signed-off-by: Tejun Heo
    Reported-by: kbuild test robot
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

26 Apr, 2018

2 commits

  • [ Upstream commit a7ab400d6fe73d0119fdc234e9982a6f80faea9f ]

    During our recent testing with fadvise(FADV_DONTNEED), we find that if
    given offset/length is not page-aligned, the last page will not be
    discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/),
    we map a 10KB-sized file into memory and then try to run this tool to
    evict the whole file mapping, but the last single page always remains
    staying in the memory:

    $./vmtouch -e test_10K
    Files: 1
    Directories: 0
    Evicted Pages: 3 (12K)
    Elapsed: 2.1e-05 seconds

    $./vmtouch test_10K
    Files: 1
    Directories: 0
    Resident Pages: 1/3 4K/12K 33.3%
    Elapsed: 5.5e-05 seconds

    However when we test with an older kernel, say 3.10, this problem is
    gone. So we wonder if this is a regression:

    $./vmtouch -e test_10K
    Files: 1
    Directories: 0
    Evicted Pages: 3 (12K)
    Elapsed: 8.2e-05 seconds

    $./vmtouch test_10K
    Files: 1
    Directories: 0
    Resident Pages: 0/3 0/12K 0%
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    int i, fd, ret, len;
    struct stat buf;
    void *addr;
    unsigned char *vec;
    char *strbuf;
    ssize_t pagesize = getpagesize();
    ssize_t filesize;

    fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
    if (fd < 0)
    return -1;
    filesize = strtoul(argv[2], NULL, 10);

    strbuf = malloc(filesize);
    memset(strbuf, 42, filesize);
    write(fd, strbuf, filesize);
    free(strbuf);
    fsync(fd);

    len = (filesize + pagesize - 1) / pagesize;
    printf("length of pages: %d\n", len);

    addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED)
    return -1;

    ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
    if (ret < 0)
    return -1;

    vec = malloc(len);
    ret = mincore(addr, filesize, (void *)vec);
    if (ret < 0)
    return -1;

    for (i = 0; i < len; i++)
    printf("pages[%d]: %x\n", i, vec[i] & 0x1);

    free(vec);
    close(fd);

    return 0;
    }
    ==============================

    Test 1: running on kernel with commit 18aba41cbf reverted:

    [root@caspar ~]# uname -r
    4.15.0-rc6.revert+
    [root@caspar ~]# ./test_fadvise file1 1024
    length of pages: 1
    pages[0]: 0 #
    Signed-off-by: Caspar Zhang
    Reviewed-by: Oliver Yang
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    shidao.ytt
     
  • [ Upstream commit 69d763fc6d3aee787a3e8c8c35092b4f4960fa5d ]

    Minchan Kim asked the following question -- what locks protects
    address_space destroying when race happens between inode trauncation and
    __isolate_lru_page? Jan Kara clarified by describing the race as follows

    CPU1 CPU2

    truncate(inode) __isolate_lru_page()
    ...
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    if (mapping && !mapping->a_ops->migratepage)
    - we've dereferenced mapping which is potentially already free.

    The race is theoretically possible but unlikely. Before the
    delete_from_page_cache, truncate_cleanup_page is called so the page is
    likely to be !PageDirty or PageWriteback which gets skipped by the only
    caller that checks the mappping in __isolate_lru_page. Even if the race
    occurs, a substantial amount of work has to happen during a tiny window
    with no preemption but it could potentially be done using a virtual
    machine to artifically slow one CPU or halt it during the critical
    window.

    This patch should eliminate the race with truncation by try-locking the
    page before derefencing mapping and aborting if the lock was not
    acquired. There was a suggestion from Huang Ying to use RCU as a
    side-effect to prevent mapping being freed. However, I do not like the
    solution as it's an unconventional means of preserving a mapping and
    it's not a context where rcu_read_lock is obviously protecting rcu data.

    Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
    Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman