05 Mar, 2020

4 commits

  • commit f42f25526502d851d0e3ca1e46297da8aafce8a7 upstream.

    If thp defrag setting "defer" is used and a newline is *not* used when
    writing to the sysfs file, this is interpreted as the "defer+madvise"
    option.

    This is because we do prefix matching and if five characters are written
    without a newline, the current code ends up comparing to the first five
    bytes of the "defer+madvise" option and using that instead.

    Use the more appropriate sysfs_streq() that handles the trailing newline
    for us. Since this doubles as a nice cleanup, do it in enabled_store()
    as well.

    The current implementation relies on prefix matching: the number of
    bytes compared is either the number of bytes written or the length of
    the option being compared. With a newline, "defer\n" does not match
    "defer+"madvise"; without a newline, however, "defer" is considered to
    match "defer+madvise" (prefix matching is only comparing the first five
    bytes). End result is that writing "defer" is broken unless it has an
    additional trailing character.

    This means that writing "madv" in the past would match and set
    "madvise". With strict checking, that no longer is the case but it is
    unlikely anybody is currently doing this.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com
    Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option")
    Signed-off-by: David Rientjes
    Suggested-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit cb829624867b5ab10bc6a7036d183b1b82bfe9f8 upstream.

    The page could be a tail page, if this is the case, this BUG_ON will
    never be triggered.

    Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com
    Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")

    Signed-off-by: Wei Yang
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     
  • commit f4000fdf435b8301a11cf85237c561047f8c4c72 upstream.

    Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
    only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
    This, combined with the fact that get_user_pages_fast() falls back to
    "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
    if you need FOLL_FORCE, you cannot call get_user_pages_fast().

    There does not appear to be any reason for filtering out FOLL_FORCE.
    There is nothing in the _fast() implementation that requires that we
    avoid writing to the pages. So it appears to have been an oversight.

    Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

    Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
    Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
    Signed-off-by: John Hubbard
    Reviewed-by: Leon Romanovsky
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Björn Töpel
    Cc: Daniel Vetter
    Cc: Dan Williams
    Cc: Hans Verkuil
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jerome Glisse
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    John Hubbard
     
  • commit 5b57b8f22709f07c0ab5921c94fd66e8c59c3e11 upstream.

    Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
    inadvertently removed printing of page flags for pages that are neither
    anon nor ksm nor have a mapping. Fix that.

    Using pr_cont() again would be a solution, but the commit explicitly
    removed its use. Avoiding the danger of mixing up split lines from
    multiple CPUs might be beneficial for near-panic dumps like this, so fix
    this without reintroducing pr_cont().

    Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
    Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
    Signed-off-by: Vlastimil Babka
    Reported-by: Anshuman Khandual
    Reported-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Qian Cai
    Cc: Oscar Salvador
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Ralph Campbell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

29 Feb, 2020

4 commits

  • commit dcde237319e626d1ec3c9d8b7613032f0fd4663a upstream.

    Currently the arm64 kernel ignores the top address byte passed to brk(),
    mmap() and mremap(). When the user is not aware of the 56-bit address
    limit or relies on the kernel to return an error, untagging such
    pointers has the potential to create address aliases in user-space.
    Passing a tagged address to munmap(), madvise() is permitted since the
    tagged pointer is expected to be inside an existing mapping.

    The current behaviour breaks the existing glibc malloc() implementation
    which relies on brk() with an address beyond 56-bit to be rejected by
    the kernel.

    Remove untagging in the above functions by partially reverting commit
    ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
    addition, update the arm64 tagged-address-abi.rst document accordingly.

    Link: https://bugzilla.redhat.com/1797052
    Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
    Cc: # 5.4.x-
    Cc: Florian Weimer
    Reviewed-by: Andrew Morton
    Reported-by: Victor Stinner
    Acked-by: Will Deacon
    Acked-by: Andrey Konovalov
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     
  • commit 18e19f195cd888f65643a77a0c6aee8f5be6439a upstream.

    When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
    doesn't work before sparse_init_one_section() is called.

    This leads to a crash when hotplug memory:

    BUG: unable to handle page fault for address: 0000000006400000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:__memset+0x24/0x30
    Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
    RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
    RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
    RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
    RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
    R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
    R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
    FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    sparse_add_section+0x1c9/0x26a
    __add_pages+0xbf/0x150
    add_pages+0x12/0x60
    add_memory_resource+0xc8/0x210
    __add_memory+0x62/0xb0
    acpi_memory_device_add+0x13f/0x300
    acpi_bus_attach+0xf6/0x200
    acpi_bus_scan+0x43/0x90
    acpi_device_hotplug+0x275/0x3d0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x1a7/0x370
    worker_thread+0x30/0x380
    kthread+0x112/0x130
    ret_from_fork+0x35/0x40

    We should use memmap as it did.

    On x86 the impact is limited to x86_32 builds, or x86_64 configurations
    that override the default setting for SPARSEMEM_VMEMMAP.

    Other memory hotplug archs (arm64, ia64, and ppc) also default to
    SPARSEMEM_VMEMMAP=y.

    [dan.j.williams@intel.com: changelog update]
    {rppt@linux.ibm.com: changelog update]
    Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
    Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: Wei Yang
    Signed-off-by: Baoquan He
    Acked-by: David Hildenbrand
    Reviewed-by: Baoquan He
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     
  • commit 76073c646f5f4999d763f471df9e38a5a912d70d upstream.

    Commit 68600f623d69 ("mm: don't miss the last page because of round-off
    error") makes the scan size round up to @denominator regardless of the
    memory cgroup's state, online or offline. This affects the overall
    reclaiming behavior: the corresponding LRU list is eligible for
    reclaiming only when its size logically right shifted by @sc->priority
    is bigger than zero in the former formula.

    For example, the inactive anonymous LRU list should have at least 0x4000
    pages to be eligible for reclaiming when we have 60/12 for
    swappiness/priority and without taking scan/rotation ratio into account.

    After the roundup is applied, the inactive anonymous LRU list becomes
    eligible for reclaiming when its size is bigger than or equal to 0x1000
    in the same condition.

    (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
    ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

    aarch64 has 512MB huge page size when the base page size is 64KB. The
    memory cgroup that has a huge page is always eligible for reclaiming in
    that case.

    The reclaiming is likely to stop after the huge page is reclaimed,
    meaing the further iteration on @sc->priority and the silbing and child
    memory cgroups will be skipped. The overall behaviour has been changed.
    This fixes the issue by applying the roundup to offlined memory cgroups
    only, to give more preference to reclaim memory from offlined memory
    cgroup. It sounds reasonable as those memory is unlikedly to be used by
    anyone.

    The issue was found by starting up 8 VMs on a Ampere Mustang machine,
    which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
    2GB memory. It took 264 seconds for all VMs to be completely up and
    784MB swap is consumed after that. With this patch applied, it took 236
    seconds and 60MB swap to do same thing. So there is 10% performance
    improvement for my case. Note that KSM is disable while THP is enabled
    in the testing.

    total used free shared buff/cache available
    Mem: 16196 10065 2049 16 4081 3749
    Swap: 8175 784 7391
    total used free shared buff/cache available
    Mem: 16196 11324 3656 24 1215 2936
    Swap: 8175 60 8115

    Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
    Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
    Signed-off-by: Gavin Shan
    Acked-by: Roman Gushchin
    Cc: [4.20+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit 75866af62b439859d5146b7093ceb6b482852683 upstream.

    for_each_mem_cgroup() increases css reference counter for memory cgroup
    and requires to use mem_cgroup_iter_break() if the walk is cancelled.

    Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
    Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
    Signed-off-by: Vasily Averin
    Acked-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     

11 Feb, 2020

7 commits

  • commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

    Architectures for which we have hardware walkers of Linux page table
    should flush TLB on mmu gather batch allocation failures and batch flush.
    Some architectures like POWER supports multiple translation modes (hash
    and radix) and in the case of POWER only radix translation mode needs the
    above TLBI. This is because for hash translation mode kernel wants to
    avoid this extra flush since there are no hardware walkers of linux page
    table. With radix translation, the hardware also walks linux page table
    and with that, kernel needs to make sure to TLB invalidate page walk cache
    before page table pages are freed.

    More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
    TLB caches for RCU_TABLE_FREE")

    The changes to sparc are to make sure we keep the old behavior since we
    are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
    tlb_needs_table_invalidate is to always force an invalidate and sparc can
    avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
    false for sparc architecture.

    Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
    Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Michael Ellerman [powerpc]
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit e822969cab48b786b64246aad1a3ba2a774f5d23 upstream.

    Patch series "mm: fix max_pfn not falling on section boundary", v2.

    Playing with different memory sizes for a x86-64 guest, I discovered that
    some memmaps (highest section if max_mem does not fall on the section
    boundary) are marked as being valid and online, but contain garbage. We
    have to properly initialize these memmaps.

    Looking at /proc/kpageflags and friends, I found some more issues,
    partially related to this.

    This patch (of 3):

    If max_pfn is not aligned to a section boundary, we can easily run into
    BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
    memory size that is not a multiple of 128MB (e.g., 4097MB, but also
    4160MB). I was told that on real HW, we can easily have this scenario
    (esp., one of the main reasons sub-section hotadd of devmem was added).

    The issue is, that we have a valid memmap (pfn_valid()) for the whole
    section, and the whole section will be marked "online".
    pfn_to_online_page() will succeed, but the memmap contains garbage.

    E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
    4160M" - (see tools/vm/page-types.c):

    [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
    [ 200.477500] #PF: supervisor read access in kernel mode
    [ 200.478334] #PF: error_code(0x0000) - not-present page
    [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
    [ 200.479557] Oops: 0000 [#4] SMP NOPTI
    [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
    [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
    [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
    [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
    [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
    [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
    [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
    [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
    [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
    [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
    [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
    [ 200.488897] Call Trace:
    [ 200.489115] kpageflags_read+0xe9/0x140
    [ 200.489447] proc_reg_read+0x3c/0x60
    [ 200.489755] vfs_read+0xc2/0x170
    [ 200.490037] ksys_pread64+0x65/0xa0
    [ 200.490352] do_syscall_64+0x5c/0xa0
    [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
    after cold/hot plugging a DIMM to such a system:

    [root@localhost ~]# cat /proc/kpageflags > /dev/null
    [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
    [ 111.517907] #PF: supervisor read access in kernel mode
    [ 111.518333] #PF: error_code(0x0000) - not-present page
    [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

    This patch fixes that by at least zero-ing out that memmap (so e.g.,
    page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining
    unavailable struct pages") tried to fix a similar issue, but forgot to
    consider this special case.

    After this patch, there are still problems to solve. E.g., not all of
    these pages falling into a memory hole will actually get initialized later
    and set PageReserved - they are only zeroed out - but at least the
    immediate crashes are gone. A follow-up patch will take care of this.

    Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
    Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
    Signed-off-by: David Hildenbrand
    Tested-by: Daniel Jordan
    Cc: Naoya Horiguchi
    Cc: Pavel Tatashin
    Cc: Andrew Morton
    Cc: Steven Sistare
    Cc: Michal Hocko
    Cc: Daniel Jordan
    Cc: Bob Picco
    Cc: Oscar Salvador
    Cc: Alexey Dobriyan
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Stephen Rothwell
    Cc: [4.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit 5984fabb6e82d9ab4e6305cb99694c85d46de8ae upstream.

    Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the
    semantic of move_pages() has changed to return the number of
    non-migrated pages if they were result of a non-fatal reasons (usually a
    busy page).

    This was an unintentional change that hasn't been noticed except for LTP
    tests which checked for the documented behavior.

    There are two ways to go around this change. We can even get back to
    the original behavior and return -EAGAIN whenever migrate_pages is not
    able to migrate pages due to non-fatal reasons. Another option would be
    to simply continue with the changed semantic and extend move_pages
    documentation to clarify that -errno is returned on an invalid input or
    when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the
    number of pages that couldn't have been migrated due to ephemeral
    reasons (e.g. page is pinned or locked for other reasons).

    This patch implements the second option because this behavior is in
    place for some time without anybody complaining and possibly new users
    depending on it. Also it allows to have a slightly easier error
    handling as the caller knows that it is worth to retry when err > 0.

    But since the new semantic would be aborted immediately if migration is
    failed due to ephemeral reasons, need include the number of
    non-attempted pages in the return value too.

    Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
    Signed-off-by: Yang Shi
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Reviewed-by: Wei Yang
    Cc: [4.17+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     
  • commit fac0516b5534897bf4c4a88daa06a8cfa5611b23 upstream.

    If compound is true, this means it is a PMD mapped THP. Which implies
    the page is not linked to any defer list. So the first code chunk will
    not be executed.

    Also with this reason, it would not be proper to add this page to a
    defer list. So the second code chunk is not correct.

    Based on this, we should remove the defer list related code.

    [yang.shi@linux.alibaba.com: better patch title]
    Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
    Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
    Signed-off-by: Wei Yang
    Suggested-by: Kirill A. Shutemov
    Acked-by: Yang Shi
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: [5.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     
  • commit f1037ec0cc8ac1a450974ad9754e991f72884f48 upstream.

    The daxctl unit test for the dax_kmem driver currently triggers the
    (false positive) lockdep splat below. It results from the fact that
    remove_memory_block_devices() is invoked under the mem_hotplug_lock()
    causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
    active state tracking). It is a false positive because the sysfs
    attribute path triggering the memory remove is not the same attribute
    path associated with memory-block device.

    sysfs_break_active_protection() is not applicable since there is no real
    deadlock conflict, instead move memory-block device removal outside the
    lock. The mem_hotplug_lock() is not needed to synchronize the
    memory-block device removal vs the page online state, that is already
    handled by lock_device_hotplug(). Specifically, lock_device_hotplug()
    is sufficient to allow try_remove_memory() to check the offline state of
    the memblocks and be assured that any in progress online attempts are
    flushed / blocked by kernfs_drain() / attribute removal.

    The add_memory() path safely creates memblock devices under the
    mem_hotplug_lock(). There is no kernfs active state synchronization in
    the memblock device_register() path, so nothing to fix there.

    This change is only possible thanks to the recent change that refactored
    memory block device removal out of arch_remove_memory() (commit
    4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before
    arch_remove_memory()"), and David's due diligence tracking down the
    guarantees afforded by kernfs_drain(). Not flagged for -stable since
    this only impacts ongoing development and lockdep validation, not a
    runtime issue.

    ======================================================
    WARNING: possible circular locking dependency detected
    5.5.0-rc3+ #230 Tainted: G OE
    ------------------------------------------------------
    lt-daxctl/6459 is trying to acquire lock:
    ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

    but task is already holding lock:
    ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (mem_hotplug_lock.rw_sem){++++}:
    __lock_acquire+0x39c/0x790
    lock_acquire+0xa2/0x1b0
    get_online_mems+0x3e/0xb0
    kmem_cache_create_usercopy+0x2e/0x260
    kmem_cache_create+0x12/0x20
    ptlock_cache_init+0x20/0x28
    start_kernel+0x243/0x547
    secondary_startup_64+0xb6/0xc0

    -> #1 (cpu_hotplug_lock.rw_sem){++++}:
    __lock_acquire+0x39c/0x790
    lock_acquire+0xa2/0x1b0
    cpus_read_lock+0x3e/0xb0
    online_pages+0x37/0x300
    memory_subsys_online+0x17d/0x1c0
    device_online+0x60/0x80
    state_store+0x65/0xd0
    kernfs_fop_write+0xcf/0x1c0
    vfs_write+0xdb/0x1d0
    ksys_write+0x65/0xe0
    do_syscall_64+0x5c/0xa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#241){++++}:
    check_prev_add+0x98/0xa40
    validate_chain+0x576/0x860
    __lock_acquire+0x39c/0x790
    lock_acquire+0xa2/0x1b0
    __kernfs_remove+0x25f/0x2e0
    kernfs_remove_by_name_ns+0x41/0x80
    remove_files.isra.0+0x30/0x70
    sysfs_remove_group+0x3d/0x80
    sysfs_remove_groups+0x29/0x40
    device_remove_attrs+0x39/0x70
    device_del+0x16a/0x3f0
    device_unregister+0x16/0x60
    remove_memory_block_devices+0x82/0xb0
    try_remove_memory+0xb5/0x130
    remove_memory+0x26/0x40
    dev_dax_kmem_remove+0x44/0x6a [kmem]
    device_release_driver_internal+0xe4/0x1c0
    unbind_store+0xef/0x120
    kernfs_fop_write+0xcf/0x1c0
    vfs_write+0xdb/0x1d0
    ksys_write+0x65/0xe0
    do_syscall_64+0x5c/0xa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
    kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(mem_hotplug_lock.rw_sem);
    lock(cpu_hotplug_lock.rw_sem);
    lock(mem_hotplug_lock.rw_sem);
    lock(kn->count#241);

    *** DEADLOCK ***

    No fixes tag as this has been a long standing issue that predated the
    addition of kernfs lockdep annotations.

    Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Vishal Verma
    Cc: Pavel Tatashin
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 1f503443e7df8dc8366608b4d810ce2d6669827c upstream.

    After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
    when a mem section is fully deactivated, section_mem_map still records
    the section's start pfn, which is not used any more and will be
    reassigned during re-addition.

    In analogy with alloc/free pattern, it is better to clear all fields of
    section_mem_map.

    Beside this, it breaks the user space tool "makedumpfile" [1], which
    makes assumption that a hot-removed section has mem_map as NULL, instead
    of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
    will be better to change the assumption, and need a patch)

    The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
    trigger a crash, and save vmcore by makedumpfile

    [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

    Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Baoquan He
    Cc: Qian Cai
    Cc: Kazuhito Hagio
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pingfan Liu
     
  • commit 68f23b89067fdf187763e75a56087550624fdbee upstream.

    Without memcg, there is a one-to-one mapping between the bdi and
    bdi_writeback structures. In this world, things are fairly
    straightforward; the first thing bdi_unregister() does is to shutdown
    the bdi_writeback structure (or wb), and part of that writeback ensures
    that no other work queued against the wb, and that the wb is fully
    drained.

    With memcg, however, there is a one-to-many relationship between the bdi
    and bdi_writeback structures; that is, there are multiple wb objects
    which can all point to a single bdi. There is a refcount which prevents
    the bdi object from being released (and hence, unregistered). So in
    theory, the bdi_unregister() *should* only get called once its refcount
    goes to zero (bdi_put will drop the refcount, and when it is zero,
    release_bdi gets called, which calls bdi_unregister).

    Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
    the Brave New memcg World, and calls bdi_unregister directly. It does
    this without informing the file system, or the memcg code, or anything
    else. This causes the root wb associated with the bdi to be
    unregistered, but none of the memcg-specific wb's are shutdown. So when
    one of these wb's are woken up to do delayed work, they try to
    dereference their wb->bdi->dev to fetch the device name, but
    unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
    called by del_gendisk(). As a result, *boom*.

    Fortunately, it looks like the rest of the writeback path is perfectly
    happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
    create a bdi_dev_name() function which can handle bdi->dev being NULL.
    This also allows us to bulletproof the writeback tracepoints to prevent
    them from dereferencing a NULL pointer and crashing the kernel if one is
    tracing with memcg's enabled, and an iSCSI device dies or a USB storage
    stick is pulled.

    The most common way of triggering this will be hotremoval of a device
    while writeback with memcg enabled is going on. It was triggering
    several times a day in a heavily loaded production environment.

    Google Bug Id: 145475544

    Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
    Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
    Signed-off-by: Theodore Ts'o
    Cc: Chris Mason
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

06 Feb, 2020

2 commits

  • [ Upstream commit dfe9aa23cab7880a794db9eb2d176c00ed064eb6 ]

    If we get here after successfully adding page to list, err would be 1 to
    indicate the page is queued in the list.

    Current code has two problems:

    * on success, 0 is not returned
    * on error, if add_page_for_migratioin() return 1, and the following err1
    from do_move_pages_to_node() is set, the err1 is not returned since err
    is 1

    And these behaviors break the user interface.

    Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
    Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node").
    Signed-off-by: Wei Yang
    Acked-by: Yang Shi
    Cc: John Hubbard
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Wei Yang
     
  • commit c7a91bc7c2e17e0a9c8b9745a2cb118891218fd1 upstream.

    What we are trying to do is change the '=' character to a NUL terminator
    and then at the end of the function we restore it back to an '='. The
    problem is there are two error paths where we jump to the end of the
    function before we have replaced the '=' with NUL.

    We end up putting the '=' in the wrong place (possibly one element
    before the start of the buffer).

    Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
    Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Signed-off-by: Dan Carpenter
    Acked-by: Vlastimil Babka
    Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Dan Carpenter
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

23 Jan, 2020

7 commits

  • commit 6d9e8c651dd979aa666bee15f086745f3ea9c4b3 upstream.

    Patch series "use div64_ul() instead of div_u64() if the divisor is
    unsigned long".

    We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
    by zero in avg_atom () calculation"), then refer to the recently analyzed
    mm code, we found this suspicious place.

    201 if (min) {
    202 min *= this_bw;
    203 do_div(min, tot_bw);
    204 }

    And we also disassembled and confirmed it:

    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
    0xffffffff811c37da : xor %r10d,%r10d
    0xffffffff811c37dd : test %rax,%rax
    0xffffffff811c37e0 : je 0xffffffff811c3800
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
    0xffffffff811c37e2 : imul %r8,%rax
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
    0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
    0xffffffff811c37e9 : xor %edx,%edx
    0xffffffff811c37eb : div %r10
    0xffffffff811c37ee : imul %rbx,%rax
    0xffffffff811c37f2 : shr $0x2,%rax
    0xffffffff811c37f6 : mul %rcx
    0xffffffff811c37f9 : shr $0x2,%rdx
    0xffffffff811c37fd : mov %rdx,%r10

    This series uses div64_ul() instead of div_u64() if the divisor is
    unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

    This patch (of 3):

    The variables 'min' and 'max' are unsigned long and do_div truncates
    them to 32 bits, which means it can test non-zero and be truncated to
    zero for division. Fix this issue by using div64_ul() instead.

    Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
    Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wen Yang
     
  • commit 8068df3b60373c390198f660574ea14c8098de57 upstream.

    When we remove an early section, we don't free the usage map, as the
    usage maps of other sections are placed into the same page. Once the
    section is removed, it is no longer an early section (especially, the
    memmap is freed). When we re-add that section, the usage map is reused,
    however, it is no longer an early section. When removing that section
    again, we try to kfree() a usage map that was allocated during early
    boot - bad.

    Let's check against PageReserved() to see if we are dealing with an
    usage map that was allocated during boot. We could also check against
    !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
    cleaner.

    Can be triggered using memtrace under ppc64/powernv:

    $ mount -t debugfs none /sys/kernel/debug/
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    ------------[ cut here ]------------
    kernel BUG at mm/slub.c:3969!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
    NIP kfree+0x338/0x3b0
    LR section_deactivate+0x138/0x200
    Call Trace:
    section_deactivate+0x138/0x200
    __remove_pages+0x114/0x150
    arch_remove_memory+0x3c/0x160
    try_remove_memory+0x114/0x1a0
    __remove_memory+0x20/0x40
    memtrace_enable_set+0x254/0x850
    simple_attr_write+0x138/0x160
    full_proxy_write+0x8c/0x110
    __vfs_write+0x38/0x70
    vfs_write+0x11c/0x2a0
    ksys_write+0x84/0x140
    system_call+0x5c/0x68
    ---[ end trace 4b053cbd84e0db62 ]---

    The first invocation will offline+remove memory blocks. The second
    invocation will first add+online them again, in order to offline+remove
    them again (usually we are lucky and the exact same memory blocks will
    get "reallocated").

    Tested on powernv with boot memory: The usage map will not get freed.
    Tested on x86-64 with DIMMs: The usage map will get freed.

    Using Dynamic Memory under a Power DLAPR can trigger it easily.

    Triggering removal (I assume after previously removed+re-added) of
    memory from the HMC GUI can crash the kernel with the same call trace
    and is fixed by this patch.

    Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
    Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
    Signed-off-by: David Hildenbrand
    Tested-by: Pingfan Liu
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

    Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 2fe20210fc5f5e62644678b8f927c49f2c6f42a7 upstream.

    When booting with amd_iommu=off, the following WARNING message
    appears:

    AMD-Vi: AMD IOMMU disabled on kernel command-line
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
    Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
    RIP: 0010:flush_workqueue+0x42e/0x450
    Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
    Call Trace:
    kmem_cache_destroy+0x69/0x260
    iommu_go_to_state+0x40c/0x5ab
    amd_iommu_prepare+0x16/0x2a
    irq_remapping_prepare+0x36/0x5f
    enable_IR_x2apic+0x21/0x172
    default_setup_apic_routing+0x12/0x6f
    apic_intr_mode_init+0x1a1/0x1f1
    x86_late_time_init+0x17/0x1c
    start_kernel+0x480/0x53f
    secondary_startup_64+0xb6/0xc0
    ---[ end trace 30894107c3749449 ]---
    x2apic: IRQ remapping doesn't support X2APIC mode
    x2apic disabled

    The warning is caused by the calling of 'kmem_cache_destroy()'
    in free_iommu_resources(). Here is the call path:

    free_iommu_resources
    kmem_cache_destroy
    flush_memcg_workqueue
    flush_workqueue

    The root cause is that the IOMMU subsystem runs before the workqueue
    subsystem, which the variable 'wq_online' is still 'false'. This leads
    to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
    'true'.

    Since the variable 'memcg_kmem_cache_wq' is not allocated during the
    time, it is unnecessary to call flush_memcg_workqueue(). This prevents
    the WARNING message triggered by flush_workqueue().

    Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
    Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
    Signed-off-by: Adrian Huang
    Reported-by: Xiaochun Lee
    Reviewed-by: Shakeel Butt
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Adrian Huang
     
  • commit 4a87e2a25dc27131c3cce5e94421622193305638 upstream.

    Currently slab percpu vmstats are flushed twice: during the memcg
    offlining and just before freeing the memcg structure. Each time percpu
    counters are summed, added to the atomic counterparts and propagated up
    by the cgroup tree.

    The second flushing is required due to how recursive vmstats are
    implemented: counters are batched in percpu variables on a local level,
    and once a percpu value is crossing some predefined threshold, it spills
    over to atomic values on the local and each ascendant levels. It means
    that without flushing some numbers cached in percpu variables will be
    dropped on floor each time a cgroup is destroyed. And with uptime the
    error on upper levels might become noticeable.

    The first flushing aims to make counters on ancestor levels more
    precise. Dying cgroups may resume in the dying state for a long time.
    After kmem_cache reparenting which is performed during the offlining
    slab counters of the dying cgroup don't have any chances to be updated,
    because any slab operations will be performed on the parent level. It
    means that the inaccuracy caused by percpu batching will not decrease up
    to the final destruction of the cgroup. By the original idea flushing
    slab counters during the offlining should minimize the visible
    inaccuracy of slab counters on the parent level.

    The problem is that percpu counters are not zeroed after the first
    flushing. So every cached percpu value is summed twice. It creates a
    small error (up to 32 pages per cpu, but usually less) which accumulates
    on parent cgroup level. After creating and destroying of thousands of
    child cgroups, slab counter on parent level can be way off the real
    value.

    For now, let's just stop flushing slab counters on memcg offlining. It
    can't be done correctly without scheduling a work on each cpu: reading
    and zeroing it during css offlining can race with an asynchronous
    update, which doesn't expect values to be changed underneath.

    With this change, slab counters on parent level will become eventually
    consistent. Once all dying children are gone, values are correct. And
    if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

    It's not perfect, as slab are reparented, so any updates after the
    reparenting will happen on the parent level. It means that if a slab
    page was allocated, a counter on child level was bumped, then the page
    was reparented and freed, the annihilation of positive and negative
    counter values will not happen until the child cgroup is released. It
    makes slab counters different from others, and it might want us to
    implement flushing in a correct form again. But it's also a question of
    performance: scheduling a work on each cpu isn't free, and it's an open
    question if the benefit of having more accurate counters is worth it.

    We might also consider flushing all counters on offlining, not only slab
    counters.

    So let's fix the main problem now: make the slab counters eventually
    consistent, so at least the error won't grow with uptime (or more
    precisely the number of created and destroyed cgroups). And think about
    the accuracy of counters separately.

    Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
    Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit 97d3d0f9a1cf132c63c0b8b8bd497b8a56283dd9 upstream.

    Patch series "Fix two above-47bit hint address vs. THP bugs".

    The two get_unmapped_area() implementations have to be fixed to provide
    THP-friendly mappings if above-47bit hint address is specified.

    This patch (of 2):

    Filesystems use thp_get_unmapped_area() to provide THP-friendly
    mappings. For DAX in particular.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks thp_get_unmapped_area(): the function
    would not try to allocate PMD-aligned area if *any* hint address
    specified.

    Modify the routine to handle it correctly:

    - Try to allocate the space at the specified hint address with length
    padding required for PMD alignment.
    - If failed, retry without length padding (but with the same hint
    address);
    - If the returned address matches the hint address return it.
    - Otherwise, align the address as required for THP and return.

    The user specified hint address is passed down to get_unmapped_area() so
    above-47bit hint address will be taken into account without breaking
    alignment requirements.

    Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Thomas Willhalm
    Tested-by: Dan Williams
    Cc: "Aneesh Kumar K . V"
    Cc: "Bruggeman, Otto G"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 991589974d9c9ecb24ee3799ec8c415c730598a2 upstream.

    Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
    enabled. But it doesn't work well with above-47bit hint address.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks THP alignment in shmem/tmp:
    shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
    *any* hint address specified.

    This can be fixed by requesting the aligned area if the we failed to
    allocated at user-specified hint address. The request with inflated
    length will also take the user-specified hint address. This way we will
    not lose an allocation request from the full address space.

    [kirill@shutemov.name: fold in a fixup]
    Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
    Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Cc: "Willhalm, Thomas"
    Cc: Dan Williams
    Cc: "Bruggeman, Otto G"
    Cc: "Aneesh Kumar K . V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

18 Jan, 2020

1 commit

  • commit 1d1585ca0f48fe7ed95c3571f3e4a82b2b5045dc upstream.

    Commit 3d7081822f7f ("uaccess: Add non-pagefault user-space read functions")
    missed to add probe write function, therefore factor out a probe_write_common()
    helper with most logic of probe_kernel_write() except setting KERNEL_DS, and
    add a new probe_user_write() helper so it can be used from BPF side.

    Again, on some archs, the user address space and kernel address space can
    co-exist and be overlapping, so in such case, setting KERNEL_DS would mean
    that the given address is treated as being in kernel address space.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Cc: Masami Hiramatsu
    Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

09 Jan, 2020

10 commits

  • [ Upstream commit c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d ]

    The following lockdep splat was observed when a certain hugetlbfs test
    was run:

    ================================
    WARNING: inconsistent lock state
    4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - -
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
    {SOFTIRQ-ON-W} state was registered at:
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    __nr_hugepages_store_common+0x11b/0xb30
    hugetlb_sysctl_handler_common+0x209/0x2d0
    proc_sys_call_handler+0x37f/0x450
    vfs_write+0x157/0x460
    ksys_write+0xb8/0x170
    do_syscall_64+0xa5/0x4d0
    entry_SYSCALL_64_after_hwframe+0x6a/0xdf
    irq event stamp: 691296
    hardirqs last enabled at (691296): [] _raw_spin_unlock_irqrestore+0x4b/0x60
    hardirqs last disabled at (691295): [] _raw_spin_lock_irqsave+0x22/0x81
    softirqs last enabled at (691284): [] irq_enter+0xc3/0xe0
    softirqs last disabled at (691285): [] irq_exit+0x23e/0x2b0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(hugetlb_lock);

    lock(hugetlb_lock);

    *** DEADLOCK ***
    :
    Call Trace:

    __lock_acquire+0x146b/0x48c0
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    free_huge_page+0x36f/0xaa0
    bio_check_pages_dirty+0x2fc/0x5c0
    clone_endio+0x17f/0x670 [dm_mod]
    blk_update_request+0x276/0xe50
    scsi_end_request+0x7b/0x6a0
    scsi_io_completion+0x1c6/0x1570
    blk_done_softirq+0x22e/0x350
    __do_softirq+0x23d/0xad8
    irq_exit+0x23e/0x2b0
    do_IRQ+0x11a/0x200
    common_interrupt+0xf/0xf

    Both the hugetbl_lock and the subpool lock can be acquired in
    free_huge_page(). One way to solve the problem is to make both locks
    irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is
    held for a linear scan of ALL hugetlb pages during a cgroup reparentling
    operation. So it is just too long to have irq disabled unless we can
    break hugetbl_lock down into finer-grained locks with shorter lock hold
    times.

    Another alternative is to defer the freeing to a workqueue job. This
    patch implements the deferred freeing by adding a free_hpage_workfn()
    work function to do the actual freeing. The free_huge_page() call in a
    non-task context saves the page to be freed in the hpage_freelist linked
    list in a lockless manner using the llist APIs.

    The generic workqueue is used to process the work, but a dedicated
    workqueue can be used instead if it is desirable to have the huge page
    freed ASAP.

    Thanks to Kirill Tkhai for suggesting the use of
    llist APIs which simplfy the code.

    Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Mike Kravetz
    Acked-by: Davidlohr Bueso
    Acked-by: Michal Hocko
    Reviewed-by: Kirill Tkhai
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     
  • [ Upstream commit 030eab4f9ffb469344c10a46bc02c5149db0a2a9 ]

    Building the kernel on s390 with -Og produces the following warning:

    WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
    The function populate_section_memmap() references
    the function __meminit __populate_section_memmap().
    This is often because populate_section_memmap lacks a __meminit
    annotation or the annotation of __populate_section_memmap is wrong.

    While -Og is not supported, in theory this might still happen with
    another compiler or on another architecture. So fix this by using the
    correct section annotations.

    [iii@linux.ibm.com: v2]
    Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
    Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
    Signed-off-by: Ilya Leoshkevich
    Acked-by: David Hildenbrand
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Ilya Leoshkevich
     
  • commit 24cecc37746393432d994c0dbc251fb9ac7c5d72 upstream.

    The ARMv8 64-bit architecture supports execute-only user permissions by
    clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
    privileged mapping but from which user running at EL0 can still execute.

    The downside, however, is that the kernel at EL1 inadvertently reading
    such mapping would not trip over the PAN (privileged access never)
    protection.

    Revert the relevant bits from commit cab15ce604e5 ("arm64: Introduce
    execute-only page access permissions") so that PROT_EXEC implies
    PROT_READ (and therefore PTE_USER) until the architecture gains proper
    support for execute-only user mappings.

    Fixes: cab15ce604e5 ("arm64: Introduce execute-only page access permissions")
    Cc: # 4.9.x-
    Acked-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     
  • commit a7c46c0c0e3d62f2764cd08b90934cd2aaaf8545 upstream.

    In the implementation of __gup_benchmark_ioctl() the allocated pages
    should be released before returning in case of an invalid cmd. Release
    pages via kvfree().

    [akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
    Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
    Fixes: 714a3a1ebafe ("mm/gup_benchmark.c: add additional pinning methods")
    Signed-off-by: Navid Emamdoost
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Reviewed-by: John Hubbard
    Cc: Keith Busch
    Cc: Kirill A. Shutemov
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Navid Emamdoost
     
  • commit 941f762bcb276259a78e7931674668874ccbda59 upstream.

    pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
    As everything else is printed in kB, I chose to fix the value rather than
    the string.

    Before:

    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    ...
    [ 1878] 1000 1878 217253 151144 1269760 0 0 python
    ...
    Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0

    After:

    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    ...
    [ 1436] 1000 1436 217253 151890 1294336 0 0 python
    ...
    Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0

    Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
    Fixes: 70cb6d267790 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Edward Chron
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     
  • commit e0153fc2c7606f101392b682e720a7a456d6c766 upstream.

    Felix Abecassis reports move_pages() would return random status if the
    pages are already on the target node by the below test program:

    int main(void)
    {
    const long node_id = 1;
    const long page_size = sysconf(_SC_PAGESIZE);
    const int64_t num_pages = 8;

    unsigned long nodemask = 1 << node_id;
    long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
    if (ret < 0)
    return (EXIT_FAILURE);

    void **pages = malloc(sizeof(void*) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
    pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
    MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
    -1, 0);
    if (pages[i] == MAP_FAILED)
    return (EXIT_FAILURE);
    }

    ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
    if (ret < 0)
    return (EXIT_FAILURE);

    int *nodes = malloc(sizeof(int) * num_pages);
    int *status = malloc(sizeof(int) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
    nodes[i] = node_id;
    status[i] = 0xd0; /* simulate garbage values */
    }

    ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
    printf("move_pages: %ld\n", ret);
    for (int i = 0; i < num_pages; ++i)
    printf("status[%d] = %d\n", i, status[i]);
    }

    Then running the program would return nonsense status values:

    $ ./move_pages_bug
    move_pages: 0
    status[0] = 208
    status[1] = 208
    status[2] = 208
    status[3] = 208
    status[4] = 208
    status[5] = 208
    status[6] = 208
    status[7] = 208

    This is because the status is not set if the page is already on the
    target node, but move_pages() should return valid status as long as it
    succeeds. The valid status may be errno or node id.

    We can't simply initialize status array to zero since the pages may be
    not on node 0. Fix it by updating status with node id which the page is
    already on.

    Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
    Signed-off-by: Yang Shi
    Reported-by: Felix Abecassis
    Tested-by: Felix Abecassis
    Suggested-by: Michal Hocko
    Reviewed-by: John Hubbard
    Acked-by: Christoph Lameter
    Acked-by: Michal Hocko
    Reviewed-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: [4.17+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     
  • commit ac8f05da5174c560de122c499ce5dfb5d0dfbee5 upstream.

    When zspage is migrated to the other zone, the zone page state should be
    updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
    counts including proc/zoneinfo in practice.

    Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
    Fixes: 91537fee0013 ("mm: add NR_ZSMALLOC to vmstat")
    Signed-off-by: Chanho Min
    Signed-off-by: Jinsuk Choi
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Chanho Min
     
  • commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream.

    We currently try to shrink a single zone when removing memory. We use
    the zone of the first page of the memory we are removing. If that
    memmap was never initialized (e.g., memory was never onlined), we will
    read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __remove_pages+0x4b/0x640
    arch_remove_memory+0x63/0x8d
    try_remove_memory+0xdb/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x70/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x227/0x3a0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x221/0x550
    worker_thread+0x50/0x3b0
    kthread+0x105/0x140
    ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

    Instead, shrink the zones when offlining memory or when onlining failed.
    Introduce and use remove_pfn_range_from_zone(() for that. We now
    properly shrink the zones, even if we have DIMMs whereby

    - Some memory blocks fall into no zone (never onlined)

    - Some memory blocks fall into multiple zones (offlined+re-onlined)

    - Multiple memory blocks that fall into different zones

    Drop the zone parameter (with a potential dubious value) from
    __remove_pages() and __remove_section().

    Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • [ Upstream commit 89b15332af7c0312a41e50846819ca6613b58b4c ]

    One of our services is observing hanging ps/top/etc under heavy write
    IO, and the task states show this is an mmap_sem priority inversion:

    A write fault is holding the mmap_sem in read-mode and waiting for
    (heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

    Somebody tries to change the address space, contending for the mmap_sem in
    write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The waiting writer locks out all subsequent readers to avoid lock
    starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    To fix this, do what we do for cache read faults already: drop the
    mmap_sem before calling into anything IO bound, in this case the
    balance_dirty_pages() function, and return VM_FAULT_RETRY.

    Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Kirill A. Shutemov
    Cc: Josef Bacik
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     
  • [ Upstream commit 8897c1b1a1795cab23d5ac13e4e23bf0b5f4e0c6 ]

    syzbot found the following crash:

    BUG: KASAN: use-after-free in perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
    Read of size 8 at addr ffff8880a5cf2c50 by task syz-executor.0/26173

    CPU: 0 PID: 26173 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #146
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    perf_trace_lock_acquire+0x401/0x530 include/trace/events/lock.h:13
    trace_lock_acquire include/trace/events/lock.h:13 [inline]
    lock_acquire+0x2de/0x410 kernel/locking/lockdep.c:4411
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:338 [inline]
    shmem_fault+0x5ec/0x7b0 mm/shmem.c:2034
    __do_fault+0x111/0x540 mm/memory.c:3083
    do_shared_fault mm/memory.c:3535 [inline]
    do_fault mm/memory.c:3613 [inline]
    handle_pte_fault mm/memory.c:3840 [inline]
    __handle_mm_fault+0x2adf/0x3f20 mm/memory.c:3964
    handle_mm_fault+0x1b5/0x6b0 mm/memory.c:4001
    do_user_addr_fault arch/x86/mm/fault.c:1441 [inline]
    __do_page_fault+0x536/0xdd0 arch/x86/mm/fault.c:1506
    do_page_fault+0x38/0x590 arch/x86/mm/fault.c:1530
    page_fault+0x39/0x40 arch/x86/entry/entry_64.S:1202

    It happens if the VMA got unmapped under us while we dropped mmap_sem
    and inode got freed.

    Pinning the file if we drop mmap_sem fixes the issue.

    Link: http://lkml.kernel.org/r/20190927083908.rhifa4mmaxefc24r@box
    Signed-off-by: Kirill A. Shutemov
    Reported-by: syzbot+03ee87124ee05af991bd@syzkaller.appspotmail.com
    Acked-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Josef Bacik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     

31 Dec, 2019

1 commit

  • commit 42a9a53bb394a1de2247ef78f0b802ae86798122 upstream.

    Since commit 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on
    memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
    CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
    with CONFIG_MEMCG_KMEM.

    And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
    shrinker, and it is deferred_split_shrinker. But it is never actually
    called, since idr_replace() is never compiled due to the wrong #ifdef.
    The deferred_split_shrinker all the time is staying in half-registered
    state, and it's never called for subordinate mem cgroups.

    Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem")
    Signed-off-by: Yang Shi
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Roman Gushchin
    Cc: [5.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     

18 Dec, 2019

3 commits

  • commit aa71ecd8d86500da6081a72da6b0b524007e0627 upstream.

    In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE,
    which equal LLONG_MAX.

    If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in
    shmem_fallocate, which will pass the checking in vfs_fallocate.

    /* Check for wrap through zero too */
    if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
    return -EFBIG;

    loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate
    causes a overflow.

    Syzkaller reports a overflow problem in mm/shmem:

    UBSAN: Undefined behaviour in mm/shmem.c:2014:10
    signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int'
    CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1
    Hardware name: linux, dummy-virt (DT)
    Call trace:
    dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100
    show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238
    __dump_stack lib/dump_stack.c:15 [inline]
    ubsan_epilogue+0x18/0x70 lib/ubsan.c:164
    handle_overflow+0x158/0x1b0 lib/ubsan.c:195
    shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104
    vfs_fallocate+0x238/0x428 fs/open.c:312
    SYSC_fallocate fs/open.c:335 [inline]
    SyS_fallocate+0x54/0xc8 fs/open.c:239

    The highest bit of unmap_start will be appended with sign bit 1
    (overflow) when calculate shmem_falloc.start:

    shmem_falloc.start = unmap_start >> PAGE_SHIFT.

    Fix it by casting the type of unmap_start to u64, when right shifted.

    This bug is found in LTS Linux 4.1. It also seems to exist in mainline.

    Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.com
    Signed-off-by: Chen Jun
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Qian Cai
    Cc: Kefeng Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Chen Jun
     
  • commit a264df74df38855096393447f1b8f386069a94b9 upstream.

    Christian reported a warning like the following obtained during running
    some KVM-related tests on s390:

    WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
    Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
    CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
    Hardware name: IBM 2964 NC9 712 (LPAR)
    Workqueue: events sysfs_slab_remove_workfn
    Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
    0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
    0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
    0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
    Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
    0000001529746848: 47000700 bc 0,1792
    #000000152974684c: a7f40001 brc 15,152974684e
    >0000001529746850: a7f4fff2 brc 15,1529746834
    0000001529746854: 0707 bcr 0,%r7
    0000001529746856: 0707 bcr 0,%r7
    0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
    000000152974685e: a738ffff lhi %r3,-1
    Call Trace:
    ([] 0x3e009263d00)
    [] slab_kmem_cache_release+0x3a/0x70
    [] kobject_put+0xaa/0xe8
    [] process_one_work+0x1e8/0x428
    [] worker_thread+0x48/0x460
    [] kthread+0x126/0x160
    [] ret_from_fork+0x28/0x30
    [] kernel_thread_starter+0x0/0x10
    Last Breaking-Event-Address:
    [] percpu_ref_exit+0x4c/0x58
    ---[ end trace b035e7da5788eb09 ]---

    The problem occurs because kmem_cache_destroy() is called immediately
    after deleting of a memcg, so it races with the memcg kmem_cache
    deactivation.

    flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
    supposed to guarantee that all deactivation processes are finished, but
    failed to do so. It waits for an rcu grace period, after which all
    children kmem_caches should be deactivated. During the deactivation
    percpu_ref_kill() is called for non root kmem_cache refcounters, but it
    requires yet another rcu grace period to finish the transition to the
    atomic (dead) state.

    So in a rare case when not all children kmem_caches are destroyed at the
    moment when the root kmem_cache is about to be gone, we need to wait
    another rcu grace period before destroying the root kmem_cache.

    This issue can be triggered only with dynamically created kmem_caches
    which are used with memcg accounting. In this case per-memcg child
    kmem_caches are created. They are deactivated from the cgroup removing
    path. If the destruction of the root kmem_cache is racing with the
    removal of the cgroup (both are quite complicated multi-stage
    processes), the described issue can occur. The only known way to
    trigger it in the real life, is to unload some kernel module which
    creates a dedicated kmem_cache, used from different memory cgroups with
    GFP_ACCOUNT flag. If the unloading happens immediately after calling
    rmdir on the corresponding cgroup, there is some chance to trigger the
    issue.

    Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
    Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
    Signed-off-by: Roman Gushchin
    Reported-by: Christian Borntraeger
    Tested-by: Christian Borntraeger
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit 05d351102dbe4e103d6bdac18b1122cd3cd04925 upstream.

    F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE:
    A private mapping created after the memfd file that gets sealed with
    F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning
    children and parent share the same memory, even though the mapping is
    private.

    The reason for this is due to the code below:

    static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
    {
    struct shmem_inode_info *info = SHMEM_I(file_inode(file));

    if (info->seals & F_SEAL_FUTURE_WRITE) {
    /*
    * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
    * "future write" seal active.
    */
    if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
    return -EPERM;

    /*
    * Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED
    * read-only mapping, take care to not allow mprotect to revert
    * protections.
    */
    vma->vm_flags &= ~(VM_MAYWRITE);
    }
    ...
    }

    And for the mm to know if a mapping is copy-on-write:

    static inline bool is_cow_mapping(vm_flags_t flags)
    {
    return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
    }

    The patch fixes the issue by making the mprotect revert protection
    happen only for shared mappings. For private mappings, using mprotect
    will have no effect on the seal behavior.

    The F_SEAL_FUTURE_WRITE feature was introduced in v5.1 so v5.3.x stable
    kernels would need a backport.

    [akpm@linux-foundation.org: reflow comment, per Christoph]
    Link: http://lkml.kernel.org/r/20191107195355.80608-1-joel@joelfernandes.org
    Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
    Signed-off-by: Nicolas Geoffray
    Signed-off-by: Joel Fernandes (Google)
    Cc: Hugh Dickins
    Cc: Shuah Khan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Geoffray
     

23 Nov, 2019

1 commit

  • It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
    remove_stable_node() when it races with __mmput() and squeezes in
    between ksm_exit() and exit_mmap().

    WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150

    Call Trace:
    remove_all_stable_nodes+0x12b/0x330
    run_store+0x4ef/0x7b0
    kernfs_fop_write+0x200/0x420
    vfs_write+0x154/0x450
    ksys_write+0xf9/0x1d0
    do_syscall_64+0x99/0x510
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Remove the warning as there is nothing scary going on.

    Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
    Fixes: cbf86cfe04a6 ("ksm: remove old stable nodes more thoroughly")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin