08 Oct, 2020

1 commit

  • * tag 'v5.4.70': (3051 commits)
    Linux 5.4.70
    netfilter: ctnetlink: add a range check for l3/l4 protonum
    ep_create_wakeup_source(): dentry name can change under you...
    ...

    Conflicts:
    arch/arm/mach-imx/pm-imx6.c
    arch/arm64/boot/dts/freescale/imx8mm-evk.dts
    arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts
    drivers/crypto/caam/caamalg.c
    drivers/gpu/drm/imx/dw_hdmi-imx.c
    drivers/gpu/drm/imx/imx-ldb.c
    drivers/gpu/drm/imx/ipuv3/ipuv3-crtc.c
    drivers/mmc/host/sdhci-esdhc-imx.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/net/ethernet/freescale/enetc/enetc.c
    drivers/net/ethernet/freescale/enetc/enetc_pf.c
    drivers/thermal/imx_thermal.c
    drivers/usb/cdns3/ep0.c
    drivers/xen/swiotlb-xen.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c

    Signed-off-by: Jason Liu

    Jason Liu
     

07 Oct, 2020

2 commits

  • commit f85086f95fa36194eb0db5cd5c12e56801b98523 upstream.

    In register_mem_sect_under_node() the system_state's value is checked to
    detect whether the call is made during boot time or during an hot-plug
    operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
    because regular memory is registered at SYSTEM_SCHEDULING state. In
    addition, memory hot-plug operation can be triggered at this system
    state by the ACPI [1]. So checking against the system state is not
    enough.

    The consequence is that on system with interleaved node's ranges like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    This can be seen on PowerPC LPAR after multiple memory hot-plug and
    hot-unplug operations are done. At the next reboot the node's memory
    ranges can be interleaved and since the call to link_mem_sections() is
    made in topology_init() while the system is in the SYSTEM_SCHEDULING
    state, the node's id is not checked, and the sections registered to
    multiple nodes:

    $ ls -l /sys/devices/system/memory/memory21/node*
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

    In that case, the system is able to boot but if later one of theses
    memory blocks is hot-unplugged and then hot-plugged, the sysfs
    inconsistency is detected and this is triggering a BUG_ON():

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This patch addresses the root cause by not relying on the system_state
    value to detect whether the call is due to a hot-plug operation. An
    extra parameter is added to link_mem_sections() detailing whether the
    operation is due to a hot-plug operation.

    [1] According to Oscar Salvador, using this qemu command line, ACPI
    memory hotplug operations are raised at SYSTEM_SCHEDULING state:

    $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
    -m size=$MEM,slots=255,maxmem=4294967296k \
    -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
    -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
    -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
    -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
    -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
    -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
    -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
    -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

    Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Fenghua Yu
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     
  • commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream.

    Patch series "mm: fix memory to node bad links in sysfs", v3.

    Sometimes, firmware may expose interleaved memory layout like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    In that case, we can see memory blocks assigned to multiple nodes in
    sysfs:

    $ ls -l /sys/devices/system/memory/memory21
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
    drwxr-xr-x 2 root root 0 Aug 24 05:27 power
    -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
    lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
    -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
    -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

    The same applies in the node's directory with a memory21 link in both
    the node1 and node2's directory.

    This is wrong but doesn't prevent the system to run. However when
    later, one of these memory blocks is hot-unplugged and then hot-plugged,
    the system is detecting an inconsistency in the sysfs layout and a
    BUG_ON() is raised:

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This has been seen on PowerPC LPAR.

    The root cause of this issue is that when node's memory is registered,
    the range used can overlap another node's range, thus the memory block
    is registered to multiple nodes in sysfs.

    There are two issues here:

    (a) The sysfs memory and node's layouts are broken due to these
    multiple links

    (b) The link errors in link_mem_sections() should not lead to a system
    panic.

    To address (a) register_mem_sect_under_node should not rely on the
    system state to detect whether the link operation is triggered by a hot
    plug operation or not. This is addressed by the patches 1 and 2 of this
    series.

    Issue (b) will be addressed separately.

    This patch (of 2):

    The memmap_context enum is used to detect whether a memory operation is
    due to a hot-add operation or happening at boot time.

    Make it general to the hotplug operation and rename it as
    meminit_context.

    There is no functional change introduced by this patch

    Suggested-by: David Hildenbrand
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J . Wysocki"
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
    Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     

01 Oct, 2020

15 commits

  • commit d3f7b1bb204099f2f7306318896223e8599bb6a2 upstream.

    Currently to make sure that every page table entry is read just once
    gup_fast walks perform READ_ONCE and pass pXd value down to the next
    gup_pXd_range function by value e.g.:

    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    ...
    pudp = pud_offset(&p4d, addr);

    This function passes a reference on that local value copy to pXd_offset,
    and might get the very same pointer in return. This happens when the
    level is folded (on most arches), and that pointer should not be
    iterated.

    On s390 due to the fact that each task might have different 5,4 or
    3-level address translation and hence different levels folded the logic
    is more complex and non-iteratable pointer to a local copy leads to
    severe problems.

    Here is an example of what happens with gup_fast on s390, for a task
    with 3-level paging, crossing a 2 GB pud boundary:

    // addr = 0x1007ffff000, end = 0x10080001000
    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    {
    unsigned long next;
    pud_t *pudp;

    // pud_offset returns &p4d itself (a pointer to a value on stack)
    pudp = pud_offset(&p4d, addr);
    do {
    // on second iteratation reading "random" stack value
    pud_t pud = READ_ONCE(*pudp);

    // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
    next = pud_addr_end(addr, end);
    ...
    } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

    return 1;
    }

    This happens since s390 moved to common gup code with commit
    d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
    commit 1a42010cdc26 ("s390/mm: convert to the generic
    get_user_pages_fast code").

    s390 tried to mimic static level folding by changing pXd_offset
    primitives to always calculate top level page table offset in pgd_offset
    and just return the value passed when pXd_offset has to act as folded.

    What is crucial for gup_fast and what has been overlooked is that
    PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
    And the latter is not possible with dynamic folding.

    To fix the issue in addition to pXd values pass original pXdp pointers
    down to gup_pXd_range functions. And introduce pXd_offset_lockless
    helpers, which take an additional pXd entry value parameter. This has
    already been discussed in

    https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

    Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Andrew Morton
    Reviewed-by: Gerald Schaefer
    Reviewed-by: Alexander Gordeev
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Mike Rapoport
    Reviewed-by: John Hubbard
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Arnd Bergmann
    Cc: Andrey Ryabinin
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Claudio Imbrenda
    Cc: [5.2+]
    Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Gorbik
     
  • commit 41663430588c737dd735bad5a0d1ba325dcabd59 upstream.

    SWP_FS is used to make swap_{read,write}page() go through the
    filesystem, and it's only used for swap files over NFS. So, !SWP_FS
    means non NFS for now, it could be either file backed or device backed.
    Something similar goes with legacy SWP_FILE.

    So in order to achieve the goal of the original patch, SWP_BLKDEV should
    be used instead.

    FS corruption can be observed with SSD device + XFS + fragmented
    swapfile due to CONFIG_THP_SWAP=y.

    I reproduced the issue with the following details:

    Environment:

    QEMU + upstream kernel + buildroot + NVMe (2 GB)

    Kernel config:

    CONFIG_BLK_DEV_NVME=y
    CONFIG_THP_SWAP=y

    Some reproducible steps:

    mkfs.xfs -f /dev/nvme0n1
    mkdir /tmp/mnt
    mount /dev/nvme0n1 /tmp/mnt
    bs="32k"
    sz="1024m" # doesn't matter too much, I also tried 16m
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

    mkswap /tmp/mnt/sw
    swapon /tmp/mnt/sw

    stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

    Symptoms:
    - FS corruption (e.g. checksum failure)
    - memory corruption at: 0xd2808010
    - segfault

    Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
    Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Reviewed-by: Yang Shi
    Acked-by: Rafael Aquini
    Cc: Matthew Wilcox
    Cc: Carlos Maiolino
    Cc: Eric Sandeen
    Cc: Dave Chinner
    Cc:
    Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gao Xiang
     
  • [ Upstream commit ce2684254bd4818ca3995c0d021fb62c4cf10a19 ]

    syzbot reported the following KASAN splat:

    general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
    CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
    Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
    RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
    Call Trace:
    lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:354 [inline]
    madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
    walk_pmd_range mm/pagewalk.c:89 [inline]
    walk_pud_range mm/pagewalk.c:160 [inline]
    walk_p4d_range mm/pagewalk.c:193 [inline]
    walk_pgd_range mm/pagewalk.c:229 [inline]
    __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
    walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
    madvise_pageout_page_range mm/madvise.c:521 [inline]
    madvise_pageout mm/madvise.c:557 [inline]
    madvise_vma mm/madvise.c:946 [inline]
    do_madvise+0x12d0/0x2090 mm/madvise.c:1145
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The backing vma was shmem.

    In case of split page of file-backed THP, madvise zaps the pmd instead
    of remapping of sub-pages. So we need to check pmd validity after
    split.

    Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Minchan Kim
     
  • [ Upstream commit abb242f57196dbaa108271575353a0453f6834ef ]

    The move_lock is a per-memcg lock, but the VM accounting code that needs
    to acquire it comes from the page and follows page->mem_cgroup under RCU
    protection. That means that the page becomes unlocked not when we drop
    the move_lock, but when we update page->mem_cgroup. And that assignment
    doesn't imply any memory ordering. If that pointer write gets reordered
    against the reads of the page state - page_mapped, PageDirty etc. the
    state may change while we rely on it being stable and we can end up
    corrupting the counters.

    Place an SMP memory barrier to make sure we're done with all page state by
    the time the new page->mem_cgroup becomes visible.

    Also replace the open-coded move_lock with a lock_page_memcg() to make it
    more obvious what we're serializing against.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-3-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     
  • [ Upstream commit d6c1f098f2a7ba62627c9bc17cda28f534ef9e4a ]

    "prev_offset" is a static variable in swapin_nr_pages() that can be
    accessed concurrently with only mmap_sem held in read mode as noticed by
    KCSAN,

    BUG: KCSAN: data-race in swap_cluster_readahead / swap_cluster_readahead

    write to 0xffffffff92763830 of 8 bytes by task 14795 on cpu 17:
    swap_cluster_readahead+0x2a6/0x5e0
    swapin_readahead+0x92/0x8dc
    do_swap_page+0x49b/0xf20
    __handle_mm_fault+0xcfb/0xd70
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x715
    page_fault+0x34/0x40

    1 lock held by (dnf)/14795:
    #0: ffff897bd2e98858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715
    do_user_addr_fault at arch/x86/mm/fault.c:1405
    (inlined by) do_page_fault at arch/x86/mm/fault.c:1535
    irq event stamp: 83493
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x365/0x589
    irq_exit+0xa2/0xc0

    read to 0xffffffff92763830 of 8 bytes by task 1 on cpu 22:
    swap_cluster_readahead+0xfd/0x5e0
    swapin_readahead+0x92/0x8dc
    do_swap_page+0x49b/0xf20
    __handle_mm_fault+0xcfb/0xd70
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x715
    page_fault+0x34/0x40

    1 lock held by systemd/1:
    #0: ffff897c38f14858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715
    irq event stamp: 43530289
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x365/0x589
    irq_exit+0xa2/0xc0

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200402213748.2237-1-cai@lca.pw
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit cbfc35a48609ceac978791e3ab9dde0c01f8cb20 ]

    In a couple of places in the slub memory allocator, the code uses
    "s->offset" as a check to see if the free pointer is put right after the
    object. That check is no longer true with commit 3202fa62fb43 ("slub:
    relocate freelist pointer to middle of object").

    As a result, echoing "1" into the validate sysfs file, e.g. of dentry,
    may cause a bunch of "Freepointer corrupt" error reports like the
    following to appear with the system in panic afterwards.

    =============================================================================
    BUG dentry(666:pmcd.service) (Tainted: G B): Freepointer corrupt
    -----------------------------------------------------------------------------

    To fix it, use the check "s->offset == s->inuse" in the new helper
    function freeptr_outside_object() instead. Also add another helper
    function get_info_end() to return the end of info block (inuse + free
    pointer if not overlapping with object).

    Fixes: 3202fa62fb43 ("slub: relocate freelist pointer to middle of object")
    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Kees Cook
    Acked-by: Rafael Aquini
    Cc: Christoph Lameter
    Cc: Vitaly Nikolenko
    Cc: Silvio Cesare
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Markus Elfring
    Cc: Changbin Du
    Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     
  • [ Upstream commit 09ef5283fd96ac424ef0e569626f359bf9ab86c9 ]

    On passing requirement to vm_unmapped_area, arch_get_unmapped_area and
    arch_get_unmapped_area_topdown did not set align_offset. Internally on
    both unmapped_area and unmapped_area_topdown, if info->align_mask is 0,
    then info->align_offset was meaningless.

    But commit df529cabb7a2 ("mm: mmap: add trace point of
    vm_unmapped_area") always prints info->align_offset even though it is
    uninitialized.

    Fix this uninitialized value issue by setting it to 0 explicitly.

    Before:
    vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022

    After:
    vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0

    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox (Oracle)
    Cc: Michel Lespinasse
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jaewon Kim
     
  • [ Upstream commit 5644e1fbbfe15ad06785502bbfe5751223e5841d ]

    pgdat->kswapd_classzone_idx could be accessed concurrently in
    wakeup_kswapd(). Plain writes and reads without any lock protection
    result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as
    well as saving a branch (compilers might well optimize the original code
    in an unintentional way anyway). While at it, also take care of
    pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim(). The
    data races were reported by KCSAN,

    BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd

    write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
    wakeup_kswapd+0xf1/0x400
    wakeup_kswapd at mm/vmscan.c:3967
    wake_all_kswapds+0x59/0xc0
    wake_all_kswapds at mm/page_alloc.c:4241
    __alloc_pages_slowpath+0xdcc/0x1290
    __alloc_pages_slowpath at mm/page_alloc.c:4512
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    1 lock held by mtest01/7454:
    #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
    do_page_fault+0x143/0x6f9
    do_user_addr_fault at arch/x86/mm/fault.c:1405
    (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
    irq event stamp: 6944085
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
    wakeup_kswapd+0xc8/0x400
    wake_all_kswapds+0x59/0xc0
    __alloc_pages_slowpath+0xdcc/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    1 lock held by mtest01/7472:
    #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
    do_page_fault+0x143/0x6f9
    irq event stamp: 6793561
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    BUG: KCSAN: data-race in kswapd / wakeup_kswapd

    write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6:
    kswapd+0x27c/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0:
    wakeup_kswapd+0xf3/0x450
    wake_all_kswapds+0x59/0xc0
    __alloc_pages_slowpath+0xdcc/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Marco Elver
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 218209487c3da2f6d861b236c11226b6eca7b7b7 ]

    si->inuse_pages could be accessed concurrently as noticed by KCSAN,

    write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92:
    swap_range_free+0xbe/0x230
    swap_range_free at mm/swapfile.c:719
    swapcache_free_entries+0x1be/0x250
    free_swap_slot+0x1c8/0x220
    __swap_entry_free.constprop.19+0xa3/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7e0/0x1ce0
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0xe7/0x240
    do_exit+0x598/0xfd0
    do_group_exit+0x8b/0x180
    get_signal+0x293/0x13d0
    do_signal+0x37/0x5d0
    prepare_exit_to_usermode+0x1b7/0x2c0
    ret_from_intr+0x32/0x42

    read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46:
    try_to_unuse+0x86b/0xc80
    try_to_unuse at mm/swapfile.c:2185
    __x64_sys_swapoff+0x372/0xd40
    do_syscall_64+0x91/0xb05
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The plain reads in try_to_unuse() are outside si->lock critical section
    which result in data races that could be dangerous to be used in a loop.
    Fix them by adding READ_ONCE().

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit faffdfa04fa11ccf048cebdde73db41ede0679e0 ]

    Mount failure issue happens under the scenario: Application forked dozens
    of threads to mount the same number of cramfs images separately in docker,
    but several mounts failed with high probability. Mount failed due to the
    checking result of the page(read from the superblock of loop dev) is not
    uptodate after wait_on_page_locked(page) returned in function cramfs_read:

    wait_on_page_locked(page);
    if (!PageUptodate(page)) {
    ...
    }

    The reason of the checking result of the page not uptodate: systemd-udevd
    read the loopX dev before mount, because the status of loopX is Lo_unbound
    at this time, so loop_make_request directly trigger the calling of io_end
    handler end_buffer_async_read, which called SetPageError(page). So It
    caused the page can't be set to uptodate in function
    end_buffer_async_read:

    if(page_uptodate && !PageError(page)) {
    SetPageUptodate(page);
    }

    Then mount operation is performed, it used the same page which is just
    accessed by systemd-udevd above, Because this page is not uptodate, it
    will launch a actual read via submit_bh, then wait on this page by calling
    wait_on_page_locked(page). When the I/O of the page done, io_end handler
    end_buffer_async_read is called, because no one cleared the page
    error(during the whole read path of mount), which is caused by
    systemd-udevd reading, so this page is still in "PageError" status, which
    can't be set to uptodate in function end_buffer_async_read, then caused
    mount failure.

    But sometimes mount succeed even through systemd-udeved read loopX dev
    just before, The reason is systemd-udevd launched other loopX read just
    between step 3.1 and 3.2, the steps as below:

    1, loopX dev default status is Lo_unbound;
    2, systemd-udved read loopX dev (page is set to PageError);
    3, mount operation
    1) set loopX status to Lo_bound;
    ==>systemd-udevd read loopX deva_ops->readpage(filp, page);

    here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
    some function name changed, the call trace as below:

    blkdev_read_iter
    generic_file_read_iter
    generic_file_buffered_read:
    /*
    * A previous I/O error may have been due to temporary
    * failures, eg. mutipath errors.
    * Pg_error will be set again if readpage fails.
    */
    ClearPageError(page);
    /* Start the actual read. The read will unlock the page*/
    error=mapping->a_ops->readpage(flip, page);

    We can see ClearPageError(page) is called before the actual read,
    then the read in step 3.2 succeed.

    This patch is to add the calling of ClearPageError just before the actual
    read of read path of cramfs mount. Without the patch, the call trace as
    below when performing cramfs mount:

    do_mount
    cramfs_read
    cramfs_blkdev_read
    read_cache_page
    do_read_cache_page:
    filler(data, page);
    or
    mapping->a_ops->readpage(data, page);

    With the patch, the call trace as below when performing mount:

    do_mount
    cramfs_read
    cramfs_blkdev_read
    read_cache_page:
    do_read_cache_page:
    ClearPageError(page); a_ops->readpage(data, page);

    With the patch, mount operation trigger the calling of
    ClearPageError(page) before the actual read, the page has no error if no
    additional page error happen when I/O done.

    Signed-off-by: Xianting Tian
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Jan Kara
    Cc:
    Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Xianting Tian
     
  • [ Upstream commit b0d14fc43d39203ae025f20ef4d5d25d9ccf4be1 ]

    Clang warns:

    mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare]
    if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
    ^
    mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare]
    if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)

    These are not true arrays, they are linker defined symbols, which are just
    addresses. Using the address of operator silences the warning and does
    not change the resulting assembly with either clang/ld.lld or gcc/ld
    (tested with diff + objdump -Dr).

    Suggested-by: Nick Desaulniers
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Andrew Morton
    Acked-by: Catalin Marinas
    Link: https://github.com/ClangBuiltLinux/linux/issues/895
    Link: http://lkml.kernel.org/r/20200220051551.44000-1-natechancellor@gmail.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Nathan Chancellor
     
  • [ Upstream commit c3e5ea6ee574ae5e845a40ac8198de1fb63bb3ab ]

    Jeff Moyer has reported that one of xfstests triggers a warning when run
    on DAX-enabled filesystem:

    WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
    ...
    wp_page_copy+0x98c/0xd50 (unreliable)
    do_wp_page+0xd8/0xad0
    __handle_mm_fault+0x748/0x1b90
    handle_mm_fault+0x120/0x1f0
    __do_page_fault+0x240/0xd70
    do_page_fault+0x38/0xd0
    handle_page_fault+0x10/0x30

    The warning happens on failed __copy_from_user_inatomic() which tries to
    copy data into a CoW page.

    This happens because of race between MADV_DONTNEED and CoW page fault:

    CPU0 CPU1
    handle_mm_fault()
    do_wp_page()
    wp_page_copy()
    do_wp_page()
    madvise(MADV_DONTNEED)
    zap_page_range()
    zap_pte_range()
    ptep_get_and_clear_full()

    __copy_from_user_inatomic()
    sees empty PTE and fails
    WARN_ON_ONCE(1)
    clear_page()

    The solution is to re-try __copy_from_user_inatomic() under PTL after
    checking that PTE is matches the orig_pte.

    The second copy attempt can still fail, like due to non-readable PTE, but
    there's nothing reasonable we can do about, except clearing the CoW page.

    Reported-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Jeff Moyer
    Cc:
    Cc: Justin He
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     
  • [ Upstream commit c02a98753e0a36ba65a05818626fa6adeb4e7c97 ]

    If walk_pte_range() is called with a 'end' argument that is beyond the
    last page of memory (e.g. ~0UL) then the comparison between 'addr' and
    'end' will always fail and the loop will be infinite. Instead change the
    comparison to >= while accounting for overflow.

    Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Steven Price
     
  • [ Upstream commit 10c8d69f314d557d94d74ec492575ae6a4f1eb1c ]

    If seq_file .next fuction does not change position index, read after
    some lseek can generate unexpected output.

    In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
    simplify seq_file iteration code and interface") "Some ->next functions
    do not increment *pos when they return NULL... Note that such ->next
    functions are buggy and should be fixed. A simple demonstration is

    dd if=/proc/swaps bs=1000 skip=1

    Choose any block size larger than the size of /proc/swaps. This will
    always show the whole last line of /proc/swaps"

    Described problem is still actual. If you make lseek into middle of
    last output line following read will output end of last line and whole
    last line once again.

    $ dd if=/proc/swaps bs=1 # usual output
    Filename Type Size Used Priority
    /dev/dm-0 partition 4194812 97536 -2
    104+0 records in
    104+0 records out
    104 bytes copied

    $ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
    dd: /proc/swaps: cannot skip to specified offset
    v/dm-0 partition 4194812 97536 -2
    /dev/dm-0 partition 4194812 97536 -2
    3+1 records in
    3+1 records out
    131 bytes copied

    https://bugzilla.kernel.org/show_bug.cgi?id=206283

    Link: http://lkml.kernel.org/r/bd8cfd7b-ac95-9b91-f9e7-e8438bd5047d@virtuozzo.com
    Signed-off-by: Vasily Averin
    Reviewed-by: Andrew Morton
    Cc: Jann Horn
    Cc: Alexander Viro
    Cc: Kees Cook
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vasily Averin
     
  • [ Upstream commit 83d116c53058d505ddef051e90ab27f57015b025 ]

    When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
    will be a double page fault in __copy_from_user_inatomic of cow_user_page.

    To reproduce the bug, the cmd is as follows after you deployed everything:
    make -C src/test/vmmalloc_fork/ TEST_TIME=60m check

    Below call trace is from arm64 do_page_fault for debugging purpose:
    [ 110.016195] Call trace:
    [ 110.016826] do_page_fault+0x5a4/0x690
    [ 110.017812] do_mem_abort+0x50/0xb0
    [ 110.018726] el1_da+0x20/0xc4
    [ 110.019492] __arch_copy_from_user+0x180/0x280
    [ 110.020646] do_wp_page+0xb0/0x860
    [ 110.021517] __handle_mm_fault+0x994/0x1338
    [ 110.022606] handle_mm_fault+0xe8/0x180
    [ 110.023584] do_page_fault+0x240/0x690
    [ 110.024535] do_mem_abort+0x50/0xb0
    [ 110.025423] el0_da+0x20/0x24

    The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
    [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
    pmd=000000023d4b3003, pte=360000298607bd3

    As told by Catalin: "On arm64 without hardware Access Flag, copying from
    user will fail because the pte is old and cannot be marked young. So we
    always end up with zeroed page after fork() + CoW for pfn mappings. we
    don't always have a hardware-managed access flag on arm64."

    This patch fixes it by calling pte_mkyoung. Also, the parameter is
    changed because vmf should be passed to cow_user_page()

    Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
    in case there can be some obscure use-case (by Kirill).

    [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

    Signed-off-by: Jia He
    Reported-by: Yibo Cai
    Reviewed-by: Catalin Marinas
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Catalin Marinas
    Signed-off-by: Sasha Levin

    Jia He
     

27 Sep, 2020

2 commits

  • commit e3336cab2579012b1e72b5265adf98e2d6e244ad upstream.

    We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
    doesn't have any reclaimable memory.

    It can be easily reproduced as below:

    watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
    CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
    Call Trace:
    shrink_lruvec+0x49f/0x640
    shrink_node+0x2a6/0x6f0
    do_try_to_free_pages+0xe9/0x3e0
    try_to_free_mem_cgroup_pages+0xef/0x1f0
    try_charge+0x2c1/0x750
    mem_cgroup_charge+0xd7/0x240
    __add_to_page_cache_locked+0x2fd/0x370
    add_to_page_cache_lru+0x4a/0xc0
    pagecache_get_page+0x10b/0x2f0
    filemap_fault+0x661/0xad0
    ext4_filemap_fault+0x2c/0x40
    __do_fault+0x4d/0xf9
    handle_mm_fault+0x1080/0x1790

    It only happens on our 1-vcpu instances, because there's no chance for
    oom reaper to run to reclaim the to-be-killed process.

    Add a cond_resched() at the upper shrink_node_memcgs() to solve this
    issue, this will mean that we will get a scheduling point for each memcg
    in the reclaimed hierarchy without any dependency on the reclaimable
    memory in that memcg thus making it more predictable.

    Suggested-by: Michal Hocko
    Signed-off-by: Xunlei Pang
    Signed-off-by: Andrew Morton
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Julius Hemanth Pitti
    Signed-off-by: Greg Kroah-Hartman

    Xunlei Pang
     
  • [ Upstream commit ec0abae6dcdf7ef88607c869bf35a4b63ce1b370 ]

    A migrating transparent huge page has to already be unmapped. Otherwise,
    the page could be modified while it is being copied to a new page and data
    could be lost. The function __split_huge_pmd() checks for a PMD migration
    entry before calling __split_huge_pmd_locked() leading one to think that
    __split_huge_pmd_locked() can handle splitting a migrating PMD.

    However, the code always increments the page->_mapcount and adjusts the
    memory control group accounting assuming the page is mapped.

    Also, if the PMD entry is a migration PMD entry, the call to
    is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
    of migration_entry_to_pfn(pmd_to_swp_entry(pmd)). Fix these problems by
    checking for a PMD migration entry.

    Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Reviewed-by: Zi Yan
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Alistair Popple
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Bharata B Rao
    Cc: Ben Skeggs
    Cc: Shuah Khan
    Cc: [4.14+]
    Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Ralph Campbell
     

23 Sep, 2020

2 commits

  • commit 9683182612214aa5f5e709fad49444b847cd866a upstream.

    There is a race during page offline that can lead to infinite loop:
    a page never ends up on a buddy list and __offline_pages() keeps
    retrying infinitely or until a termination signal is received.

    Thread#1 - a new process:

    load_elf_binary
    begin_new_exec
    exec_mmap
    mmput
    exit_mmap
    tlb_finish_mmu
    tlb_flush_mmu
    release_pages
    free_unref_page_list
    free_unref_page_prepare
    set_pcppage_migratetype(page, migratetype);
    // Set page->index migration type below MIGRATE_PCPTYPES

    Thread#2 - hot-removes memory
    __offline_pages
    start_isolate_page_range
    set_migratetype_isolate
    set_pageblock_migratetype(page, MIGRATE_ISOLATE);
    Set migration type to MIGRATE_ISOLATE-> set
    drain_all_pages(zone);
    // drain per-cpu page lists to buddy allocator.

    Thread#1 - continue
    free_unref_page_commit
    migratetype = get_pcppage_migratetype(page);
    // get old migration type
    list_add(&page->lru, &pcp->lists[migratetype]);
    // add new page to already drained pcp list

    Thread#2
    Never drains pcp again, and therefore gets stuck in the loop.

    The fix is to try to drain per-cpu lists again after
    check_pages_isolated_cb() fails.

    Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Wei Yang
    Cc:
    Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
    Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
    Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit b3b33d3c43bbe0177d70653f4e889c78cc37f097 upstream.

    Variable populated, which is a member of struct pcpu_chunk, is used as a
    unit of size of unsigned long.
    However, size of populated is miscounted. So, I fix this minor part.

    Fixes: 8ab16c43ea79 ("percpu: change the number of pages marked in the first_chunk pop bitmap")
    Cc: # 4.14+
    Signed-off-by: Sunghyun Jin
    Signed-off-by: Dennis Zhou
    Signed-off-by: Greg Kroah-Hartman

    Sunghyun Jin
     

22 Sep, 2020

1 commit

  • If driver invoking these functions and want to build as a module, these
    functions need export symbol. Without this patch, will meet below errors
    when build our driver module:

    ERROR: "cma_for_each_area" [drivers/misc/mic/imx-host/imx_mic_host.ko] undefined!
    ERROR: "cma_get_name" [drivers/misc/mic/imx-host/imx_mic_host.ko] undefined!

    Signed-off-by: Joakim Zhang
    Signed-off-by: Sherry Sun
    Reviewed-by: Frank Li
    Reviewed-by: Fugang Duan

    Sherry Sun
     

10 Sep, 2020

4 commits

  • commit e5a59d308f52bb0052af5790c22173651b187465 upstream.

    collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
    be read to page_cache_sync_readahead(). The intent was probably to read
    a single page. Fix it to use the number of pages to the end of the
    window instead.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Song Liu
    Acked-by: Yang Shi
    Acked-by: Pankaj Gupta
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit 17743798d81238ab13050e8e2833699b54e15467 upstream.

    There is a race between the assignment of `table->data` and write value
    to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
    the other thread.

    CPU0: CPU1:
    proc_sys_write
    hugetlb_sysctl_handler proc_sys_call_handler
    hugetlb_sysctl_handler_common hugetlb_sysctl_handler
    table->data = &tmp; hugetlb_sysctl_handler_common
    table->data = &tmp;
    proc_doulongvec_minmax
    do_proc_doulongvec_minmax sysctl_head_finish
    __do_proc_doulongvec_minmax unuse_table
    i = table->data;
    *i = val; // corrupt CPU1's stack

    Fix this by duplicating the `table`, and only update the duplicate of
    it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
    simplify the code.

    The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
    ? set_max_huge_pages+0x3da/0x4f0
    ? alloc_pool_huge_page+0x150/0x150
    ? proc_doulongvec_minmax+0x46/0x60
    ? hugetlb_sysctl_handler_common+0x1c7/0x200
    ? nr_hugepages_store+0x20/0x20
    ? copy_fd_bitmaps+0x170/0x170
    ? hugetlb_sysctl_handler+0x1e/0x20
    ? proc_sys_call_handler+0x2f1/0x300
    ? unregister_sysctl_table+0xb0/0xb0
    ? __fd_install+0x78/0x100
    ? proc_sys_write+0x14/0x20
    ? __vfs_write+0x4d/0x90
    ? vfs_write+0xef/0x240
    ? ksys_write+0xc0/0x160
    ? __ia32_sys_read+0x50/0x50
    ? __close_fd+0x129/0x150
    ? __x64_sys_write+0x43/0x50
    ? do_syscall_64+0x6c/0x200
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Andi Kleen
    Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     
  • commit 7867fd7cc44e63c6673cd0f8fea155456d34d0de upstream.

    The syzbot reported the below use-after-free:

    BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline]
    BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline]
    BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
    Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996

    CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x18f/0x20d lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
    madvise_willneed mm/madvise.c:293 [inline]
    madvise_vma mm/madvise.c:942 [inline]
    do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
    do_madvise mm/madvise.c:1169 [inline]
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Allocated by task 9992:
    kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482
    vm_area_alloc+0x1c/0x110 kernel/fork.c:347
    mmap_region+0x8e5/0x1780 mm/mmap.c:1743
    do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
    vm_mmap_pgoff+0x195/0x200 mm/util.c:506
    ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 9992:
    kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693
    remove_vma+0x132/0x170 mm/mmap.c:184
    remove_vma_list mm/mmap.c:2613 [inline]
    __do_munmap+0x743/0x1170 mm/mmap.c:2869
    do_munmap mm/mmap.c:2877 [inline]
    mmap_region+0x257/0x1780 mm/mmap.c:1716
    do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
    vm_mmap_pgoff+0x195/0x200 mm/util.c:506
    ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    It is because vma is accessed after releasing mmap_lock, but someone
    else acquired the mmap_lock and the vma is gone.

    Releasing mmap_lock after accessing vma should fix the problem.

    Fixes: 692fe62433d4c ("mm: Handle MADV_WILLNEED through vfs_fadvise()")
    Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: [5.4+]
    Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     
  • commit dc07a728d49cf025f5da2c31add438d839d076c0 upstream.

    Commit 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in
    deactivate_slab()") suffered an update when picked up from LKML [1].

    Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()'
    created a no-op statement. Fix it by sticking to the behavior intended
    in the original patch [1]. In addition, make freelist_corrupted()
    immune to passing NULL instead of &freelist.

    The issue has been spotted via static analysis and code review.

    [1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/

    Fixes: 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()")
    Signed-off-by: Eugeniu Rosca
    Signed-off-by: Andrew Morton
    Cc: Dongli Zhang
    Cc: Joe Jin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eugeniu Rosca
     

03 Sep, 2020

5 commits

  • [ Upstream commit e47110e90584a22e9980510b00d0dfad3a83354e ]

    Like zap_pte_range add cond_resched so that we can avoid softlockups as
    reported below. On non-preemptible kernel with large I/O map region (like
    the one we get when using persistent memory with sector mode), an unmap of
    the namespace can report below softlockups.

    22724.027334] watchdog: BUG: soft lockup - CPU#49 stuck for 23s! [ndctl:50777]
    NIP [c0000000000dc224] plpar_hcall+0x38/0x58
    LR [c0000000000d8898] pSeries_lpar_hpte_invalidate+0x68/0xb0
    Call Trace:
    flush_hash_page+0x114/0x200
    hpte_need_flush+0x2dc/0x540
    vunmap_page_range+0x538/0x6f0
    free_unmap_vmap_area+0x30/0x70
    remove_vm_area+0xfc/0x140
    __vunmap+0x68/0x270
    __iounmap.part.0+0x34/0x60
    memunmap+0x54/0x70
    release_nodes+0x28c/0x300
    device_release_driver_internal+0x16c/0x280
    unbind_store+0x124/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xd8/0x260
    ksys_write+0xdc/0x130
    system_call+0x5c/0x70

    Reported-by: Harish Sriram
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc:
    Link: http://lkml.kernel.org/r/20200807075933.310240-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Sasha Levin
     
  • [ Upstream commit 3a5139f1c5bb76d69756fb8f13fffa173e261153 ]

    The routine cma_init_reserved_areas is designed to activate all
    reserved cma areas. It quits when it first encounters an error.
    This can leave some areas in a state where they are reserved but
    not activated. There is no feedback to code which performed the
    reservation. Attempting to allocate memory from areas in such a
    state will result in a BUG.

    Modify cma_init_reserved_areas to always attempt to activate all
    areas. The called routine, cma_activate_area is responsible for
    leaving the area in a valid state. No one is making active use
    of returned error codes, so change the routine to void.

    How to reproduce: This example uses kernelcore, hugetlb and cma
    as an easy way to reproduce. However, this is a more general cma
    issue.

    Two node x86 VM 16GB total, 8GB per node
    Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
    Related boot time messages,
    hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
    cma: Reserved 4096 MiB at 0x0000000100000000
    hugetlb_cma: reserved 4096 MiB on node 0
    cma: Reserved 4096 MiB at 0x0000000300000000
    hugetlb_cma: reserved 4096 MiB on node 1
    cma: CMA area hugetlb could not be activated

    # echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    ...
    Call Trace:
    bitmap_find_next_zero_area_off+0x51/0x90
    cma_alloc+0x1a5/0x310
    alloc_fresh_huge_page+0x78/0x1a0
    alloc_pool_huge_page+0x6f/0xf0
    set_max_huge_pages+0x10c/0x250
    nr_hugepages_store_common+0x92/0x120
    ? __kmalloc+0x171/0x270
    kernfs_fop_write+0xc1/0x1a0
    vfs_write+0xc7/0x1f0
    ksys_write+0x5f/0xe0
    do_syscall_64+0x4d/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator")
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Acked-by: Barry Song
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Kyungmin Park
    Cc: Joonsoo Kim
    Cc:
    Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Mike Kravetz
     
  • [ Upstream commit 2184f9928ab52f26c2ae5e9ba37faf29c78f50b8 ]

    kzalloc() is used for cma bitmap allocation in cma_activate_area(),
    switch to bitmap_zalloc() for clarity.

    Link: http://lkml.kernel.org/r/895d4627-f115-c77a-d454-c0a196116426@huawei.com
    Signed-off-by: Yunfeng Ye
    Reviewed-by: Andrew Morton
    Cc: Mike Rapoport
    Cc: Yue Hu
    Cc: Peng Fan
    Cc: Andrey Ryabinin
    Cc: Ryohei Suzuki
    Cc: Andrey Konovalov
    Cc: Doug Berger
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Yunfeng Ye
     
  • [ Upstream commit 38cf307c1f2011d413750c5acb725456f47d9172 ]

    For SMP systems using IPI based TLB invalidation, looking at
    current->active_mm is entirely reasonable. This then presents the
    following race condition:

    CPU0 CPU1

    flush_tlb_mm(mm) use_mm(mm)

    tsk->active_mm = mm;

    if (tsk->active_mm == mm)
    // flush TLBs

    switch_mm(old_mm,mm,tsk);

    Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
    because the IPI lands before we actually switched.

    Avoid this by disabling IRQs across changing ->active_mm and
    switch_mm().

    Of the (SMP) architectures that have IPI based TLB invalidate:

    Alpha - checks active_mm
    ARC - ASID specific
    IA64 - checks active_mm
    MIPS - ASID specific flush
    OpenRISC - shoots down world
    PARISC - shoots down world
    SH - ASID specific
    SPARC - ASID specific
    x86 - N/A
    xtensa - checks active_mm

    So at the very least Alpha, IA64 and Xtensa are suspect.

    On top of this, for scheduler consistency we need at least preemption
    disabled across changing tsk->mm and doing switch_mm(), which is
    currently provided by task_lock(), but that's not sufficient for
    PREEMPT_RT.

    [akpm@linux-foundation.org: add comment]

    Reported-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Cc: Nicholas Piggin
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Jann Horn
    Cc: Will Deacon
    Cc: Christoph Hellwig
    Cc: Mathieu Desnoyers
    Cc:
    Link: http://lkml.kernel.org/r/20200721154106.GE10769@hirez.programming.kicks-ass.net
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit 4a93025cbe4a0b19d1a25a2d763a3d2018bad0d9 ]

    Especially with memory hotplug, we can have offline sections (with a
    garbage memmap) and overlapping zones. We have to make sure to only touch
    initialized memmaps (online sections managed by the buddy) and that the
    zone matches, to not move pages between zones.

    To test if this can actually happen, I added a simple

    BUG_ON(page_zone(page_i) != page_zone(page_j));

    right before the swap. When hotplugging a 256M DIMM to a 4G x86-64 VM and
    onlining the first memory block "online_movable" and the second memory
    block "online_kernel", it will trigger the BUG, as both zones (NORMAL and
    MOVABLE) overlap.

    This might result in all kinds of weird situations (e.g., double
    allocations, list corruptions, unmovable allocations ending up in the
    movable zone).

    Fixes: e900a918b098 ("mm: shuffle initial free memory to improve memory-side-cache utilization")
    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Huang Ying
    Cc: Wei Yang
    Cc: Mel Gorman
    Cc: [5.2+]
    Link: http://lkml.kernel.org/r/20200624094741.9918-2-david@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    David Hildenbrand
     

26 Aug, 2020

5 commits

  • commit 75802ca66354a39ab8e35822747cd08b3384a99a upstream.

    This is found by code observation only.

    Firstly, the worst case scenario should assume the whole range was covered
    by pmd sharing. The old algorithm might not work as expected for ranges
    like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
    expected range should be (0, 2g).

    Since at it, remove the loop since it should not be required. With that,
    the new code should be faster too when the invalidating range is huge.

    Mike said:

    : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
    : adjust to (0, 1g+2m) which is incorrect.
    :
    : We should cc stable. The original reason for adjusting the range was to
    : prevent data corruption (getting wrong page). Since the range is not
    : always adjusted correctly, the potential for corruption still exists.
    :
    : However, I am fairly confident that adjust_range_if_pmd_sharing_possible
    : is only gong to be called in two cases:
    :
    : 1) for a single page
    : 2) for range == entire vma
    :
    : In those cases, the current code should produce the correct results.
    :
    : To be safe, let's just cc stable.

    Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages")
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Cc:
    Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mike Kravetz
    Signed-off-by: Greg Kroah-Hartman

    Peter Xu
     
  • commit 88e8ac11d2ea3acc003cf01bb5a38c8aa76c3cfd upstream.

    The following race is observed with the repeated online, offline and a
    delay between two successive online of memory blocks of movable zone.

    P1 P2

    Online the first memory block in
    the movable zone. The pcp struct
    values are initialized to default
    values,i.e., pcp->high = 0 &
    pcp->batch = 1.

    Allocate the pages from the
    movable zone.

    Try to Online the second memory
    block in the movable zone thus it
    entered the online_pages() but yet
    to call zone_pcp_update().
    This process is entered into
    the exit path thus it tries
    to release the order-0 pages
    to pcp lists through
    free_unref_page_commit().
    As pcp->high = 0, pcp->count = 1
    proceed to call the function
    free_pcppages_bulk().
    Update the pcp values thus the
    new pcp values are like, say,
    pcp->high = 378, pcp->batch = 63.
    Read the pcp's batch value using
    READ_ONCE() and pass the same to
    free_pcppages_bulk(), pcp values
    passed here are, batch = 63,
    count = 1.

    Since num of pages in the pcp
    lists are less than ->batch,
    then it will stuck in
    while(list_empty(list)) loop
    with interrupts disabled thus
    a core hung.

    Avoid this by ensuring free_pcppages_bulk() is called with proper count of
    pcp list pages.

    The mentioned race is some what easily reproducible without [1] because
    pcp's are not updated for the first memory block online and thus there is
    a enough race window for P2 between alloc+free and pcp struct values
    update through onlining of second memory block.

    With [1], the race still exists but it is very narrow as we update the pcp
    struct values for the first memory block online itself.

    This is not limited to the movable zone, it could also happen in cases
    with the normal zone (e.g., hotplug to a node that only has DMA memory, or
    no other memory yet).

    [1]: https://patchwork.kernel.org/patch/11696389/

    Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Vinayak Menon
    Cc: [2.6+]
    Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Charan Teja Reddy
     
  • commit e08d3fdfe2dafa0331843f70ce1ff6c1c4900bf4 upstream.

    The lowmem_reserve arrays provide a means of applying pressure against
    allocations from lower zones that were targeted at higher zones. Its
    values are a function of the number of pages managed by higher zones and
    are assigned by a call to the setup_per_zone_lowmem_reserve() function.

    The function is initially called at boot time by the function
    init_per_zone_wmark_min() and may be called later by accesses of the
    /proc/sys/vm/lowmem_reserve_ratio sysctl file.

    The function init_per_zone_wmark_min() was moved up from a module_init to
    a core_initcall to resolve a sequencing issue with khugepaged.
    Unfortunately this created a sequencing issue with CMA page accounting.

    The CMA pages are added to the managed page count of a zone when
    cma_init_reserved_areas() is called at boot also as a core_initcall. This
    makes it uncertain whether the CMA pages will be added to the managed page
    counts of their zones before or after the call to
    init_per_zone_wmark_min() as it becomes dependent on link order. With the
    current link order the pages are added to the managed count after the
    lowmem_reserve arrays are initialized at boot.

    This means the lowmem_reserve values at boot may be lower than the values
    used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
    ratio values are unchanged.

    In many cases the difference is not significant, but for example
    an ARM platform with 1GB of memory and the following memory layout

    cma: Reserved 256 MiB at 0x0000000030000000
    Zone ranges:
    DMA [mem 0x0000000000000000-0x000000002fffffff]
    Normal empty
    HighMem [mem 0x0000000030000000-0x000000003fffffff]

    would result in 0 lowmem_reserve for the DMA zone. This would allow
    userspace to deplete the DMA zone easily.

    Funnily enough

    $ cat /proc/sys/vm/lowmem_reserve_ratio

    would fix up the situation because as a side effect it forces
    setup_per_zone_lowmem_reserve.

    This commit breaks the link order dependency by invoking
    init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
    have the chance to be properly accounted in their zone(s) and allowing
    the lowmem_reserve arrays to receive consistent values.

    Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
    Signed-off-by: Doug Berger
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Jason Baron
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc:
    Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Doug Berger
     
  • [ Upstream commit f3f99d63a8156c7a4a6b20aac22b53c5579c7dc1 ]

    syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
    __khugepaged_enter(): yes, when one thread is about to dump core, has set
    core_state, and is waiting for others, another might do something calling
    __khugepaged_enter(), which now crashes because I lumped the core_state
    test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
    think it's best to lump them together, so just in this exceptional case,
    check mm->mm_users directly instead of khugepaged_test_exit().

    Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
    Reported-by: syzbot
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Song Liu
    Cc: Mike Kravetz
    Cc: Eric Dumazet
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • [ Upstream commit bbe98f9cadff58cdd6a4acaeba0efa8565dabe65 ]

    Move collapse_huge_page()'s mmget_still_valid() check into
    khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP
    only, and earned its mmget_still_valid() check because it inserts a huge
    pmd entry in place of the page table's pmd entry; whereas
    collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
    merely clears the page table's pmd entry. But core dumping without mmap
    lock must have been as open to mistaking a racily cleared pmd entry for a
    page table at physical page 0, as exit_mmap() was. And we certainly have
    no interest in mapping as a THP once dumping core.

    Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Song Liu
    Cc: Mike Kravetz
    Cc: Kirill A. Shutemov
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

21 Aug, 2020

3 commits

  • commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.

    Only once have I seen this scenario (and forgot even to notice what forced
    the eventual crash): a sequence of "BUG: Bad page map" alerts from
    vm_normal_page(), from zap_pte_range() servicing exit_mmap();
    pmd:00000000, pte values corresponding to data in physical page 0.

    The pte mappings being zapped in this case were supposed to be from a huge
    page of ext4 text (but could as well have been shmem): my belief is that
    it was racing with collapse_file()'s retract_page_tables(), found *pmd
    pointing to a page table, locked it, but *pmd had become 0 by the time
    start_pte was decided.

    In most cases, that possibility is excluded by holding mmap lock; but
    exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
    checks khugepaged_test_exit() after acquiring mmap lock:
    khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
    for example. But retract_page_tables() did not: fix that.

    The fix is for retract_page_tables() to check khugepaged_test_exit(),
    after acquiring mmap lock, before doing anything to the page table.
    Getting the mmap lock serializes with __mmput(), which briefly takes and
    drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
    mm_users makes sure we don't touch the page table once exit_mmap() might
    reach it, since exit_mmap() will be proceeding without mmap lock, not
    expecting anyone to be racing with it.

    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit b4223a510e2ab1bf0f971d50af7c1431014b25ad upstream.

    When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
    online at that time), mem_hotplug_begin/done is unpaired in such case.

    Therefore a warning:
    Call Trace:
    percpu_up_write+0x33/0x40
    try_remove_memory+0x66/0x120
    ? _cond_resched+0x19/0x30
    remove_memory+0x2b/0x40
    dev_dax_kmem_remove+0x36/0x72 [kmem]
    device_release_driver_internal+0xf0/0x1c0
    device_release_driver+0x12/0x20
    bus_remove_device+0xe1/0x150
    device_del+0x17b/0x3e0
    unregister_dev_dax+0x29/0x60
    devm_action_release+0x15/0x20
    release_nodes+0x19a/0x1e0
    devres_release_all+0x3f/0x50
    device_release_driver_internal+0x100/0x1c0
    driver_detach+0x4c/0x8f
    bus_remove_driver+0x5c/0xd0
    driver_unregister+0x31/0x50
    dax_pmem_exit+0x10/0xfe0 [dax_pmem]

    Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
    Signed-off-by: Jia He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: [5.6+]
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chuhong Yuan
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: Fenghua Yu
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Kaly Xin
    Cc: Logan Gunthorpe
    Cc: Masahiro Yamada
    Cc: Mike Rapoport
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vishal Verma
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jia He
     
  • commit a6f23d14ec7d7d02220ad8bb2774be3322b9aeec upstream.

    When workload runs in cgroups that aren't directly below root cgroup and
    their parent specifies reclaim protection, it may end up ineffective.

    The reason is that propagate_protected_usage() is not called in all
    hierarchy up. All the protected usage is incorrectly accumulated in the
    workload's parent. This means that siblings_low_usage is overestimated
    and effective protection underestimated. Even though it is transitional
    phenomenon (uncharge path does correct propagation and fixes the wrong
    children_low_usage), it can undermine the intended protection
    unexpectedly.

    We have noticed this problem while seeing a swap out in a descendant of a
    protected memcg (intermediate node) while the parent was conveniently
    under its protection limit and the memory pressure was external to that
    hierarchy. Michal has pinpointed this down to the wrong
    siblings_low_usage which led to the unwanted reclaim.

    The fix is simply updating children_low_usage in respective ancestors also
    in the charging path.

    Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
    Signed-off-by: Michal Koutný
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: [4.18+]
    Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Koutný