01 Oct, 2020

8 commits


29 Sep, 2020

3 commits

  • The async buffered reads feature is not working when readahead is
    turned off. There are two things to concern:

    - when doing retry in io_read, not only the IOCB_WAITQ flag but also
    the IOCB_NOWAIT flag is still set, which makes it goes to would_block
    phase in generic_file_buffered_read() and then return -EAGAIN. After
    that, the io-wq thread work is queued, and later doing the async
    reads in the old way.

    - even if we remove IOCB_NOWAIT when doing retry, the feature is still
    not running properly, since in generic_file_buffered_read() it goes to
    lock_page_killable() after calling mapping->a_ops->readpage() to do
    IO, and thus causing process to sleep.

    Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
    Fixes: 3b2a4439e0ae ("io_uring: get rid of kiocb_wait_page_queue_init()")
    Signed-off-by: Hao Xu
    Signed-off-by: Jens Axboe

    Hao Xu
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    - NFSv4.2: copy_file_range needs to invalidate caches on success

    - NFSv4.2: Fix security label length not being reset

    - pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
    on read

    - pNFS/flexfiles: Fix signed/unsigned type issues with mirror
    indices"

    * tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    pNFS/flexfiles: Be consistent about mirror index types
    pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
    NFSv4.2: fix client's attribute cache management for copy_file_range
    nfs: Fix security label length not being reset

    Linus Torvalds
     
  • It seems likely this block was pasted from internal_get_user_pages_fast,
    which is not passed an mm struct and therefore uses current's. But
    __get_user_pages_locked is passed an explicit mm, and current->mm is not
    always valid. This was hit when being called from i915, which uses:

    pin_user_pages_remote->
    __get_user_pages_remote->
    __gup_longterm_locked->
    __get_user_pages_locked

    Before, this would lead to an OOPS:

    BUG: kernel NULL pointer dereference, address: 0000000000000064
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
    Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
    Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
    RIP: 0010:__get_user_pages_remote+0xd7/0x310
    Call Trace:
    __i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
    process_one_work+0x1ca/0x390
    worker_thread+0x48/0x3c0
    kthread+0x114/0x130
    ret_from_fork+0x1f/0x30
    CR2: 0000000000000064

    This commit fixes the problem by using the mm pointer passed to the
    function rather than the bogus one in current.

    Fixes: 008cfe4418b3 ("mm: Introduce mm_struct.has_pinned")
    Tested-by: Chris Wilson
    Reported-by: Harald Arnesen
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Peter Xu
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: Linus Torvalds

    Jason A. Donenfeld
     

28 Sep, 2020

10 commits

  • syzbot reports a potential lock deadlock between the normal IO path and
    ->show_fdinfo():

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-rc6-syzkaller #0 Not tainted
    ------------------------------------------------------
    syz-executor.2/19710 is trying to acquire lock:
    ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296

    but task is already holding lock:
    ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (&ctx->uring_lock){+.+.}-{3:3}:
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
    __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
    io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
    seq_show+0x4a8/0x700 fs/proc/fd.c:65
    seq_read+0x432/0x1070 fs/seq_file.c:208
    do_loop_readv_writev fs/read_write.c:734 [inline]
    do_loop_readv_writev fs/read_write.c:721 [inline]
    do_iter_read+0x48e/0x6e0 fs/read_write.c:955
    vfs_readv+0xe5/0x150 fs/read_write.c:1073
    kernel_readv fs/splice.c:355 [inline]
    default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
    do_splice_to+0x137/0x170 fs/splice.c:871
    splice_direct_to_actor+0x307/0x980 fs/splice.c:950
    do_splice_direct+0x1b3/0x280 fs/splice.c:1059
    do_sendfile+0x55f/0xd40 fs/read_write.c:1540
    __do_sys_sendfile64 fs/read_write.c:1601 [inline]
    __se_sys_sendfile64 fs/read_write.c:1587 [inline]
    __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&p->lock){+.+.}-{3:3}:
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
    seq_read+0x61/0x1070 fs/seq_file.c:155
    pde_read fs/proc/inode.c:306 [inline]
    proc_reg_read+0x221/0x300 fs/proc/inode.c:318
    do_loop_readv_writev fs/read_write.c:734 [inline]
    do_loop_readv_writev fs/read_write.c:721 [inline]
    do_iter_read+0x48e/0x6e0 fs/read_write.c:955
    vfs_readv+0xe5/0x150 fs/read_write.c:1073
    kernel_readv fs/splice.c:355 [inline]
    default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
    do_splice_to+0x137/0x170 fs/splice.c:871
    splice_direct_to_actor+0x307/0x980 fs/splice.c:950
    do_splice_direct+0x1b3/0x280 fs/splice.c:1059
    do_sendfile+0x55f/0xd40 fs/read_write.c:1540
    __do_sys_sendfile64 fs/read_write.c:1601 [inline]
    __se_sys_sendfile64 fs/read_write.c:1587 [inline]
    __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (sb_writers#4){.+.+}-{0:0}:
    check_prev_add kernel/locking/lockdep.c:2496 [inline]
    check_prevs_add kernel/locking/lockdep.c:2601 [inline]
    validate_chain kernel/locking/lockdep.c:3218 [inline]
    __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
    lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
    percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
    __sb_start_write+0x228/0x450 fs/super.c:1672
    io_write+0x6b5/0xb30 fs/io_uring.c:3296
    io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
    __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
    io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
    io_submit_sqe fs/io_uring.c:6324 [inline]
    io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
    __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Chain exists of:
    sb_writers#4 --> &p->lock --> &ctx->uring_lock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&ctx->uring_lock);
    lock(&p->lock);
    lock(&ctx->uring_lock);
    lock(sb_writers#4);

    *** DEADLOCK ***

    1 lock held by syz-executor.2/19710:
    #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

    stack backtrace:
    CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
    check_prev_add kernel/locking/lockdep.c:2496 [inline]
    check_prevs_add kernel/locking/lockdep.c:2601 [inline]
    validate_chain kernel/locking/lockdep.c:3218 [inline]
    __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
    lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
    percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
    __sb_start_write+0x228/0x450 fs/super.c:1672
    io_write+0x6b5/0xb30 fs/io_uring.c:3296
    io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
    __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
    io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
    io_submit_sqe fs/io_uring.c:6324 [inline]
    io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
    __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x45e179
    Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
    RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
    RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
    R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c

    Fix this by just not diving into details if we fail to trylock the
    io_uring mutex. We know the ctx isn't going away during this operation,
    but we cannot safely iterate buffers/files/personalities if we don't
    hold the io_uring mutex.

    Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • syzbot reports a crash with tty polling, which is using the double poll
    handling:

    general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
    CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
    RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
    Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
    RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
    RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
    RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
    R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
    R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
    FS: 0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __wake_up_common+0x147/0x650 kernel/sched/wait.c:93
    __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
    tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
    __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
    __tty_hangup drivers/tty/tty_io.c:575 [inline]
    tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
    pty_close+0x3f5/0x550 drivers/tty/pty.c:79
    tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
    __fput+0x285/0x920 fs/file_table.c:281
    task_work_run+0xdd/0x190 kernel/task_work.c:141
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
    exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
    syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x401210

    which is due to a failure in removing the double poll wait entry if we
    hit a wakeup match. This can cause multiple invocations of the wakeup,
    which isn't safe.

    Cc: stable@vger.kernel.org # v5.8
    Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Linus Torvalds
     
  • …/masahiroy/linux-kbuild

    Pull Kbuild fixes from Masahiro Yamada:

    - ignore compiler stubs for PPC to fix builds

    - fix the usage of --target mentioned in the LLVM document

    * tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    Documentation/llvm: Fix clang target examples
    scripts/kallsyms: skip ppc compiler stub *.long_branch.* / *.plt_branch.*

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:
    "Two fixes for the x86 interrupt code:

    - Unbreak the magic 'search the timer interrupt' logic in IO/APIC
    code which got wreckaged when the core interrupt code made the
    state tracking logic stricter.

    That caused the interrupt line to stay masked after switching from
    IO/APIC to PIC delivery mode, which obviously prevents interrupts
    from being delivered.

    - Make run_on_irqstack_code() typesafe. The function argument is a
    void pointer which is then cast to 'void (*fun)(void *).

    This breaks Control Flow Integrity checking in clang. Use proper
    helper functions for the three variants reuqired"

    * tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/ioapic: Unbreak check_timer()
    x86/irq: Make run_on_irqstack_cond() typesafe

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "A set of clocksource/clockevents updates:

    - Reset the TI/DM timer before enabling it instead of doing it the
    other way round.

    - Initialize the reload value for the GX6605s timer correctly so the
    hardware counter starts at 0 again after overrun.

    - Make error return value negative in the h8300 timer init function"

    * tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource/drivers/timer-gx6605s: Fixup counter reload
    clocksource/drivers/timer-ti-dm: Do reset before enable
    clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()

    Linus Torvalds
     
  • Pinned pages shouldn't be write-protected when fork() happens, because
    follow up copy-on-write on these pages could cause the pinned pages to
    be replaced by random newly allocated pages.

    For huge PMDs, we split the huge pmd if pinning is detected. So that
    future handling will be done by the PTE level (with our latest changes,
    each of the small pages will be copied). We can achieve this by let
    copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
    fallthrough in copy_pmd_range() and finally land the next
    copy_pte_range() call.

    Huge PUDs will be even more special - so far it does not support
    anonymous pages. But it can actually be done the same as the huge PMDs
    even if the split huge PUDs means to erase the PUD entries. It'll
    guarantee the follow up fault ins will remap the same pages in either
    parent/child later.

    This might not be the most efficient way, but it should be easy and
    clean enough. It should be fine, since we're tackling with a very rare
    case just to make sure userspaces that pinned some thps will still work
    even without MADV_DONTFORK and after they fork()ed.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • This allows copy_pte_range() to do early cow if the pages were pinned on
    the source mm.

    Currently we don't have an accurate way to know whether a page is pinned
    or not. The only thing we have is page_maybe_dma_pinned(). However
    that's good enough for now. Especially, with the newly added
    mm->has_pinned flag to make sure we won't affect processes that never
    pinned any pages.

    It would be easier if we can do GFP_KERNEL allocation within
    copy_one_pte(). Unluckily, we can't because we're with the page table
    locks held for both the parent and child processes. So the page
    allocation needs to be done outside copy_one_pte().

    Some trick is there in copy_present_pte(), majorly the wrprotect trick
    to block concurrent fast-gup. Comments in the function should explain
    better in place.

    Oleg Nesterov reported a (probably harmless) bug during review that we
    didn't reset entry.val properly in copy_pte_range() so that potentially
    there's chance to call add_swap_count_continuation() multiple times on
    the same swp entry. However that should be harmless since even if it
    happens, the same function (add_swap_count_continuation()) will return
    directly noticing that there're enough space for the swp counter. So
    instead of a standalone stable patch, it is touched up in this patch
    directly.

    Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • This prepares for the future work to trigger early cow on pinned pages
    during fork().

    No functional change intended.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • (Commit message majorly collected from Jason Gunthorpe)

    Reduce the chance of false positive from page_maybe_dma_pinned() by
    keeping track if the mm_struct has ever been used with pin_user_pages().
    This allows cases that might drive up the page ref_count to avoid any
    penalty from handling dma_pinned pages.

    Future work is planned, to provide a more sophisticated solution, likely
    to turn it into a real counter. For now, make it atomic_t but use it as
    a boolean for simplicity.

    Suggested-by: Jason Gunthorpe
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

27 Sep, 2020

17 commits

  • Pull clocksource/clockevent fixes from Daniel Lezcano:

    - Fix wrong signed return value when checking of_iomap in the probe
    function for the h8300 timer (Tianjia Zhang)

    - Fix reset sequence when setting up the timer on the dm_timer (Tony
    Lindgren)

    - Fix counter reload when the interrupt fires on gx6605s (Guo Ren)

    Thomas Gleixner
     
  • Pull SCSI fixes from James Bottomley:
    "Three fixes: one in drivers (lpfc) and two for zoned block devices.

    The latter also impinges on the block layer but only to introduce a
    new block API for setting the zone model rather than fiddling with the
    queue directly in the zoned block driver"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: sd: sd_zbc: Fix ZBC disk initialization
    scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
    scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported

    Linus Torvalds
     
  • Pull io_uring fixes from Jens Axboe:
    "Two fixes for regressions in this cycle, and one that goes to 5.8
    stable:

    - fix leak of getname() retrieved filename

    - remove plug->nowait assignment, fixing a regression with btrfs

    - fix for async buffered retry"

    * tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
    io_uring: ensure async buffered read-retry is setup properly
    io_uring: don't unconditionally set plug->nowait = true
    io_uring: ensure open/openat2 name is cleaned on cancelation

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "NVMe pull request from Christoph, and removal of a dead define.

    - fix error during controller probe that cause double free irqs
    (Keith Busch)

    - FC connection establishment fix (James Smart)

    - properly handle completions for invalid tags (Xianting Tian)

    - pass the correct nsid to the command effects and supported log
    (Chaitanya Kulkarni)"

    * tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
    block: remove unused BLK_QC_T_EAGAIN flag
    nvme-core: don't use NVME_NSID_ALL for command effects and supported log
    nvme-fc: fail new connections to a deleted host or remote port
    nvme-pci: fix NULL req in completion handler
    nvme: return errors for hwmon init

    Linus Torvalds
     
  • Pull s390 fix from Vasily Gorbik:
    "Fix truncated ZCRYPT_PERDEV_REQCNT ioctl result. Copy entire reqcnt
    list"

    * tag 's390-5.9-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390/zcrypt: Fix ZCRYPT_PERDEV_REQCNT ioctl

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "9 patches.

    Subsystems affected by this patch series: mm (thp, memcg, gup,
    migration, memory-hotplug), lib, and x86"

    * emailed patches from Andrew Morton :
    mm: don't rely on system state to detect hot-plug operations
    mm: replace memmap_context by meminit_context
    arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
    lib/memregion.c: include memregion.h
    lib/string.c: implement stpcpy
    mm/migrate: correct thp migration stats
    mm/gup: fix gup_fast with dynamic page table folding
    mm: memcontrol: fix missing suffix of workingset_restore
    mm, THP, swap: fix allocating cluster for swapfile by mistake

    Linus Torvalds
     
  • syzbot reported the following KASAN splat:

    general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
    CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
    Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
    RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
    Call Trace:
    lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:354 [inline]
    madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
    walk_pmd_range mm/pagewalk.c:89 [inline]
    walk_pud_range mm/pagewalk.c:160 [inline]
    walk_p4d_range mm/pagewalk.c:193 [inline]
    walk_pgd_range mm/pagewalk.c:229 [inline]
    __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
    walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
    madvise_pageout_page_range mm/madvise.c:521 [inline]
    madvise_pageout mm/madvise.c:557 [inline]
    madvise_vma mm/madvise.c:946 [inline]
    do_madvise+0x12d0/0x2090 mm/madvise.c:1145
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The backing vma was shmem.

    In case of split page of file-backed THP, madvise zaps the pmd instead
    of remapping of sub-pages. So we need to check pmd validity after
    split.

    Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In register_mem_sect_under_node() the system_state's value is checked to
    detect whether the call is made during boot time or during an hot-plug
    operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
    because regular memory is registered at SYSTEM_SCHEDULING state. In
    addition, memory hot-plug operation can be triggered at this system
    state by the ACPI [1]. So checking against the system state is not
    enough.

    The consequence is that on system with interleaved node's ranges like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    This can be seen on PowerPC LPAR after multiple memory hot-plug and
    hot-unplug operations are done. At the next reboot the node's memory
    ranges can be interleaved and since the call to link_mem_sections() is
    made in topology_init() while the system is in the SYSTEM_SCHEDULING
    state, the node's id is not checked, and the sections registered to
    multiple nodes:

    $ ls -l /sys/devices/system/memory/memory21/node*
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2

    In that case, the system is able to boot but if later one of theses
    memory blocks is hot-unplugged and then hot-plugged, the sysfs
    inconsistency is detected and this is triggering a BUG_ON():

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This patch addresses the root cause by not relying on the system_state
    value to detect whether the call is due to a hot-plug operation. An
    extra parameter is added to link_mem_sections() detailing whether the
    operation is due to a hot-plug operation.

    [1] According to Oscar Salvador, using this qemu command line, ACPI
    memory hotplug operations are raised at SYSTEM_SCHEDULING state:

    $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
    -m size=$MEM,slots=255,maxmem=4294967296k \
    -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
    -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
    -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
    -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
    -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
    -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
    -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
    -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

    Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Fenghua Yu
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Patch series "mm: fix memory to node bad links in sysfs", v3.

    Sometimes, firmware may expose interleaved memory layout like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    In that case, we can see memory blocks assigned to multiple nodes in
    sysfs:

    $ ls -l /sys/devices/system/memory/memory21
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
    drwxr-xr-x 2 root root 0 Aug 24 05:27 power
    -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
    lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
    -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
    -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

    The same applies in the node's directory with a memory21 link in both
    the node1 and node2's directory.

    This is wrong but doesn't prevent the system to run. However when
    later, one of these memory blocks is hot-unplugged and then hot-plugged,
    the system is detecting an inconsistency in the sysfs layout and a
    BUG_ON() is raised:

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This has been seen on PowerPC LPAR.

    The root cause of this issue is that when node's memory is registered,
    the range used can overlap another node's range, thus the memory block
    is registered to multiple nodes in sysfs.

    There are two issues here:

    (a) The sysfs memory and node's layouts are broken due to these
    multiple links

    (b) The link errors in link_mem_sections() should not lead to a system
    panic.

    To address (a) register_mem_sect_under_node should not rely on the
    system state to detect whether the link operation is triggered by a hot
    plug operation or not. This is addressed by the patches 1 and 2 of this
    series.

    Issue (b) will be addressed separately.

    This patch (of 2):

    The memmap_context enum is used to detect whether a memory operation is
    due to a hot-add operation or happening at boot time.

    Make it general to the hotplug operation and rename it as
    meminit_context.

    There is no functional change introduced by this patch

    Suggested-by: David Hildenbrand
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J . Wysocki"
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
    Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • If we copy less than 8 bytes and if the destination crosses a cache
    line, __copy_user_flushcache would invalidate only the first cache line.

    This patch makes it invalidate the second cache line as well.

    Fixes: 0aed55af88345b ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations")
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Andrew Morton
    Reviewed-by: Dan Williams
    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Toshi Kani
    Cc: "H. Peter Anvin"
    Cc: Al Viro
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Ingo Molnar
    Cc:
    Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • This addresses the following sparse warning:

    lib/memregion.c:8:5: warning: symbol 'memregion_alloc' was not declared. Should it be static?
    lib/memregion.c:14:6: warning: symbol 'memregion_free' was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200921142852.875312-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     
  • LLVM implemented a recent "libcall optimization" that lowers calls to
    `sprintf(dest, "%s", str)` where the return value is used to
    `stpcpy(dest, str) - dest`.

    This generally avoids the machinery involved in parsing format strings.
    `stpcpy` is just like `strcpy` except it returns the pointer to the new
    tail of `dest`. This optimization was introduced into clang-12.

    Implement this so that we don't observe linkage failures due to missing
    symbol definitions for `stpcpy`.

    Similar to last year's fire drill with: commit 5f074f3e192f
    ("lib/string.c: implement a basic bcmp")

    The kernel is somewhere between a "freestanding" environment (no full
    libc) and "hosted" environment (many symbols from libc exist with the
    same type, function signature, and semantics).

    As Peter Anvin notes, there's not really a great way to inform the
    compiler that you're targeting a freestanding environment but would like
    to opt-in to some libcall optimizations (see pr/47280 below), rather
    than opt-out.

    Arvind notes, -fno-builtin-* behaves slightly differently between GCC
    and Clang, and Clang is missing many __builtin_* definitions, which I
    consider a bug in Clang and am working on fixing.

    Masahiro summarizes the subtle distinction between compilers justly:
    To prevent transformation from foo() into bar(), there are two ways in
    Clang to do that; -fno-builtin-foo, and -fno-builtin-bar. There is
    only one in GCC; -fno-buitin-foo.

    (Any difference in that behavior in Clang is likely a bug from a missing
    __builtin_* definition.)

    Masahiro also notes:
    We want to disable optimization from foo() to bar(),
    but we may still benefit from the optimization from
    foo() into something else. If GCC implements the same transform, we
    would run into a problem because it is not -fno-builtin-bar, but
    -fno-builtin-foo that disables that optimization.

    In this regard, -fno-builtin-foo would be more future-proof than
    -fno-built-bar, but -fno-builtin-foo is still potentially overkill. We
    may want to prevent calls from foo() being optimized into calls to
    bar(), but we still may want other optimization on calls to foo().

    It seems that compilers today don't quite provide the fine grain control
    over which libcall optimizations pseudo-freestanding environments would
    prefer.

    Finally, Kees notes that this interface is unsafe, so we should not
    encourage its use. As such, I've removed the declaration from any
    header, but it still needs to be exported to avoid linkage errors in
    modules.

    Reported-by: Sami Tolvanen
    Suggested-by: Andy Lavr
    Suggested-by: Arvind Sankar
    Suggested-by: Joe Perches
    Suggested-by: Kees Cook
    Suggested-by: Masahiro Yamada
    Suggested-by: Rasmus Villemoes
    Signed-off-by: Nick Desaulniers
    Signed-off-by: Andrew Morton
    Tested-by: Nathan Chancellor
    Cc:
    Link: https://lkml.kernel.org/r/20200914161643.938408-1-ndesaulniers@google.com
    Link: https://bugs.llvm.org/show_bug.cgi?id=47162
    Link: https://bugs.llvm.org/show_bug.cgi?id=47280
    Link: https://github.com/ClangBuiltLinux/linux/issues/1126
    Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html
    Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
    Link: https://reviews.llvm.org/D85963
    Signed-off-by: Linus Torvalds

    Nick Desaulniers
     
  • PageTransHuge returns true for both thp and hugetlb, so thp stats was
    counting both thp and hugetlb migrations. Exclude hugetlb migration by
    setting is_thp variable right.

    Clean up thp handling code too when we are there.

    Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Anshuman Khandual
    Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • Currently to make sure that every page table entry is read just once
    gup_fast walks perform READ_ONCE and pass pXd value down to the next
    gup_pXd_range function by value e.g.:

    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    ...
    pudp = pud_offset(&p4d, addr);

    This function passes a reference on that local value copy to pXd_offset,
    and might get the very same pointer in return. This happens when the
    level is folded (on most arches), and that pointer should not be
    iterated.

    On s390 due to the fact that each task might have different 5,4 or
    3-level address translation and hence different levels folded the logic
    is more complex and non-iteratable pointer to a local copy leads to
    severe problems.

    Here is an example of what happens with gup_fast on s390, for a task
    with 3-level paging, crossing a 2 GB pud boundary:

    // addr = 0x1007ffff000, end = 0x10080001000
    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    {
    unsigned long next;
    pud_t *pudp;

    // pud_offset returns &p4d itself (a pointer to a value on stack)
    pudp = pud_offset(&p4d, addr);
    do {
    // on second iteratation reading "random" stack value
    pud_t pud = READ_ONCE(*pudp);

    // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
    next = pud_addr_end(addr, end);
    ...
    } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

    return 1;
    }

    This happens since s390 moved to common gup code with commit
    d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
    commit 1a42010cdc26 ("s390/mm: convert to the generic
    get_user_pages_fast code").

    s390 tried to mimic static level folding by changing pXd_offset
    primitives to always calculate top level page table offset in pgd_offset
    and just return the value passed when pXd_offset has to act as folded.

    What is crucial for gup_fast and what has been overlooked is that
    PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
    And the latter is not possible with dynamic folding.

    To fix the issue in addition to pXd values pass original pXdp pointers
    down to gup_pXd_range functions. And introduce pXd_offset_lockless
    helpers, which take an additional pXd entry value parameter. This has
    already been discussed in

    https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

    Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Andrew Morton
    Reviewed-by: Gerald Schaefer
    Reviewed-by: Alexander Gordeev
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Mike Rapoport
    Reviewed-by: John Hubbard
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Arnd Bergmann
    Cc: Andrey Ryabinin
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Claudio Imbrenda
    Cc: [5.2+]
    Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
    Signed-off-by: Linus Torvalds

    Vasily Gorbik
     
  • We forget to add the suffix to the workingset_restore string, so fix it.

    And also update the documentation of cgroup-v2.rst.

    Fixes: 170b04b7ae49 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Tejun Heo
    Cc: Zefan Li
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Randy Dunlap
    Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • SWP_FS is used to make swap_{read,write}page() go through the
    filesystem, and it's only used for swap files over NFS. So, !SWP_FS
    means non NFS for now, it could be either file backed or device backed.
    Something similar goes with legacy SWP_FILE.

    So in order to achieve the goal of the original patch, SWP_BLKDEV should
    be used instead.

    FS corruption can be observed with SSD device + XFS + fragmented
    swapfile due to CONFIG_THP_SWAP=y.

    I reproduced the issue with the following details:

    Environment:

    QEMU + upstream kernel + buildroot + NVMe (2 GB)

    Kernel config:

    CONFIG_BLK_DEV_NVME=y
    CONFIG_THP_SWAP=y

    Some reproducible steps:

    mkfs.xfs -f /dev/nvme0n1
    mkdir /tmp/mnt
    mount /dev/nvme0n1 /tmp/mnt
    bs="32k"
    sz="1024m" # doesn't matter too much, I also tried 16m
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

    mkswap /tmp/mnt/sw
    swapon /tmp/mnt/sw

    stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

    Symptoms:
    - FS corruption (e.g. checksum failure)
    - memory corruption at: 0xd2808010
    - segfault

    Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
    Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Reviewed-by: Yang Shi
    Acked-by: Rafael Aquini
    Cc: Matthew Wilcox
    Cc: Carlos Maiolino
    Cc: Eric Sandeen
    Cc: Dave Chinner
    Cc:
    Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     
  • With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
    kmem_caches for all allocations"), it becomes possible to call kfree()
    from the slabs_destroy().

    The functions cache_flusharray() and do_drain() calls slabs_destroy() on
    array_cache of the local CPU without updating the size of the
    array_cache. This enables the kfree() call from the slabs_destroy() to
    recursively call cache_flusharray() which can potentially call
    free_block() on the same elements of the array_cache of the local CPU
    and causing double free and memory corruption.

    To fix the issue, simply update the local CPU array_cache cache before
    calling slabs_destroy().

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Tested-by: Ming Lei
    Reported-by: kernel test robot
    Cc: Andrew Morton
    Cc: Ted Ts'o
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

26 Sep, 2020

2 commits

  • clang --target= is how we can specify a particular toolchain
    triple to be use, fix the two occurences in the documentation.

    Fixes: fcf1b6a35c16 ("Documentation/llvm: add documentation on building w/ Clang/LLVM")
    Signed-off-by: Florian Fainelli
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Nathan Chancellor
    Signed-off-by: Masahiro Yamada

    Florian Fainelli
     
  • Pull more kvm fixes from Paolo Bonzini:
    "Five small fixes.

    The nested migration bug will be fixed with a better API in 5.10 or
    5.11, for now this is a fix that works with existing userspace but
    keeps the current ugly API"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: SVM: Add a dedicated INVD intercept routine
    KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
    KVM: x86: fix MSR_IA32_TSC read for nested migration
    selftests: kvm: Fix assert failure in single-step test
    KVM: x86: VMX: Make smaller physical guest address space support user-configurable

    Linus Torvalds