07 Dec, 2020

1 commit

  • On success, mmap should return the begin address of newly mapped area,
    but patch "mm: mmap: merge vma after call_mmap() if possible" set
    vm_start of newly merged vma to return value addr. Users of mmap will
    get wrong address if vma is merged after call_mmap(). We fix this by
    moving the assignment to addr before merging vma.

    We have a driver which changes vm_flags, and this bug is found by our
    testcases.

    Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
    Signed-off-by: Liu Zixian
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: David Hildenbrand
    Cc: Miaohe Lin
    Cc: Hongxiang Lou
    Cc: Hu Shiyuan
    Cc: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
    Signed-off-by: Linus Torvalds

    Liu Zixian
     

19 Oct, 2020

2 commits

  • There are two locations that have a block of code for munmapping a vma
    range. Change those two locations to use a function and add meaningful
    comments about what happens to the arguments, which was unclear in the
    previous code.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There are three places that the next vma is required which uses the same
    block of code. Replace the block with a function and add comments on what
    happens in the case where NULL is encountered.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     

17 Oct, 2020

2 commits

  • The preceding patches have ensured that core dumping properly takes the
    mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
    its users.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • In commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), the helper put_write_access()
    came with the atomic_dec operation of the i_writecount field. But it
    forgot to use this helper in __vma_link_file() and dup_mmap().

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200924115235.5111-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

14 Oct, 2020

9 commits

  • Replace do_brk with do_brk_flags in comment of insert_vm_struct(), since
    do_brk was removed in following commit.

    Fixes: bb177a732c4369 ("mm: do not bug_on on incorrect length in __mm_populate()")
    Signed-off-by: Liao Pingfang
    Signed-off-by: Yi Wang
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/1600650778-43230-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Linus Torvalds

    Liao Pingfang
     
  • In commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), the helper allow_write_access
    came with the atomic_inc operation of the i_writecount field in the func
    __remove_shared_vm_struct(). But it forgot to use this helper function.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200921115814.39680-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Commit 4bb5f5d9395b ("mm: allow drivers to prevent new writable mappings")
    changed i_mmap_writable from unsigned int to atomic_t and add the helper
    function mapping_allow_writable() to atomic_inc i_mmap_writable. But it
    forgot to use this helper function in dup_mmap() and __vma_link_file().

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christian Brauner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Christian Kellner
    Cc: Suren Baghdasaryan
    Cc: Adrian Reber
    Cc: Shakeel Butt
    Cc: Aleksa Sarai
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200917112736.7789-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • In __vma_adjust(), we do the check on *root* to decide whether to adjust
    the address_space. It seems to be more meaningful to do the check on
    *file* itself. This means we are adjusting some data because it is a file
    backed vma.

    Since we seem to assume the address_space is valid if it is a file backed
    vma, let's just replace *root* with *file* here.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200913133631.37781-2-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • *root* with type of struct rb_root_cached is an element of *mapping*
    with type of struct address_space. This implies when we have a valid
    *root* it must be a part of valid *mapping*.

    So we can merge these two checks together to make the code more easy to
    read and to save some cpu cycles.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200913133631.37781-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Instead of converting adjust_next between bytes and pages number, let's
    just store the virtual address into adjust_next.

    Also, this patch fixes one typo in the comment of vma_adjust_trans_huge().

    [vbabka@suse.cz: changelog tweak]

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mike Kravetz
    Link: http://lkml.kernel.org/r/20200828081031.11306-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • These two functions share the same logic except ignore a different vma.

    Let's reuse the code.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200809232057.23477-2-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • __vma_unlink_common() and __vma_unlink() are counterparts. Since there is
    no function named __vma_unlink(), let's rename __vma_unlink_common() to
    __vma_unlink() to make the code more self-explanatory and easy for
    audience to understand.

    Otherwise we may expect there are several variants of vma_unlink() and
    __vma_unlink_common() is used by them.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200809232057.23477-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull arm64 updates from Will Deacon:
    "There's quite a lot of code here, but much of it is due to the
    addition of a new PMU driver as well as some arm64-specific selftests
    which is an area where we've traditionally been lagging a bit.

    In terms of exciting features, this includes support for the Memory
    Tagging Extension which narrowly missed 5.9, hopefully allowing
    userspace to run with use-after-free detection in production on CPUs
    that support it. Work is ongoing to integrate the feature with KASAN
    for 5.11.

    Another change that I'm excited about (assuming they get the hardware
    right) is preparing the ASID allocator for sharing the CPU page-table
    with the SMMU. Those changes will also come in via Joerg with the
    IOMMU pull.

    We do stray outside of our usual directories in a few places, mostly
    due to core changes required by MTE. Although much of this has been
    Acked, there were a couple of places where we unfortunately didn't get
    any review feedback.

    Other than that, we ran into a handful of minor conflicts in -next,
    but nothing that should post any issues.

    Summary:

    - Userspace support for the Memory Tagging Extension introduced by
    Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

    - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
    switching.

    - Fix and subsequent rewrite of our Spectre mitigations, including
    the addition of support for PR_SPEC_DISABLE_NOEXEC.

    - Support for the Armv8.3 Pointer Authentication enhancements.

    - Support for ASID pinning, which is required when sharing
    page-tables with the SMMU.

    - MM updates, including treating flush_tlb_fix_spurious_fault() as a
    no-op.

    - Perf/PMU driver updates, including addition of the ARM CMN PMU
    driver and also support to handle CPU PMU IRQs as NMIs.

    - Allow prefetchable PCI BARs to be exposed to userspace using normal
    non-cacheable mappings.

    - Implementation of ARCH_STACKWALK for unwinding.

    - Improve reporting of unexpected kernel traps due to BPF JIT
    failure.

    - Improve robustness of user-visible HWCAP strings and their
    corresponding numerical constants.

    - Removal of TEXT_OFFSET.

    - Removal of some unused functions, parameters and prototypes.

    - Removal of MPIDR-based topology detection in favour of firmware
    description.

    - Cleanups to handling of SVE and FPSIMD register state in
    preparation for potential future optimisation of handling across
    syscalls.

    - Cleanups to the SDEI driver in preparation for support in KVM.

    - Miscellaneous cleanups and refactoring work"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    Revert "arm64: initialize per-cpu offsets earlier"
    arm64: random: Remove no longer needed prototypes
    arm64: initialize per-cpu offsets earlier
    kselftest/arm64: Check mte tagged user address in kernel
    kselftest/arm64: Verify KSM page merge for MTE pages
    kselftest/arm64: Verify all different mmap MTE options
    kselftest/arm64: Check forked child mte memory accessibility
    kselftest/arm64: Verify mte tag inclusion via prctl
    kselftest/arm64: Add utilities and a test to validate mte memory
    perf: arm-cmn: Fix conversion specifiers for node type
    perf: arm-cmn: Fix unsigned comparison to less than zero
    arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
    arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
    arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
    arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
    KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
    arm64: Get rid of arm64_ssbd_state
    KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
    KVM: arm64: Get rid of kvm_arm_have_ssbd()
    KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
    ...

    Linus Torvalds
     

12 Oct, 2020

1 commit

  • The syzbot reported the below general protection fault:

    general protection fault, probably for non-canonical address
    0xe00eeaee0000003b: 0000 [#1] PREEMPT SMP KASAN
    KASAN: maybe wild-memory-access in range [0x00777770000001d8-0x00777770000001df]
    CPU: 1 PID: 10488 Comm: syz-executor721 Not tainted 5.9.0-rc3-syzkaller #0
    RIP: 0010:unlink_file_vma+0x57/0xb0 mm/mmap.c:164
    Call Trace:
    free_pgtables+0x1b3/0x2f0 mm/memory.c:415
    exit_mmap+0x2c0/0x530 mm/mmap.c:3184
    __mmput+0x122/0x470 kernel/fork.c:1076
    mmput+0x53/0x60 kernel/fork.c:1097
    exit_mm kernel/exit.c:483 [inline]
    do_exit+0xa8b/0x29f0 kernel/exit.c:793
    do_group_exit+0x125/0x310 kernel/exit.c:903
    get_signal+0x428/0x1f00 kernel/signal.c:2757
    arch_do_signal+0x82/0x2520 arch/x86/kernel/signal.c:811
    exit_to_user_mode_loop kernel/entry/common.c:136 [inline]
    exit_to_user_mode_prepare+0x1ae/0x200 kernel/entry/common.c:167
    syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:242
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    It's because the ->mmap() callback can change vma->vm_file and fput the
    original file. But the commit d70cec898324 ("mm: mmap: merge vma after
    call_mmap() if possible") failed to catch this case and always fput()
    the original file, hence add an extra fput().

    [ Thanks Hillf for pointing this extra fput() out. ]

    Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
    Reported-by: syzbot+c5d5a51dcbb558ca0cb5@syzkaller.appspotmail.com
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christian König
    Cc: Hongxiang Lou
    Cc: Chris Wilson
    Cc: Dave Airlie
    Cc: Daniel Vetter
    Cc: Sumit Semwal
    Cc: Matthew Wilcox (Oracle)
    Cc: John Hubbard
    Link: https://lkml.kernel.org/r/20200916090733.31427-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

25 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • Similarly to arch_validate_prot() called from do_mprotect_pkey(), an
    architecture may need to sanity-check the new vm_flags.

    Define a dummy function always returning true. In addition to
    do_mprotect_pkey(), also invoke it from mmap_region() prior to updating
    vma->vm_page_prot to allow the architecture code to veto potentially
    inconsistent vm_flags.

    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Catalin Marinas
     

08 Aug, 2020

3 commits

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     
  • The vm_flags may be changed after call_mmap() because drivers may set some
    flags for their own purpose. As a result, we failed to merge the adjacent
    vma due to the different vm_flags as userspace can't pass in the same one.
    Try to merge vma after call_mmap() to fix this issue.

    Signed-off-by: Hongxiang Lou
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1594954065-23733-1-git-send-email-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Look at the pseudo code below. It's very clear that, the judgement
    "!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
    use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is
    only needed by the branch 3), because "retval" will be overwritten at 4).

    No functional change, but it can reduce the code size. Maybe more clearer?
    Before:
    text data bss dec hex filename
    28733 1590 1 30324 7674 mm/mmap.o

    After:
    text data bss dec hex filename
    28701 1590 1 30292 7654 mm/mmap.o

    ====pseudo code====:
    if (!(flags & MAP_ANONYMOUS)) {
    ...
    1) if (is_file_hugepages(file))
    len = ALIGN(len, huge_page_size(hstate_file(file)));
    2) retval = -EINVAL;
    3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
    goto out_fput;
    } else if (flags & MAP_HUGETLB) {
    ...
    }
    ...

    4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
    out_fput:
    ...
    return retval;

    Signed-off-by: Zhen Lei
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
    Signed-off-by: Linus Torvalds

    Zhen Lei
     

31 Jul, 2020

1 commit


25 Jul, 2020

1 commit

  • VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under
    mmap_read_lock(). It can lead to race with __do_munmap():

    Thread A Thread B
    __do_munmap()
    detach_vmas_to_be_unmapped()
    mmap_write_downgrade()
    expand_downwards()
    vma->vm_start = address;
    // The VMA now overlaps with
    // VMAs detached by the Thread A
    // page fault populates expanded part
    // of the VMA
    unmap_region()
    // Zaps pagetables partly
    // populated by Thread B

    Similar race exists for expand_upwards().

    The fix is to avoid downgrading mmap_lock in __do_munmap() if detached
    VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA.

    [akpm@linux-foundation.org: s/mmap_sem/mmap_lock/ in comment]

    Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
    Reported-by: Jann Horn
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Oleg Nesterov
    Cc: Matthew Wilcox
    Cc: [4.20+]
    Link: http://lkml.kernel.org/r/20200709105309.42495-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

30 Jun, 2020

1 commit

  • A large process running on a heavily loaded system can encounter the
    following RCU CPU stall warning:

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu: 3-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
    (t=21013 jiffies g=1005461 q=132576)
    NMI backtrace for cpu 3
    CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
    Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
    Call Trace:

    dump_stack+0x46/0x60
    nmi_cpu_backtrace.cold.3+0x13/0x50
    ? lapic_can_unplug_cpu.cold.27+0x34/0x34
    nmi_trigger_cpumask_backtrace+0xba/0xca
    rcu_dump_cpu_stacks+0x99/0xc7
    rcu_sched_clock_irq.cold.87+0x1aa/0x397
    ? tick_sched_do_timer+0x60/0x60
    update_process_times+0x28/0x60
    tick_sched_timer+0x37/0x70
    __hrtimer_run_queues+0xfe/0x270
    hrtimer_interrupt+0xf4/0x210
    smp_apic_timer_interrupt+0x5e/0x120
    apic_timer_interrupt+0xf/0x20

    RIP: 0010:kmem_cache_free+0x223/0x300
    Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
    RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
    RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
    RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
    R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
    R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
    ? remove_vma+0x4f/0x60
    remove_vma+0x4f/0x60
    exit_mmap+0xd6/0x160
    mmput+0x4a/0x110
    do_exit+0x278/0xae0
    ? syscall_trace_enter+0x1d3/0x2b0
    ? handle_mm_fault+0xaa/0x1c0
    do_group_exit+0x3a/0xa0
    __x64_sys_exit_group+0x14/0x20
    do_syscall_64+0x42/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
    for a very long time given a large process. This commit therefore adds
    a cond_resched() to this loop, providing RCU any needed quiescent states.

    Cc: Andrew Morton
    Cc:
    Reviewed-by: Shakeel Butt
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

10 Jun, 2020

4 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Rename the mmap_sem field to mmap_lock. Any new uses of this lock should
    now go through the new mmap locking api. The mmap_lock is still
    implemented as a rwsem, though this could change in the future.

    [akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

05 Jun, 2020

1 commit


11 Apr, 2020

2 commits

  • There are many places where all basic VMA access flags (read, write,
    exec) are initialized or checked against as a group. One such example
    is during page fault. Existing vma_is_accessible() wrapper already
    creates the notion of VMA accessibility as a group access permissions.

    Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
    will not only reduce code duplication but also extend the VMA
    accessibility concept in general.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Mark Salter
    Cc: Nick Hu
    Cc: Ley Foon Tan
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Guan Xuetao
    Cc: Dave Hansen
    Cc: Thomas Gleixner
    Cc: Rob Springer
    Cc: Greg Kroah-Hartman
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • On passing requirement to vm_unmapped_area, arch_get_unmapped_area and
    arch_get_unmapped_area_topdown did not set align_offset. Internally on
    both unmapped_area and unmapped_area_topdown, if info->align_mask is 0,
    then info->align_offset was meaningless.

    But commit df529cabb7a2 ("mm: mmap: add trace point of
    vm_unmapped_area") always prints info->align_offset even though it is
    uninitialized.

    Fix this uninitialized value issue by setting it to 0 explicitly.

    Before:
    vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022

    After:
    vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0

    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox (Oracle)
    Cc: Michel Lespinasse
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     

08 Apr, 2020

2 commits

  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
    available for general use. While here, this replaces all remaining open
    encodings for VMA access check with vma_is_accessible().

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Geert Uytterhoeven
    Acked-by: Guo Ren
    Acked-by: Vlastimil Babka
    Cc: Guo Ren
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: "Aneesh Kumar K.V"
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnd Bergmann
    Cc: Nick Piggin
    Cc: Paul Mackerras
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Apr, 2020

2 commits

  • Even on 64 bit kernel, the mmap failure can happen for a 32 bit task.
    Virtual memory space shortage of a task on mmap is reported to userspace
    as -ENOMEM. It can be confused as physical memory shortage of overall
    system.

    The vm_unmapped_area can be called to by some drivers or other kernel core
    system like filesystem. In my platform, GPU driver calls to
    vm_unmapped_area and the driver returns -ENOMEM even in GPU side shortage.
    It can be hard to distinguish which code layer returns the -ENOMEM.

    Create mmap trace file and add trace point of vm_unmapped_area.

    i.e.)
    277.156599: vm_unmapped_area: addr=77e0d03000 err=0 total_vm=0x17014b flags=0x1 len=0x400000 lo=0x8000 hi=0x7878c27000 mask=0x0 ofs=0x1
    342.838740: vm_unmapped_area: addr=0 err=-12 total_vm=0xffb08 flags=0x0 len=0x100000 lo=0x40000000 hi=0xfffff000 mask=0x0 ofs=0x22

    [akpm@linux-foundation.org: prefix address printk with 0x, per Matthew]
    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Cc: Borislav Petkov
    Cc: Matthew Wilcox (Oracle)
    Cc: Michel Lespinasse
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200320055823.27089-3-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     
  • Patch series "mm: mmap: add mmap trace point", v3.

    Create mmap trace file and add trace point of vm_unmapped_area().

    This patch (of 2):

    In preparation for next patch remove inline of vm_unmapped_area and move
    code to mmap.c. There is no logical change.

    Also remove unmapped_area[_topdown] out of mm.h, there is no code
    calling to them.

    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Matthew Wilcox (Oracle)
    Cc: Michel Lespinasse
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/20200320055823.27089-2-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     

20 Feb, 2020

1 commit

  • Currently the arm64 kernel ignores the top address byte passed to brk(),
    mmap() and mremap(). When the user is not aware of the 56-bit address
    limit or relies on the kernel to return an error, untagging such
    pointers has the potential to create address aliases in user-space.
    Passing a tagged address to munmap(), madvise() is permitted since the
    tagged pointer is expected to be inside an existing mapping.

    The current behaviour breaks the existing glibc malloc() implementation
    which relies on brk() with an address beyond 56-bit to be rejected by
    the kernel.

    Remove untagging in the above functions by partially reverting commit
    ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
    addition, update the arm64 tagged-address-abi.rst document accordingly.

    Link: https://bugzilla.redhat.com/1797052
    Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
    Cc: # 5.4.x-
    Cc: Florian Weimer
    Reviewed-by: Andrew Morton
    Reported-by: Victor Stinner
    Acked-by: Will Deacon
    Acked-by: Andrey Konovalov
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon

    Catalin Marinas
     

01 Feb, 2020

1 commit

  • The jump labels try_prev and none are not really needed in
    find_mergeable_anon_vma(), eliminate them to improve readability.

    Link: http://lkml.kernel.org/r/1574079844-17493-1-git-send-email-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: David Hildenbrand
    Reviewed-by: John Hubbard
    Reviewed-by: Wei Yang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

28 Jan, 2020

1 commit

  • Pull timer updates from Thomas Gleixner:
    "The timekeeping and timers departement provides:

    - Time namespace support:

    If a container migrates from one host to another then it expects
    that clocks based on MONOTONIC and BOOTTIME are not subject to
    disruption. Due to different boot time and non-suspended runtime
    these clocks can differ significantly on two hosts, in the worst
    case time goes backwards which is a violation of the POSIX
    requirements.

    The time namespace addresses this problem. It allows to set offsets
    for clock MONOTONIC and BOOTTIME once after creation and before
    tasks are associated with the namespace. These offsets are taken
    into account by timers and timekeeping including the VDSO.

    Offsets for wall clock based clocks (REALTIME/TAI) are not provided
    by this mechanism. While in theory possible, the overhead and code
    complexity would be immense and not justified by the esoteric
    potential use cases which were discussed at Plumbers '18.

    The overhead for tasks in the root namespace (ie where host time
    offsets = 0) is in the noise and great effort was made to ensure
    that especially in the VDSO. If time namespace is disabled in the
    kernel configuration the code is compiled out.

    Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
    feature and kept on for more than a year addressing review
    comments, finding better solutions. A pleasant experience.

    - Overhaul of the alarmtimer device dependency handling to ensure
    that the init/suspend/resume ordering is correct.

    - A new clocksource/event driver for Microchip PIT64

    - Suspend/resume support for the Hyper-V clocksource

    - The usual pile of fixes, updates and improvements mostly in the
    driver code"

    * tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
    alarmtimer: Use wakeup source from alarmtimer platform device
    alarmtimer: Make alarmtimer platform device child of RTC device
    alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
    hrtimer: Add missing sparse annotation for __run_timer()
    lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
    MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
    clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
    clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
    clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
    clocksource/drivers/exynos_mct: Rename Exynos to lowercase
    clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
    clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
    clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
    clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
    clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
    clocksource/drivers/bcm2835_timer: Fix memory leak of timer
    clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
    clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
    clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
    ...

    Linus Torvalds
     

14 Jan, 2020

1 commit

  • If a task belongs to a time namespace then the VVAR page which contains
    the system wide VDSO data is replaced with a namespace specific page
    which has the same layout as the VVAR page.

    Co-developed-by: Andrei Vagin
    Signed-off-by: Andrei Vagin
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20191112012724.250792-25-dima@arista.com

    Dmitry Safonov
     

07 Jan, 2020

1 commit

  • The ARMv8 64-bit architecture supports execute-only user permissions by
    clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
    privileged mapping but from which user running at EL0 can still execute.

    The downside, however, is that the kernel at EL1 inadvertently reading
    such mapping would not trip over the PAN (privileged access never)
    protection.

    Revert the relevant bits from commit cab15ce604e5 ("arm64: Introduce
    execute-only page access permissions") so that PROT_EXEC implies
    PROT_READ (and therefore PTE_USER) until the architecture gains proper
    support for execute-only user mappings.

    Fixes: cab15ce604e5 ("arm64: Introduce execute-only page access permissions")
    Cc: # 4.9.x-
    Acked-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Linus Torvalds

    Catalin Marinas