16 Nov, 2020

9 commits

  • …scm/linux/kernel/git/gregkh/char-misc") into android-mainline

    Steps on the way to 5.10-rc4

    Resolves conflict in:
    arch/arm64/kvm/sys_regs.c

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Id188ccbec038cf7e30c204e9f5b7866f72b6640d

    Greg Kroah-Hartman
     
  • Pull char/misc driver fixes from Greg KH:
    "Here are some small char/misc/whatever driver fixes for 5.10-rc4.

    Nothing huge, lots of small fixes for reported issues:

    - habanalabs driver fixes

    - speakup driver fixes

    - uio driver fixes

    - virtio driver fix

    - other tiny driver fixes

    Full details are in the shortlog.

    All of these have been in linux-next for a full week with no reported
    issues"

    * tag 'char-misc-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    uio: Fix use-after-free in uio_unregister_device()
    firmware: xilinx: fix out-of-bounds access
    nitro_enclaves: Fixup type and simplify logic of the poll mask setup
    speakup ttyio: Do not schedule() in ttyio_in_nowait
    speakup: Fix clearing selection in safe context
    speakup: Fix var_id_t values and thus keymap
    virtio: virtio_console: fix DMA memory allocation for rproc serial
    habanalabs/gaudi: mask WDT error in QMAN
    habanalabs/gaudi: move coresight mmu config
    habanalabs: fix kernel pointer type
    mei: protect mei_cl_mtu from null dereference

    Linus Torvalds
     
  • Pull USB and Thunderbolt fixes from Greg KH:
    "Here are some small Thunderbolt and USB driver fixes for 5.10-rc4 to
    solve some reported issues.

    Nothing huge in here, just small things:

    - thunderbolt memory leaks fixed and new device ids added

    - revert of problem patch for the musb driver

    - new quirks added for USB devices

    - typec power supply fixes to resolve much reported problems about
    charging notifications not working anymore

    All except the cdc-acm driver quirk addition have been in linux-next
    with no reported issues (the quirk patch was applied on Friday, and is
    self-contained)"

    * tag 'usb-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    usb: cdc-acm: Add DISABLE_ECHO for Renesas USB Download mode
    MAINTAINERS: add usb raw gadget entry
    usb: typec: ucsi: Report power supply changes
    xhci: hisilicon: fix refercence leak in xhci_histb_probe
    Revert "usb: musb: convert to devm_platform_ioremap_resource_byname"
    thunderbolt: Add support for Intel Tiger Lake-H
    thunderbolt: Only configure USB4 wake for lane 0 adapters
    thunderbolt: Add uaccess dependency to debugfs interface
    thunderbolt: Fix memory leak if ida_simple_get() fails in enumerate_services()
    thunderbolt: Add the missed ida_simple_remove() in ring_request_msix()

    Linus Torvalds
     
  • Pull kvm fixes from Paolo Bonzini:
    "Fixes for ARM and x86, the latter especially for old processors
    without two-dimensional paging (EPT/NPT)"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
    KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests
    KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
    KVM: x86: clflushopt should be treated as a no-op by emulation
    KVM: arm64: Handle SCXTNUM_ELx traps
    KVM: arm64: Unify trap handlers injecting an UNDEF
    KVM: arm64: Allow setting of ID_AA64PFR0_EL1.CSV2 from userspace

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:
    "A small set of fixes for x86:

    - Cure the fallout from the MSI irqdomain overhaul which missed that
    the Intel IOMMU does not register virtual function devices and
    therefore never reaches the point where the MSI interrupt domain is
    assigned. This made the VF devices use the non-remapped MSI domain
    which is trapped by the IOMMU/remap unit

    - Remove an extra space in the SGI_UV architecture type procfs output
    for UV5

    - Remove a unused function which was missed when removing the UV BAU
    TLB shootdown handler"

    * tag 'x86-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    iommu/vt-d: Cure VF irqdomain hickup
    x86/platform/uv: Fix copied UV5 output archtype
    x86/platform/uv: Drop last traces of uv_flush_tlb_others

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A set of fixes for perf:

    - A set of commits which reduce the stack usage of various perf
    event handling functions which allocated large data structs on
    stack causing stack overflows in the worst case

    - Use the proper mechanism for detecting soft interrupts in the
    recursion protection

    - Make the resursion protection simpler and more robust

    - Simplify the scheduling of event groups to make the code more
    robust and prepare for fixing the issues vs. scheduling of
    exclusive event groups

    - Prevent event multiplexing and rotation for exclusive event groups

    - Correct the perf event attribute exclusive semantics to take
    pinned events, e.g. the PMU watchdog, into account

    - Make the anythread filtering conditional for Intel's generic PMU
    counters as it is not longer guaranteed to be supported on newer
    CPUs. Check the corresponding CPUID leaf to make sure

    - Fixup a duplicate initialization in an array which was probably
    caused by the usual 'copy & paste - forgot to edit' mishap"

    * tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/uncore: Fix Add BW copypasta
    perf/x86/intel: Make anythread filter support conditional
    perf: Tweak perf_event_attr::exclusive semantics
    perf: Fix event multiplexing for exclusive groups
    perf: Simplify group_sched_in()
    perf: Simplify group_sched_out()
    perf/x86: Make dummy_iregs static
    perf/arch: Remove perf_sample_data::regs_user_copy
    perf: Optimize get_recursion_context()
    perf: Fix get_recursion_context()
    perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
    perf: Reduce stack usage of perf_output_begin()

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "A set of scheduler fixes:

    - Address a load balancer regression by making the load balancer use
    the same logic as the wakeup path to spread tasks in the LLC domain

    - Prefer the CPU on which a task run last over the local CPU in the
    fast wakeup path for asymmetric CPU capacity systems to align with
    the symmetric case. This ensures more locality and prevents massive
    migration overhead on those asymetric systems

    - Fix a memory corruption bug in the scheduler debug code caused by
    handing a modified buffer pointer to kfree()"

    * tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/debug: Fix memory corruption caused by multiple small reads of flags
    sched/fair: Prefer prev cpu in asymmetric wakeup path
    sched/fair: Ensure tasks spreading in LLC during LB

    Linus Torvalds
     
  • Pull locking fixes from Thomas Gleixner:
    "Two fixes for the locking subsystem:

    - Prevent an unconditional interrupt enable in a futex helper
    function which can be called from contexts which expect interrupts
    to stay disabled across the call

    - Don't modify lockdep chain keys in the validation process as that
    causes chain inconsistency"

    * tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lockdep: Avoid to modify chain keys in validate_chain()
    futex: Don't enable IRQs unconditionally in put_pi_state()

    Linus Torvalds
     
  • Pull percpu fix and cleanup from Dennis Zhou:
    "A fix for a Wshadow warning in the asm-generic percpu macros came in
    and then I tacked on the removal of flexible array initializers in the
    percpu allocator"

    * 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
    percpu: convert flexible array initializers to use struct_size()
    asm-generic: percpu: avoid Wshadow warning

    Linus Torvalds
     

15 Nov, 2020

23 commits

  • In some cases where shadow paging is in use, the root page will
    be either mmu->pae_root or vcpu->arch.mmu->lm_root. Then it will
    not have an associated struct kvm_mmu_page, because it is allocated
    with alloc_page instead of kvm_mmu_alloc_page.

    Just return false quickly from is_tdp_mmu_root if the TDP MMU is
    not in use, which also includes the case where shadow paging is
    enabled.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Steps on the way to 5.10-rc4

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: Id86ac2ef339902d7ea3689767ef52744c6fa0d9e

    Greg Kroah-Hartman
     
  • Merge fixes from Andrew Morton:
    "14 patches.

    Subsystems affected by this patch series: mm (migration, vmscan, slub,
    gup, memcg, hugetlbfs), mailmap, kbuild, reboot, watchdog, panic, and
    ocfs2"

    * emailed patches from Andrew Morton :
    ocfs2: initialize ip_next_orphan
    panic: don't dump stack twice on warn
    hugetlbfs: fix anon huge page migration race
    mm: memcontrol: fix missing wakeup polling thread
    kernel/watchdog: fix watchdog_allowed_mask not used warning
    reboot: fix overflow parsing reboot cpu number
    Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
    compiler.h: fix barrier_data() on clang
    mm/gup: use unpin_user_pages() in __gup_longterm_locked()
    mm/slub: fix panic in slab_alloc_node()
    mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov
    mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
    mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
    mm/compaction: count pages and stop correctly during page isolation

    Linus Torvalds
     
  • Pull clk fixes from Stephen Boyd:
    "Two small clk driver fixes:

    - Make to_clk_regmap() inline to avoid compiler annoyance

    - Fix critical clks on i.MX imx8m SoCs"

    * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
    clk: imx8m: fix bus critical clk registration
    clk: define to_clk_regmap() as inline function

    Linus Torvalds
     
  • …/groeck/linux-staging

    Pull hwmon fixes from Guenter Roeck:

    - Fix potential bufer overflow in pmbus/max20730 driver

    - Fix locking issue in pmbus core

    - Fix regression causing timeouts in applesmc driver

    - Fix RPM calculation in pwm-fan driver

    - Restrict counter visibility in amd_energy driver

    * tag 'hwmon-for-v5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
    hwmon: (amd_energy) modify the visibility of the counters
    hwmon: (applesmc) Re-work SMC comms
    hwmon: (pwm-fan) Fix RPM calculation
    hwmon: (pmbus) Add mutex locking for sysfs reads
    hwmon: (pmbus/max20730) use scnprintf() instead of snprintf()

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "Three small fixes, all in the embedded ufs driver subsystem"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: ufshcd: Fix missing destroy_workqueue()
    scsi: ufs: Try to save power mode change and UIC cmd completion timeout
    scsi: ufs: Fix unbalanced scsi_block_reqs_cnt caused by ufshcd_hold()

    Linus Torvalds
     
  • Pull selinux fix from Paul Moore:
    "One small SELinux patch to make sure we return an error code when an
    allocation fails. It passes all of our tests, but given the nature of
    the patch that isn't surprising"

    * tag 'selinux-pr-20201113' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    selinux: Fix error return code in sel_ib_pkey_sid_slow()

    Linus Torvalds
     
  • Pull uml fix from Richard Weinberger:
    "Call PMD destructor in __pmd_free_tlb()"

    * tag 'for-linus-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
    um: Call pgtable_pmd_page_dtor() in __pmd_free_tlb()

    Linus Torvalds
     
  • When afs_write_end() is called with copied == 0, it tries to set the
    dirty region, but there's no way to actually encode a 0-length region in
    the encoding in page->private.

    "0,0", for example, indicates a 1-byte region at offset 0. The maths
    miscalculates this and sets it incorrectly.

    Fix it to just do nothing but unlock and put the page in this case. We
    don't actually need to mark the page dirty as nothing presumably
    changed.

    Fixes: 65dd2d6072d3 ("afs: Alter dirty range encoding in page->private")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Though problem if found on a lower 4.1.12 kernel, I think upstream has
    same issue.

    In one node in the cluster, there is the following callback trace:

    # cat /proc/21473/stack
    __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
    ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
    ocfs2_evict_inode+0x152/0x820 [ocfs2]
    evict+0xae/0x1a0
    iput+0x1c6/0x230
    ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
    ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
    ocfs2_dir_foreach+0x29/0x30 [ocfs2]
    ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
    ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
    process_one_work+0x169/0x4a0
    worker_thread+0x5b/0x560
    kthread+0xcb/0xf0
    ret_from_fork+0x61/0x90

    The above stack is not reasonable, the final iput shouldn't happen in
    ocfs2_orphan_filldir() function. Looking at the code,

    2067 /* Skip inodes which are already added to recover list, since dio may
    2068 * happen concurrently with unlink/rename */
    2069 if (OCFS2_I(iter)->ip_next_orphan) {
    2070 iput(iter);
    2071 return 0;
    2072 }
    2073

    The logic thinks the inode is already in recover list on seeing
    ip_next_orphan is non-NULL, so it skip this inode after dropping a
    reference which incremented in ocfs2_iget().

    While, if the inode is already in recover list, it should have another
    reference and the iput() at line 2070 should not be the final iput
    (dropping the last reference). So I don't think the inode is really in
    the recover list (no vmcore to confirm).

    Note that ocfs2_queue_orphans(), though not shown up in the call back
    trace, is holding cluster lock on the orphan directory when looking up
    for unlinked inodes. The on disk inode eviction could involve a lot of
    IOs which may need long time to finish. That means this node could hold
    the cluster lock for very long time, that can lead to the lock requests
    (from other nodes) to the orhpan directory hang for long time.

    Looking at more on ip_next_orphan, I found it's not initialized when
    allocating a new ocfs2_inode_info structure.

    This causes te reflink operations from some nodes hang for very long
    time waiting for the cluster lock on the orphan directory.

    Fix: initialize ip_next_orphan as NULL.

    Signed-off-by: Wengang Wang
    Signed-off-by: Andrew Morton
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc:
    Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
    Signed-off-by: Linus Torvalds

    Wengang Wang
     
  • Before commit 3f388f28639f ("panic: dump registers on panic_on_warn"),
    __warn() was calling show_regs() when regs was not NULL, and show_stack()
    otherwise.

    After that commit, show_stack() is called regardless of whether
    show_regs() has been called or not, leading to duplicated Call Trace:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/nohash/8xx.c:186 mmu_mark_initmem_nx+0x24/0x94
    CPU: 0 PID: 1 Comm: swapper Not tainted 5.10.0-rc2-s3k-dev-01375-gf46ec0d3ecbd-dirty #4092
    NIP: c00128b4 LR: c0010228 CTR: 00000000
    REGS: c9023e40 TRAP: 0700 Not tainted (5.10.0-rc2-s3k-dev-01375-gf46ec0d3ecbd-dirty)
    MSR: 00029032 CR: 24000424 XER: 00000000

    GPR00: c0010228 c9023ef8 c2100000 0074c000 ffffffff 00000000 c2151000 c07b3880
    GPR08: ff000900 0074c000 c8000000 c33b53a8 24000822 00000000 c0003a20 00000000
    GPR16: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    GPR24: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00800000
    NIP [c00128b4] mmu_mark_initmem_nx+0x24/0x94
    LR [c0010228] free_initmem+0x20/0x58
    Call Trace:
    free_initmem+0x20/0x58
    kernel_init+0x1c/0x114
    ret_from_kernel_thread+0x14/0x1c
    Instruction dump:
    7d291850 7d234b78 4e800020 9421ffe0 7c0802a6 bfc10018 3fe0c060 3bff0000
    3fff4080 3bffffff 90010024 57ff0010 392001cd 7c3e0b78 953e0008
    CPU: 0 PID: 1 Comm: swapper Not tainted 5.10.0-rc2-s3k-dev-01375-gf46ec0d3ecbd-dirty #4092
    Call Trace:
    __warn+0x8c/0xd8 (unreliable)
    report_bug+0x11c/0x154
    program_check_exception+0x1dc/0x6e0
    ret_from_except_full+0x0/0x4
    --- interrupt: 700 at mmu_mark_initmem_nx+0x24/0x94
    LR = free_initmem+0x20/0x58
    free_initmem+0x20/0x58
    kernel_init+0x1c/0x114
    ret_from_kernel_thread+0x14/0x1c
    ---[ end trace 31702cd2a9570752 ]---

    Only call show_stack() when regs is NULL.

    Fixes: 3f388f28639f ("panic: dump registers on panic_on_warn")
    Signed-off-by: Christophe Leroy
    Signed-off-by: Andrew Morton
    Cc: Alexey Kardashevskiy
    Cc: Kefeng Wang
    Link: https://lkml.kernel.org/r/e8c055458b080707f1bc1a98ff8bea79d0cec445.1604748361.git.christophe.leroy@csgroup.eu
    Signed-off-by: Linus Torvalds

    Christophe Leroy
     
  • Qian Cai reported the following BUG in [1]

    LTP: starting move_pages12
    BUG: unable to handle page fault for address: ffffffffffffffe0
    ...
    RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
    Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Hugh Dickins diagnosed this as a migration bug caused by code introduced
    to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
    routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
    flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
    anon pages as the anon_vma_lock should be held in this case. Further
    analysis suggested that i_mmap_rwsem was not required to he held at all
    when calling try_to_unmap for anon pages as an anon page could never be
    part of a shared pmd mapping.

    Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
    to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
    keep mapping valid while dropping page lock.

    This patch does the following:

    - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
    calling try_to_unmap.

    - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
    will now simply do a 'trylock' while still holding the page lock. If
    the trylock fails, it will return NULL. This could impact the
    callers:

    - migration calling code will receive -EAGAIN and retry up to the
    hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
    killing (SIGKILL) instead of SIGBUS any mapping tasks.

    Do note that this change in behavior only happens when there is a
    race. None of the standard kernel testing suites actually hit this
    race, but it is possible.

    [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
    [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Reported-by: Qian Cai
    Suggested-by: Hugh Dickins
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc:
    Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When we poll the swap.events, we can miss being woken up when the swap
    event occurs. Because we didn't notify.

    Fixes: f3a53a3a1e5b ("mm, memcontrol: implement memory.swap.events")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Michal Hocko
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: Tejun Heo
    Link: https://lkml.kernel.org/r/20201105161936.98312-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • Define watchdog_allowed_mask only when SOFTLOCKUP_DETECTOR is enabled.

    Fixes: 7feeb9cd4f5b ("watchdog/sysctl: Clean up sysctl variable name space")
    Signed-off-by: Santosh Sivaraj
    Signed-off-by: Andrew Morton
    Reviewed-by: Petr Mladek
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20201106015025.1281561-1-santosh@fossix.org
    Signed-off-by: Linus Torvalds

    Santosh Sivaraj
     
  • Limit the CPU number to num_possible_cpus(), because setting it to a
    value lower than INT_MAX but higher than NR_CPUS produces the following
    error on reboot and shutdown:

    BUG: unable to handle page fault for address: ffffffff90ab1bb0
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1c09067 P4D 1c09067 PUD 1c0a063 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 5.9.0-rc8-kvm #110
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    RIP: 0010:migrate_to_reboot_cpu+0xe/0x60
    Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c
    RSP: 0018:ffffc90000013e08 EFLAGS: 00010246
    RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000
    RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0
    RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000
    FS: 00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __do_sys_reboot.cold+0x34/0x5b
    do_syscall_64+0x2d/0x40

    Fixes: 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel")
    Signed-off-by: Matteo Croce
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Fabian Frederick
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: Kees Cook
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Petr Mladek
    Cc: Robin Holt
    Cc:
    Link: https://lkml.kernel.org/r/20201103214025.116799-3-mcroce@linux.microsoft.com
    Signed-off-by: Linus Torvalds

    Matteo Croce
     
  • Patch series "fix parsing of reboot= cmdline", v3.

    The parsing of the reboot= cmdline has two major errors:

    - a missing bound check can crash the system on reboot

    - parsing of the cpu number only works if specified last

    Fix both.

    This patch (of 2):

    This reverts commit 616feab753972b97.

    kstrtoint() and simple_strtoul() have a subtle difference which makes
    them non interchangeable: if a non digit character is found amid the
    parsing, the former will return an error, while the latter will just
    stop parsing, e.g. simple_strtoul("123xyx") = 123.

    The kernel cmdline reboot= argument allows to specify the CPU used for
    rebooting, with the syntax `s####` among the other flags, e.g.
    "reboot=warm,s31,force", so if this flag is not the last given, it's
    silently ignored as well as the subsequent ones.

    Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint")
    Signed-off-by: Matteo Croce
    Signed-off-by: Andrew Morton
    Cc: Guenter Roeck
    Cc: Petr Mladek
    Cc: Arnd Bergmann
    Cc: Mike Rapoport
    Cc: Kees Cook
    Cc: Pavel Tatashin
    Cc: Robin Holt
    Cc: Fabian Frederick
    Cc: Greg Kroah-Hartman
    Cc:
    Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com
    Signed-off-by: Linus Torvalds

    Matteo Croce
     
  • Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h
    mutually exclusive") neglected to copy barrier_data() from
    compiler-gcc.h into compiler-clang.h.

    The definition in compiler-gcc.h was really to work around clang's more
    aggressive optimization, so this broke barrier_data() on clang, and
    consequently memzero_explicit() as well.

    For example, this results in at least the memzero_explicit() call in
    lib/crypto/sha256.c:sha256_transform() being optimized away by clang.

    Fix this by moving the definition of barrier_data() into compiler.h.

    Also move the gcc/clang definition of barrier() into compiler.h,
    __memory_barrier() is icc-specific (and barrier() is already defined
    using it in compiler-intel.h) and doesn't belong in compiler.h.

    [rdunlap@infradead.org: fix ALPHA builds when SMP is not enabled]

    Link: https://lkml.kernel.org/r/20201101231835.4589-1-rdunlap@infradead.org
    Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
    Signed-off-by: Arvind Sankar
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Tested-by: Nick Desaulniers
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Kees Cook
    Cc:
    Link: https://lkml.kernel.org/r/20201014212631.207844-1-nivedita@alum.mit.edu
    Signed-off-by: Linus Torvalds

    Arvind Sankar
     
  • When FOLL_PIN is passed to __get_user_pages() the page list must be put
    back using unpin_user_pages() otherwise the page pin reference persists
    in a corrupted state.

    There are two places in the unwind of __gup_longterm_locked() that put
    the pages back without checking. Normally on error this function would
    return the partial page list making this the caller's responsibility,
    but in these two cases the caller is not allowed to see these pages at
    all.

    Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
    Reported-by: Ira Weiny
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Reviewed-by: John Hubbard
    Cc: Aneesh Kumar K.V
    Cc: Dan Williams
    Cc:
    Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     
  • While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
    with 11TB of ram, I hit the following panic:

    BUG: Kernel NULL pointer dereference on read at 0x00000007
    Faulting instruction address: 0xc000000000456048
    Oops: Kernel access of bad area, sig: 11 [#2]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp
    CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1
    NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
    REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0)
    MSR: 8000000000009033 CR: 24004228 XER: 00000000
    CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
    GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
    GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
    GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
    GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
    GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
    GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
    GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
    GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
    NIP [c000000000456048] __kmalloc_node+0x108/0x790
    LR [c000000000455fd4] __kmalloc_node+0x94/0x790
    Call Trace:
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x10c/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2c4/0x470
    cgroup_mkdir+0x408/0x5f0
    kernfs_iop_mkdir+0x90/0x100
    vfs_mkdir+0x138/0x250
    do_mkdirat+0x154/0x1c0
    system_call_exception+0xf8/0x200
    system_call_common+0xf0/0x27c
    Instruction dump:
    e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
    2fbc0000 419e0018 41920230 e9270010 7f994800 419e0220 7ee6bb78

    This pointing to the following code:

    mm/slub.c:2851
    if (unlikely(!object || !node_match(page, node))) {
    c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0
    c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054
    node_match():
    mm/slub.c:2491
    if (node != NUMA_NO_NODE && page_to_nid(page) != node)
    c000000000456040: 30 02 92 41 beq cr4,c000000000456270
    page_to_nid():
    include/linux/mm.h:1294
    c000000000456044: 10 00 27 e9 ld r9,16(r7)
    c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL
    node_match():
    mm/slub.c:2491
    c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9
    c000000000456050: 20 02 9e 41 beq cr7,c000000000456270

    The panic occurred in slab_alloc_node() when checking for the page's node:

    object = c->freelist;
    page = c->page;
    if (unlikely(!object || !node_match(page, node))) {
    object = __slab_alloc(s, gfpflags, node, addr, c);
    stat(s, ALLOC_SLOWPATH);

    The issue is that object is not NULL while page is NULL which is odd but
    may happen if the cache flush happened after loading object but before
    loading page. Thus checking for the page pointer is required too.

    The cache flush is done through an inter processor interrupt when a
    piece of memory is off-lined. That interrupt is triggered when a memory
    hot-unplug operation is initiated and offline_pages() is calling the
    slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
    which is calling flush_cpu_slab(). If that interrupt is caught between
    the reading of c->freelist and the reading of c->page, this could lead
    to such a situation. That situation is expected and the later call to
    this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
    the whole operation.

    In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
    node_match()") check on the page pointer has been removed assuming that
    page is always valid when it is called. It happens that this is not
    true in that particular case, so check for page before calling
    node_match() here.

    Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Cc: Wei Yang
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Change back surname to new (old) one. Dmitry Baryshkov -> Dmitry
    Eremin-Solenikov -> Dmitry Baryshkov. Map several odd entries to main
    identity.

    Signed-off-by: Dmitry Baryshkov
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201103005158.1181426-1-dmitry.baryshkov@linaro.org
    Signed-off-by: Linus Torvalds

    Dmitry Baryshkov
     
  • Previously the negated unsigned long would be cast back to signed long
    which would have the correct negative value. After commit 730ec8c01a2b
    ("mm/vmscan.c: change prototype for shrink_page_list"), the large
    unsigned int converts to a large positive signed long.

    Symptoms include CMA allocations hanging forever holding the cma_mutex
    due to alloc_contig_range->...->isolate_migratepages_block waiting
    forever in "while (unlikely(too_many_isolated(pgdat)))".

    [akpm@linux-foundation.org: fix -stat.nr_lazyfree_fail as well, per Michal]

    Fixes: 730ec8c01a2b ("mm/vmscan.c: change prototype for shrink_page_list")
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vaneet Narang
    Cc: Maninder Singh
    Cc: Amit Sahrawat
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Link: https://lkml.kernel.org/r/20201029032320.1448441-1-npiggin@gmail.com
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • In isolate_migratepages_block, if we have too many isolated pages and
    nr_migratepages is not zero, we should try to migrate what we have
    without wasting time on isolating.

    In theory it's possible that multiple parallel compactions will cause
    too_many_isolated() to become true even if each has isolated less than
    COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
    immediately prevents that.

    [vbabka@suse.cz: changelog addition]

    Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
    Suggested-by: Vlastimil Babka
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Cc:
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Yang Shi
    Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • In isolate_migratepages_block, when cc->alloc_contig is true, we are
    able to isolate compound pages. But nr_migratepages and nr_isolated did
    not count compound pages correctly, causing us to isolate more pages
    than we thought.

    So count compound pages as the number of base pages they contain.
    Otherwise, we might be trapped in too_many_isolated while loop, since
    the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
    where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
    cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

    In addition, after we fix the issue above, cc->nr_migratepages could
    never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
    thus page isolation could not stop as we intended. Change the isolation
    stop condition to '>='.

    The issue can be triggered as follows:

    In a system with 16GB memory and an 8GB CMA region reserved by
    hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
    are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
    pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
    get stuck (looping in too_many_isolated function) until we kill either
    task. With the patch applied, oom will kill the application with 10GB
    THPs and let hugetlb page reservation finish.

    [ziy@nvidia.com: v3]

    Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
    Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     

14 Nov, 2020

8 commits

  • …m/fs/xfs/xfs-linux") into android-mainline

    Steps on the way to 5.10-rc4

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Iba36d2244c4229d44e3d391ed23dc25c6022f917

    Greg Kroah-Hartman
     
  • Pull fs freeze fix and cleanups from Darrick Wong:
    "A single vfs fix for 5.10, along with two subsequent cleanups.

    A very long time ago, a hack was added to the vfs fs freeze protection
    code to work around lockdep complaints about XFS, which would try to
    run a transaction (which requires intwrite protection) to finalize an
    xfs freeze (by which time the vfs had already taken intwrite).

    Fast forward a few years, and XFS fixed the recursive intwrite problem
    on its own, and the hack became unnecessary. Fast forward almost a
    decade, and latent bugs in the code converting this hack from freeze
    flags to freeze locks combine with lockdep bugs to make this reproduce
    frequently enough to notice page faults racing with freeze.

    Since the hack is unnecessary and causes thread race errors, just get
    rid of it completely. Making this kind of vfs change midway through a
    cycle makes me nervous, but a large enough number of the usual
    VFS/ext4/XFS/btrfs suspects have said this looks good and solves a
    real problem vector.

    And once that removal is done, __sb_start_write is now simple enough
    that it becomes possible to refactor the function into smaller,
    simpler static inline helpers in linux/fs.h. The cleanup is
    straightforward.

    Summary:

    - Finally remove the "convert to trylock" weirdness in the fs freezer
    code. It was necessary 10 years ago to deal with nested
    transactions in XFS, but we've long since removed that; and now
    this is causing subtle race conditions when lockdep goes offline
    and sb_start_* aren't prepared to retry a trylock failure.

    - Minor cleanups of the sb_start_* fs freeze helpers"

    * tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: move __sb_{start,end}_write* to fs.h
    vfs: separate __sb_start_write into blocking and non-blocking helpers
    vfs: remove lockdep bogosity in __sb_start_write

    Linus Torvalds
     
  • Pull xfs fixes from Darrick Wong:

    - Fix a fairly serious problem where the reverse mapping btree key
    comparison functions were silently ignoring parts of the keyspace
    when doing comparisons

    - Fix a thinko in the online refcount scrubber

    - Fix a missing unlock in the pnfs code

    * tag 'xfs-5.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: fix a missing unlock on error in xfs_fs_map_blocks
    xfs: fix brainos in the refcount scrubber's rmap fragment processor
    xfs: fix rmap key and record comparison functions
    xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents
    xfs: fix flags argument to rmap lookup when converting shared file rmaps

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "A few small fixes:

    - NVMe pull request from Christoph:
    - don't clear the read-only bit on a revalidate (Sagi Grimberg)

    - nbd error case refcount leak (Christoph)

    - loop/generic uevent fix (Christoph, Petr)"

    * tag 'block-5.10-2020-11-13' of git://git.kernel.dk/linux-block:
    loop: Fix occasional uevent drop
    block: add a return value to set_capacity_revalidate_and_notify
    nbd: fix a block_device refcount leak in nbd_release
    nvme: fix incorrect behavior when BLKROSET is called by the user

    Linus Torvalds
     
  • Pull io_uring fix from Jens Axboe:
    "A single fix in here, for a missed rounding case at setup time, which
    caused an otherwise legitimate setup case to return -EINVAL if used
    with unaligned ring size values"

    * tag 'io_uring-5.10-2020-11-13' of git://git.kernel.dk/linux-block:
    io_uring: round-up cq size before comparing with rounded sq size

    Linus Torvalds
     
  • Add vendor hook to print epoch values when system enter and exit
    out of suspend and resume. These epoch values are useful to know
    how long the device is in suspend state. These values can be used
    to synchronize various subsystem timestamps and have an unique
    timestamp to correlate between various subsystems.

    Bug: 172945021
    Change-Id: I82a01e348d05a46c9c3921869cc9d2fc0fd28867
    Signed-off-by: Murali Nalajala

    Murali Nalajala
     
  • Pull devicetree fixes from Rob Herring:

    - fix Flexcan binding schema errors introduced in rc3

    - fix an of_node ref counting error in of_dma_is_coherent

    * tag 'devicetree-fixes-for-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
    dt-bindings: clock: imx5: fix example
    dt-bindings: can: fsl,flexcan.yaml: fix compatible for i.MX35 and i.MX53
    dt-bindings: can: fsl,flexcan.yaml: fix fsl,stop-mode
    of/address: Fix of_node memory leak in of_dma_is_coherent

    Linus Torvalds
     
  • Step 8 of:
    https://android.googlesource.com/platform/prebuilts/clang/host/linux-x86/+/master/BINUTILS_KERNEL_DEPRECATION.md

    Bug: 141693040
    Signed-off-by: Nick Desaulniers
    Change-Id: I9d1621f6484c0402a7518ffb12a3f8f3815f43a9

    Nick Desaulniers