11 Dec, 2020

2 commits

  • Prarit reported that depending on the affinity setting the

    ' irq $N: Affinity broken due to vector space exhaustion.'

    message is showing up in dmesg, but the vector space on the CPUs in the
    affinity mask is definitely not exhausted.

    Shung-Hsi provided traces and analysis which pinpoints the problem:

    The ordering of trying to assign an interrupt vector in
    assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
    valid node assigned. It does:

    1) Try the intersection of affinity mask and node mask
    2) Try the node mask
    3) Try the full affinity mask
    4) Try the full online mask

    Obviously #2 and #3 are in the wrong order as the requested affinity
    mask has to take precedence.

    In the observed cases #1 failed because the affinity mask did not contain
    CPUs from node 0. That made it allocate a vector from node 0, thereby
    breaking affinity and emitting the misleading message.

    Revert the order of #2 and #3 so the full affinity mask without the node
    intersection is tried before actually affinity is broken.

    If no node is assigned then only the full affinity mask and if that fails
    the full online mask is tried.

    Fixes: d6ffc6ac83b1 ("x86/vector: Respect affinity mask in irq descriptor")
    Reported-by: Prarit Bhargava
    Reported-by: Shung-Hsi Yu
    Signed-off-by: Thomas Gleixner
    Tested-by: Shung-Hsi Yu
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de

    Thomas Gleixner
     
  • The MBA software controller (mba_sc) is a feedback loop which
    periodically reads MBM counters and tries to restrict the bandwidth
    below a user-specified value. It tags along the MBM counter overflow
    handler to do the updates with 1s interval in mbm_update() and
    update_mba_bw().

    The purpose of mbm_update() is to periodically read the MBM counters to
    make sure that the hardware counter doesn't wrap around more than once
    between user samplings. mbm_update() calls __mon_event_count() for local
    bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
    instead when mba_sc is enabled. __mon_event_count() will not be called
    for local bandwidth updating in MBM counter overflow handler, but it is
    still called when reading MBM local bandwidth counter file
    'mbm_local_bytes', the call path is as below:

    rdtgroup_mondata_show()
    mon_event_read()
    mon_event_count()
    __mon_event_count()

    In __mon_event_count(), m->chunks is updated by delta chunks which is
    calculated from previous MSR value (m->prev_msr) and current MSR value.
    When mba_sc is enabled, m->chunks is also updated in mbm_update() by
    mistake by the delta chunks which is calculated from m->prev_bw_msr
    instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
    the mba_sc feedback loop.

    When reading MBM local bandwidth counter file, m->chunks was changed
    unexpectedly by mbm_bw_count(). As a result, the incorrect local
    bandwidth counter which calculated from incorrect m->chunks is shown to
    the user.

    Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
    MBM counter overflow handler, and always calling __mon_event_count() in
    mbm_update() to make sure that the hardware local bandwidth counter
    doesn't wrap around.

    Test steps:
    # Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
    git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
    && make
    ./tools/membw/membw -c 0 -b 10000 --read

    # Enable MBA software controller
    mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl

    # Create control group c1
    mkdir /sys/fs/resctrl/c1

    # Set MB throttle to 6 GB/s
    echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata

    # Write PID of the workload to tasks file
    echo `pidof membw` > /sys/fs/resctrl/c1/tasks

    # Read local bytes counters twice with 1s interval, the calculated
    # local bandwidth is not as expected (approaching to 6 GB/s):
    local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
    sleep 1
    local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
    echo "local b/w (bytes/s):" `expr $local_2 - $local_1`

    Before fix:
    local b/w (bytes/s): 11076796416

    After fix:
    local b/w (bytes/s): 5465014272

    Fixes: ba0f26d8529c (x86/intel_rdt/mba_sc: Prepare for feedback loop)
    Signed-off-by: Xiaochen Shen
    Signed-off-by: Borislav Petkov
    Reviewed-by: Tony Luck
    Cc:
    Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com

    Xiaochen Shen
     

10 Dec, 2020

1 commit

  • The PAT bit is in different locations for 4k and 2M/1G page table
    entries.

    Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
    caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
    and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
    index for write-protected pages.

    Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place")
    Signed-off-by: Arvind Sankar
    Signed-off-by: Borislav Petkov
    Tested-by: Tom Lendacky
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu

    Arvind Sankar
     

09 Dec, 2020

4 commits

  • membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
    syncing the core on all sibling threads but not necessarily the calling
    thread. This behavior is fundamentally buggy and cannot be used safely.

    Suppose a user program has two threads. Thread A is on CPU 0 and thread B
    is on CPU 1. Thread A modifies some text and calls
    membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

    Then thread B executes the modified code. If, at any point after
    membarrier() decides which CPUs to target, thread A could be preempted and
    replaced by thread B on CPU 0. This could even happen on exit from the
    membarrier() syscall. If this happens, thread B will end up running on CPU
    0 without having synced.

    In principle, this could be fixed by arranging for the scheduler to issue
    sync_core_before_usermode() whenever switching between two threads in the
    same mm if there is any possibility of a concurrent membarrier() call, but
    this would have considerable overhead. Instead, make membarrier() sync the
    calling CPU as well.

    As an optimization, this avoids an extra smp_mb() in the default
    barrier-only mode and an extra rseq preempt on the caller.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • membarrier() does not explicitly sync_core() remote CPUs; instead, it
    relies on the assumption that an IPI will result in a core sync. On x86,
    this may be true in practice, but it's not architecturally reliable. In
    particular, the SDM and APM do not appear to guarantee that interrupt
    delivery is serializing. While IRET does serialize, IPI return can
    schedule, thereby switching to another task in the same mm that was
    sleeping in a syscall. The new task could then SYSRET back to usermode
    without ever executing IRET.

    Make this more robust by explicitly calling sync_core_before_usermode()
    on remote cores. (This also helps people who search the kernel tree for
    instances of sync_core() and sync_core_before_usermode() -- one might be
    surprised that the core membarrier code doesn't currently show up in a
    such a search.)

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • It seems that most RSEQ membarrier users will expect any stores done before
    the membarrier() syscall to be visible to the target task(s). While this
    is extremely likely to be true in practice, nothing actually guarantees it
    by a strict reading of the x86 manuals. Rather than providing this
    guarantee by accident and potentially causing a problem down the road, just
    add an explicit barrier.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • sync_core_before_usermode() had an incorrect optimization. If the kernel
    returns from an interrupt, it can get to usermode without IRET. It just has
    to schedule to a different task in the same mm and do SYSRET. Fortunately,
    there were no callers of sync_core_before_usermode() that could have had
    in_irq() or in_nmi() equal to true, because it's only ever called from the
    scheduler.

    While at it, clarify a related comment.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org

    Andy Lutomirski
     

07 Dec, 2020

22 commits

  • Linus Torvalds
     
  • Pull char/misc driver fixes from Greg KH:
    "Here are some small driver fixes, and one "large" revert, for
    5.10-rc7.

    They include:

    - revert mei patch from 5.10-rc1 that was using a reserved userspace
    value. It will be resubmitted once the proper id has been assigned
    by the virtio people.

    - habanalabs fixes found by the fall-through audit from Gustavo

    - speakup driver fixes for reported issues

    - fpga config build fix for reported issue.

    All of these except the revert have been in linux-next with no
    reported issues. The revert is "clean" and just removes a
    previously-added driver, so no real issue there"

    * tag 'char-misc-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    Revert "mei: virtio: virtualization frontend driver"
    fpga: Specify HAS_IOMEM dependency for FPGA_DFL
    habanalabs: put devices before driver removal
    habanalabs: free host huge va_range if not used
    speakup: Reject setting the speakup line discipline outside of speakup

    Linus Torvalds
     
  • Pull tty fixes from Greg KH:
    "Here are two tty core fixes for 5.10-rc7.

    They resolve some reported locking issues in the tty core. While they
    have not been in a released linux-next yet, they have passed all of
    the 0-day bot testing as well as the submitter's testing"

    * tag 'tty-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
    tty: Fix ->session locking
    tty: Fix ->pgrp locking in tiocspgrp()

    Linus Torvalds
     
  • Pull USB fixes from Greg KH:
    "Here are some small USB fixes for 5.10-rc7 that resolve a number of
    reported issues, and add some new device ids.

    Nothing major here, but these solve some problems that people were
    having with the 5.10-rc tree:

    - reverts for USB storage dma settings that broke working devices

    - thunderbolt use-after-free fix

    - cdns3 driver fixes

    - gadget driver userspace copy fix

    - new device ids

    All of these except for the reverts have been in linux-next with no
    reported issues. The reverts are "clean" and were tested by Hans, as
    well as passing the 0-day tests"

    * tag 'usb-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    usb: gadget: f_fs: Use local copy of descriptors for userspace copy
    usb: ohci-omap: Fix descriptor conversion
    Revert "usb-storage: fix sdev->host->dma_dev"
    Revert "uas: fix sdev->host->dma_dev"
    Revert "uas: bump hw_max_sectors to 2048 blocks for SS or faster drives"
    USB: serial: kl5kusb105: fix memleak on open
    USB: serial: ch341: sort device-id entries
    USB: serial: ch341: add new Product ID for CH341A
    USB: serial: option: fix Quectel BG96 matching
    usb: cdns3: core: fix goto label for error path
    usb: cdns3: gadget: clear trb->length as zero after preparing every trb
    usb: cdns3: Fix hardware based role switch
    USB: serial: option: add support for Thales Cinterion EXS82
    USB: serial: option: add Fibocom NL668 variants
    thunderbolt: Fix use-after-free in remove_unplugged_switch()

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:
    "A set of fixes for x86:

    - Make the AMD L3 QoS code and data priorization enable/disable
    mechanism work correctly.

    The control bit was only set/cleared on one of the CPUs in a L3
    domain, but it has to be modified on all CPUs in the domain. The
    initial documentation was not clear about this, but the updated one
    from Oct 2020 spells it out.

    - Fix an off by one in the UV platform detection code which causes
    the UV hubs to be identified wrongly.

    The chip revisions start at 1 not at 0.

    - Fix a long standing bug in the evaluation of prefixes in the
    uprobes code which fails to handle repeated prefixes properly.

    The aggregate size of the prefixes can be larger than the bytes
    array but the code blindly iterated over the aggregate size beyond
    the array boundary. Add a macro to handle this case properly and
    use it at the affected places"

    * tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
    x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
    x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
    x86/platform/uv: Fix UV4 hub revision adjustment
    x86/resctrl: Fix AMD L3 QOS CDP enable/disable

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "Two fixes for performance monitoring on X86:

    - Add recursion protection to another callchain invoked from
    x86_pmu_stop() which can recurse back into x86_pmu_stop(). The
    first attempt to fix this missed this extra code path.

    - Use the already filtered status variable to check for PEBS counter
    overflow bits and not the unfiltered full status read from
    IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
    would be evaluated incorrectly"

    * tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel: Check PEBS status correctly
    perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "A set of updates for the interrupt subsystem:

    - Make multiqueue devices which use the managed interrupt affinity
    infrastructure work on PowerPC/Pseries. PowerPC does not use the
    generic infrastructure for setting up PCI/MSI interrupts and the
    multiqueue changes failed to update the legacy PCI/MSI
    infrastructure. Make this work by passing the affinity setup
    information down to the mapping and allocation functions.

    - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is
    bouncing and he's not reachable. We hope all is well with him and
    say thanks for his work over the years"

    * tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    powerpc/pseries: Pass MSI affinity to irq_create_mapping()
    genirq/irqdomain: Add an irq_create_mapping_affinity() function
    MAINTAINERS: Move Jason Cooper to CREDITS

    Linus Torvalds
     
  • Pull intel_idle build fix from Thomas Gleixner:
    "A tiny build fix for a recent change in the intel_idle driver which
    missed a CONFIG dependency and broke the build for certain
    configurations"

    * tag 'locking-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    intel_idle: Build fix

    Linus Torvalds
     
  • …t/masahiroy/linux-kbuild

    Pull Kbuild fixes from Masahiro Yamada:

    - Move -Wcast-align to W=3, which tends to be false-positive and there
    is no tree-wide solution.

    - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a
    preprocessor option and makes sense for .S files as well.

    - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.

    - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.

    - Fix undesirable line breaks in *.mod files.

    * tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: avoid split lines in .mod files
    kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1
    kbuild: Hoist '--orphan-handling' into Kconfig
    Kbuild: do not emit debug info for assembly with LLVM_IAS=1
    kbuild: use -fmacro-prefix-map for .S sources
    Makefile.extrawarn: move -Wcast-align to W=3

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "12 patches.

    Subsystems affected by this patch series: mm (memcg, zsmalloc, swap,
    mailmap, selftests, pagecache, hugetlb, pagemap), lib, and coredump"

    * emailed patches from Andrew Morton :
    mm/mmap.c: fix mmap return value when vma is merged after call_mmap()
    hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
    mm/filemap: add static for function __add_to_page_cache_locked
    userfaultfd: selftests: fix SIGSEGV if huge mmap fails
    tools/testing/selftests/vm: fix build error
    mailmap: add two more addresses of Uwe Kleine-König
    mm/swapfile: do not sleep with a spin lock held
    mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
    mm: list_lru: set shrinker map bit when child nr_items is not zero
    mm: memcg/slab: fix obj_cgroup_charge() return value handling
    coredump: fix core_pattern parse error
    zlib: export S390 symbols for zlib modules

    Linus Torvalds
     
  • On success, mmap should return the begin address of newly mapped area,
    but patch "mm: mmap: merge vma after call_mmap() if possible" set
    vm_start of newly merged vma to return value addr. Users of mmap will
    get wrong address if vma is merged after call_mmap(). We fix this by
    moving the assignment to addr before merging vma.

    We have a driver which changes vm_flags, and this bug is found by our
    testcases.

    Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
    Signed-off-by: Liu Zixian
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: David Hildenbrand
    Cc: Miaohe Lin
    Cc: Hongxiang Lou
    Cc: Hu Shiyuan
    Cc: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
    Signed-off-by: Linus Torvalds

    Liu Zixian
     
  • Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
    using hugetlbfs. In this environment the issue is reproduced by:

    - Start a simple pod that uses the recently added HugePages medium
    feature (pod yaml attached)

    - Start a DPDK app. It doesn't need to run successfully (as in transfer
    packets) nor interact with real hardware. It seems just initializing
    the EAL layer (which handles hugepage reservation and locking) is
    enough to trigger the issue

    - Delete the Pod (or let it "Complete").

    This would result in a kworker thread going into a tight loop (top output):

    1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

    'perf top -g' reports:

    - 63.28% 0.01% [kernel] [k] worker_thread
    - 49.97% worker_thread
    - 52.64% process_one_work
    - 62.08% css_killed_work_fn
    - hugetlb_cgroup_css_offline
    41.52% _raw_spin_lock
    - 2.82% _cond_resched
    rcu_all_qs
    2.66% PageHuge
    - 0.57% schedule
    - 0.57% __schedule

    We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
    Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
    infinitely spinning. Little else can be done on the system as the
    cgroup_mutex can not be acquired.

    Do note that the issue can be reproduced by simply offlining a hugetlb
    cgroup containing pages with reservation counts.

    The loop in hugetlb_cgroup_css_offline is moving page counts from the
    cgroup being offlined to the parent cgroup. This is done for each
    hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
    The routine moving counts (hugetlb_cgroup_move_parent) is only moving
    'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
    both 'usage' and 'reservation' counts. Discussion about what to do with
    reservation counts when reparenting was discussed here:

    https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

    The decision was made to leave a zombie cgroup for with reservation
    counts. Unfortunately, the code checking reservation counts was
    incorrectly added to hugetlb_cgroup_have_usage.

    To fix the issue, simply remove the check for reservation counts. While
    fixing this issue, a related bug in hugetlb_cgroup_css_offline was
    noticed. The hstate index is not reinitialized each time through the
    do-while loop. Fix this as well.

    Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: Adrian Moreno
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Adrian Moreno
    Reviewed-by: Shakeel Butt
    Cc: Mina Almasry
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shuah Khan
    Cc:
    Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

    Signed-off-by: Alex Shi
    Signed-off-by: Andrew Morton
    Cc: Souptick Joarder
    Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • The error handling in hugetlb_allocate_area() was incorrect for the
    hugetlb_shared test case.

    Previously the behavior was:

    - mmap a hugetlb area
    - If this fails, set the pointer to NULL, and carry on
    - mmap an alias of the same hugetlb fd
    - If this fails, munmap the original area

    If the original mmap failed, it's likely the second one did too. If
    both failed, we'd blindly try to munmap a NULL pointer, causing a
    SIGSEGV. Instead, "goto fail" so we return before trying to mmap the
    alias.

    This issue can be hit "in real life" by forgetting to set
    /proc/sys/vm/nr_hugepages (leaving it at 0), and then trying to run the
    hugetlb_shared test.

    Another small improvement is, when the original mmap fails, don't just
    print "it failed": perror(), so we can see *why*. :)

    Signed-off-by: Axel Rasmussen
    Signed-off-by: Andrew Morton
    Cc: Shuah Khan
    Cc: Peter Xu
    Cc: Joe Perches
    Cc: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: David Alan Gilbert
    Link: https://lkml.kernel.org/r/20201204203443.2714693-1-axelrasmussen@google.com
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     
  • Only x86 and PowerPC implement the pkey-xxx.h, and an error was reported
    when compiling protection_keys.c.

    Add a Arch judgment to compile "protection_keys" in the Makefile.

    If other arch implement this, add the arch name to the Makefile.
    eg:
    ifneq (,$(findstring $(ARCH),powerpc mips ... ))

    Following build errors:

    pkey-helpers.h:93:2: error: #error Architecture not supported
    #error Architecture not supported
    pkey-helpers.h:96:20: error: `PKEY_DISABLE_ACCESS' undeclared
    #define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
    ^
    protection_keys.c:218:45: error: `PKEY_DISABLE_WRITE' undeclared
    pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
    ^

    Signed-off-by: Xingxing Su
    Signed-off-by: Andrew Morton
    Cc: Shuah Khan
    Cc: Sandipan Das
    Cc: John Hubbard
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Brian Geffon
    Cc: Mina Almasry
    Link: https://lkml.kernel.org/r/1606826876-30656-1-git-send-email-suxingxing@loongson.cn
    Signed-off-by: Linus Torvalds

    Xingxing Su
     
  • This fixes attribution for the commits (among others)

    - d4097456cd1d ("video/framebuffer: move the probe func into
    .devinit.text in Blackfin LCD driver")

    - 0312e024d6cd ("mfd: mc13xxx: Add support for mc34708")

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201127213358.3440830-1-u.kleine-koenig@pengutronix.de
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • We can't call kvfree() with a spin lock held, so defer it. Fixes a
    might_sleep() runtime warning.

    Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc:
    Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • While I was doing zram testing, I found sometimes decompression failed
    since the compression buffer was corrupted. With investigation, I found
    below commit calls cond_resched unconditionally so it could make a
    problem in atomic context if the task is reschedule.

    BUG: sleeping function called from invalid context at mm/vmalloc.c:108
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
    3 locks held by memhog/946:
    #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
    #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
    #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
    CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
    Call Trace:
    unmap_kernel_range_noflush+0x2eb/0x350
    unmap_kernel_range+0x14/0x30
    zs_unmap_object+0xd5/0xe0
    zram_bvec_rw.isra.0+0x38c/0x8e0
    zram_rw_page+0x90/0x101
    bdev_write_page+0x92/0xe0
    __swap_writepage+0x94/0x4a0
    pageout+0xe3/0x3a0
    shrink_page_list+0xb94/0xd60
    shrink_inactive_list+0x158/0x460

    We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
    contains the offending calling code) from zsmalloc.

    Even though this option showed some amount improvement(e.g., 30%) in
    some arm32 platforms, it has been headache to maintain since it have
    abused APIs[1](e.g., unmap_kernel_range in atomic context).

    Since we are approaching to deprecate 32bit machines and already made
    the config option available for only builtin build since v5.8, lastly it
    has been not default option in zsmalloc, it's time to drop the option
    for better maintenance.

    [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

    Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Sergey Senozhatsky
    Cc: Tony Lindgren
    Cc: Christoph Hellwig
    Cc: Harish Sriram
    Cc: Uladzislau Rezki
    Cc:
    Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When investigating a slab cache bloat problem, significant amount of
    negative dentry cache was seen, but confusingly they neither got shrunk
    by reclaimer (the host has very tight memory) nor be shrunk by dropping
    cache. The vmcore shows there are over 14M negative dentry objects on
    lru, but tracing result shows they were even not scanned at all.

    Further investigation shows the memcg's vfs shrinker_map bit is not set.
    So the reclaimer or dropping cache just skip calling vfs shrinker. So
    we have to reboot the hosts to get the memory back.

    I didn't manage to come up with a reproducer in test environment, and
    the problem can't be reproduced after rebooting. But it seems there is
    race between shrinker map bit clear and reparenting by code inspection.
    The hypothesis is elaborated as below.

    The memcg hierarchy on our production environment looks like:

    root
    / \
    system user

    The main workloads are running under user slice's children, and it
    creates and removes memcg frequently. So reparenting happens very often
    under user slice, but no task is under user slice directly.

    So with the frequent reparenting and tight memory pressure, the below
    hypothetical race condition may happen:

    CPU A CPU B
    reparent
    dst->nr_items == 0
    shrinker:
    total_objects == 0
    add src->nr_items to dst
    set_bit
    return SHRINK_EMPTY
    clear_bit
    child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
    list_lru_del() between shrinker runs
    see parent's kmemcg_id
    dec dst->nr_items
    reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

    The second run of shrinker:
    read nr_items without any
    synchronization, so it may
    see intermediate negative
    nr_items then total_objects
    may return 0 coincidently

    keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

    After this point dst->nr_item may never go zero, so reparenting will not
    set shrinker_map bit anymore. And since there is no task under user
    slice directly, so no new object will be added to its lru to set the
    shrinker map bit either. That bit is kept cleared forever.

    How does list_lru_del() race with reparenting? It is because reparenting
    replaces children's kmemcg_id to parent's without protecting from
    nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
    deleting items from child's lru, but dec'ing parent's nr_items, so the
    parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
    reparent list_lrus and free kmemcg_id on css offline") says.

    Since it is impossible that dst->nr_items goes negative and
    src->nr_items goes zero at the same time, so it seems we could set the
    shrinker map bit iff src->nr_items != 0. We could synchronize
    list_lru_count_one() and reparenting with nlru->lock, but it seems
    checking src->nr_items in reparenting is the simplest and avoids lock
    contention.

    Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
    Suggested-by: Roman Gushchin
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Kirill Tkhai
    Cc: Vladimir Davydov
    Cc: [4.19]
    Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
    for all allocations") introduced a regression into the handling of the
    obj_cgroup_charge() return value. If a non-zero value is returned
    (indicating of exceeding one of memory.max limits), the allocation
    should fail, instead of falling back to non-accounted mode.

    To make the code more readable, move memcg_slab_pre_alloc_hook() and
    memcg_slab_post_alloc_hook() calling conditions into bodies of these
    hooks.

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • 'format_corename()' will splite 'core_pattern' on spaces when it is in
    pipe mode, and take helper_argv[0] as the path to usermode executable.
    It works fine in most cases.

    However, if there is a space between '|' and '/file/path', such as
    '| /usr/lib/systemd/systemd-coredump %P %u %g', then helper_argv[0] will
    be parsed as '', and users will get a 'Core dump to | disabled'.

    It is not friendly to users, as the pattern above was valid previously.
    Fix this by ignoring the spaces between '|' and '/file/path'.

    Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template")
    Signed-off-by: Menglong Dong
    Signed-off-by: Andrew Morton
    Cc: Paul Wise
    Cc: Jakub Wilk [https://bugs.debian.org/924398]
    Cc: Neil Horman
    Cc:
    Link: https://lkml.kernel.org/r/5fb62870.1c69fb81.8ef5d.af76@mx.google.com
    Signed-off-by: Linus Torvalds

    Menglong Dong
     
  • Fix build errors when ZLIB_INFLATE=m and ZLIB_DEFLATE=m and ZLIB_DFLTCC=y
    by exporting the 2 needed symbols in dfltcc_inflate.c.

    Fixes these build errors:

    ERROR: modpost: "dfltcc_inflate" [lib/zlib_inflate/zlib_inflate.ko] undefined!
    ERROR: modpost: "dfltcc_can_inflate" [lib/zlib_inflate/zlib_inflate.ko] undefined!

    Fixes: 126196100063 ("lib/zlib: add s390 hardware support for kernel zlib_inflate")
    Reported-by: kernel test robot
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Acked-by: Ilya Leoshkevich
    Cc: Mikhail Zaslonko
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Link: https://lkml.kernel.org/r/20201123191712.4882-1-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

06 Dec, 2020

11 commits

  • "xargs echo" is not a safe way to remove line breaks because the input
    may exceed the command line limit and xargs may break it up into
    multiple invocations of echo. This should never happen because
    scripts/gen_autoksyms.sh expects all undefined symbols are placed in
    the second line of .mod files.

    One possible way is to replace "xargs echo" with
    "sed ':x;N;$!bx;s/\n/ /g'" or something, but I rewrote the code by
    using awk because it is more readable.

    This issue was reported by Sami Tolvanen; in his Clang LTO patch set,
    $(multi-used-m) is no longer an ELF object, but a thin archive that
    contains LLVM bitcode files. llvm-nm prints out symbols for each
    archive member separately, which results a lot of dupications, in some
    places, beyond the system-defined limit.

    This problem must be fixed irrespective of LTO, and we must ensure
    zero possibility of having this issue.

    Link: https://lkml.org/lkml/2020/12/1/1658
    Reported-by: Sami Tolvanen
    Signed-off-by: Masahiro Yamada
    Reviewed-by: Sami Tolvanen

    Masahiro Yamada
     
  • This reverts commit d162219c655c8cf8003128a13840d6c1e183fb80.

    The device uses a VIRTIO device ID out of a not-for-production range.
    Releasing Linux using an ID out of this range will make it conflict with
    development setups. An official request to reserve an ID for an MEI
    device is yet to be submitted to the virtio TC, thus there's no chance
    it will be reserved and fixed in time before the next release.

    Once requested it usually takes 2-3 weeks to land in the spec, which
    means the device can be supported with the official ID in the next Linux
    version if contributors act quickly.

    Signed-off-by: Michael S. Tsirkin
    Cc: Tomas Winkler
    Cc: Alexander Usyskin
    Cc: Wang Yu
    Cc: Liu Shuo
    Link: https://lore.kernel.org/r/20201205193625.469773-1-mst@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • Since insn.prefixes.nbytes can be bigger than the size of
    insn.prefixes.bytes[] when a prefix is repeated, the proper
    check must be:

    insn.prefixes.bytes[i] != 0 and i < 4

    instead of using insn.prefixes.nbytes. Use the new
    for_each_insn_prefix() macro which does it correctly.

    Debugged by Kees Cook .

    [ bp: Massage commit message. ]

    Fixes: 25189d08e516 ("x86/sev-es: Add support for handling IOIO exceptions")
    Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/160697106089.3146288.2052422845039649176.stgit@devnote2

    Masami Hiramatsu
     
  • Since insn.prefixes.nbytes can be bigger than the size of
    insn.prefixes.bytes[] when a prefix is repeated, the proper check must
    be

    insn.prefixes.bytes[i] != 0 and i < 4

    instead of using insn.prefixes.nbytes. Use the new
    for_each_insn_prefix() macro which does it correctly.

    Debugged by Kees Cook .

    [ bp: Massage commit message. ]

    Fixes: 32d0b95300db ("x86/insn-eval: Add utility functions to get segment selector")
    Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/160697104969.3146288.16329307586428270032.stgit@devnote2

    Masami Hiramatsu
     
  • Since insn.prefixes.nbytes can be bigger than the size of
    insn.prefixes.bytes[] when a prefix is repeated, the proper check must
    be

    insn.prefixes.bytes[i] != 0 and i < 4

    instead of using insn.prefixes.nbytes.

    Introduce a for_each_insn_prefix() macro for this purpose. Debugged by
    Kees Cook .

    [ bp: Massage commit message, sync with the respective header in tools/
    and drop "we". ]

    Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
    Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Reviewed-by: Srikar Dronamraju
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2

    Masami Hiramatsu
     
  • Pull input fixes from Dmitry Torokhov:
    "A fix for 'RETRIGEN' handling in Atmel touch controllers that was
    causing lost interrupts on systems using edge-triggered interrupts, a
    quirk for i8042 driver, and a couple more fixes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: atmel_mxt_ts - fix lost interrupts
    Input: xpad - support Ardwiino Controllers
    Input: i8042 - add ByteSpeed touchpad to noloop table
    Input: i8042 - fix error return code in i8042_setup_aux()
    Input: soc_button_array - add missing include

    Linus Torvalds
     
  • Pull i2c fixes from Wolfram Sang:
    "Some more I2C driver updates. IMX updates are a tad bigger, but not
    exceptionally big"

    * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
    i2c: mlxbf: Fix the return check of devm_ioremap and ioremap
    i2c: mlxbf: select CONFIG_I2C_SLAVE
    i2c: imx: Don't generate STOP condition if arbitration has been lost
    i2c: imx: Check for I2SR_IAL after every byte
    i2c: imx: Fix reset of I2SR_IAL flag
    i2c: qcom: Fix IRQ error misassignement
    i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc()

    Linus Torvalds
     
  • Pull block fix from Jens Axboe:
    "Single fix for an issue with chunk_sectors and stacked devices"

    * tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
    block: use gcd() to fix chunk_sectors limit stacking

    Linus Torvalds
     
  • Pull io_uring fix from Jens Axboe:
    "Just a small fix this time, for an issue with 32-bit compat apps and
    buffer selection with recvmsg"

    * tag 'io_uring-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
    io_uring: fix recvmsg setup with compat buf-select

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:
    "Some more powerpc fixes for 5.10:

    - Three commits fixing possible missed TLB invalidations for
    multi-threaded processes when CPUs are hotplugged in and out.

    - A fix for a host crash triggerable by host userspace (qemu) in KVM
    on Power9.

    - A fix for a host crash in machine check handling when running HPT
    guests on a HPT host.

    - One commit fixing potential missed TLB invalidations when using the
    hash MMU on Power9 or later.

    - A regression fix for machines with CPUs on node 0 but no memory.

    Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
    Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
    Dronamraju"

    * tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
    KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
    powerpc/numa: Fix a regression on memoryless node 0
    powerpc/64s: Trim offlined CPUs from mm_cpumasks
    kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
    powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
    powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation

    Linus Torvalds
     
  • Pull cifs fixes from Steve French:
    "Three smb3 fixes (two for stable) fixing

    - a null pointer issue in a DFS error path

    - a problem with excessive padding when mounted with "idsfromsid"
    causing owner fields to get corrupted

    - a more recent problem with compounded reparse point query found in
    testing to the Linux kernel server"

    * tag '5.10-rc6-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: refactor create_sd_buf() and and avoid corrupting the buffer
    cifs: add NULL check for ses->tcon_ipc
    smb3: set COMPOUND_FID to FileID field of subsequent compound request

    Linus Torvalds