19 Oct, 2022

1 commit


18 Oct, 2022

4 commits

  • Consumer of the kernel crypto api, after allocating
    the transformation (tfm), sets the:
    - flag 'is_hbk'
    - structure 'struct hw_bound_key_info hbk_info'
    based on the type of key, the consumer is using.

    This helps:

    - This helps to influence the core processing logic
    for the encapsulated algorithm.
    - This flag is set by the consumer after allocating
    the tfm and before calling the function crypto_xxx_setkey().

    Signed-off-by: Pankaj Gupta
    Reviewed-by: Gaurav Jain
    Reviewed by: Kshitiz Varshney

    Pankaj Gupta
     
  • Hardware bound keys buffer has additional information,
    that will be accessed using this new structure.

    structure members are:
    - flags, flags for hardware specific information.
    - key_sz, size of the plain key.

    Signed-off-by: Pankaj Gupta
    Reviewed-by: Gaurav Jain
    Reviewed by: Kshitiz Varshney

    Pankaj Gupta
     
  • - added ele_trng_init to register Hardware Random Number Generator driver
    - added ele_get_random to proceed with random number generation operation.
    ele hwrng driver use this to read the random number with Sentinel.

    Signed-off-by: Gaurav Jain

    Gaurav Jain
     
  • - added ele_get_trng_state to read the TRNG state.
    ele hwrng driver use this to check if TRNG entropy is valid
    and ready to be read.
    - added ele_start_rng to start the initialization of the Sentinel RNG.
    ele hwrng driver use this to start Sentinel RNG when trng state
    is not valid.

    Signed-off-by: Gaurav Jain

    Gaurav Jain
     

17 Oct, 2022

5 commits

  • This will provide a way for SMMU drivers to retrieve StreamIDs
    associated with IORT RMR nodes and use that to set bypass settings
    for those IDs.

    Tested-by: Steven Price
    Tested-by: Laurentiu Tudor
    Tested-by: Hanjun Guo
    Reviewed-by: Hanjun Guo
    Signed-off-by: Shameer Kolothum
    Acked-by: Robin Murphy
    Link: https://lore.kernel.org/r/20220615101044.1972-6-shameerali.kolothum.thodi@huawei.com
    Signed-off-by: Joerg Roedel

    Shameer Kolothum
     
  • Parse through the IORT RMR nodes and populate the reserve region list
    corresponding to a given IOMMU and device(optional). Also, go through
    the ID mappings of the RMR node and retrieve all the SIDs associated
    with it.

    Reviewed-by: Lorenzo Pieralisi
    Tested-by: Steven Price
    Tested-by: Laurentiu Tudor
    Tested-by: Hanjun Guo
    Reviewed-by: Hanjun Guo
    Signed-off-by: Shameer Kolothum
    Acked-by: Robin Murphy
    Link: https://lore.kernel.org/r/20220615101044.1972-5-shameerali.kolothum.thodi@huawei.com
    Signed-off-by: Joerg Roedel

    Shameer Kolothum
     
  • Currently IORT provides a helper to retrieve HW MSI reserve regions.
    Change this to a generic helper to retrieve any IORT related reserve
    regions. This will be useful when we add support for RMR nodes in
    subsequent patches.

    [Lorenzo: For ACPI IORT]

    Reviewed-by: Lorenzo Pieralisi
    Reviewed-by: Christoph Hellwig
    Tested-by: Steven Price
    Tested-by: Laurentiu Tudor
    Tested-by: Hanjun Guo
    Reviewed-by: Hanjun Guo
    Signed-off-by: Shameer Kolothum
    Acked-by: Robin Murphy
    Link: https://lore.kernel.org/r/20220615101044.1972-4-shameerali.kolothum.thodi@huawei.com
    Signed-off-by: Joerg Roedel

    Shameer Kolothum
     
  • At present iort_iommu_msi_get_resv_regions() returns the number of
    MSI reserved regions on success and there are no users for this.
    The reserved region list will get populated anyway for platforms
    that require the HW MSI region reservation. Hence, change the
    function to return void instead.

    Reviewed-by: Christoph Hellwig
    Tested-by: Steven Price
    Tested-by: Laurentiu Tudor
    Reviewed-by: Hanjun Guo
    Signed-off-by: Shameer Kolothum
    Acked-by: Robin Murphy
    Link: https://lore.kernel.org/r/20220615101044.1972-3-shameerali.kolothum.thodi@huawei.com
    Signed-off-by: Joerg Roedel

    Shameer Kolothum
     
  • A callback is introduced to struct iommu_resv_region to free memory
    allocations associated with the reserved region. This will be useful
    when we introduce support for IORT RMR based reserved regions.

    Reviewed-by: Christoph Hellwig
    Tested-by: Steven Price
    Tested-by: Laurentiu Tudor
    Tested-by: Hanjun Guo
    Signed-off-by: Shameer Kolothum
    Acked-by: Robin Murphy
    Link: https://lore.kernel.org/r/20220615101044.1972-2-shameerali.kolothum.thodi@huawei.com
    Signed-off-by: Joerg Roedel

    Shameer Kolothum
     

06 Oct, 2022

1 commit


30 Sep, 2022

1 commit

  • This is the 5.15.71 stable release

    * tag 'v5.15.71': (144 commits)
    Linux 5.15.71
    ext4: use locality group preallocation for small closed files
    ext4: avoid unnecessary spreading of allocations among groups
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    drivers/net/phy/aquantia_main.c
    drivers/tty/serial/fsl_lpuart.c

    Jason Liu
     

28 Sep, 2022

2 commits

  • commit e77cab77f2cb3a1ca2ba8df4af45bb35617ac16d upstream.

    A very common pattern in the drivers is to advance xmit tail
    index and do bookkeeping of Tx'ed characters. Create
    uart_xmit_advance() to handle it.

    Reviewed-by: Andy Shevchenko
    Cc: stable
    Signed-off-by: Ilpo Järvinen
    Link: https://lore.kernel.org/r/20220901143934.8850-2-ilpo.jarvinen@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Ilpo Järvinen
     
  • commit d7f06bdd6ee87fbefa05af5f57361d85e7715b11 upstream.

    As PAGE_SIZE is unsigned long, -1 > PAGE_SIZE when NR_CPUS
    Cc: "Rafael J. Wysocki"
    Cc: Yury Norov
    Cc: stable@vger.kernel.org
    Cc: feng xiangjun
    Reported-by: feng xiangjun
    Signed-off-by: Phil Auld
    Signed-off-by: Yury Norov
    Link: https://lore.kernel.org/r/20220906203542.1796629-1-pauld@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Phil Auld
     

27 Sep, 2022

1 commit

  • This is the 5.15.70 stable release

    * tag 'v5.15.70': (2444 commits)
    Linux 5.15.70
    ALSA: hda/sigmatel: Fix unused variable warning for beep power change
    cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6ul.dtsi
    arch/arm/mm/mmu.c
    arch/arm64/boot/dts/freescale/imx8mp-evk.dts
    drivers/gpu/drm/imx/dcss/dcss-kms.c
    drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.c
    drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.h
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/soc/fsl/Kconfig
    drivers/soc/imx/gpcv2.c
    drivers/usb/dwc3/host.c
    net/dsa/slave.c
    sound/soc/fsl/imx-card.c

    Jason Liu
     

23 Sep, 2022

2 commits

  • commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.

    Flush the CPU caches when memory is reclaimed from an SEV guest (where
    reclaim also includes it being unmapped from KVM's memslots). Due to lack
    of coherency for SEV encrypted memory, failure to flush results in silent
    data corruption if userspace is malicious/broken and doesn't ensure SEV
    guest memory is properly pinned and unpinned.

    Cache coherency is not enforced across the VM boundary in SEV (AMD APM
    vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
    VM guests have to be explicitly flushed on the host side. If a memory page
    containing dirty confidential cachelines was released by VM and reallocated
    to another user, the cachelines may corrupt the new user at a later time.

    KVM takes a shortcut by assuming all confidential memory remain pinned
    until the end of VM lifetime. Therefore, KVM does not flush cache at
    mmu_notifier invalidation events. Because of this incorrect assumption and
    the lack of cache flushing, malicous userspace can crash the host kernel:
    creating a malicious VM and continuously allocates/releases unpinned
    confidential memory pages when the VM is running.

    Add cache flush operations to mmu_notifier operations to ensure that any
    physical memory leaving the guest VM get flushed. In particular, hook
    mmu_notifier_invalidate_range_start and mmu_notifier_release events and
    flush cache accordingly. The hook after releasing the mmu lock to avoid
    contention with other vCPUs.

    Cc: stable@vger.kernel.org
    Suggested-by: Sean Christpherson
    Reported-by: Mingwei Zhang
    Signed-off-by: Mingwei Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini
    [OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied
    kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()]
    Signed-off-by: Ovidiu Panait
    Signed-off-by: Greg Kroah-Hartman

    Mingwei Zhang
     
  • commit 40bfe7a86d84cf08ac6a8fe2f0c8bf7a43edd110 upstream.

    Since the stub version of of_dma_configure_id() was added in commit
    a081bd4af4ce ("of/device: Add input id to of_dma_configure()"), it has
    not matched the signature of the full function, leading to build failure
    reports when code using this function is built on !OF configurations.

    Fixes: a081bd4af4ce ("of/device: Add input id to of_dma_configure()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Thierry Reding
    Reviewed-by: Frank Rowand
    Acked-by: Lorenzo Pieralisi
    Link: https://lore.kernel.org/r/20220824153256.1437483-1-thierry.reding@gmail.com
    Signed-off-by: Rob Herring
    Signed-off-by: Greg Kroah-Hartman

    Thierry Reding
     

20 Sep, 2022

4 commits

  • [ Upstream commit 0c5f6c0d8201a809a6585b07b6263e9db2c874a3 ]

    The translation table copying code for kdump kernels is currently based
    on the extended root/context entry formats of ECS mode defined in older
    VT-d v2.5, and doesn't handle the scalable mode formats. This causes
    the kexec capture kernel boot failure with DMAR faults if the IOMMU was
    enabled in scalable mode by the previous kernel.

    The ECS mode has already been deprecated by the VT-d spec since v3.0 and
    Intel IOMMU driver doesn't support this mode as there's no real hardware
    implementation. Hence this converts ECS checking in copying table code
    into scalable mode.

    The existing copying code consumes a bit in the context entry as a mark
    of copied entry. It needs to work for the old format as well as for the
    extended context entries. As it's hard to find such a common bit for both
    legacy and scalable mode context entries. This replaces it with a per-
    IOMMU bitmap.

    Fixes: 7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support")
    Cc: stable@vger.kernel.org
    Reported-by: Jerry Snitselaar
    Tested-by: Wen Jin
    Signed-off-by: Lu Baolu
    Link: https://lore.kernel.org/r/20220817011035.3250131-1-baolu.lu@linux.intel.com
    Signed-off-by: Joerg Roedel
    Signed-off-by: Sasha Levin

    Lu Baolu
     
  • [ Upstream commit e87f4152e542610d0b4c6c8548964a68a59d2040 ]

    Force-inline two stack helpers to fix the following objtool warnings:

    vmlinux.o: warning: objtool: in_task_stack()+0xc: call to task_stack_page() leaves .noinstr.text section
    vmlinux.o: warning: objtool: in_entry_stack()+0x10: call to cpu_entry_stack() leaves .noinstr.text section

    Signed-off-by: Borislav Petkov
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20220324183607.31717-2-bp@alien8.de
    Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip")
    Signed-off-by: Sasha Levin

    Borislav Petkov
     
  • [ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ]

    While looking into a bug related to the compiler's handling of addresses
    of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
    Drive by cleanup.

    -Wunused-parameter:
    kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
    kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
    kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'

    Signed-off-by: Nick Desaulniers
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
    Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip")
    Signed-off-by: Sasha Levin

    Nick Desaulniers
     
  • commit 0ebeebcf59601bcfa0284f4bb7abdec051eb856d upstream.

    Fixes the following WARN_ON
    WARNING: CPU: 2 PID: 18678 at fs/nfs/inode.c:123 nfs_clear_inode+0x3b/0x50 [nfs]
    ...
    Call Trace:
    nfs4_evict_inode+0x57/0x70 [nfsv4]
    evict+0xd1/0x180
    dispose_list+0x48/0x60
    evict_inodes+0x156/0x190
    generic_shutdown_super+0x37/0x110
    nfs_kill_super+0x1d/0x40 [nfs]
    deactivate_locked_super+0x36/0xa0

    Signed-off-by: Dave Wysochanski
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Dave Wysochanski
     

15 Sep, 2022

8 commits

  • [ Upstream commit 3261400639463a853ba2b3be8bd009c2a8089775 ]

    We got a recent syzbot report [1] showing a possible misuse
    of pfmemalloc page status in TCP zerocopy paths.

    Indeed, for pages coming from user space or other layers,
    using page_is_pfmemalloc() is moot, and possibly could give
    false positives.

    There has been attempts to make page_is_pfmemalloc() more robust,
    but not using it in the first place in this context is probably better,
    removing cpu cycles.

    Note to stable teams :

    You need to backport 84ce071e38a6 ("net: introduce
    __skb_fill_page_desc_noacc") as a prereq.

    Race is more probable after commit c07aea3ef4d4
    ("mm: add a signature in struct page") because page_is_pfmemalloc()
    is now using low order bit from page->lru.next, which can change
    more often than page->index.

    Low order bit should never be set for lru.next (when used as an anchor
    in LRU list), so KCSAN report is mostly a false positive.

    Backporting to older kernel versions seems not necessary.

    [1]
    BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag

    write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0:
    __list_add include/linux/list.h:73 [inline]
    list_add include/linux/list.h:88 [inline]
    lruvec_add_folio include/linux/mm_inline.h:105 [inline]
    lru_add_fn+0x440/0x520 mm/swap.c:228
    folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
    folio_batch_add_and_move mm/swap.c:263 [inline]
    folio_add_lru+0xf1/0x140 mm/swap.c:490
    filemap_add_folio+0xf8/0x150 mm/filemap.c:948
    __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981
    pagecache_get_page+0x26/0x190 mm/folio-compat.c:104
    grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116
    ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988
    generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738
    ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270
    ext4_file_write_iter+0x2e3/0x1210
    call_write_iter include/linux/fs.h:2187 [inline]
    new_sync_write fs/read_write.c:491 [inline]
    vfs_write+0x468/0x760 fs/read_write.c:578
    ksys_write+0xe8/0x1a0 fs/read_write.c:631
    __do_sys_write fs/read_write.c:643 [inline]
    __se_sys_write fs/read_write.c:640 [inline]
    __x64_sys_write+0x3e/0x50 fs/read_write.c:640
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1:
    page_is_pfmemalloc include/linux/mm.h:1740 [inline]
    __skb_fill_page_desc include/linux/skbuff.h:2422 [inline]
    skb_fill_page_desc include/linux/skbuff.h:2443 [inline]
    tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018
    do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075
    tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline]
    tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150
    inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
    kernel_sendpage+0x184/0x300 net/socket.c:3561
    sock_sendpage+0x5a/0x70 net/socket.c:1054
    pipe_to_sendpage+0x128/0x160 fs/splice.c:361
    splice_from_pipe_feed fs/splice.c:415 [inline]
    __splice_from_pipe+0x222/0x4d0 fs/splice.c:559
    splice_from_pipe fs/splice.c:594 [inline]
    generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
    do_splice_from fs/splice.c:764 [inline]
    direct_splice_actor+0x80/0xa0 fs/splice.c:931
    splice_direct_to_actor+0x305/0x620 fs/splice.c:886
    do_splice_direct+0xfb/0x180 fs/splice.c:974
    do_sendfile+0x3bf/0x910 fs/read_write.c:1249
    __do_sys_sendfile64 fs/read_write.c:1317 [inline]
    __se_sys_sendfile64 fs/read_write.c:1303 [inline]
    __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    value changed: 0x0000000000000000 -> 0xffffea0004a1d288

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b5d05-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022

    Fixes: c07aea3ef4d4 ("mm: add a signature in struct page")
    Reported-by: syzbot
    Signed-off-by: Eric Dumazet
    Cc: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit 84ce071e38a6e25ea3ea91188e5482ac1f17b3af ]

    Managed pages contain pinned userspace pages and controlled by upper
    layers, there is no need in tracking skb->pfmemalloc for them. Introduce
    a helper for filling frags but ignoring page tracking, it'll be needed
    later.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit ac56a0b48da86fd1b4389632fb7c4c8a5d86eefa ]

    Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing
    it to siphon off UDP packets early in the handling of received UDP packets
    thereby avoiding the packet going through the UDP receive queue, it doesn't
    get ICMP packets through the UDP ->sk_error_report() callback. In fact, it
    doesn't appear that there's any usable option for getting hold of ICMP
    packets.

    Fix this by adding a new UDP encap hook to distribute error messages for
    UDP tunnels. If the hook is set, then the tunnel driver will be able to
    see ICMP packets. The hook provides the offset into the packet of the UDP
    header of the original packet that caused the notification.

    An alternative would be to call the ->error_handler() hook - but that
    requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error()
    do, though isn't really necessary or desirable in rxrpc's case is we want
    to parse them there and then, not queue them).

    Changes
    =======
    ver #3)
    - Fixed an uninitialised variable.

    ver #2)
    - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals.

    Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook")
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin

    David Howells
     
  • [ Upstream commit 67f4b5dc49913abcdb5cc736e73674e2f352f81d ]

    Currently, when the writeback code detects a server reboot, it redirties
    any pages that were not committed to disk, and it sets the flag
    NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
    that dirtied the file. While this allows the file descriptor in question
    to redrive its own writes, it violates the fsync() requirement that we
    should be synchronising all writes to disk.
    While the problem is infrequent, we do see corner cases where an
    untimely server reboot causes the fsync() call to abandon its attempt to
    sync data to disk and causing data corruption issues due to missed error
    conditions or similar.

    In order to tighted up the client's ability to deal with this situation
    without introducing livelocks, add a counter that records the number of
    times pages are redirtied due to a server reboot-like condition, and use
    that in fsync() to redrive the sync to disk.

    Fixes: 2197e9b06c22 ("NFS: Fix up fsync() when the server rebooted")
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit e591b298d7ecb851e200f65946e3d53fe78a3c4f ]

    Save some space in the nfs_inode by setting up an anonymous union with
    the fields that are peculiar to a specific type of filesystem object.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit ff81dfb5d721fff87bd516c558847f6effb70031 ]

    If a user is doing 'ls -l', we have a heuristic in GETATTR that tells
    the readdir code to try to use READDIRPLUS in order to refresh the inode
    attributes. In certain cirumstances, we also try to invalidate the
    remaining directory entries in order to ensure this refresh.

    If there are multiple readers of the directory, we probably should avoid
    invalidating the page cache, since the heuristic breaks down in that
    situation anyway.

    Signed-off-by: Trond Myklebust
    Tested-by: Benjamin Coddington
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • commit dec9b2f1e0455a151a7293c367da22ab973f713e upstream.

    There is a very common pattern of using
    debugfs_remove(debufs_lookup(..)) which results in a dentry leak of the
    dentry that was looked up. Instead of having to open-code the correct
    pattern of calling dput() on the dentry, create
    debugfs_lookup_and_remove() to handle this pattern automatically and
    properly without any memory leaks.

    Cc: stable
    Reported-by: Kuyo Chang
    Tested-by: Kuyo Chang
    Link: https://lore.kernel.org/r/YxIaQ8cSinDR881k@kroah.com
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • commit 2f79cdfe58c13949bbbb65ba5926abfe9561d0ec upstream.

    Commit d4252071b97d ("add barriers to buffer_uptodate and
    set_buffer_uptodate") added proper memory barriers to the buffer head
    BH_Uptodate bit, so that anybody who tests a buffer for being up-to-date
    will be guaranteed to actually see initialized state.

    However, that commit didn't _just_ add the memory barrier, it also ended
    up dropping the "was it already set" logic that the BUFFER_FNS() macro
    had.

    That's conceptually the right thing for a generic "this is a memory
    barrier" operation, but in the case of the buffer contents, we really
    only care about the memory barrier for the _first_ time we set the bit,
    in that the only memory ordering protection we need is to avoid anybody
    seeing uninitialized memory contents.

    Any other access ordering wouldn't be about the BH_Uptodate bit anyway,
    and would require some other proper lock (typically BH_Lock or the folio
    lock). A reader that races with somebody invalidating the buffer head
    isn't an issue wrt the memory ordering, it's a serialization issue.

    Now, you'd think that the buffer head operations don't matter in this
    day and age (and I certainly thought so), but apparently some loads
    still end up being heavy users of buffer heads. In particular, the
    kernel test robot reported that not having this bit access optimization
    in place caused a noticeable direct IO performance regression on ext4:

    fxmark.ssd_ext4_no_jnl_DWTL_54_directio.works/sec -26.5% regression

    although you presumably need a fast disk and a lot of cores to actually
    notice.

    Link: https://lore.kernel.org/all/Yw8L7HTZ%2FdE2%2Fo9C@xsang-OptiPlex-9020/
    Reported-by: kernel test robot
    Tested-by: Fengwei Yin
    Cc: Mikulas Patocka
    Cc: Matthew Wilcox (Oracle)
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

08 Sep, 2022

3 commits

  • commit 9c6d778800b921bde3bff3cff5003d1650f942d1 upstream.

    Automatic kernel fuzzing revealed a recursive locking violation in
    usb-storage:

    ============================================
    WARNING: possible recursive locking detected
    5.18.0 #3 Not tainted
    --------------------------------------------
    kworker/1:3/1205 is trying to acquire lock:
    ffff888018638db8 (&us_interface_key[i]){+.+.}-{3:3}, at:
    usb_stor_pre_reset+0x35/0x40 drivers/usb/storage/usb.c:230

    but task is already holding lock:
    ffff888018638db8 (&us_interface_key[i]){+.+.}-{3:3}, at:
    usb_stor_pre_reset+0x35/0x40 drivers/usb/storage/usb.c:230

    ...

    stack backtrace:
    CPU: 1 PID: 1205 Comm: kworker/1:3 Not tainted 5.18.0 #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    1.13.0-1ubuntu1.1 04/01/2014
    Workqueue: usb_hub_wq hub_event
    Call Trace:

    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
    print_deadlock_bug kernel/locking/lockdep.c:2988 [inline]
    check_deadlock kernel/locking/lockdep.c:3031 [inline]
    validate_chain kernel/locking/lockdep.c:3816 [inline]
    __lock_acquire.cold+0x152/0x3ca kernel/locking/lockdep.c:5053
    lock_acquire kernel/locking/lockdep.c:5665 [inline]
    lock_acquire+0x1ab/0x520 kernel/locking/lockdep.c:5630
    __mutex_lock_common kernel/locking/mutex.c:603 [inline]
    __mutex_lock+0x14f/0x1610 kernel/locking/mutex.c:747
    usb_stor_pre_reset+0x35/0x40 drivers/usb/storage/usb.c:230
    usb_reset_device+0x37d/0x9a0 drivers/usb/core/hub.c:6109
    r871xu_dev_remove+0x21a/0x270 drivers/staging/rtl8712/usb_intf.c:622
    usb_unbind_interface+0x1bd/0x890 drivers/usb/core/driver.c:458
    device_remove drivers/base/dd.c:545 [inline]
    device_remove+0x11f/0x170 drivers/base/dd.c:537
    __device_release_driver drivers/base/dd.c:1222 [inline]
    device_release_driver_internal+0x1a7/0x2f0 drivers/base/dd.c:1248
    usb_driver_release_interface+0x102/0x180 drivers/usb/core/driver.c:627
    usb_forced_unbind_intf+0x4d/0xa0 drivers/usb/core/driver.c:1118
    usb_reset_device+0x39b/0x9a0 drivers/usb/core/hub.c:6114

    This turned out not to be an error in usb-storage but rather a nested
    device reset attempt. That is, as the rtl8712 driver was being
    unbound from a composite device in preparation for an unrelated USB
    reset (that driver does not have pre_reset or post_reset callbacks),
    its ->remove routine called usb_reset_device() -- thus nesting one
    reset call within another.

    Performing a reset as part of disconnect processing is a questionable
    practice at best. However, the bug report points out that the USB
    core does not have any protection against nested resets. Adding a
    reset_in_progress flag and testing it will prevent such errors in the
    future.

    Link: https://lore.kernel.org/all/CAB7eexKUpvX-JNiLzhXBDWgfg2T9e9_0Tw4HQ6keN==voRbP0g@mail.gmail.com/
    Cc: stable@vger.kernel.org
    Reported-and-tested-by: Rondreis
    Signed-off-by: Alan Stern
    Link: https://lore.kernel.org/r/YwkflDxvg0KWqyZK@rowland.harvard.edu
    Signed-off-by: Greg Kroah-Hartman

    Alan Stern
     
  • commit c1e5c2f0cb8a22ec2e14af92afc7006491bebabb upstream.

    Fix incorrect pin assignment values when connecting to a monitor with
    Type-C receptacle instead of a plug.

    According to specification, an UFP_D receptacle's pin assignment
    should came from the UFP_D pin assignments field (bit 23:16), while
    an UFP_D plug's assignments are described in the DFP_D pin assignments
    (bit 15:8) during Mode Discovery.

    For example the LG 27 UL850-W is a monitor with Type-C receptacle.
    The monitor responds to MODE DISCOVERY command with following
    DisplayPort Capability flag:

    dp->alt->vdo=0x140045

    The existing logic only take cares of UPF_D plug case,
    and would take the bit 15:8 for this 0x140045 case.

    This results in an non-existing pin assignment 0x0 in
    dp_altmode_configure.

    To fix this problem a new set of macros are introduced
    to take plug/receptacle differences into consideration.

    Fixes: 0e3bb7d6894d ("usb: typec: Add driver for DisplayPort alternate mode")
    Cc: stable@vger.kernel.org
    Co-developed-by: Pablo Sun
    Co-developed-by: Macpaul Lin
    Reviewed-by: Guillaume Ranquet
    Reviewed-by: Heikki Krogerus
    Signed-off-by: Pablo Sun
    Signed-off-by: Macpaul Lin
    Link: https://lore.kernel.org/r/20220804034803.19486-1-macpaul.lin@mediatek.com
    Signed-off-by: Greg Kroah-Hartman

    Pablo Sun
     
  • [ Upstream commit 0a90ed8d0cfa29735a221eba14d9cb6c735d35b6 ]

    On Intel hardware the SLP_TYPx bitfield occupies bits 10-12 as per ACPI
    specification (see Table 4.13 "PM1 Control Registers Fixed Hardware
    Feature Control Bits" for the details).

    Fix the mask and other related definitions accordingly.

    Fixes: 93e5eadd1f6e ("x86/platform: New Intel Atom SOC power management controller driver")
    Signed-off-by: Andy Shevchenko
    Link: https://lore.kernel.org/r/20220801113734.36131-1-andriy.shevchenko@linux.intel.com
    Reviewed-by: Hans de Goede
    Signed-off-by: Hans de Goede
    Signed-off-by: Sasha Levin

    Andy Shevchenko
     

06 Sep, 2022

1 commit

  • When a frame is dequeued by the CEETM, the effective packet size is a
    function of the actual packet size, the configured Overhead Accounting
    Length (OAL), and the configured Minimum Packet Size (MPS). This
    effective packet size is used to update the CEETM shaper credits. This
    MPS value should be applied to the actual packet size before the OAL
    value is applied. However, the OAL value is applied to the actual packet
    size first. Then, the MPS is applied. This results in an incorrect value
    being subtracted from the CEETM credit for that particular Logical
    Network Interface (LNI) shaper.

    To avoid an erroneous shaper credit count, do not configure both OAL and
    MPS on an LNI shaper.

    Signed-off-by: Camelia Groza

    Camelia Groza
     

05 Sep, 2022

3 commits

  • commit 2555283eb40df89945557273121e9393ef9b542b upstream.

    anon_vma->degree tracks the combined number of child anon_vmas and VMAs
    that use the anon_vma as their ->anon_vma.

    anon_vma_clone() then assumes that for any anon_vma attached to
    src->anon_vma_chain other than src->anon_vma, it is impossible for it to
    be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
    elevated by 1 because of a child anon_vma, meaning that if ->degree
    equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.

    This assumption is wrong because the ->degree optimization leads to leaf
    nodes being abandoned on anon_vma_clone() - an existing anon_vma is
    reused and no new parent-child relationship is created. So it is
    possible to reuse an anon_vma for one VMA while it is still tied to
    another VMA.

    This is an issue because is_mergeable_anon_vma() and its callers assume
    that if two VMAs have the same ->anon_vma, the list of anon_vmas
    attached to the VMAs is guaranteed to be the same. When this assumption
    is violated, vma_merge() can merge pages into a VMA that is not attached
    to the corresponding anon_vma, leading to dangling page->mapping
    pointers that will be dereferenced during rmap walks.

    Fix it by separately tracking the number of child anon_vmas and the
    number of VMAs using the anon_vma as their ->anon_vma.

    Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
    Cc: stable@kernel.org
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit fd1894224407c484f652ad456e1ce423e89bb3eb upstream.

    Syzbot found an issue [1]: fq_codel_drop() try to drop a flow whitout any
    skbs, that is, the flow->head is null.
    The root cause, as the [2] says, is because that bpf_prog_test_run_skb()
    run a bpf prog which redirects empty skbs.
    So we should determine whether the length of the packet modified by bpf
    prog or others like bpf_prog_test is valid before forwarding it directly.

    LINK: [1] https://syzkaller.appspot.com/bug?id=0b84da80c2917757915afa89f7738a9d16ec96c5
    LINK: [2] https://www.spinics.net/lists/netdev/msg777503.html

    Reported-by: syzbot+7a12909485b94426aceb@syzkaller.appspotmail.com
    Signed-off-by: Zhengchao Shao
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/r/20220715115559.139691-1-shaozhengchao@huawei.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Zhengchao Shao
     
  • commit 2a0133723f9ebeb751cfce19f74ec07e108bef1f upstream.

    Syzkaller reports refcount bug as follows:
    ------------[ cut here ]------------
    refcount_t: saturated; leaking memory.
    WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
    Modules linked in:
    CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0

    __refcount_add_not_zero include/linux/refcount.h:163 [inline]
    __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
    refcount_inc_not_zero include/linux/refcount.h:245 [inline]
    sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
    tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
    tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
    tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
    tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
    tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
    sk_backlog_rcv include/net/sock.h:1061 [inline]
    __release_sock+0x134/0x3b0 net/core/sock.c:2849
    release_sock+0x54/0x1b0 net/core/sock.c:3404
    inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
    __sys_shutdown_sock net/socket.c:2331 [inline]
    __sys_shutdown_sock net/socket.c:2325 [inline]
    __sys_shutdown+0xf1/0x1b0 net/socket.c:2343
    __do_sys_shutdown net/socket.c:2351 [inline]
    __se_sys_shutdown net/socket.c:2349 [inline]
    __x64_sys_shutdown+0x50/0x70 net/socket.c:2349
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    During SMC fallback process in connect syscall, kernel will
    replaces TCP with SMC. In order to forward wakeup
    smc socket waitqueue after fallback, kernel will sets
    clcsk->sk_user_data to origin smc socket in
    smc_fback_replace_callbacks().

    Later, in shutdown syscall, kernel will calls
    sk_psock_get(), which treats the clcsk->sk_user_data
    as psock type, triggering the refcnt warning.

    So, the root cause is that smc and psock, both will use
    sk_user_data field. So they will mismatch this field
    easily.

    This patch solves it by using another bit(defined as
    SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
    sk_user_data points to a psock object or not.
    This patch depends on a PTRMASK introduced in commit f1ff5ce2cd5e
    ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").

    For there will possibly be more flags in the sk_user_data field,
    this patch also refactor sk_user_data flags code to be more generic
    to improve its maintainability.

    Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
    Suggested-by: Jakub Kicinski
    Acked-by: Wen Gu
    Signed-off-by: Hawkins Jiawei
    Reviewed-by: Jakub Sitnicki
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Hawkins Jiawei
     

01 Sep, 2022

1 commit


31 Aug, 2022

3 commits

  • commit dbb16df6443c59e8a1ef21c2272fcf387d600ddf upstream.

    This reverts commit 96e51ccf1af33e82f429a0d6baebba29c6448d0f.

    Recently we started running the kernel with rstat infrastructure on
    production traffic and begin to see negative memcg stats values.
    Particularly the 'sock' stat is the one which we observed having negative
    value.

    $ grep "sock " /mnt/memory/job/memory.stat
    sock 253952
    total_sock 18446744073708724224

    Re-run after couple of seconds

    $ grep "sock " /mnt/memory/job/memory.stat
    sock 253952
    total_sock 53248

    For now we are only seeing this issue on large machines (256 CPUs) and
    only with 'sock' stat. I think the networking stack increase the stat on
    one cpu and decrease it on another cpu much more often. So, this negative
    sock is due to rstat flusher flushing the stats on the CPU that has seen
    the decrement of sock but missed the CPU that has increments. A typical
    race condition.

    For easy stable backport, revert is the most simple solution. For long
    term solution, I am thinking of two directions. First is just reduce the
    race window by optimizing the rstat flusher. Second is if the reader sees
    a negative stat value, force flush and restart the stat collection.
    Basically retry but limited.

    Link: https://lkml.kernel.org/r/20220817172139.3141101-1-shakeelb@google.com
    Fixes: 96e51ccf1af33e8 ("memcg: cleanup racy sum avoidance code")
    Signed-off-by: Shakeel Butt
    Cc: "Michal Koutný"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Muchun Song
    Cc: David Hildenbrand
    Cc: Yosry Ahmed
    Cc: Greg Thelen
    Cc: [5.15]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     
  • [ Upstream commit a5612ca10d1aa05624ebe72633e0c8c792970833 ]

    While reading sysctl_devconf_inherit_init_net, it can be changed
    concurrently. Thus, we need to add READ_ONCE() to its readers.

    Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config")
    Signed-off-by: Kuniyuki Iwashima
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Kuniyuki Iwashima
     
  • [ Upstream commit af67508ea6cbf0e4ea27f8120056fa2efce127dd ]

    While reading sysctl_fb_tunnels_only_for_init_net, it can be changed
    concurrently. Thus, we need to add READ_ONCE() to its readers.

    Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces")
    Signed-off-by: Kuniyuki Iwashima
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Kuniyuki Iwashima