08 Apr, 2022

1 commit

  • commit bddac7c1e02ba47f0570e494c9289acea3062cc1 upstream.

    This reverts commit aa6f8dcbab473f3a3c7454b74caa46d36cdc5d13.

    It turns out this breaks at least the ath9k wireless driver, and
    possibly others.

    What the ath9k driver does on packet receive is to set up the DMA
    transfer with:

    int ath_rx_init(..)
    ..
    bf->bf_buf_addr = dma_map_single(sc->dev, skb->data,
    common->rx_bufsize,
    DMA_FROM_DEVICE);

    and then the receive logic (through ath_rx_tasklet()) will fetch
    incoming packets

    static bool ath_edma_get_buffers(..)
    ..
    dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
    common->rx_bufsize, DMA_FROM_DEVICE);

    ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
    if (ret == -EINPROGRESS) {
    /*let device gain the buffer again*/
    dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
    common->rx_bufsize, DMA_FROM_DEVICE);
    return false;
    }

    and it's worth noting how that first DMA sync:

    dma_sync_single_for_cpu(..DMA_FROM_DEVICE);

    is there to make sure the CPU can read the DMA buffer (possibly by
    copying it from the bounce buffer area, or by doing some cache flush).
    The iommu correctly turns that into a "copy from bounce bufer" so that
    the driver can look at the state of the packets.

    In the meantime, the device may continue to write to the DMA buffer, but
    we at least have a snapshot of the state due to that first DMA sync.

    But that _second_ DMA sync:

    dma_sync_single_for_device(..DMA_FROM_DEVICE);

    is telling the DMA mapping that the CPU wasn't interested in the area
    because the packet wasn't there. In the case of a DMA bounce buffer,
    that is a no-op.

    Note how it's not a sync for the CPU (the "for_device()" part), and it's
    not a sync for data written by the CPU (the "DMA_FROM_DEVICE" part).

    Or rather, it _should_ be a no-op. That's what commit aa6f8dcbab47
    broke: it made the code bounce the buffer unconditionally, and changed
    the DMA_FROM_DEVICE to just unconditionally and illogically be
    DMA_TO_DEVICE.

    [ Side note: purely within the confines of the swiotlb driver it wasn't
    entirely illogical: The reason it did that odd DMA_FROM_DEVICE ->
    DMA_TO_DEVICE conversion thing is because inside the swiotlb driver,
    it uses just a swiotlb_bounce() helper that doesn't care about the
    whole distinction of who the sync is for - only which direction to
    bounce.

    So it took the "sync for device" to mean that the CPU must have been
    the one writing, and thought it meant DMA_TO_DEVICE. ]

    Also note how the commentary in that commit was wrong, probably due to
    that whole confusion, claiming that the commit makes the swiotlb code

    "bounce unconditionally (that is, also
    when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
    data from the swiotlb buffer"

    which is nonsensical for two reasons:

    - that "also when dir == DMA_TO_DEVICE" is nonsensical, as that was
    exactly when it always did - and should do - the bounce.

    - since this is a sync for the device (not for the CPU), we're clearly
    fundamentally not coping back stale data from the bounce buffers at
    all, because we'd be copying *to* the bounce buffers.

    So that commit was just very confused. It confused the direction of the
    synchronization (to the device, not the cpu) with the direction of the
    DMA (from the device).

    Reported-and-bisected-by: Oleksandr Natalenko
    Reported-by: Olha Cherevyk
    Cc: Halil Pasic
    Cc: Christoph Hellwig
    Cc: Kalle Valo
    Cc: Robin Murphy
    Cc: Toke Høiland-Jørgensen
    Cc: Maxime Bizon
    Cc: Johannes Berg
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

28 Mar, 2022

1 commit

  • commit 92ee3c60ec9fe64404dc035e7c41277d74aa26cb upstream.

    Currently we have neither proper check nor protection against the
    concurrent calls of PCM hw_params and hw_free ioctls, which may result
    in a UAF. Since the existing PCM stream lock can't be used for
    protecting the whole ioctl operations, we need a new mutex to protect
    those racy calls.

    This patch introduced a new mutex, runtime->buffer_mutex, and applies
    it to both hw_params and hw_free ioctl code paths. Along with it, the
    both functions are slightly modified (the mmap_count check is moved
    into the state-check block) for code simplicity.

    Reported-by: Hu Jiahui
    Cc:
    Reviewed-by: Jaroslav Kysela
    Link: https://lore.kernel.org/r/20220322170720.3529-2-tiwai@suse.de
    Signed-off-by: Takashi Iwai
    Signed-off-by: Greg Kroah-Hartman

    Takashi Iwai
     

23 Mar, 2022

2 commits

  • [ Upstream commit 4ee06de7729d795773145692e246a06448b1eb7a ]

    This kind of interface doesn't have a mac header. This patch fixes
    bpf_redirect() to a PIM interface.

    Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
    Signed-off-by: Nicolas Dichtel
    Link: https://lore.kernel.org/r/20220315092008.31423-1-nicolas.dichtel@6wind.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Nicolas Dichtel
     
  • [ Upstream commit 8e6ed963763fe21429eabfc76c69ce2b0163a3dd ]

    When iterating over sockets using vsock_for_each_connected_socket, make
    sure that a transport filters out sockets that don't belong to the
    transport.

    There actually was an issue caused by this; in a nested VM
    configuration, destroying the nested VM (which often involves the
    closing of /dev/vhost-vsock if there was h2g connections to the nested
    VM) kills not only the h2g connections, but also all existing g2h
    connections to the (outmost) host which are totally unrelated.

    Tested: Executed the following steps on Cuttlefish (Android running on a
    VM) [1]: (1) Enter into an `adb shell` session - to have a g2h
    connection inside the VM, (2) open and then close /dev/vhost-vsock by
    `exec 3< /dev/vhost-vsock && exec 3
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jiyong Park
    Link: https://lore.kernel.org/r/20220311020017.1509316-1-jiyong@google.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Jiyong Park
     

19 Mar, 2022

1 commit

  • [ Upstream commit c1aca3080e382886e2e58e809787441984a2f89b ]

    This patch enables distinguishing SAs and SPs based on if_id during
    the xfrm_migrate flow. This ensures support for xfrm interfaces
    throughout the SA/SP lifecycle.

    When there are multiple existing SPs with the same direction,
    the same xfrm_selector and different endpoint addresses,
    xfrm_migrate might fail with ENODATA.

    Specifically, the code path for performing xfrm_migrate is:
    Stage 1: find policy to migrate with
    xfrm_migrate_policy_find(sel, dir, type, net)
    Stage 2: find and update state(s) with
    xfrm_migrate_state_find(mp, net)
    Stage 3: update endpoint address(es) of template(s) with
    xfrm_policy_migrate(pol, m, num_migrate)

    Currently "Stage 1" always returns the first xfrm_policy that
    matches, and "Stage 3" looks for the xfrm_tmpl that matches the
    old endpoint address. Thus if there are multiple xfrm_policy
    with same selector, direction, type and net, "Stage 1" might
    rertun a wrong xfrm_policy and "Stage 3" will fail with ENODATA
    because it cannot find a xfrm_tmpl with the matching endpoint
    address.

    The fix is to allow userspace to pass an if_id and add if_id
    to the matching rule in Stage 1 and Stage 2 since if_id is a
    unique ID for xfrm_policy and xfrm_state. For compatibility,
    if_id will only be checked if the attribute is set.

    Tested with additions to Android's kernel unit test suite:
    https://android-review.googlesource.com/c/kernel/tests/+/1668886

    Signed-off-by: Yan Yan
    Signed-off-by: Steffen Klassert
    Signed-off-by: Sasha Levin

    Yan Yan
     

16 Mar, 2022

9 commits

  • This reverts commit 2566a89b9e163b2fcd104d6005e0149f197b8a48 which is
    commit a2614140dc0f467a83aa3bb4b6ee2d6480a76202 upstream.

    The above change depends on upstream commit 0faf890fc519 ("net: dsa:
    drop rtnl_lock from dsa_slave_switchdev_event_work"), which is not
    present in linux-5.15.y. Without that change, waiting for the switchdev
    workqueue causes deadlocks on the rtnl_mutex.

    Backporting the dependency commit isn't trivial/desirable, since it
    requires that the following dependencies of the dependency are also
    backported:

    df405910ab9f net: dsa: sja1105: wait for dynamic config command completion on writes too
    eb016afd83a9 net: dsa: sja1105: serialize access to the dynamic config interface
    2468346c5677 net: mscc: ocelot: serialize access to the MAC table
    f7eb4a1c0864 net: dsa: b53: serialize access to the ARL table
    cf231b436f7c net: dsa: lantiq_gswip: serialize access to the PCE registers
    338a3a4745aa net: dsa: introduce locking for the address lists on CPU and DSA ports

    and then this bugfix on top:

    8940e6b669ca ("net: dsa: avoid call to __dev_set_promiscuity() while rtnl_mutex isn't held")

    Reported-by: Daniel Suchy
    Signed-off-by: Vladimir Oltean
    Signed-off-by: Greg Kroah-Hartman

    Vladimir Oltean
     
  • commit b81e0c2372e65e5627864ba034433b64b2fc73f5 upstream.

    Drop various include not actually used in genhd.h itself, and
    move the remaning includes closer together.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
    Signed-off-by: Jens Axboe
    Reported-by: Sudip Mukherjee a
    Reported-by: "H. Nikolaus Schaller"
    Reported-by: Guenter Roeck
    Cc: "Maciej W. Rozycki"
    [ resolves MIPS build failure by luck, root cause needs to be fixed in
    Linus's tree properly, but this is needed for now to fix the build - gregkh ]
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit c993ee0f9f81caf5767a50d1faeba39a0dc82af2 upstream.

    In watch_queue_set_filter(), there are a couple of places where we check
    that the filter type value does not exceed what the type_filter bitmap
    can hold. One place calculates the number of bits by:

    if (tf[i].type >= sizeof(wfilter->type_filter) * 8)

    which is fine, but the second does:

    if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)

    which is not. This can lead to a couple of out-of-bounds writes due to
    a too-large type:

    (1) __set_bit() on wfilter->type_filter
    (2) Writing more elements in wfilter->filters[] than we allocated.

    Fix this by just using the proper WATCH_TYPE__NR instead, which is the
    number of types we actually know about.

    The bug may cause an oops looking something like:

    BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
    Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
    ...
    Call Trace:

    dump_stack_lvl+0x45/0x59
    print_address_description.constprop.0+0x1f/0x150
    ...
    kasan_report.cold+0x7f/0x11b
    ...
    watch_queue_set_filter+0x659/0x740
    ...
    __x64_sys_ioctl+0x127/0x190
    do_syscall_64+0x43/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Allocated by task 611:
    kasan_save_stack+0x1e/0x40
    __kasan_kmalloc+0x81/0xa0
    watch_queue_set_filter+0x23a/0x740
    __x64_sys_ioctl+0x127/0x190
    do_syscall_64+0x43/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff88800d2c66a0
    which belongs to the cache kmalloc-32 of size 32
    The buggy address is located 28 bytes inside of
    32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)

    Fixes: c73be61cede5 ("pipe: Add general notification queue support")
    Reported-by: Jann Horn
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit 4fa59ede95195f267101a1b8916992cf3f245cdb upstream.

    The feature negotiation was designed in a way that
    makes it possible for devices to know which config
    fields will be accessed by drivers.

    This is broken since commit 404123c2db79 ("virtio: allow drivers to
    validate features") with fallout in at least block and net. We have a
    partial work-around in commit 2f9a174f918e ("virtio: write back
    F_VERSION_1 before validate") which at least lets devices find out which
    format should config space have, but this is a partial fix: guests
    should not access config space without acknowledging features since
    otherwise we'll never be able to change the config space format.

    To fix, split finalize_features from virtio_finalize_features and
    call finalize_features with all feature bits before validation,
    and then - if validation changed any bits - once again after.

    Since virtio_finalize_features no longer writes out features
    rename it to virtio_features_ok - since that is what it does:
    checks that features are ok with the device.

    As a side effect, this also reduces the amount of hypervisor accesses -
    we now only acknowledge features once unless we are clearing any
    features when validating (which is uncommon).

    IRC I think that this was more or less always the intent in the spec but
    unfortunately the way the spec is worded does not say this explicitly, I
    plan to address this at the spec level, too.

    Acked-by: Jason Wang
    Cc: stable@vger.kernel.org
    Fixes: 404123c2db79 ("virtio: allow drivers to validate features")
    Fixes: 2f9a174f918e ("virtio: write back F_VERSION_1 before validate")
    Cc: "Halil Pasic"
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • commit 838d6d3461db0fdbf33fc5f8a69c27b50b4a46da upstream.

    virtio_finalize_features is only used internally within virtio.
    No reason to export it.

    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Cornelia Huck
    Acked-by: Jason Wang
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • commit aa6f8dcbab473f3a3c7454b74caa46d36cdc5d13 upstream.

    Unfortunately, we ended up merging an old version of the patch "fix info
    leak with DMA_FROM_DEVICE" instead of merging the latest one. Christoph
    (the swiotlb maintainer), he asked me to create an incremental fix
    (after I have pointed this out the mix up, and asked him for guidance).
    So here we go.

    The main differences between what we got and what was agreed are:
    * swiotlb_sync_single_for_device is also required to do an extra bounce
    * We decided not to introduce DMA_ATTR_OVERWRITE until we have exploiters
    * The implantation of DMA_ATTR_OVERWRITE is flawed: DMA_ATTR_OVERWRITE
    must take precedence over DMA_ATTR_SKIP_CPU_SYNC

    Thus this patch removes DMA_ATTR_OVERWRITE, and makes
    swiotlb_sync_single_for_device() bounce unconditionally (that is, also
    when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
    data from the swiotlb buffer.

    Let me note, that if the size used with dma_sync_* API is less than the
    size used with dma_[un]map_*, under certain circumstances we may still
    end up with swiotlb not being transparent. In that sense, this is no
    perfect fix either.

    To get this bullet proof, we would have to bounce the entire
    mapping/bounce buffer. For that we would have to figure out the starting
    address, and the size of the mapping in
    swiotlb_sync_single_for_device(). While this does seem possible, there
    seems to be no firm consensus on how things are supposed to work.

    Signed-off-by: Halil Pasic
    Fixes: ddbd89deb7d3 ("swiotlb: fix info leak with DMA_FROM_DEVICE")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Halil Pasic
     
  • [ Upstream commit ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e ]

    The problem I'm addressing was discovered by the LTP test covering
    cve-2018-1000204.

    A short description of what happens follows:
    1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
    interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
    and a corresponding dxferp. The peculiar thing about this is that TUR
    is not reading from the device.
    2) In sg_start_req() the invocation of blk_rq_map_user() effectively
    bounces the user-space buffer. As if the device was to transfer into
    it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in
    sg_build_indirect()") we make sure this first bounce buffer is
    allocated with GFP_ZERO.
    3) For the rest of the story we keep ignoring that we have a TUR, so the
    device won't touch the buffer we prepare as if the we had a
    DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
    and the buffer allocated by SG is mapped by the function
    virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
    scatter-gather and not scsi generics). This mapping involves bouncing
    via the swiotlb (we need swiotlb to do virtio in protected guest like
    s390 Secure Execution, or AMD SEV).
    4) When the SCSI TUR is done, we first copy back the content of the second
    (that is swiotlb) bounce buffer (which most likely contains some
    previous IO data), to the first bounce buffer, which contains all
    zeros. Then we copy back the content of the first bounce buffer to
    the user-space buffer.
    5) The test case detects that the buffer, which it zero-initialized,
    ain't all zeros and fails.

    One can argue that this is an swiotlb problem, because without swiotlb
    we leak all zeros, and the swiotlb should be transparent in a sense that
    it does not affect the outcome (if all other participants are well
    behaved).

    Copying the content of the original buffer into the swiotlb buffer is
    the only way I can think of to make swiotlb transparent in such
    scenarios. So let's do just that if in doubt, but allow the driver
    to tell us that the whole mapped buffer is going to be overwritten,
    in which case we can preserve the old behavior and avoid the performance
    impact of the extra bounce.

    Signed-off-by: Halil Pasic
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Halil Pasic
     
  • [ Upstream commit ac77998b7ac3044f0509b097da9637184598980d ]

    According to HW spec the field "size" should be 16 bits
    in bufferx register.

    Fixes: e281682bf294 ("net/mlx5_core: HW data structs/types definitions cleanup")
    Signed-off-by: Mohammad Kabat
    Reviewed-by: Moshe Shemesh
    Signed-off-by: Saeed Mahameed
    Signed-off-by: Sasha Levin

    Mohammad Kabat
     
  • [ Upstream commit ebe48d368e97d007bfeb76fcb065d6cfc4c96645 ]

    The maximum message size that can be send is bigger than
    the maximum site that skb_page_frag_refill can allocate.
    So it is possible to write beyond the allocated buffer.

    Fix this by doing a fallback to COW in that case.

    v2:

    Avoid get get_order() costs as suggested by Linus Torvalds.

    Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
    Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
    Reported-by: valis
    Signed-off-by: Steffen Klassert
    Signed-off-by: Sasha Levin

    Steffen Klassert
     

11 Mar, 2022

5 commits

  • Commit 42baefac638f06314298087394b982ead9ec444b upstream.

    gnttab_end_foreign_access() is used to free a grant reference and
    optionally to free the associated page. In case the grant is still in
    use by the other side processing is being deferred. This leads to a
    problem in case no page to be freed is specified by the caller: the
    caller doesn't know that the page is still mapped by the other side
    and thus should not be used for other purposes.

    The correct way to handle this situation is to take an additional
    reference to the granted page in case handling is being deferred and
    to drop that reference when the grant reference could be freed
    finally.

    This requires that there are no users of gnttab_end_foreign_access()
    left directly repurposing the granted page after the call, as this
    might result in clobbered data or information leaks via the not yet
    freed grant reference.

    This is part of CVE-2022-23041 / XSA-396.

    Reported-by: Simon Gaiser
    Signed-off-by: Juergen Gross
    Reviewed-by: Jan Beulich
    Signed-off-by: Greg Kroah-Hartman

    Juergen Gross
     
  • Commit 1dbd11ca75fe664d3e54607547771d021f531f59 upstream.

    Remove gnttab_query_foreign_access(), as it is unused and unsafe to
    use.

    All previous use cases assumed a grant would not be in use after
    gnttab_query_foreign_access() returned 0. This information is useless
    in best case, as it only refers to a situation in the past, which could
    have changed already.

    Signed-off-by: Juergen Gross
    Reviewed-by: Jan Beulich
    Signed-off-by: Greg Kroah-Hartman

    Juergen Gross
     
  • Commit 6b1775f26a2da2b05a6dc8ec2b5d14e9a4701a1a upstream.

    Add a new grant table function gnttab_try_end_foreign_access(), which
    will remove and free a grant if it is not in use.

    Its main use case is to either free a grant if it is no longer in use,
    or to take some other action if it is still in use. This other action
    can be an error exit, or (e.g. in the case of blkfront persistent grant
    feature) some special handling.

    This is CVE-2022-23036, CVE-2022-23038 / part of XSA-396.

    Reported-by: Demi Marie Obenour
    Signed-off-by: Juergen Gross
    Reviewed-by: Jan Beulich
    Signed-off-by: Greg Kroah-Hartman

    Juergen Gross
     
  • commit ba2689234be92024e5635d30fe744f4853ad97db upstream.

    Some CPUs affected by Spectre-BHB need a sequence of branches, or a
    firmware call to be run before any indirect branch. This needs to go
    in the vectors. No CPU needs both.

    While this can be patched in, it would run on all CPUs as there is a
    single set of vectors. If only one part of a big/little combination is
    affected, the unaffected CPUs have to run the mitigation too.

    Create extra vectors that include the sequence. Subsequent patches will
    allow affected CPUs to select this set of vectors. Later patches will
    modify the loop count to match what the CPU requires.

    Reviewed-by: Catalin Marinas
    Signed-off-by: James Morse
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit 44a3918c8245ab10c6c9719dd12e7a8d291980d8 upstream.

    With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
    to Spectre v2 BHB-based attacks.

    When both are enabled, print a warning message and report it in the
    'spectre_v2' sysfs vulnerabilities file.

    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Borislav Petkov
    Reviewed-by: Thomas Gleixner
    [fllinden@amazon.com: backported to 5.15]
    Signed-off-by: Frank van der Linden
    Signed-off-by: Greg Kroah-Hartman

    Josh Poimboeuf
     

09 Mar, 2022

16 commits

  • commit a6d95c5a628a09be129f25d5663a7e9db8261f51 upstream.

    This reverts commit b515d2637276a3810d6595e10ab02c13bfd0b63a.

    Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm: xfrm_state_mtu
    should return at least 1280 for ipv6") in v5.14 breaks the TCP MSS
    calculation in ipsec transport mode, resulting complete stalls of TCP
    connections. This happens when the (P)MTU is 1280 or slighly larger.

    The desired formula for the MSS is:
    MSS = (MTU - ESP_overhead) - IP header - TCP header

    However, the above commit clamps the (MTU - ESP_overhead) to a
    minimum of 1280, turning the formula into
    MSS = max(MTU - ESP overhead, 1280) - IP header - TCP header

    With the (P)MTU near 1280, the calculated MSS is too large and the
    resulting TCP packets never make it to the destination because they
    are over the actual PMTU.

    The above commit also causes suboptimal double fragmentation in
    xfrm tunnel mode, as described in
    https://lore.kernel.org/netdev/20210429202529.codhwpc7w6kbudug@dwarf.suse.cz/

    The original problem the above commit was trying to fix is now fixed
    by commit 6596a0229541270fb8d38d989f91b78838e5e9da ("xfrm: fix MTU
    regression").

    Signed-off-by: Jiri Bohac
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Jiri Bohac
     
  • commit 327b89f0acc4c20a06ed59e4d9af7f6d804dc2e2 upstream.

    This patch adds a new key definition for KEY_ALL_APPLICATIONS
    and aliases KEY_DASHBOARD to it.

    It also maps the 0x0c/0x2a2 usage code to KEY_ALL_APPLICATIONS.

    Signed-off-by: William Mahon
    Acked-by: Benjamin Tissoires
    Link: https://lore.kernel.org/r/20220303035618.1.I3a7746ad05d270161a18334ae06e3b6db1a1d339@changeid
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Greg Kroah-Hartman

    William Mahon
     
  • commit bfa26ba343c727e055223be04e08f2ebdd43c293 upstream.

    Numerous keyboards are adding dictate keys which allows for text
    messages to be dictated by a microphone.

    This patch adds a new key definition KEY_DICTATE and maps 0x0c/0x0d8
    usage code to this new keycode. Additionally hid-debug is adjusted to
    recognize this new usage code as well.

    Signed-off-by: William Mahon
    Acked-by: Benjamin Tissoires
    Link: https://lore.kernel.org/r/20220303021501.1.I5dbf50eb1a7a6734ee727bda4a8573358c6d3ec0@changeid
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Greg Kroah-Hartman

    William Mahon
     
  • commit b1e8206582f9d680cff7d04828708c8b6ab32957 upstream.

    Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
    invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
    race vs syscalls by not placing the task on the runqueue before it
    gets exposed through the pidhash.

    Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
    trying to fix a single instance of this, instead fix the whole class
    of issues, effectively reverting this commit.

    Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Tadeusz Struk
    Tested-by: Zhang Qiao
    Tested-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit c3873070247d9e3c7a6b0cf9bf9b45e8018427b1 upstream.

    Eric Dumazet says:
    The sock_hold() side seems suspect, because there is no guarantee
    that sk_refcnt is not already 0.

    On failure, we cannot queue the packet and need to indicate an
    error. The packet will be dropped by the caller.

    v2: split skb prefetch hunk into separate change

    Fixes: 271b72c7fa82c ("udp: RCU handling for Unicast packets.")
    Reported-by: Eric Dumazet
    Reviewed-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • commit 7c76ecd9c99b6e9a771d813ab1aa7fa428b3ade1 upstream.

    struct xfrm_user_offload has flags variable that received user input,
    but kernel didn't check if valid bits were provided. It caused a situation
    where not sanitized input was forwarded directly to the drivers.

    For example, XFRM_OFFLOAD_IPV6 define that was exposed, was used by
    strongswan, but not implemented in the kernel at all.

    As a solution, check and sanitize input flags to forward
    XFRM_OFFLOAD_INBOUND to the drivers.

    Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")
    Signed-off-by: Leon Romanovsky
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Leon Romanovsky
     
  • [ Upstream commit 8b017fbe0bbb98dd71fb4850f6b9cc0e136a26b8 ]

    Moving the of_net code from drivers/of/ to net/core means we
    no longer stub out the helpers when networking is disabled,
    which leads to a randconfig build failure with at least one
    ARM platform that calls this from non-networking code:

    arm-linux-gnueabi-ld: arch/arm/mach-mvebu/kirkwood.o: in function `kirkwood_dt_eth_fixup':
    kirkwood.c:(.init.text+0x54): undefined reference to `of_get_mac_address'

    Restore the way this worked before by changing that #ifdef
    check back to testing for both CONFIG_OF and CONFIG_NET.

    Fixes: e330fb14590c ("of: net: move of_net under net/")
    Signed-off-by: Arnd Bergmann
    Link: https://lore.kernel.org/r/20211014090055.2058949-1-arnd@kernel.org
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Arnd Bergmann
     
  • [ Upstream commit e330fb14590c5c80f7195c3d8c9b4bcf79e1a5cd ]

    Rob suggests to move of_net.c from under drivers/of/ somewhere
    to the networking code.

    Suggested-by: Rob Herring
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Rob Herring
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Jakub Kicinski
     
  • [ Upstream commit 61a0abaee2092eee69e44fe60336aa2f5b578938 ]

    Commit 316580b69d0a ("u64_stats: provide u64_stats_t type")
    fixed possible load/store tearing on 64bit arches.

    For instance the following C code

    stats->nsecs += sched_clock() - start;

    Could be rightfully implemented like this by a compiler,
    confusing concurrent readers a lot:

    stats->nsecs += sched_clock();
    // arbitrary delay
    stats->nsecs -= start;

    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20211026214133.3114279-4-eric.dumazet@gmail.com
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit e2f08207c558bc0bc8abaa557cdb29bad776ac7b ]

    The link extended sub-states are assigned as enum that is an integer
    size but read from a union as u8, this is working for small values on
    little endian systems but for big endian this always give 0. Fix the
    variable in the union to match the enum size.

    Fixes: ecc31c60240b ("ethtool: Add link extended state")
    Signed-off-by: Moshe Tal
    Reviewed-by: Ido Schimmel
    Tested-by: Ido Schimmel
    Reviewed-by: Gal Pressman
    Reviewed-by: Amit Cohen
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Moshe Tal
     
  • [ Upstream commit 60115fa54ad7b913b7cb5844e6b7ffeb842d55f2 ]

    Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
    enabled(without KASAN_VMALLOC) on x86[1].

    When the module area allocates memory, it's kmemleak_object is created
    successfully, but the KASAN shadow memory of module allocation is not
    ready, so when kmemleak scan the module's pointer, it will panic due to
    no shadow memory with KASAN check.

    module_alloc
    __vmalloc_node_range
    kmemleak_vmalloc
    kmemleak_scan
    update_checksum
    kasan_module_alloc
    kmemleak_ignore

    Note, there is no problem if KASAN_VMALLOC enabled, the modules area
    entire shadow memory is preallocated. Thus, the bug only exits on ARCH
    which supports dynamic allocation of module area per module load, for
    now, only x86/arm64/s390 are involved.

    Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
    kmemleak in module_alloc() to fix this issue.

    [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/

    [wangkefeng.wang@huawei.com: fix build]
    Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
    [akpm@linux-foundation.org: simplify ifdefs, per Andrey]
    Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com

    Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
    Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules")
    Fixes: 39d114ddc682 ("arm64: add KASAN support")
    Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables")
    Signed-off-by: Kefeng Wang
    Reported-by: Yongqiang Liu
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: Alexander Gordeev
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Alexander Potapenko
    Cc: Kefeng Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kefeng Wang
     
  • [ Upstream commit 16720861675393a35974532b3c837d9fd7bfe08c ]

    Avoid potentially hazardous memory copying and the needless use of
    "%pIS" -- in the kernel, an RPC service listener is always bound to
    ANYADDR. Having the network namespace is helpful when recording
    errors, though.

    Fixes: a0469f46faab ("SUNRPC: Replace dprintk call sites in TCP state change callouts")
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit dc6c6fb3d639756a532bcc47d4a9bf9f3965881b ]

    While testing, I got an unexpected KASAN splat:

    Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
    Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628

    The memcpy() in the TP_fast_assign section of this trace point
    copies the size of the destination buffer in order that the buffer
    won't be overrun.

    In other similar trace points, the source buffer for this memcpy is
    a "struct sockaddr_storage" so the actual length of the source
    buffer is always long enough to prevent the memcpy from reading
    uninitialized or unallocated memory.

    However, for this trace point, the source buffer can be as small as
    a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
    memory that follows the source buffer, which is not always valid
    memory.

    To avoid copying past the end of the passed-in sockaddr, make the
    source address's length available to the memcpy(). It would be a
    little nicer if the tracing infrastructure was more friendly about
    storing socket addresses that are not AF_INET, but I could not find
    a way to make printk("%pIS") work with a dynamic array.

    Reported-by: KASAN
    Fixes: 4b8f380e46e4 ("SUNRPC: Tracepoint to record errors in svc_xpo_create()")
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit dae9a6cab8009e526570e7477ce858dcdfeb256e ]

    Refactor.

    Now that the NFSv2 and NFSv3 XDR decoders have been converted to
    use xdr_streams, the WRITE decoder functions can use
    xdr_stream_subsegment() to extract the WRITE payload into its own
    xdr_buf, just as the NFSv4 WRITE XDR decoder currently does.

    That makes it possible to pass the first kvec, pages array + length,
    page_base, and total payload length via a single function parameter.

    The payload's page_base is not yet assigned or used, but will be in
    subsequent patches.

    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit 2d3916f3189172d5c69d33065c3c21119fe539fc ]

    While investigating on why a synchronize_net() has been added recently
    in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report()
    might drop skbs in some cases.

    Discussion about removing synchronize_net() from ipv6_mc_down()
    will happen in a different thread.

    Fixes: f185de28d9ae ("mld: add new workqueues for process mld events")
    Signed-off-by: Eric Dumazet
    Cc: Taehee Yoo
    Cc: Cong Wang
    Cc: David Ahern
    Link: https://lore.kernel.org/r/20220303173728.937869-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit e85c81ba8859a4c839bcd69c5d83b32954133a5b ]

    For the follow scenario:
    1. jbd start commit transaction n
    2. task A get new handle for transaction n+1
    3. task A do some ineligible actions and mark FC_INELIGIBLE
    4. jbd complete transaction n and clean FC_INELIGIBLE
    5. task A call fsync

    In this case fast commit will not fallback to full commit and
    transaction n+1 also not handled by jbd.

    Make ext4_fc_mark_ineligible() also record transaction tid for
    latest ineligible case, when call ext4_fc_cleanup() check
    current transaction tid, if small than latest ineligible tid
    do not clear the EXT4_MF_FC_INELIGIBLE.

    Reported-by: kernel test robot
    Reported-by: Dan Carpenter
    Reported-by: Ritesh Harjani
    Suggested-by: Harshad Shirwadkar
    Signed-off-by: Xin Yin
    Link: https://lore.kernel.org/r/20220117093655.35160-2-yinxin.x@bytedance.com
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Sasha Levin

    Xin Yin
     

02 Mar, 2022

5 commits

  • commit f6c052afe6f802d87c74153b7a57c43b2e9faf07 upstream.

    Wp-gpios property can be used on NVMEM nodes and the same property can
    be also used on MTD NAND nodes. In case of the wp-gpios property is
    defined at NAND level node, the GPIO management is done at NAND driver
    level. Write protect is disabled when the driver is probed or resumed
    and is enabled when the driver is released or suspended.

    When no partitions are defined in the NAND DT node, then the NAND DT node
    will be passed to NVMEM framework. If wp-gpios property is defined in
    this node, the GPIO resource is taken twice and the NAND controller
    driver fails to probe.

    It would be possible to set config->wp_gpio at MTD level before calling
    nvmem_register function but NVMEM framework will toggle this GPIO on
    each write when this GPIO should only be controlled at NAND level driver
    to ensure that the Write Protect has not been enabled.

    A way to fix this conflict is to add a new boolean flag in nvmem_config
    named ignore_wp. In case ignore_wp is set, the GPIO resource will
    be managed by the provider.

    Fixes: 2a127da461a9 ("nvmem: add support for the write-protect pin")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Kerello
    Signed-off-by: Srinivas Kandagatla
    Link: https://lore.kernel.org/r/20220220151432.16605-2-srinivas.kandagatla@linaro.org
    Signed-off-by: Greg Kroah-Hartman

    Christophe Kerello
     
  • [ Upstream commit a1cdec57e03a1352e92fbbe7974039dda4efcec0 ]

    UDP sendmsg() can be lockless, this is causing all kinds
    of data races.

    This patch converts sk->sk_tskey to remove one of these races.

    BUG: KCSAN: data-race in __ip_append_data / __ip_append_data

    read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
    __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
    ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
    udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
    inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
    sock_sendmsg_nosec net/socket.c:705 [inline]
    sock_sendmsg net/socket.c:725 [inline]
    ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
    ___sys_sendmsg net/socket.c:2467 [inline]
    __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
    __do_sys_sendmmsg net/socket.c:2582 [inline]
    __se_sys_sendmmsg net/socket.c:2579 [inline]
    __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
    __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
    ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
    udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
    inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
    sock_sendmsg_nosec net/socket.c:705 [inline]
    sock_sendmsg net/socket.c:725 [inline]
    ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
    ___sys_sendmsg net/socket.c:2467 [inline]
    __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
    __do_sys_sendmmsg net/socket.c:2582 [inline]
    __se_sys_sendmmsg net/socket.c:2579 [inline]
    __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x0000054d -> 0x0000054e

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • commit 5486f5bf790b5c664913076c3194b8f916a5c7ad upstream.

    All functions defined as static inline in net/checksum.h are
    meant to be inlined for performance reason.

    But since commit ac7c3e4ff401 ("compiler: enable
    CONFIG_OPTIMIZE_INLINING forcibly") the compiler is allowed to
    uninline functions when it wants.

    Fair enough in the general case, but for tiny performance critical
    checksum helpers that's counter-productive.

    The problem mainly arises when selecting CONFIG_CC_OPTIMISE_FOR_SIZE,
    Those helpers being 'static inline' in header files you suddenly find
    them duplicated many times in the resulting vmlinux.

    Here is a typical exemple when building powerpc pmac32_defconfig
    with CONFIG_CC_OPTIMISE_FOR_SIZE. csum_sub() appears 4 times:

    c04a23cc :
    c04a23cc: 7c 84 20 f8 not r4,r4
    c04a23d0: 7c 63 20 14 addc r3,r3,r4
    c04a23d4: 7c 63 01 94 addze r3,r3
    c04a23d8: 4e 80 00 20 blr
    ...
    c04a2ce8: 4b ff f6 e5 bl c04a23cc
    ...
    c04a2d2c: 4b ff f6 a1 bl c04a23cc
    ...
    c04a2d54: 4b ff f6 79 bl c04a23cc
    ...
    c04a754c :
    c04a754c: 7c 84 20 f8 not r4,r4
    c04a7550: 7c 63 20 14 addc r3,r3,r4
    c04a7554: 7c 63 01 94 addze r3,r3
    c04a7558: 4e 80 00 20 blr
    ...
    c04ac930: 4b ff ac 1d bl c04a754c
    ...
    c04ad264: 4b ff a2 e9 bl c04a754c
    ...
    c04e3b08 :
    c04e3b08: 7c 84 20 f8 not r4,r4
    c04e3b0c: 7c 63 20 14 addc r3,r3,r4
    c04e3b10: 7c 63 01 94 addze r3,r3
    c04e3b14: 4e 80 00 20 blr
    ...
    c04e5788: 4b ff e3 81 bl c04e3b08
    ...
    c04e65c8: 4b ff d5 41 bl c04e3b08
    ...
    c0512d34 :
    c0512d34: 7c 84 20 f8 not r4,r4
    c0512d38: 7c 63 20 14 addc r3,r3,r4
    c0512d3c: 7c 63 01 94 addze r3,r3
    c0512d40: 4e 80 00 20 blr
    ...
    c0512dfc: 4b ff ff 39 bl c0512d34
    ...
    c05138bc: 4b ff f4 79 bl c0512d34
    ...

    Restore the expected behaviour by using __always_inline for all
    functions defined in net/checksum.h

    vmlinux size is even reduced by 256 bytes with this patch:

    text data bss dec hex filename
    6980022 2515362 194384 9689768 93daa8 vmlinux.before
    6979862 2515266 194384 9689512 93d9a8 vmlinux.now

    Fixes: ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly")
    Cc: Masahiro Yamada
    Cc: Nick Desaulniers
    Cc: Andrew Morton
    Signed-off-by: Christophe Leroy
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit d9b5ae5c1b241b91480aa30408be12fe91af834a upstream.

    Ipv6 ttl, label and tos fields are modified without first
    pulling/pushing the ipv6 header, which would have updated
    the hw csum (if available). This might cause csum validation
    when sending the packet to the stack, as can be seen in
    the trace below.

    Fix this by updating skb->csum if available.

    Trace resulted by ipv6 ttl dec and then sending packet
    to conntrack [actions: set(ipv6(hlimit=63)),ct(zone=99)]:
    [295241.900063] s_pf0vf2: hw csum failure
    [295241.923191] Call Trace:
    [295241.925728]
    [295241.927836] dump_stack+0x5c/0x80
    [295241.931240] __skb_checksum_complete+0xac/0xc0
    [295241.935778] nf_conntrack_tcp_packet+0x398/0xba0 [nf_conntrack]
    [295241.953030] nf_conntrack_in+0x498/0x5e0 [nf_conntrack]
    [295241.958344] __ovs_ct_lookup+0xac/0x860 [openvswitch]
    [295241.968532] ovs_ct_execute+0x4a7/0x7c0 [openvswitch]
    [295241.979167] do_execute_actions+0x54a/0xaa0 [openvswitch]
    [295242.001482] ovs_execute_actions+0x48/0x100 [openvswitch]
    [295242.006966] ovs_dp_process_packet+0x96/0x1d0 [openvswitch]
    [295242.012626] ovs_vport_receive+0x6c/0xc0 [openvswitch]
    [295242.028763] netdev_frame_hook+0xc0/0x180 [openvswitch]
    [295242.034074] __netif_receive_skb_core+0x2ca/0xcb0
    [295242.047498] netif_receive_skb_internal+0x3e/0xc0
    [295242.052291] napi_gro_receive+0xba/0xe0
    [295242.056231] mlx5e_handle_rx_cqe_mpwrq_rep+0x12b/0x250 [mlx5_core]
    [295242.062513] mlx5e_poll_rx_cq+0xa0f/0xa30 [mlx5_core]
    [295242.067669] mlx5e_napi_poll+0xe1/0x6b0 [mlx5_core]
    [295242.077958] net_rx_action+0x149/0x3b0
    [295242.086762] __do_softirq+0xd7/0x2d6
    [295242.090427] irq_exit+0xf7/0x100
    [295242.093748] do_IRQ+0x7f/0xd0
    [295242.096806] common_interrupt+0xf/0xf
    [295242.100559]
    [295242.102750] RIP: 0033:0x7f9022e88cbd
    [295242.125246] RSP: 002b:00007f9022282b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
    [295242.132900] RAX: 0000000000000005 RBX: 0000000000000010 RCX: 0000000000000000
    [295242.140120] RDX: 00007f9022282ba8 RSI: 00007f9022282a30 RDI: 00007f9014005c30
    [295242.147337] RBP: 00007f9014014d60 R08: 0000000000000020 R09: 00007f90254a8340
    [295242.154557] R10: 00007f9022282a28 R11: 0000000000000246 R12: 0000000000000000
    [295242.161775] R13: 00007f902308c000 R14: 000000000000002b R15: 00007f9022b71f40

    Fixes: 3fdbd1ce11e5 ("openvswitch: add ipv6 'set' action")
    Signed-off-by: Paul Blakey
    Link: https://lore.kernel.org/r/20220223163416.24096-1-paulb@nvidia.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Paul Blakey
     
  • commit 5eaed6eedbe9612f642ad2b880f961d1c6c8ec2b upstream.

    The patch in [1] intends to fix a bpf_timer related issue,
    but the fix caused existing 'timer' selftest to fail with
    hang or some random errors. After some debug, I found
    an issue with check_and_init_map_value() in the hashtab.c.
    More specifically, in hashtab.c, we have code
    l_new = bpf_map_kmalloc_node(&htab->map, ...)
    check_and_init_map_value(&htab->map, l_new...)
    Note that bpf_map_kmalloc_node() does not do initialization
    so l_new contains random value.

    The function check_and_init_map_value() intends to zero the
    bpf_spin_lock and bpf_timer if they exist in the map.
    But I found bpf_spin_lock is zero'ed but bpf_timer is not zero'ed.
    With [1], later copy_map_value() skips copying of
    bpf_spin_lock and bpf_timer. The non-zero bpf_timer caused
    random failures for 'timer' selftest.
    Without [1], for both bpf_spin_lock and bpf_timer case,
    bpf_timer will be zero'ed, so 'timer' self test is okay.

    For check_and_init_map_value(), why bpf_spin_lock is zero'ed
    properly while bpf_timer not. In bpf uapi header, we have
    struct bpf_spin_lock {
    __u32 val;
    };
    struct bpf_timer {
    __u64 :64;
    __u64 :64;
    } __attribute__((aligned(8)));

    The initialization code:
    *(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
    (struct bpf_spin_lock){};
    *(struct bpf_timer *)(dst + map->timer_off) =
    (struct bpf_timer){};
    It appears the compiler has no obligation to initialize anonymous fields.
    For example, let us use clang with bpf target as below:
    $ cat t.c
    struct bpf_timer {
    unsigned long long :64;
    };
    struct bpf_timer2 {
    unsigned long long a;
    };

    void test(struct bpf_timer *t) {
    *t = (struct bpf_timer){};
    }
    void test2(struct bpf_timer2 *t) {
    *t = (struct bpf_timer2){};
    }
    $ clang -target bpf -O2 -c -g t.c
    $ llvm-objdump -d t.o
    ...
    0000000000000000 :
    0: 95 00 00 00 00 00 00 00 exit
    0000000000000008 :
    1: b7 02 00 00 00 00 00 00 r2 = 0
    2: 7b 21 00 00 00 00 00 00 *(u64 *)(r1 + 0) = r2
    3: 95 00 00 00 00 00 00 00 exit

    gcc11.2 does not have the above issue. But from
    INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    Programming languages — C
    http://www.open-std.org/Jtc1/sc22/wg14/www/docs/n1547.pdf
    page 157:
    Except where explicitly stated otherwise, for the purposes of
    this subclause unnamed members of objects of structure and union
    type do not participate in initialization. Unnamed members of
    structure objects have indeterminate value even after initialization.

    To fix the problem, let use memset for bpf_timer case in
    check_and_init_map_value(). For consistency, memset is also
    used for bpf_spin_lock case.

    [1] https://lore.kernel.org/bpf/20220209070324.1093182-2-memxor@gmail.com/

    Fixes: 68134668c17f3 ("bpf: Add map side support for bpf timers.")
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20220211194953.3142152-1-yhs@fb.com
    Signed-off-by: Greg Kroah-Hartman

    Yonghong Song