24 Nov, 2020

1 commit

  • In the patchset merged by commit b9fcf0a0d826
    ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
    did not have header_ops were given one for the purpose of protocol parsing
    on af_packet transmit path.

    That change made af_packet receive path regard these devices as having a
    visible L3 header and therefore aligned incoming skb->data to point to the
    skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
    reset their mac_header prior to ingress and therefore their incoming
    packets became malformed.

    Ideally these devices would reset their mac headers, or af_packet would be
    able to rely on dev->hard_header_len being 0 for such cases, but it seems
    this is not the case.

    Fix by changing af_packet RX ll visibility criteria to include the
    existence of a '.create()' header operation, which is used when creating
    a device hard header - via dev_hard_header() - by upper layers, and does
    not exist in these L3 devices.

    As this predicate may be useful in other situations, add it as a common
    dev_has_header() helper in netdevice.h.

    Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
    Signed-off-by: Eyal Birger
    Acked-by: Jason A. Donenfeld
    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com
    Signed-off-by: Jakub Kicinski

    Eyal Birger
     

20 Sep, 2020

1 commit

  • skb->nh.raw has been renamed as skb->network_header in 2007, in
    commit b0e380b1d8a8 ("[SK_BUFF]: unions of just one member don't get
    anything done, kill them")

    So here we change it to the new name.

    Cc: Willem de Bruijn
    Signed-off-by: Xie He
    Signed-off-by: David S. Miller

    Xie He
     

18 Sep, 2020

1 commit

  • 1. Change all "dev->hard_header" to "dev->header_ops"

    2. On receiving incoming frames when header_ops == NULL:

    The comment only says what is wrong, but doesn't say what is right.
    This patch changes the comment to make it clear what is right.

    3. On transmitting and receiving outgoing frames when header_ops == NULL:

    The comment explains that the LL header will be later added by the driver.

    However, I think it's better to simply say that the LL header is invisible
    to us. This phrasing is better from a software engineering perspective,
    because this makes it clear that what happens in the driver should be
    hidden from us and we should not care about what happens internally in the
    driver.

    4. On resuming the LL header (for RAW frames) when header_ops == NULL:

    The comment says we are "unlikely" to restore the LL header.

    However, we should say that we are "unable" to restore it.
    It's not possible (rather than not likely) to restore it, because:

    1) There is no way for us to restore because the LL header internally
    processed by the driver should be invisible to us.

    2) In function packet_rcv and tpacket_rcv, the code only tries to restore
    the LL header when header_ops != NULL.

    Cc: Willem de Bruijn
    Signed-off-by: Xie He
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Xie He
     

15 Sep, 2020

1 commit

  • This comment is outdated and no longer reflects the actual implementation
    of af_packet.c.

    Reasons for the new comment:

    1.

    In af_packet.c, the function packet_snd first reserves a headroom of
    length (dev->hard_header_len + dev->needed_headroom).
    Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header,
    which calls dev->header_ops->create, to create the link layer header.
    If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of
    length (dev->hard_header_len), and checks if the user has provided a
    header sized between (dev->min_header_len) and (dev->hard_header_len)
    (in dev_validate_header).
    This shows the developers of af_packet.c expect hard_header_len to
    be consistent with header_ops.

    2.

    In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment.
    That comment states that prepending an LL header internally in a driver
    is considered a bug. I believe this bug can be fixed by setting
    hard_header_len to 0, making the internal header completely invisible
    to af_packet.c (and requesting the headroom in needed_headroom instead).

    3.

    There is a commit for a WiFi driver:
    commit 9454f7a895b8 ("mwifiex: set needed_headroom, not hard_header_len")
    According to the discussion about it at:
    https://patchwork.kernel.org/patch/11407493/
    The author tried to set the WiFi driver's hard_header_len to the Ethernet
    header length, and request additional header space internally needed by
    setting needed_headroom.
    This means this usage is already adopted by driver developers.

    Cc: Willem de Bruijn
    Cc: Eric Dumazet
    Cc: Brian Norris
    Cc: Cong Wang
    Signed-off-by: Xie He
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Xie He
     

07 Sep, 2020

1 commit


05 Sep, 2020

1 commit

  • Using tp_reserve to calculate netoff can overflow as
    tp_reserve is unsigned int and netoff is unsigned short.

    This may lead to macoff receving a smaller value then
    sizeof(struct virtio_net_hdr), and if po->has_vnet_hdr
    is set, an out-of-bounds write will occur when
    calling virtio_net_hdr_from_skb.

    The bug is fixed by converting netoff to unsigned int
    and checking if it exceeds USHRT_MAX.

    This addresses CVE-2020-14386

    Fixes: 8913336a7e8d ("packet: add PACKET_RESERVE sockopt")
    Signed-off-by: Or Cohen
    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Or Cohen
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

14 Aug, 2020

1 commit

  • After @blk_fill_in_prog_lock is acquired there is an early out vnet
    situation that can occur. In that case, the rwlock needs to be
    released.

    Also, since @blk_fill_in_prog_lock is only acquired when @tp_version
    is exactly TPACKET_V3, only release it on that exact condition as
    well.

    And finally, add sparse annotation so that it is clearer that
    prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring
    and releasing @blk_fill_in_prog_lock, respectively. sparse is still
    unable to understand the balance, but the warnings are now on a
    higher level that make more sense.

    Fixes: 632ca50f2cbd ("af_packet: TPACKET_V3: replace busy-wait loop")
    Signed-off-by: John Ogness
    Reported-by: kernel test robot
    Signed-off-by: David S. Miller

    John Ogness
     

25 Jul, 2020

2 commits


20 Jul, 2020

2 commits


17 Jul, 2020

1 commit

  • A busy-wait loop is used to implement waiting for bits to be copied
    from the skb to the kernel buffer before retiring a block. This is
    a problem on PREEMPT_RT because the copying task could be preempted
    by the busy-waiting task and thus live lock in the busy-wait loop.

    Replace the busy-wait logic with an rwlock_t. This provides lockdep
    coverage and makes the code RT ready.

    Signed-off-by: John Ogness
    Signed-off-by: Jakub Kicinski

    John Ogness
     

02 Jul, 2020

1 commit


14 Jun, 2020

1 commit

  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

15 Mar, 2020

1 commit

  • PACKET_RX_RING can cause multiple writers to access the same slot if a
    fast writer wraps the ring while a slow writer is still copying. This
    is particularly likely with few, large, slots (e.g., GSO packets).

    Synchronize kernel thread ownership of rx ring slots with a bitmap.

    Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
    while holding the sk receive queue lock. They release this lock before
    copying and set tp_status to TP_STATUS_USER to release to userspace
    when done. During copying, another writer may take the lock, also see
    TP_STATUS_KERNEL, and start writing to the same slot.

    Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
    slot, test and set with the lock held. To release race-free, update
    tp_status and owner bit as a transaction, so take the lock again.

    This is the one of a variety of discussed options (see Link below):

    * instead of a shadow ring, embed the data in the slot itself, such as
    in tp_padding. But any test for this field may match a value left by
    userspace, causing deadlock.

    * avoid the lock on release. This leaves a small race if releasing the
    shadow slot before setting TP_STATUS_USER. The below reproducer showed
    that this race is not academic. If releasing the slot after tp_status,
    the race is more subtle. See the first link for details.

    * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
    of two fields. But, legacy applications may interpret all non-zero
    tp_status as owned by the user. As libpcap does. So this is possible
    only opt-in by newer processes. It can be added as an optional mode.

    * embed the struct at the tail of pg_vec to avoid extra allocation.
    The implementation proved no less complex than a separate field.

    The additional locking cost on release adds contention, no different
    than scaling on multicore or multiqueue h/w. In practice, below
    reproducer nor small packet tcpdump showed a noticeable change in
    perf report in cycles spent in spinlock. Where contention is
    problematic, packet sockets support mitigation through PACKET_FANOUT.
    And we can consider adding opt-in state TP_KERNEL_OWNED.

    Easy to reproduce by running multiple netperf or similar TCP_STREAM
    flows concurrently with `tcpdump -B 129 -n greater 60000`.

    Based on an earlier patchset by Jon Rosen. See links below.

    I believe this issue goes back to the introduction of tpacket_rcv,
    which predates git history.

    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
    Suggested-by: Jon Rosen
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Jon Rosen
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

12 Mar, 2020

1 commit

  • In one error case, tpacket_rcv drops packets after incrementing the
    ring producer index.

    If this happens, it does not update tp_status to TP_STATUS_USER and
    thus the reader is stalled for an iteration of the ring, causing out
    of order arrival.

    The only such error path is when virtio_net_hdr_from_skb fails due
    to encountering an unknown GSO type.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

30 Jan, 2020

1 commit

  • …/kernel/git/arnd/playground

    Pull y2038 updates from Arnd Bergmann:
    "Core, driver and file system changes

    These are updates to device drivers and file systems that for some
    reason or another were not included in the kernel in the previous
    y2038 series.

    I've gone through all users of time_t again to make sure the kernel is
    in a long-term maintainable state, replacing all remaining references
    to time_t with safe alternatives.

    Some related parts of the series were picked up into the nfsd, xfs,
    alsa and v4l2 trees. A final set of patches in linux-mm removes the
    now unused time_t/timeval/timespec types and helper functions after
    all five branches are merged for linux-5.6, ensuring that no new users
    get merged.

    As a result, linux-5.6, or my backport of the patches to 5.4 [1],
    should be the first release that can serve as a base for a 32-bit
    system designed to run beyond year 2038, with a few remaining caveats:

    - All user space must be compiled with a 64-bit time_t, which will be
    supported in the coming musl-1.2 and glibc-2.32 releases, along
    with installed kernel headers from linux-5.6 or higher.

    - Applications that use the system call interfaces directly need to
    be ported to use the time64 syscalls added in linux-5.1 in place of
    the existing system calls. This impacts most users of futex() and
    seccomp() as well as programming languages that have their own
    runtime environment not based on libc.

    - Applications that use a private copy of kernel uapi header files or
    their contents may need to update to the linux-5.6 version, in
    particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
    linux/elfcore.h, linux/sockios.h, linux/timex.h and
    linux/can/bcm.h.

    - A few remaining interfaces cannot be changed to pass a 64-bit
    time_t in a compatible way, so they must be configured to use
    CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
    timestamps. Most importantly this impacts all users of 'struct
    input_event'.

    - All y2038 problems that are present on 64-bit machines also apply
    to 32-bit machines. In particular this affects file systems with
    on-disk timestamps using signed 32-bit seconds: ext4 with
    ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame

    * tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
    Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
    y2038: sh: remove timeval/timespec usage from headers
    y2038: sparc: remove use of struct timex
    y2038: rename itimerval to __kernel_old_itimerval
    y2038: remove obsolete jiffies conversion functions
    nfs: fscache: use timespec64 in inode auxdata
    nfs: fix timstamp debug prints
    nfs: use time64_t internally
    sunrpc: convert to time64_t for expiry
    drm/etnaviv: avoid deprecated timespec
    drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
    drm/msm: avoid using 'timespec'
    hfs/hfsplus: use 64-bit inode timestamps
    hostfs: pass 64-bit timestamps to/from user space
    packet: clarify timestamp overflow
    tsacct: add 64-bit btime field
    acct: stop using get_seconds()
    um: ubd: use 64-bit time_t where possible
    xtensa: ISS: avoid struct timeval
    dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
    ...

    Linus Torvalds
     

27 Dec, 2019

1 commit


19 Dec, 2019

1 commit

  • The memory mapped packet socket data structure in version 1 through 3
    all contain 32-bit second values for the packet time stamps, which makes
    them suffer from the overflow of time_t in y2038 or y2106 (depending
    on whether user space interprets the value as signed or unsigned).

    The implementation uses the deprecated getnstimeofday() function.

    In order to get rid of that, this changes the code to use
    ktime_get_real_ts64() as a replacement, documenting the nature of the
    overflow. As long as the user applications treat the timestamps as
    unsigned, or only use the difference between timestamps, they are
    fine, and changing the timestamps to 64-bit wouldn't require a more
    invasive user space API change.

    Note: a lot of other APIs suffer from incompatible structures when
    time_t gets redefined to 64-bit in 32-bit user space, but this one
    does not.

    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/lkml/CAF=yD-Jomr-gWSR-EBNKnSpFL46UeG564FLfqTCMNEm-prEaXA@mail.gmail.com/T/#u
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

10 Dec, 2019

1 commit

  • There is softlockup when using TPACKET_V3:
    ...
    NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
    (__irq_svc) from [] (_raw_spin_unlock_irqrestore+0x44/0x54)
    (_raw_spin_unlock_irqrestore) from [] (mod_timer+0x210/0x25c)
    (mod_timer) from []
    (prb_retire_rx_blk_timer_expired+0x68/0x11c)
    (prb_retire_rx_blk_timer_expired) from []
    (call_timer_fn+0x90/0x17c)
    (call_timer_fn) from [] (run_timer_softirq+0x2d4/0x2fc)
    (run_timer_softirq) from [] (__do_softirq+0x218/0x318)
    (__do_softirq) from [] (irq_exit+0x88/0xac)
    (irq_exit) from [] (msa_irq_exit+0x11c/0x1d4)
    (msa_irq_exit) from [] (handle_IPI+0x650/0x7f4)
    (handle_IPI) from [] (gic_handle_irq+0x108/0x118)
    (gic_handle_irq) from [] (__irq_usr+0x44/0x5c)
    ...

    If __ethtool_get_link_ksettings() is failed in
    prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
    is zero and the timer expire for retire_blk_timer is turn to
    mod_timer(&pkc->retire_blk_timer, jiffies + 0),
    which will trigger cpu usage of softirq is 100%.

    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Tested-by: Xiao Jiangfeng
    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller

    Mao Wenan
     

09 Nov, 2019

1 commit

  • KCSAN reported the following data-race [1]

    Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.

    Since the report hinted about multiple cpus using the history
    concurrently, I added a test avoiding writing on it if the
    victim slot already contains the desired value.

    [1]

    BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover

    read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1:
    fanout_flow_is_huge net/packet/af_packet.c:1303 [inline]
    fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353
    packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
    deliver_skb net/core/dev.c:1888 [inline]
    dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
    xmit_one net/core/dev.c:3195 [inline]
    dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
    __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0:
    fanout_flow_is_huge net/packet/af_packet.c:1306 [inline]
    fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353
    packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
    deliver_skb net/core/dev.c:1888 [inline]
    dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
    xmit_one net/core/dev.c:3195 [inline]
    dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
    __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 3b3a5b0aab5b ("packet: rollover huge flows before small flows")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2019

1 commit

  • commit 174e23810cd31
    ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
    recycle always drop skb extensions. The additional skb_ext_del() that is
    performed via nf_reset on napi skb recycle is not needed anymore.

    Most nf_reset() calls in the stack are there so queued skb won't block
    'rmmod nf_conntrack' indefinitely.

    This removes the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct().

    In a few selected places, add a call to skb_ext_reset to make sure that
    no active extensions remain.

    I am submitting this for "net", because we're still early in the release
    cycle. The patch applies to net-next too, but I think the rename causes
    needless divergence between those trees.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

16 Aug, 2019

1 commit

  • packet_sendmsg() checks tx_ring.pg_vec to decide
    if it must call tpacket_snd().

    Problem is that the check is lockless, meaning another thread
    can issue a concurrent setsockopt(PACKET_TX_RING ) to flip
    tx_ring.pg_vec back to NULL.

    Given that tpacket_snd() grabs pg_vec_lock mutex, we can
    perform the check again to solve the race.

    syzbot reported :

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 11429 Comm: syz-executor394 Not tainted 5.3.0-rc4+ #101
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:packet_lookup_frame+0x8d/0x270 net/packet/af_packet.c:474
    Code: c1 ee 03 f7 73 0c 80 3c 0e 00 0f 85 cb 01 00 00 48 8b 0b 89 c0 4c 8d 24 c1 48 b8 00 00 00 00 00 fc ff df 4c 89 e1 48 c1 e9 03 3c 01 00 0f 85 94 01 00 00 48 8d 7b 10 4d 8b 3c 24 48 b8 00 00
    RSP: 0018:ffff88809f82f7b8 EFLAGS: 00010246
    RAX: dffffc0000000000 RBX: ffff8880a45c7030 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 1ffff110148b8e06 RDI: ffff8880a45c703c
    RBP: ffff88809f82f7e8 R08: ffff888087aea200 R09: fffffbfff134ae50
    R10: fffffbfff134ae4f R11: ffffffff89a5727f R12: 0000000000000000
    R13: 0000000000000001 R14: ffff8880a45c6ac0 R15: 0000000000000000
    FS: 00007fa04716f700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa04716edb8 CR3: 0000000091eb4000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    packet_current_frame net/packet/af_packet.c:487 [inline]
    tpacket_snd net/packet/af_packet.c:2667 [inline]
    packet_sendmsg+0x590/0x6250 net/packet/af_packet.c:2975
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:657
    ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
    __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
    do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jun, 2019

1 commit

  • The new route handling in ip_mc_finish_output() from 'net' overlapped
    with the new support for returning congestion notifications from BPF
    programs.

    In order to handle this I had to take the dev_loopback_xmit() calls
    out of the switch statement.

    The aquantia driver conflicts were simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jun, 2019

1 commit

  • When an application is run that:
    a) Sets its scheduler to be SCHED_FIFO
    and
    b) Opens a memory mapped AF_PACKET socket, and sends frames with the
    MSG_DONTWAIT flag cleared, its possible for the application to hang
    forever in the kernel. This occurs because when waiting, the code in
    tpacket_snd calls schedule, which under normal circumstances allows
    other tasks to run, including ksoftirqd, which in some cases is
    responsible for freeing the transmitted skb (which in AF_PACKET calls a
    destructor that flips the status bit of the transmitted frame back to
    available, allowing the transmitting task to complete).

    However, when the calling application is SCHED_FIFO, its priority is
    such that the schedule call immediately places the task back on the cpu,
    preventing ksoftirqd from freeing the skb, which in turn prevents the
    transmitting task from detecting that the transmission is complete.

    We can fix this by converting the schedule call to a completion
    mechanism. By using a completion queue, we force the calling task, when
    it detects there are no more frames to send, to schedule itself off the
    cpu until such time as the last transmitted skb is freed, allowing
    forward progress to be made.

    Tested by myself and the reporter, with good results

    Change Notes:

    V1->V2:
    Enhance the sleep logic to support being interruptible and
    allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)

    V2->V3:
    Rearrage the point at which we wait for the completion queue, to
    avoid needing to check for ph/skb being null at the end of the loop.
    Also move the complete call to the skb destructor to avoid needing to
    modify __packet_set_status. Also gate calling complete on
    packet_read_pending returning zero to avoid multiple calls to complete.
    (Willem de Bruijn)

    Move timeo computation within loop, to re-fetch the socket
    timeout since we also use the timeo variable to record the return code
    from the wait_for_complete call (Neil Horman)

    V3->V4:
    Willem has requested that the control flow be restored to the
    previous state. Doing so lets us eliminate the need for the
    po->wait_on_complete flag variable, and lets us get rid of the
    packet_next_frame function, but introduces another complexity.
    Specifically, but using the packet pending count, we can, if an
    applications calls sendmsg multiple times with MSG_DONTWAIT set, each
    set of transmitted frames, when complete, will cause
    tpacket_destruct_skb to issue a complete call, for which there will
    never be a wait_on_completion call. This imbalance will lead to any
    future call to wait_for_completion here to return early, when the frames
    they sent may not have completed. To correct this, we need to re-init
    the completion queue on every call to tpacket_snd before we enter the
    loop so as to ensure we wait properly for the frames we send in this
    iteration.

    Change the timeout and interrupted gotos to out_put rather than
    out_status so that we don't try to free a non-existant skb
    Clean up some extra newlines (Willem de Bruijn)

    Reviewed-by: Willem de Bruijn
    Signed-off-by: Neil Horman
    Reported-by: Matteo Croce
    Signed-off-by: David S. Miller

    Neil Horman
     

24 Jun, 2019

1 commit


15 Jun, 2019

8 commits


12 Jun, 2019

1 commit


08 Jun, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

    2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

    3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

    4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

    5) Memory leak on unload in RDS, from Zhu Yanjun.

    6) sctp_process_init leak, from Neil HOrman.

    7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
    pktgen: do not sleep with the thread lock held.
    net: mvpp2: Use strscpy to handle stat strings
    net: rds: fix memory leak in rds_ib_flush_mr_pool
    ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
    ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
    Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
    net: aquantia: fix wol configuration not applied sometimes
    ethtool: fix potential userspace buffer overflow
    Fix memory leak in sctp_process_init
    net: rds: fix memory leak when unload rds_rdma
    ipv6: fix the check before getting the cookie in rt6_get_cookie
    ipv4: not do cache for local delivery if bc_forwarding is enabled
    s390/qeth: handle error when updating TX queue count
    s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
    s390/qeth: check dst entry before use
    s390/qeth: handle limited IPv4 broadcast in L3 TX path
    net: fix indirect calls helpers for ptype list hooks.
    net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
    udp: only choose unbound UDP socket for multicast when not in a VRF
    net/tls: replace the sleeping lock around RX resync with a bit lock
    ...

    Linus Torvalds
     

03 Jun, 2019

1 commit

  • Rollover used to use a complex RCU mechanism for assignment, which had
    a race condition. The below patch fixed the bug and greatly simplified
    the logic.

    The feature depends on fanout, but the state is private to the socket.
    Fanout_release returns f only when the last member leaves and the
    fanout struct is to be freed.

    Destroy rollover unconditionally, regardless of fanout state.

    Fixes: 57f015f5eccf2 ("packet: fix crash in fanout_demux_rollover()")
    Reported-by: syzbot
    Diagnosed-by: Dmitry Vyukov
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

1 commit