10 Sep, 2020

1 commit

  • [ Upstream commit acf69c946233259ab4d64f8869d4037a198c7f06 ]

    Using tp_reserve to calculate netoff can overflow as
    tp_reserve is unsigned int and netoff is unsigned short.

    This may lead to macoff receving a smaller value then
    sizeof(struct virtio_net_hdr), and if po->has_vnet_hdr
    is set, an out-of-bounds write will occur when
    calling virtio_net_hdr_from_skb.

    The bug is fixed by converting netoff to unsigned int
    and checking if it exceeds USHRT_MAX.

    This addresses CVE-2020-14386

    Fixes: 8913336a7e8d ("packet: add PACKET_RESERVE sockopt")
    Signed-off-by: Or Cohen
    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Or Cohen
     

19 Aug, 2020

1 commit

  • [ Upstream commit 88fd1cb80daa20af063bce81e1fad14e945a8dc4 ]

    After @blk_fill_in_prog_lock is acquired there is an early out vnet
    situation that can occur. In that case, the rwlock needs to be
    released.

    Also, since @blk_fill_in_prog_lock is only acquired when @tp_version
    is exactly TPACKET_V3, only release it on that exact condition as
    well.

    And finally, add sparse annotation so that it is clearer that
    prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring
    and releasing @blk_fill_in_prog_lock, respectively. sparse is still
    unable to understand the balance, but the warnings are now on a
    higher level that make more sense.

    Fixes: 632ca50f2cbd ("af_packet: TPACKET_V3: replace busy-wait loop")
    Signed-off-by: John Ogness
    Reported-by: kernel test robot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    John Ogness
     

01 Apr, 2020

1 commit

  • [ Upstream commit 61fad6816fc10fb8793a925d5c1256d1c3db0cd2 ]

    PACKET_RX_RING can cause multiple writers to access the same slot if a
    fast writer wraps the ring while a slow writer is still copying. This
    is particularly likely with few, large, slots (e.g., GSO packets).

    Synchronize kernel thread ownership of rx ring slots with a bitmap.

    Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
    while holding the sk receive queue lock. They release this lock before
    copying and set tp_status to TP_STATUS_USER to release to userspace
    when done. During copying, another writer may take the lock, also see
    TP_STATUS_KERNEL, and start writing to the same slot.

    Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
    slot, test and set with the lock held. To release race-free, update
    tp_status and owner bit as a transaction, so take the lock again.

    This is the one of a variety of discussed options (see Link below):

    * instead of a shadow ring, embed the data in the slot itself, such as
    in tp_padding. But any test for this field may match a value left by
    userspace, causing deadlock.

    * avoid the lock on release. This leaves a small race if releasing the
    shadow slot before setting TP_STATUS_USER. The below reproducer showed
    that this race is not academic. If releasing the slot after tp_status,
    the race is more subtle. See the first link for details.

    * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
    of two fields. But, legacy applications may interpret all non-zero
    tp_status as owned by the user. As libpcap does. So this is possible
    only opt-in by newer processes. It can be added as an optional mode.

    * embed the struct at the tail of pg_vec to avoid extra allocation.
    The implementation proved no less complex than a separate field.

    The additional locking cost on release adds contention, no different
    than scaling on multicore or multiqueue h/w. In practice, below
    reproducer nor small packet tcpdump showed a noticeable change in
    perf report in cycles spent in spinlock. Where contention is
    problematic, packet sockets support mitigation through PACKET_FANOUT.
    And we can consider adding opt-in state TP_KERNEL_OWNED.

    Easy to reproduce by running multiple netperf or similar TCP_STREAM
    flows concurrently with `tcpdump -B 129 -n greater 60000`.

    Based on an earlier patchset by Jon Rosen. See links below.

    I believe this issue goes back to the introduction of tpacket_rcv,
    which predates git history.

    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
    Suggested-by: Jon Rosen
    Signed-off-by: Willem de Bruijn
    Signed-off-by: Jon Rosen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

18 Mar, 2020

1 commit

  • [ Upstream commit 46e4c421a053c36bf7a33dda2272481bcaf3eed3 ]

    In one error case, tpacket_rcv drops packets after incrementing the
    ring producer index.

    If this happens, it does not update tp_status to TP_STATUS_USER and
    thus the reader is stalled for an iteration of the ring, causing out
    of order arrival.

    The only such error path is when virtio_net_hdr_from_skb fails due
    to encountering an unknown GSO type.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

26 Jan, 2020

1 commit

  • [ Upstream commit b756ad928d98e5ef0b74af7546a6a31a8dadde00 ]

    KCSAN reported the following data-race [1]

    Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.

    Since the report hinted about multiple cpus using the history
    concurrently, I added a test avoiding writing on it if the
    victim slot already contains the desired value.

    [1]

    BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover

    read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1:
    fanout_flow_is_huge net/packet/af_packet.c:1303 [inline]
    fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353
    packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
    deliver_skb net/core/dev.c:1888 [inline]
    dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
    xmit_one net/core/dev.c:3195 [inline]
    dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
    __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0:
    fanout_flow_is_huge net/packet/af_packet.c:1306 [inline]
    fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353
    packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
    deliver_skb net/core/dev.c:1888 [inline]
    dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
    xmit_one net/core/dev.c:3195 [inline]
    dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
    __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 3b3a5b0aab5b ("packet: rollover huge flows before small flows")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

31 Dec, 2019

1 commit

  • [ Upstream commit b43d1f9f7067c6759b1051e8ecb84e82cef569fe ]

    There is softlockup when using TPACKET_V3:
    ...
    NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
    (__irq_svc) from [] (_raw_spin_unlock_irqrestore+0x44/0x54)
    (_raw_spin_unlock_irqrestore) from [] (mod_timer+0x210/0x25c)
    (mod_timer) from []
    (prb_retire_rx_blk_timer_expired+0x68/0x11c)
    (prb_retire_rx_blk_timer_expired) from []
    (call_timer_fn+0x90/0x17c)
    (call_timer_fn) from [] (run_timer_softirq+0x2d4/0x2fc)
    (run_timer_softirq) from [] (__do_softirq+0x218/0x318)
    (__do_softirq) from [] (irq_exit+0x88/0xac)
    (irq_exit) from [] (msa_irq_exit+0x11c/0x1d4)
    (msa_irq_exit) from [] (handle_IPI+0x650/0x7f4)
    (handle_IPI) from [] (gic_handle_irq+0x108/0x118)
    (gic_handle_irq) from [] (__irq_usr+0x44/0x5c)
    ...

    If __ethtool_get_link_ksettings() is failed in
    prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
    is zero and the timer expire for retire_blk_timer is turn to
    mod_timer(&pkc->retire_blk_timer, jiffies + 0),
    which will trigger cpu usage of softirq is 100%.

    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Tested-by: Xiao Jiangfeng
    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Mao Wenan
     

02 Oct, 2019

1 commit

  • commit 174e23810cd31
    ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
    recycle always drop skb extensions. The additional skb_ext_del() that is
    performed via nf_reset on napi skb recycle is not needed anymore.

    Most nf_reset() calls in the stack are there so queued skb won't block
    'rmmod nf_conntrack' indefinitely.

    This removes the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct().

    In a few selected places, add a call to skb_ext_reset to make sure that
    no active extensions remain.

    I am submitting this for "net", because we're still early in the release
    cycle. The patch applies to net-next too, but I think the rename causes
    needless divergence between those trees.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

16 Aug, 2019

1 commit

  • packet_sendmsg() checks tx_ring.pg_vec to decide
    if it must call tpacket_snd().

    Problem is that the check is lockless, meaning another thread
    can issue a concurrent setsockopt(PACKET_TX_RING ) to flip
    tx_ring.pg_vec back to NULL.

    Given that tpacket_snd() grabs pg_vec_lock mutex, we can
    perform the check again to solve the race.

    syzbot reported :

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 11429 Comm: syz-executor394 Not tainted 5.3.0-rc4+ #101
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:packet_lookup_frame+0x8d/0x270 net/packet/af_packet.c:474
    Code: c1 ee 03 f7 73 0c 80 3c 0e 00 0f 85 cb 01 00 00 48 8b 0b 89 c0 4c 8d 24 c1 48 b8 00 00 00 00 00 fc ff df 4c 89 e1 48 c1 e9 03 3c 01 00 0f 85 94 01 00 00 48 8d 7b 10 4d 8b 3c 24 48 b8 00 00
    RSP: 0018:ffff88809f82f7b8 EFLAGS: 00010246
    RAX: dffffc0000000000 RBX: ffff8880a45c7030 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 1ffff110148b8e06 RDI: ffff8880a45c703c
    RBP: ffff88809f82f7e8 R08: ffff888087aea200 R09: fffffbfff134ae50
    R10: fffffbfff134ae4f R11: ffffffff89a5727f R12: 0000000000000000
    R13: 0000000000000001 R14: ffff8880a45c6ac0 R15: 0000000000000000
    FS: 00007fa04716f700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa04716edb8 CR3: 0000000091eb4000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    packet_current_frame net/packet/af_packet.c:487 [inline]
    tpacket_snd net/packet/af_packet.c:2667 [inline]
    packet_sendmsg+0x590/0x6250 net/packet/af_packet.c:2975
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:657
    ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
    __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
    do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jun, 2019

1 commit

  • The new route handling in ip_mc_finish_output() from 'net' overlapped
    with the new support for returning congestion notifications from BPF
    programs.

    In order to handle this I had to take the dev_loopback_xmit() calls
    out of the switch statement.

    The aquantia driver conflicts were simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jun, 2019

1 commit

  • When an application is run that:
    a) Sets its scheduler to be SCHED_FIFO
    and
    b) Opens a memory mapped AF_PACKET socket, and sends frames with the
    MSG_DONTWAIT flag cleared, its possible for the application to hang
    forever in the kernel. This occurs because when waiting, the code in
    tpacket_snd calls schedule, which under normal circumstances allows
    other tasks to run, including ksoftirqd, which in some cases is
    responsible for freeing the transmitted skb (which in AF_PACKET calls a
    destructor that flips the status bit of the transmitted frame back to
    available, allowing the transmitting task to complete).

    However, when the calling application is SCHED_FIFO, its priority is
    such that the schedule call immediately places the task back on the cpu,
    preventing ksoftirqd from freeing the skb, which in turn prevents the
    transmitting task from detecting that the transmission is complete.

    We can fix this by converting the schedule call to a completion
    mechanism. By using a completion queue, we force the calling task, when
    it detects there are no more frames to send, to schedule itself off the
    cpu until such time as the last transmitted skb is freed, allowing
    forward progress to be made.

    Tested by myself and the reporter, with good results

    Change Notes:

    V1->V2:
    Enhance the sleep logic to support being interruptible and
    allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)

    V2->V3:
    Rearrage the point at which we wait for the completion queue, to
    avoid needing to check for ph/skb being null at the end of the loop.
    Also move the complete call to the skb destructor to avoid needing to
    modify __packet_set_status. Also gate calling complete on
    packet_read_pending returning zero to avoid multiple calls to complete.
    (Willem de Bruijn)

    Move timeo computation within loop, to re-fetch the socket
    timeout since we also use the timeo variable to record the return code
    from the wait_for_complete call (Neil Horman)

    V3->V4:
    Willem has requested that the control flow be restored to the
    previous state. Doing so lets us eliminate the need for the
    po->wait_on_complete flag variable, and lets us get rid of the
    packet_next_frame function, but introduces another complexity.
    Specifically, but using the packet pending count, we can, if an
    applications calls sendmsg multiple times with MSG_DONTWAIT set, each
    set of transmitted frames, when complete, will cause
    tpacket_destruct_skb to issue a complete call, for which there will
    never be a wait_on_completion call. This imbalance will lead to any
    future call to wait_for_completion here to return early, when the frames
    they sent may not have completed. To correct this, we need to re-init
    the completion queue on every call to tpacket_snd before we enter the
    loop so as to ensure we wait properly for the frames we send in this
    iteration.

    Change the timeout and interrupted gotos to out_put rather than
    out_status so that we don't try to free a non-existant skb
    Clean up some extra newlines (Willem de Bruijn)

    Reviewed-by: Willem de Bruijn
    Signed-off-by: Neil Horman
    Reported-by: Matteo Croce
    Signed-off-by: David S. Miller

    Neil Horman
     

24 Jun, 2019

1 commit


15 Jun, 2019

8 commits


12 Jun, 2019

1 commit


08 Jun, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

    2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

    3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

    4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

    5) Memory leak on unload in RDS, from Zhu Yanjun.

    6) sctp_process_init leak, from Neil HOrman.

    7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
    pktgen: do not sleep with the thread lock held.
    net: mvpp2: Use strscpy to handle stat strings
    net: rds: fix memory leak in rds_ib_flush_mr_pool
    ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
    ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
    Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
    net: aquantia: fix wol configuration not applied sometimes
    ethtool: fix potential userspace buffer overflow
    Fix memory leak in sctp_process_init
    net: rds: fix memory leak when unload rds_rdma
    ipv6: fix the check before getting the cookie in rt6_get_cookie
    ipv4: not do cache for local delivery if bc_forwarding is enabled
    s390/qeth: handle error when updating TX queue count
    s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
    s390/qeth: check dst entry before use
    s390/qeth: handle limited IPv4 broadcast in L3 TX path
    net: fix indirect calls helpers for ptype list hooks.
    net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
    udp: only choose unbound UDP socket for multicast when not in a VRF
    net/tls: replace the sleeping lock around RX resync with a bit lock
    ...

    Linus Torvalds
     

03 Jun, 2019

1 commit

  • Rollover used to use a complex RCU mechanism for assignment, which had
    a race condition. The below patch fixed the bug and greatly simplified
    the logic.

    The feature depends on fanout, but the state is private to the socket.
    Fanout_release returns f only when the last member leaves and the
    fanout struct is to be freed.

    Destroy rollover unconditionally, regardless of fanout state.

    Fixes: 57f015f5eccf2 ("packet: fix crash in fanout_demux_rollover()")
    Reported-by: syzbot
    Diagnosed-by: Dmitry Vyukov
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

2 commits


10 May, 2019

1 commit

  • kernel BUG at lib/list_debug.c:47!
    invalid opcode: 0000 [#1
    CPU: 0 PID: 12914 Comm: rmmod Tainted: G W 5.1.0+ #47
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__list_del_entry_valid+0x53/0x90
    Code: 48 8b 32 48 39 fe 75 35 48 8b 50 08 48 39 f2 75 40 b8 01 00 00 00 5d c3 48
    89 fe 48 89 c2 48 c7 c7 18 75 fe 82 e8 cb 34 78 ff 0b 48 89 fe 48 c7 c7 50 75 fe 82 e8 ba 34 78 ff 0f 0b 48 89 f2
    RSP: 0018:ffffc90001c2fe40 EFLAGS: 00010286
    RAX: 000000000000004e RBX: ffffffffa0184000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff888237a17788 RDI: 00000000ffffffff
    RBP: ffffc90001c2fe40 R08: 0000000000000000 R09: 0000000000000000
    R10: ffffc90001c2fe10 R11: 0000000000000000 R12: 0000000000000000
    R13: ffffc90001c2fe50 R14: ffffffffa0184000 R15: 0000000000000000
    FS: 00007f3d83634540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000555c350ea818 CR3: 0000000231677000 CR4: 00000000000006f0
    Call Trace:
    unregister_pernet_operations+0x34/0x120
    unregister_pernet_subsys+0x1c/0x30
    packet_exit+0x1c/0x369 [af_packet
    __x64_sys_delete_module+0x156/0x260
    ? lockdep_hardirqs_on+0x133/0x1b0
    ? do_syscall_64+0x12/0x1f0
    do_syscall_64+0x6e/0x1f0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    When modprobe af_packet, register_pernet_subsys
    fails and does a cleanup, ops->list is set to LIST_POISON1,
    but the module init is considered to success, then while rmmod it,
    BUG() is triggered in __list_del_entry_valid which is called from
    unregister_pernet_subsys. This patch fix error handing path in
    packet_init to avoid possilbe issue if some error occur.

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     

03 May, 2019

1 commit


01 May, 2019

2 commits

  • Packet sockets in datagram mode take a destination address. Verify its
    length before passing to dev_hard_header.

    Prior to 2.6.14-rc3, the send code ignored sll_halen. This is
    established behavior. Directly compare msg_namelen to dev->addr_len.

    Change v1->v2: initialize addr in all paths

    Fixes: 6b8d95f1795c4 ("packet: validate address length if non-zero")
    Suggested-by: David Laight
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Packet send checks that msg_name is at least sizeof sockaddr_ll.
    Packet recv must return at least this length, so that its output
    can be passed unmodified to packet send.

    This ceased to be true since adding support for lladdr longer than
    sll_addr. Since, the return value uses true address length.

    Always return at least sizeof sockaddr_ll, even if address length
    is shorter. Zero the padding bytes.

    Change v1->v2: do not overwrite zeroed padding again. use copy_len.

    Fixes: 0fb375fb9b93 ("[AF_PACKET]: Allow for > 8 byte hardware addresses.")
    Suggested-by: David Laight
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Apr, 2019

1 commit

  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

20 Apr, 2019

1 commit

  • The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
    socket protocol handlers, and all of those end up calling the same
    sock_get_timestamp()/sock_get_timestampns() helper functions, which
    results in a lot of duplicate code.

    With the introduction of 64-bit time_t on 32-bit architectures, this
    gets worse, as we then need four different ioctl commands in each
    socket protocol implementation.

    To simplify that, let's add a new .gettstamp() operation in
    struct proto_ops, and move ioctl implementation into the common
    sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
    through.

    We can reuse the sock_get_timestamp() implementation, but generalize
    it so it can deal with both native and compat mode, as well as
    timeval and timespec structures.

    Acked-by: Stefan Schmidt
    Acked-by: Neil Horman
    Acked-by: Marc Kleine-Budde
    Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
    Signed-off-by: Arnd Bergmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

28 Mar, 2019

1 commit


21 Mar, 2019

3 commits

  • After the previous patch, all the callers of ndo_select_queue()
    provide as a 'fallback' argument netdev_pick_tx.
    The only exceptions are nested calls to ndo_select_queue(),
    which pass down the 'fallback' available in the current scope
    - still netdev_pick_tx.

    We can drop such argument and replace fallback() invocation with
    netdev_pick_tx(). This avoids an indirect call per xmit packet
    in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
    with device drivers implementing such ndo. It also clean the code
    a bit.

    Tested with ixgbe and CONFIG_FCOE=m

    With pktgen using queue xmit:
    threads vanilla patched
    (kpps) (kpps)
    1 2334 2428
    2 4166 4278
    4 7895 8100

    v1 -> v2:
    - rebased after helper's name change

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Currently packet_pick_tx_queue() is the only caller of
    ndo_select_queue() using a fallback argument other than
    netdev_pick_tx.

    Leveraging rx queue, we can obtain a similar queue selection
    behavior using core helpers. After this change, ndo_select_queue()
    is always invoked with netdev_pick_tx() as fallback.
    We can change ndo_select_queue() signature in a followup patch,
    dropping an indirect call per transmitted packet in some scenarios
    (e.g. TCP syn and XDP generic xmit)

    This changes slightly how af packet queue selection happens when
    PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
    tacking in account both XPS and TC mapping.

    v1 -> v2:
    - rebased after helper name change
    RFC -> v1:
    - initialize sender_cpu to the expected value

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Since commit fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
    one can now allocate packet ring buffers >= UINT_MAX. However, syzkaller
    found that that triggers a warning:

    [ 21.100000] WARNING: CPU: 2 PID: 2075 at mm/page_alloc.c:4584 __alloc_pages_nod0
    [ 21.101490] Modules linked in:
    [ 21.101921] CPU: 2 PID: 2075 Comm: syz-executor.0 Not tainted 5.0.0 #146
    [ 21.102784] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
    [ 21.103887] RIP: 0010:__alloc_pages_nodemask+0x2a0/0x630
    [ 21.104640] Code: fe ff ff 65 48 8b 04 25 c0 de 01 00 48 05 90 0f 00 00 41 bd 01 00 00 00 48 89 44 24 48 e9 9c fe 3
    [ 21.107121] RSP: 0018:ffff88805e1cf920 EFLAGS: 00010246
    [ 21.107819] RAX: 0000000000000000 RBX: ffffffff85a488a0 RCX: 0000000000000000
    [ 21.108753] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000000
    [ 21.109699] RBP: 1ffff1100bc39f28 R08: ffffed100bcefb67 R09: ffffed100bcefb67
    [ 21.110646] R10: 0000000000000001 R11: ffffed100bcefb66 R12: 000000000000000d
    [ 21.111623] R13: 0000000000000000 R14: ffff88805e77d888 R15: 000000000000000d
    [ 21.112552] FS: 00007f7c7de05700(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
    [ 21.113612] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 21.114405] CR2: 000000000065c000 CR3: 000000005e58e006 CR4: 00000000001606e0
    [ 21.115367] Call Trace:
    [ 21.115705] ? __alloc_pages_slowpath+0x21c0/0x21c0
    [ 21.116362] alloc_pages_current+0xac/0x1e0
    [ 21.116923] kmalloc_order+0x18/0x70
    [ 21.117393] kmalloc_order_trace+0x18/0x110
    [ 21.117949] packet_set_ring+0x9d5/0x1770
    [ 21.118524] ? packet_rcv_spkt+0x440/0x440
    [ 21.119094] ? lock_downgrade+0x620/0x620
    [ 21.119646] ? __might_fault+0x177/0x1b0
    [ 21.120177] packet_setsockopt+0x981/0x2940
    [ 21.120753] ? __fget+0x2fb/0x4b0
    [ 21.121209] ? packet_release+0xab0/0xab0
    [ 21.121740] ? sock_has_perm+0x1cd/0x260
    [ 21.122297] ? selinux_secmark_relabel_packet+0xd0/0xd0
    [ 21.123013] ? __fget+0x324/0x4b0
    [ 21.123451] ? selinux_netlbl_socket_setsockopt+0x101/0x320
    [ 21.124186] ? selinux_netlbl_sock_rcv_skb+0x3a0/0x3a0
    [ 21.124908] ? __lock_acquire+0x529/0x3200
    [ 21.125453] ? selinux_socket_setsockopt+0x5d/0x70
    [ 21.126075] ? __sys_setsockopt+0x131/0x210
    [ 21.126533] ? packet_release+0xab0/0xab0
    [ 21.127004] __sys_setsockopt+0x131/0x210
    [ 21.127449] ? kernel_accept+0x2f0/0x2f0
    [ 21.127911] ? ret_from_fork+0x8/0x50
    [ 21.128313] ? do_raw_spin_lock+0x11b/0x280
    [ 21.128800] __x64_sys_setsockopt+0xba/0x150
    [ 21.129271] ? lockdep_hardirqs_on+0x37f/0x560
    [ 21.129769] do_syscall_64+0x9f/0x450
    [ 21.130182] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    We should allocate with __GFP_NOWARN to handle this.

    Cc: Kal Conley
    Cc: Andrey Konovalov
    Fixes: fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

19 Mar, 2019

2 commits

  • I am using "protocol ip" filters in TC to manipulate TC flower
    classifiers, which are only available with "protocol ip". However,
    I faced an issue that packets sent via raw sockets with ETH_P_ALL
    did not match the ip filters even if they did satisfy the condition
    (e.g., DHCP offer from dhcpd).

    I have determined that the behavior was caused by an unexpected
    value stored in skb->protocol, namely, ETH_P_ALL instead of ETH_P_IP,
    when packets were sent via raw sockets with ETH_P_ALL set.

    IMHO, storing ETH_P_ALL in skb->protocol is not appropriate for
    packets sent via raw sockets because ETH_P_ALL is not a real ether
    type used on wire, but a virtual one.

    This patch fixes the tx protocol selection in cases of transmission
    via raw sockets created with ETH_P_ALL so that it asks the driver to
    extract protocol from the Ethernet header.

    Fixes: 75c65772c3 ("net/packet: Ask driver for protocol if not provided by user")
    Signed-off-by: Yoshiki Komachi
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Yoshiki Komachi
     
  • When using fanouts with AF_PACKET, the demux functions such as
    fanout_demux_cpu will return an index in the fanout socket array, which
    corresponds to the selected socket.

    The ordering of this array depends on the order the sockets were added
    to a given fanout group, so for FANOUT_CPU this means sockets are bound
    to cpus in the order they are configured, which is OK.

    However, when stopping then restarting the interface these sockets are
    bound to, the sockets are reassigned to the fanout group in the reverse
    order, due to the fact that they were inserted at the head of the
    interface's AF_PACKET socket list.

    This means that traffic that was directed to the first socket in the
    fanout group is now directed to the last one after an interface restart.

    In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
    socket that used to receive traffic from the last CPU after an interface
    restart.

    This commit introduces a helper to add a socket at the tail of a list,
    then uses it to register AF_PACKET sockets.

    Note that this changes the order in which sockets are listed in /proc and
    with sock_diag.

    Fixes: dc99f600698d ("packet: Add fanout support")
    Signed-off-by: Maxime Chevallier
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Maxime Chevallier
     

23 Feb, 2019

3 commits

  • c72219b75f introduced tpacket_set_protocol that parses the Ethernet L2
    header and sets skb->protocol if it's unset. It is no longer needed
    since the introduction of packet_parse_headers. In case of SOCK_RAW and
    unset skb->protocol, packet_parse_headers asks the driver to tell the
    protocol number, and it's implemented for all Ethernet devices. As the
    old function supported only Ethernet, no functionality is lost.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Maxim Mikityanskiy
     
  • If a socket was created with socket(AF_PACKET, SOCK_RAW, 0), the
    protocol number is unavailable. Try to ask the driver to extract it from
    the L2 header in order for skb_try_probe_transport_header to succeed.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Maxim Mikityanskiy
     
  • If the socket was created with socket(AF_PACKET, SOCK_RAW, 0),
    skb->protocol will be unset, __skb_flow_dissect() will fail, and
    skb_probe_transport_header() will fall back to the offset_hint, making
    the resulting skb_transport_offset incorrect.

    If, however, there is no transport header in the packet,
    transport_header shouldn't be set to an arbitrary value.

    Fix it by leaving the transport offset unset if it couldn't be found, to
    be explicit rather than to fill it with some wrong value. It changes the
    behavior, but if some code relied on the old behavior, it would be
    broken anyway, as the old one is incorrect.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Maxim Mikityanskiy