26 Jan, 2020

1 commit

  • [ Upstream commit b756ad928d98e5ef0b74af7546a6a31a8dadde00 ]

    KCSAN reported the following data-race [1]

    Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.

    Since the report hinted about multiple cpus using the history
    concurrently, I added a test avoiding writing on it if the
    victim slot already contains the desired value.

    [1]

    BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover

    read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1:
    fanout_flow_is_huge net/packet/af_packet.c:1303 [inline]
    fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353
    packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
    deliver_skb net/core/dev.c:1888 [inline]
    dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
    xmit_one net/core/dev.c:3195 [inline]
    dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
    __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0:
    fanout_flow_is_huge net/packet/af_packet.c:1306 [inline]
    fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353
    packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
    deliver_skb net/core/dev.c:1888 [inline]
    dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
    xmit_one net/core/dev.c:3195 [inline]
    dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
    __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
    ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
    udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
    udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
    inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0x9f/0xc0 net/socket.c:657
    ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
    __sys_sendmmsg+0x123/0x350 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 3b3a5b0aab5b ("packet: rollover huge flows before small flows")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

31 Dec, 2019

1 commit

  • [ Upstream commit b43d1f9f7067c6759b1051e8ecb84e82cef569fe ]

    There is softlockup when using TPACKET_V3:
    ...
    NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
    (__irq_svc) from [] (_raw_spin_unlock_irqrestore+0x44/0x54)
    (_raw_spin_unlock_irqrestore) from [] (mod_timer+0x210/0x25c)
    (mod_timer) from []
    (prb_retire_rx_blk_timer_expired+0x68/0x11c)
    (prb_retire_rx_blk_timer_expired) from []
    (call_timer_fn+0x90/0x17c)
    (call_timer_fn) from [] (run_timer_softirq+0x2d4/0x2fc)
    (run_timer_softirq) from [] (__do_softirq+0x218/0x318)
    (__do_softirq) from [] (irq_exit+0x88/0xac)
    (irq_exit) from [] (msa_irq_exit+0x11c/0x1d4)
    (msa_irq_exit) from [] (handle_IPI+0x650/0x7f4)
    (handle_IPI) from [] (gic_handle_irq+0x108/0x118)
    (gic_handle_irq) from [] (__irq_usr+0x44/0x5c)
    ...

    If __ethtool_get_link_ksettings() is failed in
    prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
    is zero and the timer expire for retire_blk_timer is turn to
    mod_timer(&pkc->retire_blk_timer, jiffies + 0),
    which will trigger cpu usage of softirq is 100%.

    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Tested-by: Xiao Jiangfeng
    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Mao Wenan
     

02 Oct, 2019

1 commit

  • commit 174e23810cd31
    ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
    recycle always drop skb extensions. The additional skb_ext_del() that is
    performed via nf_reset on napi skb recycle is not needed anymore.

    Most nf_reset() calls in the stack are there so queued skb won't block
    'rmmod nf_conntrack' indefinitely.

    This removes the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct().

    In a few selected places, add a call to skb_ext_reset to make sure that
    no active extensions remain.

    I am submitting this for "net", because we're still early in the release
    cycle. The patch applies to net-next too, but I think the rename causes
    needless divergence between those trees.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

16 Aug, 2019

1 commit

  • packet_sendmsg() checks tx_ring.pg_vec to decide
    if it must call tpacket_snd().

    Problem is that the check is lockless, meaning another thread
    can issue a concurrent setsockopt(PACKET_TX_RING ) to flip
    tx_ring.pg_vec back to NULL.

    Given that tpacket_snd() grabs pg_vec_lock mutex, we can
    perform the check again to solve the race.

    syzbot reported :

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 11429 Comm: syz-executor394 Not tainted 5.3.0-rc4+ #101
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:packet_lookup_frame+0x8d/0x270 net/packet/af_packet.c:474
    Code: c1 ee 03 f7 73 0c 80 3c 0e 00 0f 85 cb 01 00 00 48 8b 0b 89 c0 4c 8d 24 c1 48 b8 00 00 00 00 00 fc ff df 4c 89 e1 48 c1 e9 03 3c 01 00 0f 85 94 01 00 00 48 8d 7b 10 4d 8b 3c 24 48 b8 00 00
    RSP: 0018:ffff88809f82f7b8 EFLAGS: 00010246
    RAX: dffffc0000000000 RBX: ffff8880a45c7030 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 1ffff110148b8e06 RDI: ffff8880a45c703c
    RBP: ffff88809f82f7e8 R08: ffff888087aea200 R09: fffffbfff134ae50
    R10: fffffbfff134ae4f R11: ffffffff89a5727f R12: 0000000000000000
    R13: 0000000000000001 R14: ffff8880a45c6ac0 R15: 0000000000000000
    FS: 00007fa04716f700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa04716edb8 CR3: 0000000091eb4000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    packet_current_frame net/packet/af_packet.c:487 [inline]
    tpacket_snd net/packet/af_packet.c:2667 [inline]
    packet_sendmsg+0x590/0x6250 net/packet/af_packet.c:2975
    sock_sendmsg_nosec net/socket.c:637 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:657
    ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
    __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
    __do_sys_sendmmsg net/socket.c:2442 [inline]
    __se_sys_sendmmsg net/socket.c:2439 [inline]
    __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
    do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jun, 2019

1 commit

  • The new route handling in ip_mc_finish_output() from 'net' overlapped
    with the new support for returning congestion notifications from BPF
    programs.

    In order to handle this I had to take the dev_loopback_xmit() calls
    out of the switch statement.

    The aquantia driver conflicts were simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jun, 2019

1 commit

  • When an application is run that:
    a) Sets its scheduler to be SCHED_FIFO
    and
    b) Opens a memory mapped AF_PACKET socket, and sends frames with the
    MSG_DONTWAIT flag cleared, its possible for the application to hang
    forever in the kernel. This occurs because when waiting, the code in
    tpacket_snd calls schedule, which under normal circumstances allows
    other tasks to run, including ksoftirqd, which in some cases is
    responsible for freeing the transmitted skb (which in AF_PACKET calls a
    destructor that flips the status bit of the transmitted frame back to
    available, allowing the transmitting task to complete).

    However, when the calling application is SCHED_FIFO, its priority is
    such that the schedule call immediately places the task back on the cpu,
    preventing ksoftirqd from freeing the skb, which in turn prevents the
    transmitting task from detecting that the transmission is complete.

    We can fix this by converting the schedule call to a completion
    mechanism. By using a completion queue, we force the calling task, when
    it detects there are no more frames to send, to schedule itself off the
    cpu until such time as the last transmitted skb is freed, allowing
    forward progress to be made.

    Tested by myself and the reporter, with good results

    Change Notes:

    V1->V2:
    Enhance the sleep logic to support being interruptible and
    allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)

    V2->V3:
    Rearrage the point at which we wait for the completion queue, to
    avoid needing to check for ph/skb being null at the end of the loop.
    Also move the complete call to the skb destructor to avoid needing to
    modify __packet_set_status. Also gate calling complete on
    packet_read_pending returning zero to avoid multiple calls to complete.
    (Willem de Bruijn)

    Move timeo computation within loop, to re-fetch the socket
    timeout since we also use the timeo variable to record the return code
    from the wait_for_complete call (Neil Horman)

    V3->V4:
    Willem has requested that the control flow be restored to the
    previous state. Doing so lets us eliminate the need for the
    po->wait_on_complete flag variable, and lets us get rid of the
    packet_next_frame function, but introduces another complexity.
    Specifically, but using the packet pending count, we can, if an
    applications calls sendmsg multiple times with MSG_DONTWAIT set, each
    set of transmitted frames, when complete, will cause
    tpacket_destruct_skb to issue a complete call, for which there will
    never be a wait_on_completion call. This imbalance will lead to any
    future call to wait_for_completion here to return early, when the frames
    they sent may not have completed. To correct this, we need to re-init
    the completion queue on every call to tpacket_snd before we enter the
    loop so as to ensure we wait properly for the frames we send in this
    iteration.

    Change the timeout and interrupted gotos to out_put rather than
    out_status so that we don't try to free a non-existant skb
    Clean up some extra newlines (Willem de Bruijn)

    Reviewed-by: Willem de Bruijn
    Signed-off-by: Neil Horman
    Reported-by: Matteo Croce
    Signed-off-by: David S. Miller

    Neil Horman
     

24 Jun, 2019

1 commit


15 Jun, 2019

8 commits


12 Jun, 2019

1 commit


08 Jun, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

    2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

    3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

    4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

    5) Memory leak on unload in RDS, from Zhu Yanjun.

    6) sctp_process_init leak, from Neil HOrman.

    7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
    pktgen: do not sleep with the thread lock held.
    net: mvpp2: Use strscpy to handle stat strings
    net: rds: fix memory leak in rds_ib_flush_mr_pool
    ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
    ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
    Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
    net: aquantia: fix wol configuration not applied sometimes
    ethtool: fix potential userspace buffer overflow
    Fix memory leak in sctp_process_init
    net: rds: fix memory leak when unload rds_rdma
    ipv6: fix the check before getting the cookie in rt6_get_cookie
    ipv4: not do cache for local delivery if bc_forwarding is enabled
    s390/qeth: handle error when updating TX queue count
    s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
    s390/qeth: check dst entry before use
    s390/qeth: handle limited IPv4 broadcast in L3 TX path
    net: fix indirect calls helpers for ptype list hooks.
    net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
    udp: only choose unbound UDP socket for multicast when not in a VRF
    net/tls: replace the sleeping lock around RX resync with a bit lock
    ...

    Linus Torvalds
     

03 Jun, 2019

1 commit

  • Rollover used to use a complex RCU mechanism for assignment, which had
    a race condition. The below patch fixed the bug and greatly simplified
    the logic.

    The feature depends on fanout, but the state is private to the socket.
    Fanout_release returns f only when the last member leaves and the
    fanout struct is to be freed.

    Destroy rollover unconditionally, regardless of fanout state.

    Fixes: 57f015f5eccf2 ("packet: fix crash in fanout_demux_rollover()")
    Reported-by: syzbot
    Diagnosed-by: Dmitry Vyukov
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

2 commits


10 May, 2019

1 commit

  • kernel BUG at lib/list_debug.c:47!
    invalid opcode: 0000 [#1
    CPU: 0 PID: 12914 Comm: rmmod Tainted: G W 5.1.0+ #47
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:__list_del_entry_valid+0x53/0x90
    Code: 48 8b 32 48 39 fe 75 35 48 8b 50 08 48 39 f2 75 40 b8 01 00 00 00 5d c3 48
    89 fe 48 89 c2 48 c7 c7 18 75 fe 82 e8 cb 34 78 ff 0b 48 89 fe 48 c7 c7 50 75 fe 82 e8 ba 34 78 ff 0f 0b 48 89 f2
    RSP: 0018:ffffc90001c2fe40 EFLAGS: 00010286
    RAX: 000000000000004e RBX: ffffffffa0184000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff888237a17788 RDI: 00000000ffffffff
    RBP: ffffc90001c2fe40 R08: 0000000000000000 R09: 0000000000000000
    R10: ffffc90001c2fe10 R11: 0000000000000000 R12: 0000000000000000
    R13: ffffc90001c2fe50 R14: ffffffffa0184000 R15: 0000000000000000
    FS: 00007f3d83634540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000555c350ea818 CR3: 0000000231677000 CR4: 00000000000006f0
    Call Trace:
    unregister_pernet_operations+0x34/0x120
    unregister_pernet_subsys+0x1c/0x30
    packet_exit+0x1c/0x369 [af_packet
    __x64_sys_delete_module+0x156/0x260
    ? lockdep_hardirqs_on+0x133/0x1b0
    ? do_syscall_64+0x12/0x1f0
    do_syscall_64+0x6e/0x1f0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    When modprobe af_packet, register_pernet_subsys
    fails and does a cleanup, ops->list is set to LIST_POISON1,
    but the module init is considered to success, then while rmmod it,
    BUG() is triggered in __list_del_entry_valid which is called from
    unregister_pernet_subsys. This patch fix error handing path in
    packet_init to avoid possilbe issue if some error occur.

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     

03 May, 2019

1 commit


01 May, 2019

2 commits

  • Packet sockets in datagram mode take a destination address. Verify its
    length before passing to dev_hard_header.

    Prior to 2.6.14-rc3, the send code ignored sll_halen. This is
    established behavior. Directly compare msg_namelen to dev->addr_len.

    Change v1->v2: initialize addr in all paths

    Fixes: 6b8d95f1795c4 ("packet: validate address length if non-zero")
    Suggested-by: David Laight
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Packet send checks that msg_name is at least sizeof sockaddr_ll.
    Packet recv must return at least this length, so that its output
    can be passed unmodified to packet send.

    This ceased to be true since adding support for lladdr longer than
    sll_addr. Since, the return value uses true address length.

    Always return at least sizeof sockaddr_ll, even if address length
    is shorter. Zero the padding bytes.

    Change v1->v2: do not overwrite zeroed padding again. use copy_len.

    Fixes: 0fb375fb9b93 ("[AF_PACKET]: Allow for > 8 byte hardware addresses.")
    Suggested-by: David Laight
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Apr, 2019

1 commit

  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

20 Apr, 2019

1 commit

  • The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
    socket protocol handlers, and all of those end up calling the same
    sock_get_timestamp()/sock_get_timestampns() helper functions, which
    results in a lot of duplicate code.

    With the introduction of 64-bit time_t on 32-bit architectures, this
    gets worse, as we then need four different ioctl commands in each
    socket protocol implementation.

    To simplify that, let's add a new .gettstamp() operation in
    struct proto_ops, and move ioctl implementation into the common
    sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
    through.

    We can reuse the sock_get_timestamp() implementation, but generalize
    it so it can deal with both native and compat mode, as well as
    timeval and timespec structures.

    Acked-by: Stefan Schmidt
    Acked-by: Neil Horman
    Acked-by: Marc Kleine-Budde
    Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
    Signed-off-by: Arnd Bergmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

28 Mar, 2019

1 commit


21 Mar, 2019

3 commits

  • After the previous patch, all the callers of ndo_select_queue()
    provide as a 'fallback' argument netdev_pick_tx.
    The only exceptions are nested calls to ndo_select_queue(),
    which pass down the 'fallback' available in the current scope
    - still netdev_pick_tx.

    We can drop such argument and replace fallback() invocation with
    netdev_pick_tx(). This avoids an indirect call per xmit packet
    in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
    with device drivers implementing such ndo. It also clean the code
    a bit.

    Tested with ixgbe and CONFIG_FCOE=m

    With pktgen using queue xmit:
    threads vanilla patched
    (kpps) (kpps)
    1 2334 2428
    2 4166 4278
    4 7895 8100

    v1 -> v2:
    - rebased after helper's name change

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Currently packet_pick_tx_queue() is the only caller of
    ndo_select_queue() using a fallback argument other than
    netdev_pick_tx.

    Leveraging rx queue, we can obtain a similar queue selection
    behavior using core helpers. After this change, ndo_select_queue()
    is always invoked with netdev_pick_tx() as fallback.
    We can change ndo_select_queue() signature in a followup patch,
    dropping an indirect call per transmitted packet in some scenarios
    (e.g. TCP syn and XDP generic xmit)

    This changes slightly how af packet queue selection happens when
    PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
    tacking in account both XPS and TC mapping.

    v1 -> v2:
    - rebased after helper name change
    RFC -> v1:
    - initialize sender_cpu to the expected value

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Since commit fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
    one can now allocate packet ring buffers >= UINT_MAX. However, syzkaller
    found that that triggers a warning:

    [ 21.100000] WARNING: CPU: 2 PID: 2075 at mm/page_alloc.c:4584 __alloc_pages_nod0
    [ 21.101490] Modules linked in:
    [ 21.101921] CPU: 2 PID: 2075 Comm: syz-executor.0 Not tainted 5.0.0 #146
    [ 21.102784] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
    [ 21.103887] RIP: 0010:__alloc_pages_nodemask+0x2a0/0x630
    [ 21.104640] Code: fe ff ff 65 48 8b 04 25 c0 de 01 00 48 05 90 0f 00 00 41 bd 01 00 00 00 48 89 44 24 48 e9 9c fe 3
    [ 21.107121] RSP: 0018:ffff88805e1cf920 EFLAGS: 00010246
    [ 21.107819] RAX: 0000000000000000 RBX: ffffffff85a488a0 RCX: 0000000000000000
    [ 21.108753] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000000
    [ 21.109699] RBP: 1ffff1100bc39f28 R08: ffffed100bcefb67 R09: ffffed100bcefb67
    [ 21.110646] R10: 0000000000000001 R11: ffffed100bcefb66 R12: 000000000000000d
    [ 21.111623] R13: 0000000000000000 R14: ffff88805e77d888 R15: 000000000000000d
    [ 21.112552] FS: 00007f7c7de05700(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
    [ 21.113612] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 21.114405] CR2: 000000000065c000 CR3: 000000005e58e006 CR4: 00000000001606e0
    [ 21.115367] Call Trace:
    [ 21.115705] ? __alloc_pages_slowpath+0x21c0/0x21c0
    [ 21.116362] alloc_pages_current+0xac/0x1e0
    [ 21.116923] kmalloc_order+0x18/0x70
    [ 21.117393] kmalloc_order_trace+0x18/0x110
    [ 21.117949] packet_set_ring+0x9d5/0x1770
    [ 21.118524] ? packet_rcv_spkt+0x440/0x440
    [ 21.119094] ? lock_downgrade+0x620/0x620
    [ 21.119646] ? __might_fault+0x177/0x1b0
    [ 21.120177] packet_setsockopt+0x981/0x2940
    [ 21.120753] ? __fget+0x2fb/0x4b0
    [ 21.121209] ? packet_release+0xab0/0xab0
    [ 21.121740] ? sock_has_perm+0x1cd/0x260
    [ 21.122297] ? selinux_secmark_relabel_packet+0xd0/0xd0
    [ 21.123013] ? __fget+0x324/0x4b0
    [ 21.123451] ? selinux_netlbl_socket_setsockopt+0x101/0x320
    [ 21.124186] ? selinux_netlbl_sock_rcv_skb+0x3a0/0x3a0
    [ 21.124908] ? __lock_acquire+0x529/0x3200
    [ 21.125453] ? selinux_socket_setsockopt+0x5d/0x70
    [ 21.126075] ? __sys_setsockopt+0x131/0x210
    [ 21.126533] ? packet_release+0xab0/0xab0
    [ 21.127004] __sys_setsockopt+0x131/0x210
    [ 21.127449] ? kernel_accept+0x2f0/0x2f0
    [ 21.127911] ? ret_from_fork+0x8/0x50
    [ 21.128313] ? do_raw_spin_lock+0x11b/0x280
    [ 21.128800] __x64_sys_setsockopt+0xba/0x150
    [ 21.129271] ? lockdep_hardirqs_on+0x37f/0x560
    [ 21.129769] do_syscall_64+0x9f/0x450
    [ 21.130182] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    We should allocate with __GFP_NOWARN to handle this.

    Cc: Kal Conley
    Cc: Andrey Konovalov
    Fixes: fc62814d690c ("net/packet: fix 4gb buffer limit due to overflow check")
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

19 Mar, 2019

2 commits

  • I am using "protocol ip" filters in TC to manipulate TC flower
    classifiers, which are only available with "protocol ip". However,
    I faced an issue that packets sent via raw sockets with ETH_P_ALL
    did not match the ip filters even if they did satisfy the condition
    (e.g., DHCP offer from dhcpd).

    I have determined that the behavior was caused by an unexpected
    value stored in skb->protocol, namely, ETH_P_ALL instead of ETH_P_IP,
    when packets were sent via raw sockets with ETH_P_ALL set.

    IMHO, storing ETH_P_ALL in skb->protocol is not appropriate for
    packets sent via raw sockets because ETH_P_ALL is not a real ether
    type used on wire, but a virtual one.

    This patch fixes the tx protocol selection in cases of transmission
    via raw sockets created with ETH_P_ALL so that it asks the driver to
    extract protocol from the Ethernet header.

    Fixes: 75c65772c3 ("net/packet: Ask driver for protocol if not provided by user")
    Signed-off-by: Yoshiki Komachi
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Yoshiki Komachi
     
  • When using fanouts with AF_PACKET, the demux functions such as
    fanout_demux_cpu will return an index in the fanout socket array, which
    corresponds to the selected socket.

    The ordering of this array depends on the order the sockets were added
    to a given fanout group, so for FANOUT_CPU this means sockets are bound
    to cpus in the order they are configured, which is OK.

    However, when stopping then restarting the interface these sockets are
    bound to, the sockets are reassigned to the fanout group in the reverse
    order, due to the fact that they were inserted at the head of the
    interface's AF_PACKET socket list.

    This means that traffic that was directed to the first socket in the
    fanout group is now directed to the last one after an interface restart.

    In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
    socket that used to receive traffic from the last CPU after an interface
    restart.

    This commit introduces a helper to add a socket at the tail of a list,
    then uses it to register AF_PACKET sockets.

    Note that this changes the order in which sockets are listed in /proc and
    with sock_diag.

    Fixes: dc99f600698d ("packet: Add fanout support")
    Signed-off-by: Maxime Chevallier
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Maxime Chevallier
     

23 Feb, 2019

3 commits

  • c72219b75f introduced tpacket_set_protocol that parses the Ethernet L2
    header and sets skb->protocol if it's unset. It is no longer needed
    since the introduction of packet_parse_headers. In case of SOCK_RAW and
    unset skb->protocol, packet_parse_headers asks the driver to tell the
    protocol number, and it's implemented for all Ethernet devices. As the
    old function supported only Ethernet, no functionality is lost.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Maxim Mikityanskiy
     
  • If a socket was created with socket(AF_PACKET, SOCK_RAW, 0), the
    protocol number is unavailable. Try to ask the driver to extract it from
    the L2 header in order for skb_try_probe_transport_header to succeed.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Maxim Mikityanskiy
     
  • If the socket was created with socket(AF_PACKET, SOCK_RAW, 0),
    skb->protocol will be unset, __skb_flow_dissect() will fail, and
    skb_probe_transport_header() will fall back to the offset_hint, making
    the resulting skb_transport_offset incorrect.

    If, however, there is no transport header in the packet,
    transport_header shouldn't be set to an arbitrary value.

    Fix it by leaving the transport offset unset if it couldn't be found, to
    be explicit rather than to fill it with some wrong value. It changes the
    behavior, but if some code relied on the old behavior, it would be
    broken anyway, as the old one is incorrect.

    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Maxim Mikityanskiy
     

13 Feb, 2019

1 commit

  • When calculating rb->frames_per_block * req->tp_block_nr the result
    can overflow. Check it for overflow without limiting the total buffer
    size to UINT_MAX.

    This change fixes support for packet ring buffers >= UINT_MAX.

    Fixes: 8f8d28e4d6d8 ("net/packet: fix overflow in check for tp_frame_nr")
    Signed-off-by: Kal Conley
    Signed-off-by: David S. Miller

    Kal Conley
     

18 Jan, 2019

1 commit

  • Since commit cb9f1b783850, scapy (which uses an AF_PACKET socket in
    SOCK_RAW mode) is unable to send a basic icmp packet over a sit tunnel:

    Here is a example of the setup:
    $ ip link set ntfp2 up
    $ ip addr add 10.125.0.1/24 dev ntfp2
    $ ip tunnel add tun1 mode sit ttl 64 local 10.125.0.1 remote 10.125.0.2 dev ntfp2
    $ ip addr add fd00:cafe:cafe::1/128 dev tun1
    $ ip link set dev tun1 up
    $ ip route add fd00:200::/64 dev tun1
    $ scapy
    >>> p = []
    >>> p += IPv6(src='fd00:100::1', dst='fd00:200::1')/ICMPv6EchoRequest()
    >>> send(p, count=1, inter=0.1)
    >>> quit()
    $ ip -s link ls dev tun1 | grep -A1 "TX.*errors"
    TX: bytes packets errors dropped carrier collsns
    0 0 1 0 0 0

    The problem is that the network offset is set to the hard_header_len of the
    output device (tun1, ie 14 + 20) and in our case, because the packet is
    small (48 bytes) the pskb_inet_may_pull() fails (it tries to pull 40 bytes
    (ipv6 header) starting from the network offset).

    This problem is more generally related to device with variable hard header
    length. To avoid a too intrusive patch in the current release, a (ugly)
    workaround is proposed in this patch. It has to be cleaned up in net-next.

    Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=993675a3100b1
    Link: http://patchwork.ozlabs.org/patch/1024489/
    Fixes: cb9f1b783850 ("ip: validate header length on virtual device xmit")
    CC: Willem de Bruijn
    CC: Maxim Mikityanskiy
    Signed-off-by: Nicolas Dichtel
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

09 Jan, 2019

1 commit

  • 'dev' is non NULL when the addr_len check triggers so it must goto a label
    that does the dev_put otherwise dev will have a leaked refcount.

    This bug causes the ib_ipoib module to become unloadable when using
    systemd-network as it triggers this check on InfiniBand links.

    Fixes: 99137b7888f4 ("packet: validate address length")
    Reported-by: Leon Romanovsky
    Signed-off-by: Jason Gunthorpe
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Jason Gunthorpe
     

23 Dec, 2018

1 commit