16 Oct, 2020

2 commits

  • Pull networking updates from Jakub Kicinski:

    - Add redirect_neigh() BPF packet redirect helper, allowing to limit
    stack traversal in common container configs and improving TCP
    back-pressure.

    Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

    - Expand netlink policy support and improve policy export to user
    space. (Ge)netlink core performs request validation according to
    declared policies. Expand the expressiveness of those policies
    (min/max length and bitmasks). Allow dumping policies for particular
    commands. This is used for feature discovery by user space (instead
    of kernel version parsing or trial and error).

    - Support IGMPv3/MLDv2 multicast listener discovery protocols in
    bridge.

    - Allow more than 255 IPv4 multicast interfaces.

    - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
    packets of TCPv6.

    - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
    multiple subflows in a load balancing scenario. Enhance advertising
    addresses via the RM_ADDR/ADD_ADDR options.

    - Support SMC-Dv2 version of SMC, which enables multi-subnet
    deployments.

    - Allow more calls to same peer in RxRPC.

    - Support two new Controller Area Network (CAN) protocols - CAN-FD and
    ISO 15765-2:2016.

    - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
    kernel problem.

    - Add TC actions for implementing MPLS L2 VPNs.

    - Improve nexthop code - e.g. handle various corner cases when nexthop
    objects are removed from groups better, skip unnecessary
    notifications and make it easier to offload nexthops into HW by
    converting to a blocking notifier.

    - Support adding and consuming TCP header options by BPF programs,
    opening the doors for easy experimental and deployment-specific TCP
    option use.

    - Reorganize TCP congestion control (CC) initialization to simplify
    life of TCP CC implemented in BPF.

    - Add support for shipping BPF programs with the kernel and loading
    them early on boot via the User Mode Driver mechanism, hence reusing
    all the user space infra we have.

    - Support sleepable BPF programs, initially targeting LSM and tracing.

    - Add bpf_d_path() helper for returning full path for given 'struct
    path'.

    - Make bpf_tail_call compatible with bpf-to-bpf calls.

    - Allow BPF programs to call map_update_elem on sockmaps.

    - Add BPF Type Format (BTF) support for type and enum discovery, as
    well as support for using BTF within the kernel itself (current use
    is for pretty printing structures).

    - Support listing and getting information about bpf_links via the bpf
    syscall.

    - Enhance kernel interfaces around NIC firmware update. Allow
    specifying overwrite mask to control if settings etc. are reset
    during update; report expected max time operation may take to users;
    support firmware activation without machine reboot incl. limits of
    how much impact reset may have (e.g. dropping link or not).

    - Extend ethtool configuration interface to report IEEE-standard
    counters, to limit the need for per-vendor logic in user space.

    - Adopt or extend devlink use for debug, monitoring, fw update in many
    drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
    dpaa2-eth).

    - In mlxsw expose critical and emergency SFP module temperature alarms.
    Refactor port buffer handling to make the defaults more suitable and
    support setting these values explicitly via the DCBNL interface.

    - Add XDP support for Intel's igb driver.

    - Support offloading TC flower classification and filtering rules to
    mscc_ocelot switches.

    - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
    fixed interval period pulse generator and one-step timestamping in
    dpaa-eth.

    - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
    offload.

    - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
    this HW to use it. Convert mvpp2 to split PCS.

    - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
    7-port Mediatek MT7531 IP.

    - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
    and wcn3680 support in wcn36xx.

    - Improve performance for packets which don't require much offloads on
    recent Mellanox NICs by 20% by making multiple packets share a
    descriptor entry.

    - Move chelsio inline crypto drivers (for TLS and IPsec) from the
    crypto subtree to drivers/net. Move MDIO drivers out of the phy
    directory.

    - Clean up a lot of W=1 warnings, reportedly the actively developed
    subsections of networking drivers should now build W=1 warning free.

    - Make sure drivers don't use in_interrupt() to dynamically adapt their
    code. Convert tasklets to use new tasklet_setup API (sadly this
    conversion is not yet complete).

    * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
    Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
    net, sockmap: Don't call bpf_prog_put() on NULL pointer
    bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
    bpf, sockmap: Add locking annotations to iterator
    netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
    net: fix pos incrementment in ipv6_route_seq_next
    net/smc: fix invalid return code in smcd_new_buf_create()
    net/smc: fix valid DMBE buffer sizes
    net/smc: fix use-after-free of delayed events
    bpfilter: Fix build error with CONFIG_BPFILTER_UMH
    cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
    net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
    bpf: Fix register equivalence tracking.
    rxrpc: Fix loss of final ack on shutdown
    rxrpc: Fix bundle counting for exclusive connections
    netfilter: restore NF_INET_NUMHOOKS
    ibmveth: Identify ingress large send packets.
    ibmveth: Switch order of ibmveth_helper calls.
    cxgb4: handle 4-tuple PEDIT to NAT mode translation
    selftests: Add VRF route leaking tests
    ...

    Linus Torvalds
     
  • Minor conflicts in net/mptcp/protocol.h and
    tools/testing/selftests/net/Makefile.

    In both cases code was added on both sides in the same place
    so just keep both.

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

15 Oct, 2020

1 commit

  • As per RFC4443, the destination address field for ICMPv6 error messages
    is copied from the source address field of the invoking packet.

    In configurations with Virtual Routing and Forwarding tables, looking up
    which routing table to use for sending ICMPv6 error messages is
    currently done by using the destination net_device.

    If the source and destination interfaces are within separate VRFs, or
    one in the global routing table and the other in a VRF, looking up the
    source address of the invoking packet in the destination interface's
    routing table will fail if the destination interface's routing table
    contains no route to the invoking packet's source address.

    One observable effect of this issue is that traceroute6 does not work in
    the following cases:

    - Route leaking between global routing table and VRF
    - Route leaking between VRFs

    Use the source device routing table when sending ICMPv6 error
    messages.

    [ In the context of ipv4, it has been pointed out that a similar issue
    may exist with ICMP errors triggered when forwarding between network
    namespaces. It would be worthwhile to investigate whether ipv6 has
    similar issues, but is outside of the scope of this investigation. ]

    [ Testing shows that similar issues exist with ipv6 unreachable /
    fragmentation needed messages. However, investigation of this
    additional failure mode is beyond this investigation's scope. ]

    Link: https://tools.ietf.org/html/rfc4443
    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski

    Mathieu Desnoyers
     

19 Sep, 2020

1 commit


21 Aug, 2020

1 commit


14 Jul, 2020

1 commit


25 Feb, 2020

1 commit


05 Dec, 2019

1 commit

  • This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
    as some modules currently pass a net argument without a socket to
    ip6_dst_lookup. This is equivalent to commit 343d60aada5a ("ipv6: change
    ipv6_stub_impl.ipv6_dst_lookup to take net argument").

    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     

16 Nov, 2019

1 commit

  • Instead of generally passing NULL to NF_HOOK_COND() for input device,
    pass skb->dev which contains input device for routed skbs.

    Note that iptables (both legacy and nft) reject rules with input
    interface match from being added to POSTROUTING chains, but nftables
    allows this.

    Cc: Eric Garver
    Signed-off-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Phil Sutter
     

19 Oct, 2019

1 commit

  • Thomas found that some forwarded packets would be stuck
    in FQ packet scheduler because their skb->tstamp contained
    timestamps far in the future.

    We thought we addressed this point in commit 8203e2d844d3
    ("net: clear skb->tstamp in forwarding paths") but there
    is still an issue when/if a packet needs to be fragmented.

    In order to meet EDT requirements, we have to make sure all
    fragments get the original skb->tstamp.

    Note that this original skb->tstamp should be zero in
    forwarding path, but might have a non zero value in
    output path if user decided so.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Eric Dumazet
    Reported-by: Thomas Bartschies
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2019

1 commit

  • Currently, ip6_xmit() sets skb->priority based on sk->sk_priority

    This is not desirable for TCP since TCP shares the same ctl socket
    for a given netns. We want to be able to send RST or ACK packets
    with a non zero skb->priority.

    This patch has no functional change.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Sep, 2019

1 commit

  • Enable setting skb->mark for UDP and RAW sockets using cmsg.

    This is analogous to existing support for TOS, TTL, txtime, etc.

    Packet sockets already support this as of commit c7d39e32632e
    ("packet: support per-packet fwmark for af_packet sendmsg").

    Similar to other fields, implement by
    1. initialize the sockcm_cookie.mark from socket option sk_mark
    2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
    3. initialize inet_cork.mark from sockcm_cookie.mark
    4. initialize each (usually just one) skb->mark from inet_cork.mark

    Step 1 is handled in one location for most protocols by ipcm_init_sk
    as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Jun, 2019

1 commit

  • The new route handling in ip_mc_finish_output() from 'net' overlapped
    with the new support for returning congestion notifications from BPF
    programs.

    In order to handle this I had to take the dev_loopback_xmit() calls
    out of the switch statement.

    The aquantia driver conflicts were simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jun, 2019

1 commit

  • There is no functional change in this patch, it only prepares the next one.

    rt6_nexthop() will be used by ip6_dst_lookup_neigh(), which uses const
    variables.

    Signed-off-by: Nicolas Dichtel
    Reported-by: kbuild test robot
    Acked-by: Nick Desaulniers
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

18 Jun, 2019

1 commit


12 Jun, 2019

1 commit

  • The below patch fixes an incorrect zerocopy refcnt increment when
    appending with MSG_MORE to an existing zerocopy udp skb.

    send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1
    send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags)

    But it missed that zerocopy need not be passed at the first send. The
    right test whether the uarg is newly allocated and thus has extra
    refcnt 1 is not !skb, but !skb_zcopy.

    send(.., MSG_MORE); //
    send(.., MSG_ZEROCOPY); // refcnt 1

    Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE")
    Reported-by: syzbot
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

08 Jun, 2019

1 commit


04 Jun, 2019

1 commit

  • syzbot reported nasty use-after-free [1]

    Lets remove frag_list field from structs ip_fraglist_iter
    and ip6_fraglist_iter. This seens not needed anyway.

    [1] :
    BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
    Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

    CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
    __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    kasan_report+0x12/0x20 mm/kasan/common.c:614
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
    kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
    ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
    __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
    ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
    dst_output include/net/dst.h:433 [inline]
    ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
    ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
    ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
    rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
    rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x44add9
    Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
    RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
    RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
    R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

    Allocated by task 8947:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_kmalloc mm/kasan/common.c:489 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
    slab_post_alloc_hook mm/slab.h:437 [inline]
    slab_alloc_node mm/slab.c:3269 [inline]
    kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
    alloc_skb include/linux/skbuff.h:1058 [inline]
    __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
    ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
    rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 8947:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
    __cache_free mm/slab.c:3432 [inline]
    kmem_cache_free+0x86/0x260 mm/slab.c:3698
    kfree_skbmem net/core/skbuff.c:625 [inline]
    kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
    __kfree_skb net/core/skbuff.c:682 [inline]
    kfree_skb net/core/skbuff.c:699 [inline]
    kfree_skb+0xf0/0x390 net/core/skbuff.c:693
    kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
    __dev_xmit_skb net/core/dev.c:3551 [inline]
    __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
    dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
    neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
    ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
    __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
    ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
    dst_output include/net/dst.h:433 [inline]
    ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
    ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
    ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
    rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
    rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff888085a3cbc0
    which belongs to the cache skbuff_head_cache of size 224
    The buggy address is located 0 bytes inside of
    224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
    The buggy address belongs to the page:
    page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
    flags: 0x1fffc0000000200(slab)
    raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
    raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
    >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

    Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
    Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
    Signed-off-by: Eric Dumazet
    Cc: Pablo Neira Ayuso
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jun, 2019

3 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2019-05-31

    The following pull-request contains BPF updates for your *net-next* tree.

    Lots of exciting new features in the first PR of this developement cycle!
    The main changes are:

    1) misc verifier improvements, from Alexei.

    2) bpftool can now convert btf to valid C, from Andrii.

    3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
    This feature greatly improves BPF speed on 32-bit architectures. From Jiong.

    4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
    bpf programs got stuck in dying cgroups. From Roman.

    5) new bpf_send_signal() helper, from Yonghong.

    6) cgroup inet skb programs can signal CN to the stack, from Lawrence.

    7) miscellaneous cleanups, from many developers.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
    congestion notifications from the BPF programs.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    brakmo
     
  • The phylink conflict was between a bug fix by Russell King
    to make sure we have a consistent PHY interface mode, and
    a change in net-next to pull some code in phylink_resolve()
    into the helper functions phylink_mac_link_{up,down}()

    On the dp83867 side it's mostly overlapping changes, with
    the 'net' side removing a condition that was supposed to
    trigger for RGMII but because of how it was coded never
    actually could trigger.

    Signed-off-by: David S. Miller

    David S. Miller
     

31 May, 2019

5 commits

  • Pull yet more SPDX updates from Greg KH:
    "Here is another set of reviewed patches that adds SPDX tags to
    different kernel files, based on a set of rules that are being used to
    parse the comments to try to determine that the license of the file is
    "GPL-2.0-or-later" or "GPL-2.0-only". Only the "obvious" versions of
    these matches are included here, a number of "non-obvious" variants of
    text have been found but those have been postponed for later review
    and analysis.

    There is also a patch in here to add the proper SPDX header to a bunch
    of Kbuild files that we have missed in the past due to new files being
    added and forgetting that Kbuild uses two different file names for
    Makefiles. This issue was reported by the Kbuild maintainer.

    These patches have been out for review on the linux-spdx@vger mailing
    list, and while they were created by automatic tools, they were
    hand-verified by a bunch of different people, all whom names are on
    the patches are reviewers"

    * tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (82 commits)
    treewide: Add SPDX license identifier - Kbuild
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 225
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 224
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 223
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 222
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 221
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 220
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 218
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 217
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 216
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 214
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 213
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 211
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 210
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 209
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 207
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 203
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201
    ...

    Linus Torvalds
     
  • TCP zerocopy takes a uarg reference for every skb, plus one for the
    tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
    as it builds, sends and frees skbs inside its inner loop.

    UDP and RAW zerocopy do not send inside the inner loop so do not need
    the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
    52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
    extra_uref to pass the initial reference taken in sock_zerocopy_alloc
    to the first generated skb.

    But, sock_zerocopy_realloc takes this extra reference at the start of
    every call. With MSG_MORE, no new skb may be generated to attach the
    extra_uref to, so refcnt is incorrectly 2 with only one skb.

    Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
    Update extra_uref accordingly.

    This conditional assignment triggers a false positive may be used
    uninitialized warning, so have to initialize extra_uref at define.

    Changes v1->v2: fix typo in Fixes SHA1

    Fixes: 52900d22288e7 ("udp: elide zerocopy operation in hot path")
    Reported-by: syzbot
    Diagnosed-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • This patch exposes a new API to refragment a skbuff. This allows you to
    split either a linear skbuff or to force the refragmentation of an
    existing fraglist using a different mtu. The API consists of:

    * ip6_frag_init(), that initializes the internal state of the transformer.
    * ip6_frag_next(), that allows you to fetch the next fragment. This function
    internally allocates the skbuff that represents the fragment, it pushes
    the IPv6 header, and it also copies the payload for each fragment.

    The ip6_frag_state object stores the internal state of the splitter.

    This code has been extracted from ip6_fragment(). Symbols are also
    exported to allow to reuse this iterator from the bridge codepath to
    build its own refragmentation routine by reusing the existing codebase.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • This patch adds the skbuff fraglist split iterator. This API provides an
    iterator to transform the fraglist into single skbuff objects, it
    consists of:

    * ip6_fraglist_init(), that initializes the internal state of the
    fraglist iterator.
    * ip6_fraglist_prepare(), that restores the IPv6 header on the fragment.
    * ip6_fraglist_next(), that retrieves the fragment from the fraglist and
    updates the internal state of the iterator to point to the next
    fragment in the fraglist.

    The ip6_fraglist_iter object stores the internal state of the iterator.

    This code has been extracted from ip6_fragment(). Symbols are also
    exported to allow to reuse this iterator from the bridge codepath to
    build its own refragmentation routine by reusing the existing codebase.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

09 Apr, 2019

1 commit

  • A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
    entry will exist in the v6 ndisc table and the cached header will contain
    the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
    use the v6 neighbor entry, neigh_output needs to skip the cached header
    and just use the output callback for the neigh entry.

    A future patchset can look at expanding the hh_cache to handle 2
    protocols. For now, IPv6 gateways with an IPv4 route will take the
    extra overhead of generating the header.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

04 Apr, 2019

1 commit

  • At the beginning of ip6_fragment func, the prevhdr pointer is
    obtained in the ip6_find_1stfragopt func.
    However, all the pointers pointing into skb header may change
    when calling skb_checksum_help func with
    skb->ip_summed = CHECKSUM_PARTIAL condition.
    The prevhdr pointe will be dangling if it is not reloaded after
    calling __skb_linearize func in skb_checksum_help func.

    Here, I add a variable, nexthdr_offset, to evaluate the offset,
    which does not changes even after calling __skb_linearize func.

    Fixes: 405c92f7a541 ("ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment")
    Signed-off-by: Junwei Hu
    Reported-by: Wenhao Zhang
    Reported-by: syzbot+e8ce541d095e486074fc@syzkaller.appspotmail.com
    Reviewed-by: Zhiqiang Liu
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Junwei Hu
     

04 Mar, 2019

1 commit

  • By default IPv6 socket with IPV6_ROUTER_ALERT socket option set will
    receive all IPv6 RA packets from all namespaces.
    IPV6_ROUTER_ALERT_ISOLATE socket option restricts packets received by
    the socket to be only from the socket's namespace.

    Signed-off-by: Maxim Martynov
    Signed-off-by: Francesco Ruggeri
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Francesco Ruggeri
     

21 Dec, 2018

1 commit


20 Dec, 2018

1 commit

  • This adds an optional extension infrastructure, with ispec (xfrm) and
    bridge netfilter as first users.
    objdiff shows no changes if kernel is built without xfrm and br_netfilter
    support.

    The third (planned future) user is Multipath TCP which is still
    out-of-tree.
    MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
    numbers used by individual subflows.

    This DSS mapping is read/written from tcp option space on receive and
    written to tcp option space on transmitted tcp packets that are part of
    and MPTCP connection.

    Extending skb_shared_info or adding a private data field to skb fclones
    doesn't work for incoming skb, so a different DSS propagation method would
    be required for the receive side.

    mptcp has same requirements as secpath/bridge netfilter:

    1. extension memory is released when the sk_buff is free'd.
    2. data is shared after cloning an skb (clone inherits extension)
    3. adding extension to an skb will COW the extension buffer if needed.

    The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
    mapping for tx and rx processing.

    Two new members are added to sk_buff:
    1. 'active_extensions' byte (filling a hole), telling which extensions
    are available for this skb.
    This has two purposes.
    a) avoids the need to initialize the pointer.
    b) allows to "delete" an extension by clearing its bit
    value in ->active_extensions.

    While it would be possible to store the active_extensions byte
    in the extension struct instead of sk_buff, there is one problem
    with this:
    When an extension has to be disabled, we can always clear the
    bit in skb->active_extensions. But in case it would be stored in the
    extension buffer itself, we might have to COW it first, if
    we are dealing with a cloned skb. On kmalloc failure we would
    be unable to turn an extension off.

    2. extension pointer, located at the end of the sk_buff.
    If the active_extensions byte is 0, the pointer is undefined,
    it is not initialized on skb allocation.

    This adds extra code to skb clone and free paths (to deal with
    refcount/free of extension area) but this replaces similar code that
    manages skb->nf_bridge and skb->sp structs in the followup patches of
    the series.

    It is possible to add support for extensions that are not preseved on
    clones/copies.

    To do this, it would be needed to define a bitmask of all extensions that
    need copy/cow semantics, and change __skb_ext_copy() to check
    ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
    ->active_extensions to 0 on the new clone.

    This isn't done here because all extensions that get added here
    need the copy/cow semantics.

    v2:
    Allocate entire extension space using kmem_cache.
    Upside is that this allows better tracking of used memory,
    downside is that we will allocate more space than strictly needed in
    most cases (its unlikely that all extensions are active/needed at same
    time for same skb).
    The allocated memory (except the small extension header) is not cleared,
    so no additonal overhead aside from memory usage.

    Avoid atomic_dec_and_test operation on skb_ext_put()
    by using similar trick as kfree_skbmem() does with fclone_ref:
    If recount is 1, there is no concurrent user and we can free right away.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

16 Dec, 2018

1 commit

  • Sergey reported that forwarding was no longer working
    if fq packet scheduler was used.

    This is caused by the recent switch to EDT model, since incoming
    packets might have been timestamped by __net_timestamp()

    __net_timestamp() uses ktime_get_real(), while fq expects packets
    using CLOCK_MONOTONIC base.

    The fix is to clear skb->tstamp in forwarding paths.

    Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Eric Dumazet
    Reported-by: Sergey Matyukevich
    Tested-by: Sergey Matyukevich
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Dec, 2018

1 commit

  • Several conflicts, seemingly all over the place.

    I used Stephen Rothwell's sample resolutions for many of these, if not
    just to double check my own work, so definitely the credit largely
    goes to him.

    The NFP conflict consisted of a bug fix (moving operations
    past the rhashtable operation) while chaning the initial
    argument in the function call in the moved code.

    The net/dsa/master.c conflict had to do with a bug fix intermixing of
    making dsa_master_set_mtu() static with the fixing of the tagging
    attribute location.

    cls_flower had a conflict because the dup reject fix from Or
    overlapped with the addition of port range classifiction.

    __set_phy_supported()'s conflict was relatively easy to resolve
    because Andrew fixed it in both trees, so it was just a matter
    of taking the net-next copy. Or at least I think it was :-)

    Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
    intermixed with changes on how the sdif and caller_net are calculated
    in these code paths in net-next.

    The remaining BPF conflicts were largely about the addition of the
    __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
    to the relevant data structure where the MD pointer macros are used.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Dec, 2018

1 commit

  • extra_uref is used in __ip(6)_append_data only if uarg is set.

    Smatch sees that the variable is passed to sock_zerocopy_put_abort.
    This function accesses it only when uarg is set, but smatch cannot
    infer this.

    Make this dependency explicit.

    Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

08 Dec, 2018

1 commit

  • Even if we send an IPv6 packet without options, MAX_HEADER might not be
    enough to account for the additional headroom required by alignment of
    hardware headers.

    On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
    sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
    100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
    bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().

    Those would be enough to append our 14 bytes header, but we're going to
    align that to 16 bytes, and write 2 bytes out of the allocated slab in
    neigh_hh_output().

    KASan says:

    [ 264.967848] ==================================================================
    [ 264.967861] BUG: KASAN: slab-out-of-bounds in ip6_finish_output2+0x1aec/0x1c70
    [ 264.967866] Write of size 16 at addr 000000006af1c7fe by task netperf/6201
    [ 264.967870]
    [ 264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
    [ 264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
    [ 264.967887] Call Trace:
    [ 264.967896] ([] show_stack+0x56/0xa0)
    [ 264.967903] [] dump_stack+0x23c/0x290
    [ 264.967912] [] print_address_description+0xf4/0x290
    [ 264.967919] [] kasan_report+0x13c/0x240
    [ 264.967927] [] ip6_finish_output2+0x1aec/0x1c70
    [ 264.967935] [] ip6_finish_output+0x430/0x7f0
    [ 264.967943] [] ip6_output+0x1f4/0x580
    [ 264.967953] [] ip6_xmit+0xfea/0x1ce8
    [ 264.967963] [] inet6_csk_xmit+0x282/0x3f8
    [ 264.968033] [] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
    [ 264.968037] [] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
    [ 264.968041] [] dev_hard_start_xmit+0x268/0x928
    [ 264.968069] [] sch_direct_xmit+0x7ae/0x1350
    [ 264.968071] [] __dev_queue_xmit+0x2b7c/0x3478
    [ 264.968075] [] ip_finish_output2+0xce2/0x11a0
    [ 264.968078] [] ip_finish_output+0x56c/0x8c8
    [ 264.968081] [] ip_output+0x226/0x4c0
    [ 264.968083] [] __ip_queue_xmit+0x894/0x1938
    [ 264.968100] [] sctp_packet_transmit+0x29d4/0x3648 [sctp]
    [ 264.968116] [] sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
    [ 264.968131] [] sctp_outq_flush+0x22e/0x7d8 [sctp]
    [ 264.968146] [] sctp_cmd_interpreter.isra.16+0x530/0x6800 [sctp]
    [ 264.968161] [] sctp_do_sm+0x222/0x648 [sctp]
    [ 264.968177] [] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
    [ 264.968192] [] __sctp_connect+0x830/0xc20 [sctp]
    [ 264.968208] [] sctp_inet_connect+0x2e6/0x378 [sctp]
    [ 264.968212] [] __sys_connect+0x21a/0x450
    [ 264.968215] [] sys_socketcall+0x3d0/0xb08
    [ 264.968218] [] system_call+0x2a2/0x2c0

    [...]

    Just like ip_finish_output2() does for IPv4, check that we have enough
    headroom in ip6_xmit(), and reallocate it if we don't.

    This issue is older than git history.

    Reported-by: Jianlin Shi
    Signed-off-by: Stefano Brivio
    Signed-off-by: David S. Miller

    Stefano Brivio
     

05 Dec, 2018

1 commit

  • Packets marked with 'offload_l3_fwd_mark' were already forwarded by a
    capable device and should not be forwarded again by the kernel.
    Therefore, have the kernel consume them.

    The check is performed in ip{,6}_forward_finish() in order to allow the
    kernel to process such packets in ip{,6}_forward() and generate required
    exceptions. For example, ICMP redirects.

    Signed-off-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Ido Schimmel
     

04 Dec, 2018

2 commits

  • With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
    Release of its last reference triggers a completion notification.

    The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
    the skbs, because it can build, send and free skbs within its loop,
    possibly reaching refcount zero and freeing the ubuf_info too soon.

    The UDP stack currently also takes this extra ref, but does not need
    it as all skbs are sent after return from __ip(6)_append_data.

    Avoid the extra refcount_inc and refcount_dec_and_test, and generally
    the sock_zerocopy_put in the common path, by passing the initial
    reference to the first skb.

    This approach is taken instead of initializing the refcount to 0, as
    that would generate error "refcount_t: increment on 0" on the
    next skb_zcopy_set.

    Changes
    v3 -> v4
    - Move skb_zcopy_set below the only kfree_skb that might cause
    a premature uarg destroy before skb_zerocopy_put_abort
    - Move the entire skb_shinfo assignment block, to keep that
    cacheline access in one place

    Signed-off-by: Willem de Bruijn
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
    interpret flag MSG_ZEROCOPY.

    This patch was previously part of the zerocopy RFC patchsets. Zerocopy
    is not effective at small MTU. With segmentation offload building
    larger datagrams, the benefit of page flipping outweights the cost of
    generating a completion notification.

    tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
    test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

    ipv4 udp -t 1
    tx=191312 (11938 MB) txc=0 zc=n
    rx=191312 (11938 MB)
    ipv4 udp -z -t 1
    tx=304507 (19002 MB) txc=304507 zc=y
    rx=304507 (19002 MB)
    ok
    ipv6 udp -t 1
    tx=174485 (10888 MB) txc=0 zc=n
    rx=174485 (10888 MB)
    ipv6 udp -z -t 1
    tx=294801 (18396 MB) txc=294801 zc=y
    rx=294801 (18396 MB)
    ok

    Changes
    v1 -> v2
    - Fixup reverse christmas tree violation
    v2 -> v3
    - Split refcount avoidance optimization into separate patch
    - Fix refcount leak on error in fragmented case
    (thanks to Paolo Abeni for pointing this one out!)
    - Fix refcount inc on zero
    - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
    This is needed since commit 5cf4a8532c99 ("tcp: really ignore
    MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

    Signed-off-by: Willem de Bruijn
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

25 Nov, 2018

1 commit

  • In ip packet generation, pagedlen is initialized for each skb at the
    start of the loop in __ip(6)_append_data, before label alloc_new_skb.

    Depending on compiler options, code can be generated that jumps to
    this label, triggering use of an an uninitialized variable.

    In practice, at -O2, the generated code moves the initialization below
    the label. But the code should not rely on that for correctness.

    Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

19 Sep, 2018

1 commit