17 Jan, 2021

1 commit

  • [ Upstream commit bb4cc1a18856a73f0ff5137df0c2a31f4c50f6cf ]

    Conntrack reassembly records the largest fragment size seen in IPCB.
    However, when this gets forwarded/transmitted, fragmentation will only
    be forced if one of the fragmented packets had the DF bit set.

    In that case, a flag in IPCB will force fragmentation even if the
    MTU is large enough.

    This should work fine, but this breaks with ip tunnels.
    Consider client that sends a UDP datagram of size X to another host.

    The client fragments the datagram, so two packets, of size y and z, are
    sent. DF bit is not set on any of these packets.

    Middlebox netfilter reassembles those packets back to single size-X
    packet, before routing decision.

    packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
    isn't set. At output time, ip refragmentation is skipped as well
    because x is still smaller than the mtu of the output device.

    If ttransmit device is an ip tunnel, the packet size increases to
    x+overhead.

    Also, tunnel might be configured to force DF bit on outer header.

    In this case, packet will be dropped (exceeds MTU) and an ICMP error is
    generated back to sender.

    But sender already respects the announced MTU, all the packets that
    it sent did fit the announced mtu.

    Force refragmentation as per original sizes unconditionally so ip tunnel
    will encapsulate the fragments instead.

    The only other solution I see is to place ip refragmentation in
    the ip_tunnel code to handle this case.

    Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
    Reported-by: Christian Perle
    Signed-off-by: Florian Westphal
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     

16 Oct, 2020

1 commit

  • Pull networking updates from Jakub Kicinski:

    - Add redirect_neigh() BPF packet redirect helper, allowing to limit
    stack traversal in common container configs and improving TCP
    back-pressure.

    Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

    - Expand netlink policy support and improve policy export to user
    space. (Ge)netlink core performs request validation according to
    declared policies. Expand the expressiveness of those policies
    (min/max length and bitmasks). Allow dumping policies for particular
    commands. This is used for feature discovery by user space (instead
    of kernel version parsing or trial and error).

    - Support IGMPv3/MLDv2 multicast listener discovery protocols in
    bridge.

    - Allow more than 255 IPv4 multicast interfaces.

    - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
    packets of TCPv6.

    - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
    multiple subflows in a load balancing scenario. Enhance advertising
    addresses via the RM_ADDR/ADD_ADDR options.

    - Support SMC-Dv2 version of SMC, which enables multi-subnet
    deployments.

    - Allow more calls to same peer in RxRPC.

    - Support two new Controller Area Network (CAN) protocols - CAN-FD and
    ISO 15765-2:2016.

    - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
    kernel problem.

    - Add TC actions for implementing MPLS L2 VPNs.

    - Improve nexthop code - e.g. handle various corner cases when nexthop
    objects are removed from groups better, skip unnecessary
    notifications and make it easier to offload nexthops into HW by
    converting to a blocking notifier.

    - Support adding and consuming TCP header options by BPF programs,
    opening the doors for easy experimental and deployment-specific TCP
    option use.

    - Reorganize TCP congestion control (CC) initialization to simplify
    life of TCP CC implemented in BPF.

    - Add support for shipping BPF programs with the kernel and loading
    them early on boot via the User Mode Driver mechanism, hence reusing
    all the user space infra we have.

    - Support sleepable BPF programs, initially targeting LSM and tracing.

    - Add bpf_d_path() helper for returning full path for given 'struct
    path'.

    - Make bpf_tail_call compatible with bpf-to-bpf calls.

    - Allow BPF programs to call map_update_elem on sockmaps.

    - Add BPF Type Format (BTF) support for type and enum discovery, as
    well as support for using BTF within the kernel itself (current use
    is for pretty printing structures).

    - Support listing and getting information about bpf_links via the bpf
    syscall.

    - Enhance kernel interfaces around NIC firmware update. Allow
    specifying overwrite mask to control if settings etc. are reset
    during update; report expected max time operation may take to users;
    support firmware activation without machine reboot incl. limits of
    how much impact reset may have (e.g. dropping link or not).

    - Extend ethtool configuration interface to report IEEE-standard
    counters, to limit the need for per-vendor logic in user space.

    - Adopt or extend devlink use for debug, monitoring, fw update in many
    drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
    dpaa2-eth).

    - In mlxsw expose critical and emergency SFP module temperature alarms.
    Refactor port buffer handling to make the defaults more suitable and
    support setting these values explicitly via the DCBNL interface.

    - Add XDP support for Intel's igb driver.

    - Support offloading TC flower classification and filtering rules to
    mscc_ocelot switches.

    - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
    fixed interval period pulse generator and one-step timestamping in
    dpaa-eth.

    - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
    offload.

    - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
    this HW to use it. Convert mvpp2 to split PCS.

    - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
    7-port Mediatek MT7531 IP.

    - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
    and wcn3680 support in wcn36xx.

    - Improve performance for packets which don't require much offloads on
    recent Mellanox NICs by 20% by making multiple packets share a
    descriptor entry.

    - Move chelsio inline crypto drivers (for TLS and IPsec) from the
    crypto subtree to drivers/net. Move MDIO drivers out of the phy
    directory.

    - Clean up a lot of W=1 warnings, reportedly the actively developed
    subsections of networking drivers should now build W=1 warning free.

    - Make sure drivers don't use in_interrupt() to dynamically adapt their
    code. Convert tasklets to use new tasklet_setup API (sadly this
    conversion is not yet complete).

    * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
    Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
    net, sockmap: Don't call bpf_prog_put() on NULL pointer
    bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
    bpf, sockmap: Add locking annotations to iterator
    netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
    net: fix pos incrementment in ipv6_route_seq_next
    net/smc: fix invalid return code in smcd_new_buf_create()
    net/smc: fix valid DMBE buffer sizes
    net/smc: fix use-after-free of delayed events
    bpfilter: Fix build error with CONFIG_BPFILTER_UMH
    cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
    net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
    bpf: Fix register equivalence tracking.
    rxrpc: Fix loss of final ack on shutdown
    rxrpc: Fix bundle counting for exclusive connections
    netfilter: restore NF_INET_NUMHOOKS
    ibmveth: Identify ingress large send packets.
    ibmveth: Switch order of ibmveth_helper calls.
    cxgb4: handle 4-tuple PEDIT to NAT mode translation
    selftests: Add VRF route leaking tests
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull copy_and_csum cleanups from Al Viro:
    "Saner calling conventions for csum_and_copy_..._user() and friends"

    [ Removing 800+ lines of code and cleaning stuff up is good - Linus ]

    * 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ppc: propagate the calling conventions change down to csum_partial_copy_generic()
    amd64: switch csum_partial_copy_generic() to new calling conventions
    sparc64: propagate the calling convention changes down to __csum_partial_copy_...()
    xtensa: propagate the calling conventions change down into csum_partial_copy_generic()
    mips: propagate the calling convention change down into __csum_partial_copy_..._user()
    mips: __csum_partial_copy_kernel() has no users left
    mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS
    sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()
    i386: propagate the calling conventions change down to csum_partial_copy_generic()
    sh: propage the calling conventions change down to csum_partial_copy_generic()
    m68k: get rid of zeroing destination on error in csum_and_copy_from_user()
    arm: propagate the calling convention changes down to csum_partial_copy_from_user()
    alpha: propagate the calling convention changes down to csum_partial_copy.c helpers
    saner calling conventions for csum_and_copy_..._user()
    csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum
    csum_partial_copy_nocheck(): drop the last argument
    unify generic instances of csum_partial_copy_nocheck()
    icmp_push_reply(): reorder adding the checksum up
    skb_copy_and_csum_bits(): don't bother with the last argument

    Linus Torvalds
     

23 Sep, 2020

1 commit

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Sep, 2020

1 commit

  • This commit adds tos as a new passed in parameter to
    ip_build_and_send_pkt() which will be used in the later commit.
    This is a pure restructure and does not have any functional change.

    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

09 Sep, 2020

1 commit

  • Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we
    echo the TOS value of the received packets in the response.
    However, we do not want to echo the lower 2 ECN bits in accordance
    with RFC 3168 6.1.5 robustness principles.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

01 Sep, 2020

2 commits


25 Aug, 2020

1 commit


21 Aug, 2020

2 commits


11 Jul, 2020

1 commit


02 Jul, 2020

1 commit

  • When no full socket is available, skbs are sent over a per-netns
    control socket. Its sk_mark is temporarily adjusted to match that
    of the real (request or timewait) socket or to reflect an incoming
    skb, so that the outgoing skb inherits this in __ip_make_skb.

    Introduction of the socket cookie mark field broke this. Now the
    skb is set through the cookie and cork:

    # init sockc.mark from sk_mark or cmsg
    ip_append_data
    ip_setup_cork # convert sockc.mark to cork mark
    ip_push_pending_frames
    ip_finish_skb
    __ip_make_skb # set skb->mark to cork mark

    But I missed these special control sockets. Update all callers of
    __ip(6)_make_skb that were originally missed.

    For IPv6, the same two icmp(v6) paths are affected. The third
    case is not, as commit 92e55f412cff ("tcp: don't annotate
    mark on control socket from tcp_v6_send_response()") replaced
    the ctl_sk->sk_mark with passing the mark field directly as a
    function argument. That commit predates the commit that
    introduced the bug.

    Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg")
    Signed-off-by: Willem de Bruijn
    Reported-by: Martin KaFai Lau
    Reviewed-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

21 Jun, 2020

1 commit


30 Mar, 2020

1 commit

  • The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the
    offset of the GSO in skb cb. This patch fixes the typo.

    Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
    Signed-off-by: Cambda Zhu
    Signed-off-by: David S. Miller

    Cambda Zhu
     

13 Mar, 2020

1 commit

  • Convert the various uses of fallthrough comments to fallthrough;

    Done via script
    Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    And by hand:

    net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
    that causes gcc to emit a warning if converted in-place.

    So move the new fallthrough; inside the containing #ifdef/#endif too.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

15 Jan, 2020

1 commit


08 Dec, 2019

1 commit

  • syzbot was once again able to crash a host by setting a very small mtu
    on loopback device.

    Let's make inetdev_valid_mtu() available in include/net/ip.h,
    and use it in ip_setup_cork(), so that we protect both ip_append_page()
    and __ip_append_data()

    Also add a READ_ONCE() when the device mtu is read.

    Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
    even if other code paths might write over this field.

    Add a big comment in include/linux/netdevice.h about dev->mtu
    needing READ_ONCE()/WRITE_ONCE() annotations.

    Hopefully we will add the missing ones in followup patches.

    [1]

    refcount_t: saturated; leaking memory.
    WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    panic+0x2e3/0x75c kernel/panic.c:221
    __warn.cold+0x2f/0x3e kernel/panic.c:582
    report_bug+0x289/0x300 lib/bug.c:195
    fixup_bug arch/x86/kernel/traps.c:174 [inline]
    fixup_bug arch/x86/kernel/traps.c:169 [inline]
    do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
    do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
    invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
    RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
    Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
    RSP: 0018:ffff88809689f550 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
    RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
    R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
    R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
    refcount_add include/linux/refcount.h:193 [inline]
    skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
    sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
    ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
    udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
    inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
    kernel_sendpage+0x92/0xf0 net/socket.c:3794
    sock_sendpage+0x8b/0xc0 net/socket.c:936
    pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
    splice_from_pipe_feed fs/splice.c:512 [inline]
    __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
    splice_from_pipe+0x108/0x170 fs/splice.c:671
    generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
    do_splice_from fs/splice.c:861 [inline]
    direct_splice_actor+0x123/0x190 fs/splice.c:1035
    splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
    do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
    do_sendfile+0x597/0xd00 fs/read_write.c:1464
    __do_sys_sendfile64 fs/read_write.c:1525 [inline]
    __se_sys_sendfile64 fs/read_write.c:1511 [inline]
    __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x441409
    Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
    RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
    RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
    R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
    R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
    Kernel Offset: disabled
    Rebooting in 86400 seconds..

    Fixes: 1470ddf7f8ce ("inet: Remove explicit write references to sk/inet in ip_append_data")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Nov, 2019

1 commit

  • Instead of generally passing NULL to NF_HOOK_COND() for input device,
    pass skb->dev which contains input device for routed skbs.

    Note that iptables (both legacy and nft) reject rules with input
    interface match from being added to POSTROUTING chains, but nftables
    allows this.

    Cc: Eric Garver
    Signed-off-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Phil Sutter
     

22 Oct, 2019

1 commit

  • This patch removes the iph field from the state structure, which is not
    properly initialized. Instead, add a new field to make the "do we want
    to set DF" be the state bit and move the code to set the DF flag from
    ip_frag_next().

    Joint work with Pablo and Linus.

    Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators")
    Reported-by: Patrick Schönthaler
    Signed-off-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Linus Torvalds
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2019

1 commit

  • Thomas found that some forwarded packets would be stuck
    in FQ packet scheduler because their skb->tstamp contained
    timestamps far in the future.

    We thought we addressed this point in commit 8203e2d844d3
    ("net: clear skb->tstamp in forwarding paths") but there
    is still an issue when/if a packet needs to be fragmented.

    In order to meet EDT requirements, we have to make sure all
    fragments get the original skb->tstamp.

    Note that this original skb->tstamp should be zero in
    forwarding path, but might have a non zero value in
    output path if user decided so.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Eric Dumazet
    Reported-by: Thomas Bartschies
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Sep, 2019

1 commit

  • ctl packets sent on behalf of TIME_WAIT sockets currently
    have a zero skb->priority, which can cause various problems.

    In this patch we :

    - add a tw_priority field in struct inet_timewait_sock.

    - populate it from sk->sk_priority when a TIME_WAIT is created.

    - For IPv4, change ip_send_unicast_reply() and its two
    callers to propagate tw_priority correctly.
    ip_send_unicast_reply() no longer changes sk->sk_priority.

    - For IPv6, make sure TIME_WAIT sockets pass their tw_priority
    field to tcp_v6_send_response() and tcp_v6_send_ack().

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2019

1 commit

  • Julian noted that rt_uses_gateway has a more subtle use than 'is gateway
    set':
    https://lore.kernel.org/netdev/alpine.LFD.2.21.1909151104060.2546@ja.home.ssi.bg/

    Revert that part of the commit referenced in the Fixes tag.

    Currently, there are no u8 holes in 'struct rtable'. There is a 4-byte hole
    in the second cacheline which contains the gateway declaration. So move
    rt_gw_family down to the gateway declarations since they are always used
    together, and then re-use that u8 for rt_uses_gateway. End result is that
    rtable size is unchanged.

    Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway")
    Reported-by: Julian Anastasov
    Signed-off-by: David Ahern
    Reviewed-by: Julian Anastasov
    Signed-off-by: Jakub Kicinski

    David Ahern
     

14 Sep, 2019

1 commit

  • Enable setting skb->mark for UDP and RAW sockets using cmsg.

    This is analogous to existing support for TOS, TTL, txtime, etc.

    Packet sockets already support this as of commit c7d39e32632e
    ("packet: support per-packet fwmark for af_packet sendmsg").

    Similar to other fields, implement by
    1. initialize the sockcm_cookie.mark from socket option sk_mark
    2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
    3. initialize inet_cork.mark from sockcm_cookie.mark
    4. initialize each (usually just one) skb->mark from inet_cork.mark

    Step 1 is handled in one location for most protocols by ipcm_init_sk
    as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Jun, 2019

1 commit

  • The new route handling in ip_mc_finish_output() from 'net' overlapped
    with the new support for returning congestion notifications from BPF
    programs.

    In order to handle this I had to take the dev_loopback_xmit() calls
    out of the switch statement.

    The aquantia driver conflicts were simple overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jun, 2019

1 commit

  • Multicast or broadcast egress packets have rt_iif set to the oif. These
    packets might be recirculated back as input and lookup to the raw
    sockets may fail because they are bound to the incoming interface
    (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
    returns rt_iif instead of skb_iif. Hence, the lookup fails.

    v2: Make it non vrf specific (David Ahern). Reword the changelog to
    reflect it.
    Signed-off-by: Stephen Suryaputra
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Stephen Suryaputra
     

18 Jun, 2019

1 commit


15 Jun, 2019

1 commit

  • If we want to set a EDT time for the skb we want to send
    via ip_send_unicast_reply(), we have to pass a new parameter
    and initialize ipc.sockc.transmit_time with it.

    This fixes the EDT time for ACK/RST packets sent on behalf of
    a TIME_WAIT socket.

    Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jun, 2019

1 commit

  • The below patch fixes an incorrect zerocopy refcnt increment when
    appending with MSG_MORE to an existing zerocopy udp skb.

    send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1
    send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags)

    But it missed that zerocopy need not be passed at the first send. The
    right test whether the uarg is newly allocated and thus has extra
    refcnt 1 is not !skb, but !skb_zcopy.

    send(.., MSG_MORE); //
    send(.., MSG_ZEROCOPY); // refcnt 1

    Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE")
    Reported-by: syzbot
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

04 Jun, 2019

1 commit

  • syzbot reported nasty use-after-free [1]

    Lets remove frag_list field from structs ip_fraglist_iter
    and ip6_fraglist_iter. This seens not needed anyway.

    [1] :
    BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
    Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

    CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
    __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    kasan_report+0x12/0x20 mm/kasan/common.c:614
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
    kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
    ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
    __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
    ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
    dst_output include/net/dst.h:433 [inline]
    ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
    ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
    ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
    rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
    rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x44add9
    Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
    RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
    RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
    R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

    Allocated by task 8947:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_kmalloc mm/kasan/common.c:489 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
    slab_post_alloc_hook mm/slab.h:437 [inline]
    slab_alloc_node mm/slab.c:3269 [inline]
    kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
    alloc_skb include/linux/skbuff.h:1058 [inline]
    __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
    ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
    rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 8947:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
    __cache_free mm/slab.c:3432 [inline]
    kmem_cache_free+0x86/0x260 mm/slab.c:3698
    kfree_skbmem net/core/skbuff.c:625 [inline]
    kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
    __kfree_skb net/core/skbuff.c:682 [inline]
    kfree_skb net/core/skbuff.c:699 [inline]
    kfree_skb+0xf0/0x390 net/core/skbuff.c:693
    kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
    __dev_xmit_skb net/core/dev.c:3551 [inline]
    __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
    dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
    neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
    ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
    __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
    ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
    dst_output include/net/dst.h:433 [inline]
    ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
    ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
    ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
    rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
    rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff888085a3cbc0
    which belongs to the cache skbuff_head_cache of size 224
    The buggy address is located 0 bytes inside of
    224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
    The buggy address belongs to the page:
    page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
    flags: 0x1fffc0000000200(slab)
    raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
    raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
    >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

    Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
    Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
    Signed-off-by: Eric Dumazet
    Cc: Pablo Neira Ayuso
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jun, 2019

3 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2019-05-31

    The following pull-request contains BPF updates for your *net-next* tree.

    Lots of exciting new features in the first PR of this developement cycle!
    The main changes are:

    1) misc verifier improvements, from Alexei.

    2) bpftool can now convert btf to valid C, from Andrii.

    3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
    This feature greatly improves BPF speed on 32-bit architectures. From Jiong.

    4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
    bpf programs got stuck in dying cgroups. From Roman.

    5) new bpf_send_signal() helper, from Yonghong.

    6) cgroup inet skb programs can signal CN to the stack, from Lawrence.

    7) miscellaneous cleanups, from many developers.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
    congestion notifications from the BPF programs.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    brakmo
     
  • The phylink conflict was between a bug fix by Russell King
    to make sure we have a consistent PHY interface mode, and
    a change in net-next to pull some code in phylink_resolve()
    into the helper functions phylink_mac_link_{up,down}()

    On the dp83867 side it's mostly overlapping changes, with
    the 'net' side removing a condition that was supposed to
    trigger for RGMII but because of how it was coded never
    actually could trigger.

    Signed-off-by: David S. Miller

    David S. Miller
     

31 May, 2019

4 commits

  • TCP zerocopy takes a uarg reference for every skb, plus one for the
    tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
    as it builds, sends and frees skbs inside its inner loop.

    UDP and RAW zerocopy do not send inside the inner loop so do not need
    the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
    52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
    extra_uref to pass the initial reference taken in sock_zerocopy_alloc
    to the first generated skb.

    But, sock_zerocopy_realloc takes this extra reference at the start of
    every call. With MSG_MORE, no new skb may be generated to attach the
    extra_uref to, so refcnt is incorrectly 2 with only one skb.

    Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
    Update extra_uref accordingly.

    This conditional assignment triggers a false positive may be used
    uninitialized warning, so have to initialize extra_uref at define.

    Changes v1->v2: fix typo in Fixes SHA1

    Fixes: 52900d22288e7 ("udp: elide zerocopy operation in hot path")
    Reported-by: syzbot
    Diagnosed-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Deal with the IPCB() area away from the iterators.

    The bridge codebase has its own control buffer layout, move specific
    IP control buffer handling into the IPv4 codepath.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • This patch exposes a new API to refragment a skbuff. This allows you to
    split either a linear skbuff or to force the refragmentation of an
    existing fraglist using a different mtu. The API consists of:

    * ip_frag_init(), that initializes the internal state of the transformer.
    * ip_frag_next(), that allows you to fetch the next fragment. This function
    internally allocates the skbuff that represents the fragment, it pushes
    the IPv4 header, and it also copies the payload for each fragment.

    The ip_frag_state object stores the internal state of the splitter.

    This code has been extracted from ip_do_fragment(). Symbols are also
    exported to allow to reuse this iterator from the bridge codepath to
    build its own refragmentation routine by reusing the existing codebase.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • This patch adds the skbuff fraglist splitter. This API provides an
    iterator to transform the fraglist into single skbuff objects, it
    consists of:

    * ip_fraglist_init(), that initializes the internal state of the
    fraglist splitter.
    * ip_fraglist_prepare(), that restores the IPv4 header on the
    fragments.
    * ip_fraglist_next(), that retrieves the fragment from the fraglist and
    it updates the internal state of the splitter to point to the next
    fragment skbuff in the fraglist.

    The ip_fraglist_iter object stores the internal state of the iterator.

    This code has been extracted from ip_do_fragment(). Symbols are also
    exported to allow to reuse this iterator from the bridge codepath to
    build its own refragmentation routine by reusing the existing codebase.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

03 May, 2019

1 commit


02 May, 2019

1 commit

  • Previously, during fragmentation after forwarding, skb->skb_iif isn't
    preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given
    'from' skb.

    As a result, ip_do_fragment's creates fragments with zero skb_iif,
    leading to inconsistent behavior.

    Assume for example an eBPF program attached at tc egress (post
    forwarding) that examines __sk_buff->ingress_ifindex:
    - the correct iif is observed if forwarding path does not involve
    fragmentation/refragmentation
    - a bogus iif is observed if forwarding path involves
    fragmentation/refragmentatiom

    Fix, by preserving skb_iif during 'ip_copy_metadata'.

    Signed-off-by: Shmulik Ladkani
    Signed-off-by: David S. Miller

    Shmulik Ladkani