18 Feb, 2017

1 commit

  • [ Upstream commit 57031eb794906eea4e1c7b31dc1e2429c0af0c66 ]

    Link layer protocols may unconditionally pull headers, as Ethernet
    does in eth_type_trans. Ensure that the entire link layer header
    always lies in the skb linear segment. tpacket_snd has such a check.
    Extend this to packet_snd.

    Variable length link layer headers complicate the computation
    somewhat. Here skb->len may be smaller than dev->hard_header_len.

    Round up the linear length to be at least as long as the smallest of
    the two.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

04 Feb, 2017

1 commit

  • [ Upstream commit 6391a4481ba0796805d6581e42f9f0418c099e34 ]

    Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
    xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
    fixing this by adding a hint (has_data_valid) and set it only on the
    receiving path.

    Cc: Rolf Neugebauer
    Signed-off-by: Jason Wang
    Acked-by: Rolf Neugebauer
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

03 Dec, 2016

1 commit

  • When packet_set_ring creates a ring buffer it will initialize a
    struct timer_list if the packet version is TPACKET_V3. This value
    can then be raced by a different thread calling setsockopt to
    set the version to TPACKET_V1 before packet_set_ring has finished.

    This leads to a use-after-free on a function pointer in the
    struct timer_list when the socket is closed as the previously
    initialized timer will not be deleted.

    The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
    changing the packet version while also taking the lock at the start
    of packet_set_ring.

    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Signed-off-by: Philip Pettersson
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Philip Pettersson
     

30 Oct, 2016

1 commit

  • When transmitting on a packet socket with PACKET_VNET_HDR and
    PACKET_QDISC_BYPASS, validate device support for features requested
    in vnet_hdr.

    Drop TSO packets sent to devices that do not support TSO or have the
    feature disabled. Note that the latter currently do process those
    packets correctly, regardless of not advertising the feature.

    Because of SKB_GSO_DODGY, it is not sufficient to test device features
    with netif_needs_gso. Full validate_xmit_skb is needed.

    Switch to software checksum for non-TSO packets that request checksum
    offload if that device feature is unsupported or disabled. Note that
    similar to the TSO case, device drivers may perform checksum offload
    correctly even when not advertising it.

    When switching to software checksum, packets hit skb_checksum_help,
    which has two BUG_ON checksum not in linear segment. Packet sockets
    always allocate at least up to csum_start + csum_off + 2 as linear.

    Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

    ethtool -K eth0 tso off tx on
    psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
    psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

    ethtool -K eth0 tx off
    psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
    psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

    v2:
    - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

    Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option")
    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

07 Oct, 2016

1 commit

  • If a socket has FANOUT sockopt set, a new proto_hook is registered
    as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
    af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
    registered as part of fanout_add is not removed. Call fanout_release, on a
    NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
    fanout_list.

    This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()

    Signed-off-by: Anoob Soman
    Signed-off-by: David S. Miller

    Anoob Soman
     

24 Jul, 2016

1 commit


22 Jul, 2016

1 commit


20 Jul, 2016

1 commit

  • This patch fixes an issue that a syscall (e.g. sendto syscall) cannot
    work correctly. Since the sendto syscall doesn't have msg_control buffer,
    the sock_tx_timestamp() in packet_snd() cannot work correctly because
    the socks.tsflags is set to 0.
    So, this patch sets the socks.tsflags to sk->sk_tsflags as default.

    Fixes: c14ac9451c34 ("sock: enable timestamping using control messages")
    Reported-by: Kazuya Mizuguchi
    Reported-by: Keita Kobayashi
    Signed-off-by: Yoshihiro Shimoda
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Yoshihiro Shimoda
     

07 Jul, 2016

1 commit


02 Jul, 2016

2 commits

  • People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
    they want packets going in both directions on a flow to hash to the
    same bucket.

    The core kernel SKB hash became non-symmetric when the ipv6 flow label
    and other entities were incorporated into the standard flow hash order
    to increase entropy.

    But there are no users of PACKET_FANOUT_HASH who want an assymetric
    hash, they all want a symmetric one.

    Therefore, use the flow dissector to compute a flat symmetric hash
    over only the protocol, addresses and ports. This hash does not get
    installed into and override the normal skb hash, so this change has
    no effect whatsoever on the rest of the stack.

    Reported-by: Eric Leblond
    Tested-by: Eric Leblond
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Since bpf_prog_get() and program type check is used in a couple of places,
    refactor this into a small helper function that we can make use of. Since
    the non RO prog->aux part is not used in performance critical paths and a
    program destruction via RCU is rather very unlikley when doing the put, we
    shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
    check, but actually not taking the ref at all (due to being in fdget() /
    fdput() section of the bpf fd) is even cleaner and makes the diff smaller
    as well, so just go for that. Callsites are changed to make use of the new
    helper where possible.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 Jun, 2016

1 commit


10 Jun, 2016

1 commit

  • Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
    if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
    a pointer into user memory. If userland is 32-bit and kernel is 64-bit
    the two disagree about the layout of struct sock_fprog.

    Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
    a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
    SO_REUSEPORT added in commit 1957598840f4 ("soreuseport: add compat
    case for setsockopt SO_ATTACH_REUSEPORT_CBPF").

    Reported-by: Daniel Borkmann
    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

24 Apr, 2016

1 commit


15 Apr, 2016

1 commit

  • consume_skb() isn't for error cases that kfree_skb() is more proper
    one. At this patch, it fixed tpacket_rcv() and packet_rcv() to be
    consistent for error or non-error cases letting perf trace its event
    properly.

    Signed-off-by: Weongyo Jeong
    Signed-off-by: David S. Miller

    Weongyo Jeong
     

14 Apr, 2016

1 commit

  • Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
    pdiag_put_mclist() leaks uninitialized heap bytes via the
    PACKET_DIAG_MCLIST netlink attribute.

    Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

    Fixes: eea68e2f1a00 ("packet: Report socket mclist info via diag module")
    Signed-off-by: Mathias Krause
    Cc: Eric W. Biederman
    Cc: Pavel Emelyanov
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Mathias Krause
     

10 Apr, 2016

1 commit


07 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
    This is very costly when users want to sample writes to gather
    tx timestamps.

    Add support for enabling SO_TIMESTAMPING via control messages by
    using tsflags added in `struct sockcm_cookie` (added in the previous
    patches in this series) to set the tx_flags of the last skb created in
    a sendmsg. With this patch, the timestamp recording bits in tx_flags
    of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

    Please note that this is only effective for overriding the recording
    timestamps flags. Users should enable timestamp reporting (e.g.,
    SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
    socket options and then should ask for SOF_TIMESTAMPING_TX_*
    using control messages per sendmsg to sample timestamps for each
    write.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

10 Mar, 2016

1 commit

  • Replace link layer header validation check ll_header_truncate with
    more generic dev_validate_header.

    Validation based on hard_header_len incorrectly drops valid packets
    in variable length protocols, such as AX25. dev_validate_header
    calls header_ops.validate for such protocols to ensure correctness
    below hard_header_len.

    See also http://comments.gmane.org/gmane.linux.network/401064

    Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header")
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

26 Feb, 2016

1 commit


09 Feb, 2016

4 commits

  • Support socket option PACKET_VNET_HDR together with PACKET_TX_RING.

    When enabled, a struct virtio_net_hdr is expected to precede the data
    in the ring. The vnet option must be set before the ring is created.

    The implementation reuses the existing skb_copy_bits code that is used
    when dev->hard_header_len is non-zero. Move this ll_header check to
    before the skb alloc and combine it with a test for vnet_hdr->hdr_len.
    Allocate and copy the max of the two.

    Verified with test program at
    github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • GSO packet headers must be stored in the linear skb segment.
    Move tpacket header parsing before sock_alloc_send_skb. The GSO
    follow-on patch will later increase the skb linear argument to
    sock_alloc_send_skb if needed for large packets.

    The header parsing code does not require an allocated skb, so is
    safe to move. Later pass to tpacket_fill_skb the computed data
    start and length.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Support socket option PACKET_VNET_HDR together with PACKET_RX_RING.
    When enabled, a struct virtio_net_hdr will precede the data in the
    packet ring slots.

    Verified with test program at
    github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c

    pkt: 1454269209.798420 len=5066
    vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off
    csum: start=34 off=16
    eth: proto=0x800
    ip: src= dst= proto=6 len=5052

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • packet_snd and packet_rcv support virtio net headers for GSO.
    Move this logic into helper functions to be able to reuse it in
    tpacket_snd and tpacket_rcv.

    This is a straighforward code move with one exception. Instead of
    creating and passing a separate gso_type variable, reuse
    vnet_hdr.gso_type after conversion from virtio to kernel gso type.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

30 Nov, 2015

1 commit

  • Commit 9c7077622dd91 ("packet: make packet_snd fail on len smaller
    than l2 header") added validation for the packet size in packet_snd.
    This change enforces that every packet needs a header (with at least
    hard_header_len bytes) plus a payload with at least one byte. Before
    this change the payload was optional.

    This fixes PPPoE connections which do not have a "Service" or
    "Host-Uniq" configured (which is violating the spec, but is still
    widely used in real-world setups). Those are currently failing with the
    following message: "pppd: packet size is too short (24
    Signed-off-by: David S. Miller

    Martin Blumenstingl
     

18 Nov, 2015

2 commits


16 Nov, 2015

5 commits

  • Since it's introduction in commit 69e3c75f4d54 ("net: TX_RING and
    packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
    side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
    reserve check should have reserve as 0, but currently, this is
    unconditionally set (in it's original form as dev->hard_header_len).

    I think this is not correct since tpacket_fill_skb() would then
    take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
    the extra VLAN_HLEN could be possible in both cases. Presumably, the
    reserve code was copied from packet_snd(), but later on missed the
    check. Make it similar as we have it in packet_snd().

    Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap")
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In case no struct sockaddr_ll has been passed to packet
    socket's sendmsg() when doing a TX_RING flush run, then
    skb->protocol is set to po->num instead, which is the protocol
    passed via socket(2)/bind(2).

    Applications only xmitting can go the path of allocating the
    socket as socket(PF_PACKET, , 0) and do a bind(2) on the
    TX_RING with sll_protocol of 0. That way, register_prot_hook()
    is neither called on creation nor on bind time, which saves
    cycles when there's no interest in capturing anyway.

    That leaves us however with po->num 0 instead and therefore
    the TX_RING flush run sets skb->protocol to 0 as well. Eric
    reported that this leads to problems when using tools like
    trafgen over bonding device. I.e. the bonding's hash function
    could invoke the kernel's flow dissector, which depends on
    skb->protocol being properly set. In the current situation, all
    the traffic is then directed to a single slave.

    Fix it up by inferring skb->protocol from the Ethernet header
    when not set and we have ARPHRD_ETHER device type. This is only
    done in case of SOCK_RAW and where we have a dev->hard_header_len
    length. In case of ARPHRD_ETHER devices, this is guaranteed to
    cover ETH_HLEN, and therefore being accessed on the skb after
    the skb_store_bits().

    Reported-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Packet sockets can be used by various net devices and are not
    really restricted to ARPHRD_ETHER device types. However, when
    currently checking for the extra 4 bytes that can be transmitted
    in VLAN case, our assumption is that we generally probe on
    ARPHRD_ETHER devices. Therefore, before looking into Ethernet
    header, check the device type first.

    This also fixes the issue where non-ARPHRD_ETHER devices could
    have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
    the check would test unfilled linear part of the skb (instead
    of non-linear).

    Fixes: 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
    Fixes: 52f1454f629f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We concluded that the skb_probe_transport_header() should better be
    called unconditionally. Avoiding the call into the flow dissector has
    also not really much to do with the direct xmit mode.

    While it seems that only virtio_net code makes use of GSO from non
    RX/TX ring packet socket paths, we should probe for a transport header
    nevertheless before they hit devices.

    Reference: http://thread.gmane.org/gmane.linux.network/386173/
    Signed-off-by: Daniel Borkmann
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In tpacket_fill_skb() commit c1aad275b029 ("packet: set transport
    header before doing xmit") and later on 40893fd0fd4e ("net: switch
    to use skb_probe_transport_header()") was probing for a transport
    header on the skb from a ring buffer slot, but at a time, where
    the skb has _not even_ been filled with data yet. So that call into
    the flow dissector is pretty useless. Lets do it after we've set
    up the skb frags.

    Fixes: c1aad275b029 ("packet: set transport header before doing xmit")
    Reported-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Nov, 2015

1 commit

  • There is a race conditions between packet_notifier and packet_bind{_spkt}.

    It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
    time packet_bind{_spkt} takes a reference on the new netdevice and the
    time packet_do_bind sets po->ifindex.
    In this case the notification can be missed.
    If this happens during a dev_change_net_namespace this can result in the
    netdevice to be moved to the new namespace while the packet_sock in the
    old namespace still holds a reference on it. When the netdevice is later
    deleted in the new namespace the deletion hangs since the packet_sock
    is not found in the new namespace' &net->packet.sklist.
    It can be reproduced with the script below.

    This patch makes packet_do_bind check again for the presence of the
    netdevice in the packet_sock's namespace after the synchronize_net
    in unregister_prot_hook.
    More in general it also uses the rcu lock for the duration of the bind
    to stop dev_change_net_namespace/rollback_registered_many from
    going past the synchronize_net following unlist_netdevice, so that
    no NETDEV_UNREGISTER notifications can happen on the new netdevice
    while the bind is executing. In order to do this some code from
    packet_bind{_spkt} is consolidated into packet_do_dev.

    import socket, os, time, sys
    proto=7
    realDev='em1'
    vlanId=400
    if len(sys.argv) > 1:
    vlanId=int(sys.argv[1])
    dev='vlan%d' % vlanId

    os.system('taskset -p 0x10 %d' % os.getpid())

    s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
    os.system('ip link add link %s name %s type vlan id %d' %
    (realDev, dev, vlanId))
    os.system('ip netns add dummy')

    pid=os.fork()

    if pid == 0:
    # dev should be moved while packet_do_bind is in synchronize net
    os.system('taskset -p 0x20000 %d' % os.getpid())
    os.system('ip link set %s netns dummy' % dev)
    os.system('ip netns exec dummy ip link del %s' % dev)
    s.close()
    sys.exit(0)

    time.sleep(.004)
    try:
    s.bind(('%s' % dev, proto+1))
    except:
    print 'Could not bind socket'
    s.close()
    os.system('ip netns del dummy')
    sys.exit(0)

    os.waitpid(pid, 0)
    s.close()
    os.system('ip netns del dummy')
    sys.exit(0)

    Signed-off-by: Francesco Ruggeri
    Signed-off-by: David S. Miller

    Francesco Ruggeri
     

13 Oct, 2015

3 commits

  • The function ip_defrag is called on both the input and the output
    paths of the networking stack. In particular conntrack when it is
    tracking outbound packets from the local machine calls ip_defrag.

    So add a struct net parameter and stop making ip_defrag guess which
    network namespace it needs to defragment packets in.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Recent TCP listener patches exposed a prior af_packet bug :
    match_fanout_group() blindly assumes it is always safe
    to cast sk to a packet socket to compare fanout with af_packet_priv

    But SYNACK packets can be sent while attached to request_sock, which
    are smaller than a "struct sock".

    We can read non existent memory and crash.

    Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group")
    Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Cc: Eric Leblond
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Edward Hyunkoo Jee
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Edward Jee
     

11 Oct, 2015

1 commit

  • eBPF socket filter programs may see junk in 'u32 cb[5]' area,
    since it could have been used by protocol layers earlier.

    For socket filter programs used in af_packet we need to clean
    20 bytes of skb->cb area if it could be used by the program.
    For programs attached to TCP/UDP sockets we need to save/restore
    these 20 bytes, since it's used by protocol layers.

    Remove SK_RUN_FILTER macro, since it's no longer used.

    Long term we may move this bpf cb area to per-cpu scratch, but that
    requires addition of new 'per-cpu load/store' instructions,
    so not suitable as a short term fix.

    Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
    Reported-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

05 Oct, 2015

1 commit

  • The current ongoing effort to dump existing cBPF seccomp filters back
    to user space requires to hold the pre-transformed instructions like
    we do in case of socket filters from sk_attach_filter() side, so they
    can be reloaded in original form at a later point in time by utilities
    such as criu.

    To prepare for this, simply extend the bpf_prog_create_from_user()
    API to hold a flag that tells whether we should store the original
    or not. Also, fanout filters could make use of that in future for
    things like diag. While fanout filters already use bpf_prog_destroy(),
    move seccomp over to them as well to handle original programs when
    present.

    Signed-off-by: Daniel Borkmann
    Cc: Tycho Andersen
    Cc: Pavel Emelyanov
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Alexei Starovoitov
    Tested-by: Tycho Andersen
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

24 Sep, 2015

1 commit

  • Commit 7d82410950aa ("virtio: add explicit big-endian support to memory
    accessors") accidentally changed the virtio_net header used by
    AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

    Since virtio_legacy_is_little_endian() is a very long identifier,
    define a vio_le macro and use that throughout the code instead of the
    hard-coded 'false' for little-endian.

    This restores the ABI to match 4.1 and earlier kernels, and makes my
    test program work again.

    Signed-off-by: David Woodhouse
    Signed-off-by: David S. Miller

    David Woodhouse