30 Jan, 2018

10 commits

  • Introduce a new qdisc ops ->change_tx_queue_len() so that
    each qdisc could decide how to implement this if it wants.
    Previously we simply read dev->tx_queue_len, after pfifo_fast
    switches to skb array, we need this API to resize the skb array
    when we change dev->tx_queue_len.

    To avoid handling race conditions with TX BH, we need to
    deactivate all TX queues before change the value and bring them
    back after we are done, this also makes implementation easier.

    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • This patch promotes the local change_tx_queue_len() to a core
    helper function, dev_change_tx_queue_len(), so that rtnetlink
    and net-sysfs could share the code. This also prepares for the
    following patch.

    Note, the -EFAULT in the original code doesn't make sense,
    we should propagate the errno from notifiers.

    Cc: John Fastabend
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • The goal is to let the user follow an interface that moves to another
    netns.

    CC: Jiri Benc
    CC: Christian Brauner
    Signed-off-by: Nicolas Dichtel
    Reviewed-by: Jiri Benc
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • In theory compiler could tear queue loads or stores in two. It does not
    seem to be happening in practice but it seems easier to convert the
    cases where this would be a problem to READ/WRITE_ONCE than worry about
    it.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • __skb_array_empty should use __ptr_ring_empty since that's the only
    legal lockless function.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • This reverts commit bcecb4bbf88aa03171c30652bca761cf27755a6b.

    If we try to allocate an extra entry as the above commit did, and when
    the requested size is UINT_MAX, addition overflows causing zero size to
    be passed to kmalloc().

    kmalloc then returns ZERO_SIZE_PTR with a subsequent crash.

    Reported-by: syzbot+87678bcf753b44c39b67@syzkaller.appspotmail.com
    Cc: John Fastabend
    Signed-off-by: Michael S. Tsirkin
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • Similar to bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can
    overrun array bounds") a lockless use of __ptr_ring_full might
    cause an out of bounds access.

    We can fix this, but it's easier to just disallow lockless
    __ptr_ring_full for now.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • Lockless __ptr_ring_empty requires that consumer head is read and
    written at once, atomically. Annotate accordingly to make sure compiler
    does it correctly. Switch locked callers to __ptr_ring_peek which does
    not support the lockless operation.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • The only function safe to call without locks
    is __ptr_ring_empty. Move documentation about
    lockless use there to make sure people do not
    try to use __ptr_ring_peek outside locks.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • The comment near __ptr_ring_peek says:

    * If ring is never resized, and if the pointer is merely
    * tested, there's no need to take the lock - see e.g. __ptr_ring_empty.

    but this was in fact never possible since consumer_head would sometimes
    point outside the ring. Refactor the code so that it's always
    pointing within a ring.

    Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Michael S. Tsirkin
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     

29 Jan, 2018

2 commits

  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2018-01-26

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) A number of extensions to tcp-bpf, from Lawrence.
    - direct R or R/W access to many tcp_sock fields via bpf_sock_ops
    - passing up to 3 arguments to bpf_sock_ops functions
    - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
    - optionally calling bpf_sock_ops program when RTO fires
    - optionally calling bpf_sock_ops program when packet is retransmitted
    - optionally calling bpf_sock_ops program when TCP state changes
    - access to tclass and sk_txhash
    - new selftest

    2) div/mod exception handling, from Daniel.
    One of the ugly leftovers from the early eBPF days is that div/mod
    operations based on registers have a hard-coded src_reg == 0 test
    in the interpreter as well as in JIT code generators that would
    return from the BPF program with exit code 0. This was basically
    adopted from cBPF interpreter for historical reasons.
    There are multiple reasons why this is very suboptimal and prone
    to bugs. To name one: the return code mapping for such abnormal
    program exit of 0 does not always match with a suitable program
    type's exit code mapping. For example, '0' in tc means action 'ok'
    where the packet gets passed further up the stack, which is just
    undesirable for such cases (e.g. when implementing policy) and
    also does not match with other program types.
    After considering _four_ different ways to address the problem,
    we adapt the same behavior as on some major archs like ARMv8:
    X div 0 results in 0, and X mod 0 results in X. aarch64 and
    aarch32 ISA do not generate any traps or otherwise aborts
    of program execution for unsigned divides.
    Given the options, it seems the most suitable from
    all of them, also since major archs have similar schemes in
    place. Given this is all in the realm of undefined behavior,
    we still have the option to adapt if deemed necessary.

    3) sockmap sample refactoring, from John.

    4) lpm map get_next_key fixes, from Yonghong.

    5) test cleanups, from Alexei and Prashant.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jan, 2018

1 commit

  • Recent findings by syzcaller fixed in 7891a87efc71 ("bpf: arsh is
    not supported in 32 bit alu thus reject it") triggered a warning
    in the interpreter due to unknown opcode not being rejected by
    the verifier. The 'return 0' for an unknown opcode is really not
    optimal, since with BPF to BPF calls, this would go untracked by
    the verifier.

    Do two things here to improve the situation: i) perform basic insn
    sanity check early on in the verification phase and reject every
    non-uapi insn right there. The bpf_opcode_in_insntable() table
    reuses the same mapping as the jumptable in ___bpf_prog_run() sans
    the non-public mappings. And ii) in ___bpf_prog_run() we do need
    to BUG in the case where the verifier would ever create an unknown
    opcode due to some rewrites.

    Note that JITs do not have such issues since they would punt to
    interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
    would also help to avoid such unknown opcodes in the first place.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

26 Jan, 2018

14 commits

  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2018-01-26

    One last patch for this development cycle:

    1) Add ESN support for IPSec HW offload.
    From Yossef Efraim.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The patch adds support for openvswitch to configure erspan
    v1 and v2. The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added
    to uapi as a binary blob to support all ERSPAN v1 and v2's
    fields. Note that Previous commit "openvswitch: Add erspan tunnel
    support." was reverted since it does not design properly.

    Signed-off-by: William Tu
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    William Tu
     
  • The patch adds a new uapi header file, erspan.h, and moves
    the 'struct erspan_metadata' from internal erspan.h to it.

    Signed-off-by: William Tu
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    William Tu
     
  • Originally the erspan fields are defined as a group into a __be16 field,
    and use mask and offset to access each field. This is more costly due to
    calling ntohs/htons. The patch changes it to use bitfields.

    Signed-off-by: William Tu
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    William Tu
     
  • Very few (mlxsw) upstream drivers seem to allow offload of chains
    other than 0. Save driver developers typing and add a helper for
    checking both if ethtool's TC offload flag is on and if chain is 0.
    This helper will set the extack appropriately in both error cases.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Adds support for calling sock_ops BPF program when there is a TCP state
    change. Two arguments are used; one for the old state and another for
    the new state.

    There is a new enum in include/uapi/linux/bpf.h that exports the TCP
    states that prepends BPF_ to the current TCP state names. If it is ever
    necessary to change the internal TCP state values (other than adding
    more to the end), then it will become necessary to convert from the
    internal TCP state value to the BPF value before calling the BPF
    sock_ops function. There are a set of compile checks added in tcp.c
    to detect if the internal and BPF values differ so we can make the
    necessary fixes.

    New op: BPF_SOCK_OPS_STATE_CB.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • Adds support for calling sock_ops BPF program when there is a
    retransmission. Three arguments are used; one for the sequence number,
    another for the number of segments retransmitted, and the last one for
    the return value of tcp_transmit_skb (0 => success).
    Does not include syn-ack retransmissions.

    New op: BPF_SOCK_OPS_RETRANS_CB.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • Add support for reading many more tcp_sock fields

    state, same as sk->sk_state
    rtt_min same as sk->rtt_min.s[0].v (current rtt_min)
    snd_ssthresh
    rcv_nxt
    snd_nxt
    snd_una
    mss_cache
    ecn_flags
    rate_delivered
    rate_interval_us
    packets_out
    retrans_out
    total_retrans
    segs_in
    data_segs_in
    segs_out
    data_segs_out
    lost_out
    sacked_out
    sk_txhash
    bytes_received (__u64)
    bytes_acked (__u64)

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • Adds an optional call to sock_ops BPF program based on whether the
    BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
    The BPF program is passed 2 arguments: icsk_retransmits and whether the
    RTO has expired.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
    use is to determine if there should be calls to sock_ops bpf program at
    various points in the TCP code. The field is initialized to zero,
    disabling the calls. A sock_ops BPF program can set it, per connection and
    as necessary, when the connection is established.

    It also adds support for reading and writting the field within a
    sock_ops BPF program. Reading is done by accessing the field directly.
    However, writing is done through the helper function
    bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
    is trying to set a callback that is not supported in the current kernel
    (i.e. running an older kernel). The helper function returns 0 if it was
    able to set all of the bits set in the argument, a positive number
    containing the bits that could not be set, or -EINVAL if the socket is
    not a full TCP socket.

    Examples of where one could call the bpf program:

    1) When RTO fires
    2) When a packet is retransmitted
    3) When the connection terminates
    4) When a packet is sent
    5) When a packet is received

    Signed-off-by: Lawrence Brakmo
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • Adds support for passing up to 4 arguments to sock_ops bpf functions. It
    reusues the reply union, so the bpf_sock_ops structures are not
    increased in size.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
    struct tcp_sock or struct sock fields. This required adding a new
    field "temp" to struct bpf_sock_ops_kern for temporary storage that
    is used by sock_ops_convert_ctx_access. It is used to store and recover
    the contents of a register, so the register can be used to store the
    address of the sk. Since we cannot overwrite the dst_reg because it
    contains the pointer to ctx, nor the src_reg since it contains the value
    we want to store, we need an extra register to contain the address
    of the sk.

    Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
    GET or SET macros depending on the value of the TYPE field.

    Signed-off-by: Lawrence Brakmo
    Acked-by: Alexei Starovoitov
    Signed-off-by: Alexei Starovoitov

    Lawrence Brakmo
     
  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2018-01-25

    Here's one last bluetooth-next pull request for the 4.16 kernel:

    - Improved support for Intel controllers
    - New set_parity method to serdev (agreed with maintainers to be taken
    through bluetooth-next)
    - Fix error path in hci_bcm (missing call to serdev close)
    - New ID for BCM4343A0 UART controller

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
    "BUG: unable to handle kernel NULL pointer dereference at (null)"

    Let's add a helper to check if update_pmtu is available before calling it.

    Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
    Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
    CC: Roman Kapl
    CC: Xin Long
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

25 Jan, 2018

13 commits

  • When a tcp socket is closed, if it detects that its net namespace is
    exiting, close immediately and do not wait for FIN sequence.

    For normal sockets, a reference is taken to their net namespace, so it will
    never exit while the socket is open. However, kernel sockets do not take a
    reference to their net namespace, so it may begin exiting while the kernel
    socket is still open. In this case if the kernel socket is a tcp socket,
    it will stay open trying to complete its close sequence. The sock's dst(s)
    hold a reference to their interface, which are all transferred to the
    namespace's loopback interface when the real interfaces are taken down.
    When the namespace tries to take down its loopback interface, it hangs
    waiting for all references to the loopback interface to release, which
    results in messages like:

    unregister_netdevice: waiting for lo to become free. Usage count = 1

    These messages continue until the socket finally times out and closes.
    Since the net namespace cleanup holds the net_mutex while calling its
    registered pernet callbacks, any new net namespace initialization is
    blocked until the current net namespace finishes exiting.

    After this change, the tcp socket notices the exiting net namespace, and
    closes immediately, releasing its dst(s) and their reference to the
    loopback interface, which lets the net namespace continue exiting.

    Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
    Signed-off-by: Dan Streetman
    Signed-off-by: David S. Miller

    Dan Streetman
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:

    1) Avoid negative netdev refcount in error flow of xfrm state add, from
    Aviad Yehezkel.

    2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the
    ethernet header protocol field in xfrm{4,6}_mode_tunnel_input().
    From Yossi Kuperman.

    3) Fix a syzbot triggered skb_under_panic in pppoe having to do with
    failing to allocate an appropriate amount of headroom. From
    Guillaume Nault.

    4) Fix memory leak in vmxnet3 driver, from Neil Horman.

    5) Cure out-of-bounds packet memory access in em_nbyte EMATCH module,
    from Wolfgang Bumiller.

    6) Restrict what kinds of sockets can be bound to the KCM multiplexer
    and also disallow when another layer has attached to the socket and
    made use of sk_user_data. From Tom Herbert.

    7) Fix use before init of IOTLB in vhost code, from Jason Wang.

    8) Correct STACR register write bit definition in IBM emac driver, from
    Ivan Mikhaylov.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    net/ibm/emac: wrong bit is used for STA control register write
    net/ibm/emac: add 8192 rx/tx fifo size
    vhost: do not try to access device IOTLB when not initialized
    vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
    i40e: flower: check if TC offload is enabled on a netdev
    qed: Free reserved MR tid
    qed: Remove reserveration of dpi for kernel
    kcm: Check if sk_user_data already set in kcm_attach
    kcm: Only allow TCP sockets to be attached to a KCM mux
    net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr
    net: sched: em_nbyte: don't add the data offset twice
    mlxsw: spectrum_router: Don't log an error on missing neighbor
    vmxnet3: repair memory leak
    ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
    pppoe: take ->needed_headroom of lower device into account on xmit
    xfrm: fix boolean assignment in xfrm_get_type_offload
    xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version
    xfrm: fix error flow in case of add state fails
    xfrm: Add SA to hardware at the end of xfrm_state_construct()

    Linus Torvalds
     
  • no users since 2014

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • Only two of dev_ioctl() callers may pass SIOCGIFCONF to it.
    Separating that codepath from the rest of dev_ioctl() allows both
    to simplify dev_ioctl() itself (all other cases work with struct ifreq *)
    *and* seriously simplify the compat side of that beast: all it takes
    is passing to inet_gifconf() an extra argument - the size of individual
    records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)). With
    dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf()
    that's easy to arrange.

    As the result, compat side of SIOCGIFCONF doesn't need any
    allocations, copy_in_user() back and forth, etc.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Al Viro
     
  • When building the erspan header for either v1 or v2, the eth_hdr()
    does not point to the right inner packet's eth_hdr,
    causing kasan report use-after-free and slab-out-of-bouds read.

    The patch fixes the following syzkaller issues:
    [1] BUG: KASAN: slab-out-of-bounds in erspan_xmit+0x22d4/0x2430 net/ipv4/ip_gre.c:735
    [2] BUG: KASAN: slab-out-of-bounds in erspan_build_header+0x3bf/0x3d0 net/ipv4/ip_gre.c:698
    [3] BUG: KASAN: use-after-free in erspan_xmit+0x22d4/0x2430 net/ipv4/ip_gre.c:735
    [4] BUG: KASAN: use-after-free in erspan_build_header+0x3bf/0x3d0 net/ipv4/ip_gre.c:698

    [2] CPU: 0 PID: 3654 Comm: syzkaller377964 Not tainted 4.15.0-rc9+ #185
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:53
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:440
    erspan_build_header+0x3bf/0x3d0 net/ipv4/ip_gre.c:698
    erspan_xmit+0x3b8/0x13b0 net/ipv4/ip_gre.c:740
    __netdev_start_xmit include/linux/netdevice.h:4042 [inline]
    netdev_start_xmit include/linux/netdevice.h:4051 [inline]
    packet_direct_xmit+0x315/0x6b0 net/packet/af_packet.c:266
    packet_snd net/packet/af_packet.c:2943 [inline]
    packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2968
    sock_sendmsg_nosec net/socket.c:638 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:648
    SYSC_sendto+0x361/0x5c0 net/socket.c:1729
    SyS_sendto+0x40/0x50 net/socket.c:1697
    do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
    do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
    entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:129
    RIP: 0023:0xf7fcfc79
    RSP: 002b:00000000ffc6976c EFLAGS: 00000286 ORIG_RAX: 0000000000000171
    RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000020011000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020008000
    RBP: 000000000000001c R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

    Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
    Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
    Reported-by: syzbot+9723f2d288e49b492cf0@syzkaller.appspotmail.com
    Reported-by: syzbot+f0ddeb2b032a8e1d9098@syzkaller.appspotmail.com
    Reported-by: syzbot+f14b3703cd8d7670203f@syzkaller.appspotmail.com
    Reported-by: syzbot+eefa384efad8d7997f20@syzkaller.appspotmail.com
    Signed-off-by: William Tu
    Signed-off-by: David S. Miller

    William Tu
     
  • All users are now converted to tc_cls_common_offload_init().

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • cls_bpf now guarantees that only device-bound programs are
    allowed with skip_sw. The drivers no longer pay attention to
    flags on filter load, therefore the bpf_offload member can be
    removed. If flags are needed again they should probably be
    added to struct tc_cls_common_offload instead.

    Signed-off-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski