01 May, 2017

39 commits

  • The kernel side of XDP_FLAGS_SKB_MODE is unsigned, and the rtnetlink
    IFLA_XDP_FLAGS is defined as NLA_U32. Thus, userspace programs under
    samples/bpf/ should use the correct type.

    Fixes: 3993f2cb983b ("samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel")
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Daniel Borkmann
    Reviewed-by: Andy Gospodarek
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • Jakub Kicinski says:

    ====================
    xdp: use netlink extended ACK reporting

    This series is an attempt to make XDP more user friendly by
    enabling exploiting the recently added netlink extended ACK
    reporting to carry messages to user space.

    David Ahern's iproute2 ext ack patches for ip link are sufficient
    to show the errors like this:

    Error: nfp: MTU too large w/ XDP enabled

    Where the message is coming directly from the driver. There could
    still be a bit of a leap for a complete novice from the message
    above to the right settings, but it's a big improvement over the
    standard "Invalid argument" message.

    v1/non-rfc:
    - add a separate macro in patch 1;
    - add KBUILD_MODNAME as part of the message (Daniel);
    - don't print the error to logs in patch 1.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Try to carry error messages to the user via the netlink extended
    ack message attribute.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Try to carry error messages to the user via the netlink extended
    ack message attribute.

    Signed-off-by: Jakub Kicinski
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Drivers usually have a number of restrictions for running XDP
    - most common being buffer sizes, LRO and number of rings.
    Even though some drivers try to be helpful and print error
    messages experience shows that users don't often consult
    kernel logs on netlink errors. Try to use the new extended
    ack mechanism to carry the message back to user space.

    Signed-off-by: Jakub Kicinski
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • As we propagate extended ack reporting throughout various paths in
    the kernel it may be that the same function is called with the
    extended ack parameter passed as NULL. One place where that happens
    is in drivers which have a centralized reconfiguration function
    called both from ndos and from ethtool_ops. Add a new helper for
    setting the error message in such conditions.

    Existing helper is left as is to encourage propagating the ext act
    fully wherever possible. It also makes it clear in the code which
    messages may be lost due to ext ack being NULL.

    Signed-off-by: Jakub Kicinski
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • This patch allows users to enable/disable internal TX and/or RX
    clock delay for BCM5481x series PHYs so as to satisfy RGMII timing
    specifications.

    On a particular platform, whether TX and/or RX clock delay is required
    depends on how PHY connected to the MAC IP. This requirement can be
    specified through "phy-mode" property in the platform device tree.

    Signed-off-by: Abhishek Shah
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Abhishek Shah
     
  • trivial fix to spelling mistakes in printk message.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • The bnx2x driver is not providing proper alignment on the receive buffers it
    passes to build_skb(), causing skb_shared_info to be misaligned.
    skb_shared_info contains an atomic, and while PPC normally supports
    unaligned accesses, it does not support unaligned atomics.

    Aligning the size of rx buffers will ensure that page_frag_alloc() returns
    aligned addresses.

    This can be reproduced on PPC by setting the network MTU to 1450 (or other
    non-multiple-of-4) and then generating sufficient inbound network traffic
    (one or two large "wget"s usually does it), producing the following oops:

    Unable to handle kernel paging request for unaligned access at address 0xc00000ffc43af656
    Faulting instruction address: 0xc00000000080ef8c
    Oops: Kernel access of bad area, sig: 7 [#1]
    SMP NR_CPUS=2048
    NUMA
    PowerNV
    Modules linked in: vmx_crypto powernv_rng rng_core powernv_op_panel leds_powernv led_class nfsd ip_tables x_tables autofs4 xfs lpfc bnx2x mdio libcrc32c crc_t10dif crct10dif_generic crct10dif_common
    CPU: 104 PID: 0 Comm: swapper/104 Not tainted 4.11.0-rc8-00088-g4c761da #2
    task: c00000ffd4892400 task.stack: c00000ffd4920000
    NIP: c00000000080ef8c LR: c00000000080eee8 CTR: c0000000001f8320
    REGS: c00000ffffc33710 TRAP: 0600 Not tainted (4.11.0-rc8-00088-g4c761da)
    MSR: 9000000000009033
    CR: 24082042 XER: 00000000
    CFAR: c00000000080eea0 DAR: c00000ffc43af656 DSISR: 00000000 SOFTE: 1
    GPR00: c000000000907f64 c00000ffffc33990 c000000000dd3b00 c00000ffcaf22100
    GPR04: c00000ffcaf22e00 0000000000000000 0000000000000000 0000000000000000
    GPR08: 0000000000b80008 c00000ffc43af636 c00000ffc43af656 0000000000000000
    GPR12: c0000000001f6f00 c00000000fe1a000 000000000000049f 000000000000c51f
    GPR16: 00000000ffffef33 0000000000000000 0000000000008a43 0000000000000001
    GPR20: c00000ffc58a90c0 0000000000000000 000000000000dd86 0000000000000000
    GPR24: c000007fd0ed10c0 00000000ffffffff 0000000000000158 000000000000014a
    GPR28: c00000ffc43af010 c00000ffc9144000 c00000ffcaf22e00 c00000ffcaf22100
    NIP [c00000000080ef8c] __skb_clone+0xdc/0x140
    LR [c00000000080eee8] __skb_clone+0x38/0x140
    Call Trace:
    [c00000ffffc33990] [c00000000080fb74] skb_clone+0x74/0x110 (unreliable)
    [c00000ffffc339c0] [c000000000907f64] packet_rcv+0x144/0x510
    [c00000ffffc33a40] [c000000000827b64] __netif_receive_skb_core+0x5b4/0xd80
    [c00000ffffc33b00] [c00000000082b2bc] netif_receive_skb_internal+0x2c/0xc0
    [c00000ffffc33b40] [c00000000082c49c] napi_gro_receive+0x11c/0x260
    [c00000ffffc33b80] [d000000066483d68] bnx2x_poll+0xcf8/0x17b0 [bnx2x]
    [c00000ffffc33d00] [c00000000082babc] net_rx_action+0x31c/0x480
    [c00000ffffc33e10] [c0000000000d5a44] __do_softirq+0x164/0x3d0
    [c00000ffffc33f00] [c0000000000d60a8] irq_exit+0x108/0x120
    [c00000ffffc33f20] [c000000000015b98] __do_irq+0x98/0x200
    [c00000ffffc33f90] [c000000000027f14] call_do_irq+0x14/0x24
    [c00000ffd4923a90] [c000000000015d94] do_IRQ+0x94/0x110
    [c00000ffd4923ae0] [c000000000008d90] hardware_interrupt_common+0x150/0x160

    Signed-off-by: David S. Miller

    Scott Wood
     
  • Commit 7e26bf45e4cb ("net: bridge: allow SW learn to take over HW fdb
    entries") added the ability to "take over an entry which was previously
    learned via HW when it shows up from a SW port".

    However, if an entry was learned via HW and then a control packet
    (e.g., ARP request) was trapped to the CPU, the bridge driver will
    update the entry and remove the externally learned flag, although the
    entry is still present in HW. Instead, only clear the externally learned
    flag in case of roaming.

    Fixes: 7e26bf45e4cb ("net: bridge: allow SW learn to take over HW fdb entries")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Arkadi Sharashevsky
    Cc: Nikolay Aleksandrov
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Arkadi Sharshevsky
     
  • After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
    we always take RTNL lock for ip_ra_control() which is the only place
    we update the list ip_ra_chain, so the ip_ra_lock is no longer needed.

    As Eric points out, BH does not need to disable either, RCU readers
    don't care.

    Signed-off-by: Cong Wang
    Acked-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    WANG Cong
     
  • The struct bpf_map_def was extended in commit fb30d4b71214 ("bpf: Add tests
    for map-in-map") with member unsigned int inner_map_idx. This changed the size
    of the maps section in the generated ELF _kern.o files.

    Unfortunately the loader in bpf_load.c does not detect or handle this. Thus,
    older _kern.o files became incompatible, and caused hard-to-debug errors
    where the syscall validation rejected BPF_MAP_CREATE request.

    This patch only detect the situation and aborts load_bpf_file(). It also
    add code comments warning people that read this loader for inspiration
    for these pitfalls.

    Fixes: fb30d4b71214 ("bpf: Add tests for map-in-map")
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • We recently added a check to see if nla_nest_start() fails. There are
    two issues with that. First, if it fails then I don't think we should
    call nla_nest_cancel(). Second, it's slightly convoluted but the
    current code returns success but we should return -EMSGSIZE instead.

    Fixes: a50fe0ffd76f ("lwtunnel: check return value of nla_nest_start")
    Signed-off-by: Dan Carpenter
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • Presumably we never hit this return, but static checkers complain that
    we need to unlock so we may as well fix that.

    Signed-off-by: Dan Carpenter
    Acked-by: Felix Manlunas
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • My static checker complains that we're holding a mutex on this error
    path. Let's goto exit instead of returning directly.

    Fixes: b0bccb69eba3 ("qed: Change locking scheme for VF channel")
    Signed-off-by: Dan Carpenter
    Acked-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • lipeng says:

    ====================
    net: hns: bug fix for HNS driver

    This patchset add support defered dsaf probe when mdio and
    mbigen module is not insmod.

    For more details, please refer to individual patch.

    change log:
    V4 - > V5:
    1. Float on net-next;
    2. Delete patch "net: hns: fixed bug that skb used after kfree"
    from this patchset;

    V3 -> V4:
    1. Delete redundant commit message;
    2. Add Reviewed-by: Matthias Brugger ;

    V2 -> V3:
    1. Check return value when platform_get_irq in hns_rcb_get_cfg;

    V1 -> V2:
    1. Return appropriate errno in hns_mac_register_phy;
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In the hip06 and hip07 SoCs, phy connect to mdio bus.The mdio
    module is probed with module_init, and, as such,
    is not guaranteed to probe before the HNS driver. So we need
    to support deferred probe.

    We check for probe deferral in the mac init, so we not init DSAF
    when there is no mdio, and free all resource, to later learn that
    we need to defer the probe.

    Signed-off-by: lipeng
    Reviewed-by: Yisen Zhuang
    Reviewed-by: Matthias Brugger
    Signed-off-by: David S. Miller

    lipeng
     
  • In the hip06 and hip07 SoCs, the interrupt lines from the
    DSAF controllers are connected to mbigen hw module.
    The mbigen module is probed with module_init, and, as such,
    is not guaranteed to probe before the HNS driver. So we need
    to support deferred probe.

    Signed-off-by: lipeng
    Reviewed-by: Yisen Zhuang
    Reviewed-by: Matthias Brugger
    Signed-off-by: David S. Miller

    lipeng
     
  • Jakub Kicinski says:

    ====================
    nfp: optimize XDP TX and small fixes

    This series optimizes the nfp XDP TX performance a little bit.
    I run quick tests on an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
    Single core/queue performance for both touch and drop and touch and
    forward is above 20Mpps @64B packets, drop being 2Mpps faster.
    I think this is max for a single queue on the low power NFPs.

    There are also a few minor fixes included for code in net-next.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • For legacy reasons NFP FW may be compiled to DMA packets to a constant
    offset into the buffer and use the space before it for metadata. This
    ensures that packets data always start at a certain offset regardless of
    the amount of preceding metadata.

    If rx offset is set to 0 there may still be up to 64 bytes of metadata
    but metadata will start at the beginning of the buffer, instead of:

    data_start_offset = rx_offset - meta_len

    Even though we make the buffers larger to accommodate up to 64 bytes of
    metadata, if there is only N bytes of metadata, we will end up with
    N bytes of headroom and 64 - N bytes of tailroom. Therefore we can't
    rely on that space for XDP headroom. Make sure we always allocate
    full 256 bytes. This, unfortunately, means we can't fit the headroom
    on an u8 any more.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Right now the required Service Process ABI version is still tied
    to max ID of known commands. For new NSP commands we are adding
    we are checking if NSP version is recent enough on command-by-command
    basis. The driver doesn't have to force the device to have the
    very latest flash, anything newer than 0.8 should do.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Reading TX queue indexes from the device memory on each interrupt
    is expensive. It's doubly expensive with XDP running since we have
    two TX rings to check there. If the software indexes indicate that
    the TX queue is completely empty, however, we don't need to look at
    the device completion index at all.

    The queuing CPU is doing a wmb() before kicking the device TX so
    we should be safe to assume on the CPU handling the completions will
    never see old value of the software copy of the index.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • On the RX path we follow the "drop if allocation of replacement
    buffer fails" rule. With XDP we extended that to the TX action,
    so if XDP prog returned TX but allocation of replacement RX buffer
    failed, we will drop the packet.

    To improve our XDP TX performance extend the idea of rings being
    always full to XDP TX rings. Pre-fill the XDP TX rings with RX
    buffers, and when XDP prog returns TX action swap the RX buffer
    with the next buffer from the TX ring.

    XDP TX complete will no longer free the buffers but let them
    sit on the TX ring and wait for swap with RX buffer, instead.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • We will soon allocate RX buffers for caching on XDP TX rings.
    The rx_ring parameter passed to nfp_net_rx_alloc_one() is not
    actually used, remove it.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • As Or points out in commit 423b3aecf290 ("net/mlx4: Change ENOTSUPP
    to EOPNOTSUPP"), ENOTSUPP is NFS specific error. Replace it with
    EOPNOTSUPP.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Avoid hashing the tx napi struct into napi_hash[], which is used for
    busy polling receive queues.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Initialise init_net.count to 1 for its pointer from init_nsproxy lest
    someone tries to do a get_net() and a put_net() in a process in which
    current->ns_proxy->net_ns points to the initial network namespace.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Creating a geneve link with 'udpcsum' set results in a creation of link
    for which UDP checksum will NOT be computed on outbound packets, as can
    be seen below.

    11: gen0: mtu 1500 qdisc noop state DOWN
    link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
    geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum

    Similarly, creating a link with 'noudpcsum' set results in a creation
    of link for which UDP checksum will be computed on outbound packets.

    Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.")
    Signed-off-by: Girish Moodalbail
    Acked-by: Pravin B Shelar
    Acked-by: Lance Richardson
    Signed-off-by: David S. Miller

    Girish Moodalbail
     
  • Jiri Benc says:

    ====================
    vxlan: do not error out on disabled IPv6

    This patchset fixes a bug with metadata based tunnels when booted with
    ipv6.disable=1.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The message "Cannot bind port X, err=Y" creates only confusion. In metadata
    based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
    no error message should be printed. But when IPv6 tunnel was requested, such
    failure is fatal. The vxlan_socket_create does not know when the error is
    harmless and when it's not.

    Instead of passing such information down to vxlan_socket_create, remove the
    message completely. It's not useful. We propagate the error code up to the
    user space and the port number comes from the user space. There's nothing in
    the message that the process creating vxlan interface does not know.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
    -EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
    operation of bringing up the tunnel.

    Ignore failure of IPv6 socket creation for metadata based tunnels caused by
    IPv6 not being available.

    Fixes: b1be00a6c39f ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • Replace pattern

    int status;
    ...
    status = func(...);
    return status;

    by

    return func(...);

    No functional change intented.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: David S. Miller

    Andy Shevchenko
     
  • Reuse bnx2x_null_format_ver() in functions where it's appropriated
    instead of open coded variant.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: David S. Miller

    Andy Shevchenko
     
  • Use scnprintf() when printing version instead of custom open coded variants.

    Signed-off-by: Andy Shevchenko
    Acked-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Andy Shevchenko
     
  • …ux/kernel/git/mkl/linux-can-next

    Marc Kleine-Budde says:

    ====================
    pull-request: can-next 2017-04-25

    this is a pull request of 1 patch for net-next/master.

    This patch by Oliver Hartkopp fixes the build of the broad cast manager
    with CONFIG_PROC_FS disabled.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • The description inside uapi/linux/bpf.h about bpf_get_socket_uid
    helper function is no longer valid. It returns overflowuid rather
    than 0 when failed.

    Signed-off-by: Chenbo Feng
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Chenbo Feng
     
  • avoid direct access to sk->sk_state when tcp_poll() is called on a socket
    using active TCP fastopen with deferred connect. Use local variable
    'state', which stores the result of sk_state_load(), like it was done in
    commit 00fd38d938db ("tcp: ensure proper barriers in lockless contexts").

    Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
    Signed-off-by: Davide Caratti
    Acked-by: Wei Wang
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • While testing a fix [1] in ___pskb_trim(), addressing the WARN_ON_ONCE()
    in skb_try_coalesce() reported by Andrey, I found that we had an skb
    with skb->sk set but no skb->destructor.

    This invalidated heuristic found in commit 158f323b9868 ("net: adjust
    skb->truesize in pskb_expand_head()") and in cited patch.

    Considering the BUG_ON(skb->sk) we have in skb_orphan(), we should
    restrain the temporary setting to a minimal section.

    [1] https://patchwork.ozlabs.org/patch/755570/
    net: adjust skb->truesize in ___pskb_trim()

    Fixes: 8f917bba0042 ("bpf: pass sk to helper functions")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Cc: Andrey Konovalov
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since 83a77e9ec415, the phydev irq is explicitly set to PHY_POLL when
    there is no pdata. It doesn't work on DT enabled platforms because the
    phydev irq is already set by libphy before.

    Fixes: 83a77e9ec415 ("net: macb: Added PCI wrapper for Platform Driver.")
    Signed-off-by: Alexandre Belloni
    Acked-by: Nicolas Ferre
    Signed-off-by: David S. Miller

    Alexandre Belloni
     

30 Apr, 2017

1 commit

  • Jeff Kirsher says:

    ====================
    40GbE Intel Wired LAN Driver Updates 2017-04-30

    This series contains updates to i40e and i40evf only.

    Jake provides majority of the changes in this series, starting with the
    renaming of a flag to avoid confusion. Then renamed a variable to a
    more meaningful name to clarify what is actually being done and to
    reduce confusion. Amortizes the wait time when initializing or disabling
    lots of VFs by using i40e_reset_all_vfs() and
    i40e_vsi_stop_rings_no_wait(). Cleaned up a unnecessary delay since
    pci_disable_sriov() already has its own delay, so need to add a additional
    delay when removing VFs. Avoid using the same name flags for both
    vsi->state and pf->state, to make code review easier and assist future
    work to use the correct state field when checking bits. Use
    DECLARE_BITMAP() to ensure that we always allocate enough space for flags.
    Replace hw_disabled_flags with the new _AUTO_DISABLED flags, which are
    more readable because we are not setting an *_ENABLED flag to
    disable the feature.

    Alex corrects a oversight where we were not reprogramming the ports
    after a reset, which was causing us to lose all of the receive tunnel
    offloads.

    Arnd Bergmann moves the declaration of a local variable to avoid a
    warning seen on architectures with larger pages about an unused variable.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller