07 Jun, 2015

5 commits

  • eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
    For ingress L2 header is already pulled, whereas for egress it's present.
    This is known to program writers which are currently forced to use
    BPF_LL_OFF workaround.
    Since programs don't change skb internal pointers it is safe to do
    pull/push right around invocation of the program and earlier taps and
    later pt->func() will not be affected.
    Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
    around run_filter/BPF_PROG_RUN even if skb_shared.

    This fix finally allows programs to use optimized LD_ABS/IND instructions
    without BPF_LL_OFF for higher performance.
    tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
    w/o JIT w/JIT
    before 20.5 23.6 Mpps
    after 21.8 26.6 Mpps

    Old programs with BPF_LL_OFF will still work as-is.

    We can now undo most of the earlier workaround commit:
    a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets")

    Signed-off-by: Alexei Starovoitov
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • For same reasons than in commit 12e25e1041d0 ("tcp: remove redundant
    checks"), we can remove redundant checks done for timewait sockets.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The debug is printing the struct smt_header * address using
    the %x format specifier. Fix it to use %p instead.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • Fix:
    drivers/net/wan/dscc4.c: In function 'dscc4_open':
    drivers/net/wan/dscc4.c:1049:25: warning: variable 'ppriv' set but not used
    [-Wunused-but-set-variable]

    This has been in there unused since 1da177e4c3f (Linux-2.6.12-rc2) simply
    remove it.

    Signed-off-by: Nicholas Mc Guire
    Signed-off-by: David S. Miller

    Nicholas Mc Guire
     
  • When an application needs to force a source IP on an active TCP socket
    it has to use bind(IP, port=x).

    As most applications do not want to deal with already used ports, x is
    often set to 0, meaning the kernel is in charge to find an available
    port.
    But kernel does not know yet if this socket is going to be a listener or
    be connected.
    It has very limited choices (no full knowledge of final 4-tuple for a
    connect())

    With limited ephemeral port range (about 32K ports), it is very easy to
    fill the space.

    This patch adds a new SOL_IP socket option, asking kernel to ignore
    the 0 port provided by application in bind(IP, port=0) and only
    remember the given IP address.

    The port will be automatically chosen at connect() time, in a way
    that allows sharing a source port as long as the 4-tuples are unique.

    This new feature is available for both IPv4 and IPv6 (Thanks Neal)

    Tested:

    Wrote a test program and checked its behavior on IPv4 and IPv6.

    strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
    connect().
    Also getsockname() show that the port is still 0 right after bind()
    but properly allocated after connect().

    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
    setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
    bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
    connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0

    IPv6 test :

    socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
    setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
    bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
    getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
    connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
    getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0

    I was able to bind()/connect() a million concurrent IPv4 sockets,
    instead of ~32000 before patch.

    lpaa23:~# ulimit -n 1000010
    lpaa23:~# ./bind --connect --num-flows=1000000 &
    1000000 sockets

    lpaa23:~# grep TCP /proc/net/sockstat
    TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66

    Check that a given source port is indeed used by many different
    connections :

    lpaa23:~# ss -t src :40000 | head -10
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
    ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
    ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
    ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
    ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
    ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jun, 2015

8 commits

  • The patch e85c9a7abfa4: ("cxgb4/cxgb4vf: Add code to calculate T5 BAR2
    Offsets for SGE Queue Registers") from Dec 3, 2014, leads to the
    following static checker warning:

    drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:5358
    t4_bar2_sge_qregs()
    warn: should '(qid >> qpp_shift) << page_shift' be a 64 bit type?

    This patch fixes it

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     
  • Hariprasad Shenai says:

    ====================
    Free VI, flush sge ec and some other misc. fixes

    This patch series adds the following.
    Free VI interface during remove, flush SGE ec routine, rename
    t4_link_start to t4_link_l1cfg since it only does l1 configuration, set
    mac addr from when we can't contact firmware for debug purpose, set pcie
    completion timeout and use fw interface to access TP_PIO_XXX registers

    This patch series has been created against net-next tree and includes
    patches on cxgb4 driver.

    We have included all the maintainers of respective drivers. Kindly review
    the change and let us know in case of any review comments.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The TP_PIO_{ADDR,DATA} registers are are in conflict with the firmware's
    use of these registers. Added a routine to access it through FW LDST
    cmd.
    Access all TP_PIO_{ADDR,DATA} register access through new routine if FW
    is alive. If firmware is dead, than fall back to indirect access.

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     
  • Set pci completion timeout to 0xd.

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     
  • Grab the Adapter MAC Address out of the VPD and use it for the "debug"
    network interface when either we can't contact the firmware

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     
  • t4_link_start() was completely misnamed. It does _not_ start up the
    link. It merely does the L1 Configuration for the link. The Link Up
    process is started automatically by the firmware when the number of
    enabled Virtual Interfaces on a port goes from 0 to 1. So renaming
    this routine to t4_link_l1cfg() for better documentation.

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     
  • Add function to flush the sge ec context cache, and utilize
    this new function in the driver

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     
  • Free VI interfaces in remove routine. If we don't do this then the
    firmware will never drop the physical link to the peer.

    Signed-off-by: Hariprasad Shenai
    Signed-off-by: David S. Miller

    Hariprasad Shenai
     

05 Jun, 2015

27 commits

  • Or Gerlitz says:

    ====================
    mlx5: Add Interface Step Sequence ID support

    ISSI (Interface Step Sequence ID) defines the step sequence ID of the
    interface between the driver to the firmware and is incremented by
    steps of one. ISSI is used to enable deprecating/modifying features,
    command interfaces and such, while maintaining compatibility.

    As the driver serves both ConnectIB (CIB) and ConnectX4, we carefully
    made sure that the IB functionality keeps running also on older CIB
    firmware releases that don't support ISSI.

    The Ethernet functionailty is available only on ConnectX4 where all
    firmware releases support the feature since the very basic ISSI level.
    So at this point no need for compatility code there.

    As done prior to this series, when the Ethernet functionlity is enabled,
    during the initialization flow, the core driver performs a query of the
    supported ISSIs using the QUERY_ISSI command, and then, if ISSI is supported,
    sets the actual issi value informing the firmware on which ISSI level to run,
    using SET_ISSI command.

    Previously, the IB driver wasn't ready to work on that mode, and hence
    building both the IB driver and the Ethernet functionality in the core
    driver were disallowed by Kconfigs, with this series, we allow users to
    enable them both.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Ethernet functionality is only available when working in ISSI > 0 mode.

    Previously, the IB driver wasn't ready to work on that mode, and hence
    building both the IB driver and the Ethernet functionality in the core
    driver were disallowed by Kconfigs.

    Now, once we have all the pre-steps in place, we can remove this limitation.

    The last steps in the IB driver for getting that setup to work are:
    create dummy SRQ for the driver's use (until now we could use XRC_SRQ
    as SRQ and XRC_SRQ, after moving to ISSI > 0, we separate XRC SRQs from
    basic SRQs) and adapt the create QP function to be compatible with ISSI > 0.

    Signed-off-by: Haggai Abramovsky
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Haggai Abramonvsky
     
  • Since we still don't have RoCE support in mlx5, avoid
    creating IB driver instance over Ethernet ports.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • In ISSI > 0 mode, most of the MAD_IFC command features are deprecated, and can't
    be used. Therefore, when in that mode, we replace all of them with other commands
    that provide the required functionality.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • Add the following helpers:

    1. mlx5_query_port_proto_oper -- queries the port speed port mask
    2. mlx5_query_port_link_width_oper - queries the port link with bitmask
    3. mlx5_query_port_vl_hw_cap - queries the Virtual Lanes supported on this port

    These helpers will be used from the IB driver when working in ISSI > 0 mode.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • Until now, mlx5_query_port_ptys always queried port number one.

    Added new argument in the function's prototype so we can also query
    the second port. This will be needed when thr helper will be invoked
    from the IB driver on non FPP (Function-Per-Port) devices.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • Extend the function prototypes for max and operational mtu to take the
    local port number. In the Ethernet driver is this hard coded to one,
    since ConnectX4 Ethernet devices are always function-per-port.
    The IB driver also serves older devices (ConnectIB) which isn't such,
    and hence the part can vary.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • Add two wrapper functions to the query adapter command:

    1. mlx5_query_board_id -- replaces the old mlx5_cmd_query_adapter.

    2. mlx5_core_query_vendor_id -- retrieves the vendor_id from the
    query_adapter command.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • Added the implementation for the following commands:

    1. QUERY_HCA_VPORT_GID
    2. QUERY_HCA_VPORT_PKEY
    3. QUERY_HCA_VPORT_CONTEXT

    They will be needed when we move to work with ISSI > 0 in the IB driver too.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • Move the vport header file to be under include/linux/mlx5, such that
    the mlx5 IB can use it as well.

    Also add nic_ prefix to the vport NIC commands to differeniate between
    HCA vport commands and NIC vport commands.

    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Majd Dibbiny
     
  • The determination of the supported ISSI versions should be conditioned
    on the returned mask, and not only on the return status of the query
    ISSI command, fix that.

    Signed-off-by: Haggai Abramovsky
    Signed-off-by: Majd Dibbiny
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Haggai Abramonvsky
     
  • When working in ISSI > 0 mode, the model exposed by the device for
    XRCs and SRQs is different. XRCs use XRC SRQs and plain SRQs are based
    on RPM (Receive Memory Pool).

    Add helper functions to create, modify, query, and arm XRC SRQs and RMPs.

    Signed-off-by: Haggai Abramovsky
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Haggai Abramonvsky
     
  • Some core helper functions were named with mlx5_ only prefix, fix that to
    mlx5_core_ so we're aligned with the overall scheme used for core services.

    Signed-off-by: Haggai Abramovsky
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Haggai Abramonvsky
     
  • The patch afb736e9330a: "net/mlx5: Ethernet resource handling files"
    from May 28, 2015, leads to the following static checker warning:

    drivers/net/ethernet/mellanox/mlx5/core/en_flow_table.c:726 mlx5e_create_main_flow_table()
    error: potential null dereference 'g'. (kcalloc returns null)

    Fixes: afb736e9330a ("net/mlx5: Ethernet resource handling files")
    Reported-by: Dan Carpenter
    Signed-off-by: Amir Vadai
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Amir Vadai
     
  • Tom Herbert says:

    ====================
    net: Increase inputs to flow_keys hashing

    This patch set adds new fields to the flow_keys structure and hashes
    over these fields to get a better flow hash. In particular, these
    patches now include hashing over the full IPv6 addresses in order
    to defend against address spoofing that always results in the
    same hash. The new input also includes the Ethertype, L4 protocol,
    VLAN, flow label, GRE keyid, and MPLS entropy label.

    In order to increase hash inputs, we switch to using jhash2
    which operates an an array of u32's. jhash2 operates on multiples of
    three words. The data in the hash is constructed for that, and there
    are are two variants for IPv4 and Ipv6 addressing. For IPv4 addresses,
    jhash is performed over six u32's and for IPv6 it is done over twelve.

    flow_keys can store either IPv4 or IPv6 addresses (addr_proto field
    is a selector). ipv6_addr_hash is no longer used to convert addresses
    for setting in flow table. For legacy uses of flow keys outside of
    flow_dissector the flow_get_u32_src and flow_get_u32_dst functions
    have been added to get u32 representation representations of addresses
    in flow_keys.

    For flow lables we also eliminate the short circuit in flow_dissector
    for non-zero flow label. The flow label is now considered additional
    input to ports.

    Testing: Ran netperf TCP_RR for 200 flows using IPv4 and IPv6 comparing
    before the patches and with the patches. Did not detect any performance
    degradation.

    v2:
    - Took out MPLS entropy label. Will add this later.
    v3:
    - Ensure hash start offset is a four byte boundary. Add BUG_BUILD_ON
    to check for this.
    - Fixes sparse error in GRE to get entropy from keyid.
    v4:
    - Rebase to Jiri changes to generalize flow dissection
    - Support TIPC as its own address
    - Bring back MPLS entropy label dissection
    - Remove FLOW_DISSECTOR_KEY_IPV6_HASH_ADDRS

    v5:
    - Minor fixes from feedback

    v6:
    - Cleanup and sparse issue with flow label
    - Change keyid to returned by flow_dissector to be __be32
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In flow dissector if an MPLS header contains an entropy label this is
    saved in the new keyid field of flow_keys. The entropy label is
    then represented in the flow hash function input.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • In flow dissector if a GRE header contains a keyid this is saved in the
    new keyid field of flow_keys. The GRE keyid is then represented
    in the flow hash function input.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • In flow_dissector set the flow label in flow_keys for IPv6. This also
    removes the shortcircuiting of flow dissection when a non-zero label
    is present, the flow label can be considered to provide additional
    entropy for a hash.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • In flow_dissector set vlan_id in flow_keys when VLAN is found.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • We don't need to return the IPv6 address hash as part of flow keys.
    In general, using the IPv6 address hash is risky in a hash value
    since the underlying use of xor provides no entropy. If someone
    really needs the hash value they can get it from the full IPv6
    addresses in flow keys (e.g. from flow_get_u32_src).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add a new flow key for TIPC addresses.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch adds full IPv6 addresses into flow_keys and uses them as
    input to the flow hash function. The implementation supports either
    IPv4 or IPv6 addresses in a union, and selector is used to determine
    how may words to input to jhash2.

    We also add flow_get_u32_dst and flow_get_u32_src functions which are
    used to get a u32 representation of the source and destination
    addresses. For IPv6, ipv6_addr_hash is called. These functions retain
    getting the legacy values of src and dst in flow_keys.

    With this patch, Ethertype and IP protocol are now included in the
    flow hash input.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • This patch changes flow hashing to use jhash2 over the flow_keys
    structure instead just doing jhash_3words over src, dst, and ports.
    This method will allow us take more input into the hashing function
    so that we can include full IPv6 addresses, VLAN, flow labels etc.
    without needing to resort to xor'ing which makes for a poor hash.

    Acked-by: Jiri Pirko
    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • key_basic is set twice in __skb_flow_dissect which seems unnecessary.
    Remove second one.

    Acked-by: Jiri Pirko
    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Add uapi define for MPLS over IP.

    Acked-by: Jiri Pirko
    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Do break when we see routing flag or a non-zero version number in GRE
    header.

    Acked-by: Jiri Pirko
    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • According to false is always '0' and
    Static variables are initialised to 0 by GCC.

    Signed-off-by: Shailendra Verma
    Signed-off-by: David S. Miller

    Shailendra Verma