07 Jun, 2015
5 commits
-
eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
This is known to program writers which are currently forced to use
BPF_LL_OFF workaround.
Since programs don't change skb internal pointers it is safe to do
pull/push right around invocation of the program and earlier taps and
later pt->func() will not be affected.
Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
around run_filter/BPF_PROG_RUN even if skb_shared.This fix finally allows programs to use optimized LD_ABS/IND instructions
without BPF_LL_OFF for higher performance.
tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
w/o JIT w/JIT
before 20.5 23.6 Mpps
after 21.8 26.6 MppsOld programs with BPF_LL_OFF will still work as-is.
We can now undo most of the earlier workaround commit:
a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets")Signed-off-by: Alexei Starovoitov
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller -
For same reasons than in commit 12e25e1041d0 ("tcp: remove redundant
checks"), we can remove redundant checks done for timewait sockets.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
The debug is printing the struct smt_header * address using
the %x format specifier. Fix it to use %p instead.Signed-off-by: Colin Ian King
Signed-off-by: David S. Miller -
Fix:
drivers/net/wan/dscc4.c: In function 'dscc4_open':
drivers/net/wan/dscc4.c:1049:25: warning: variable 'ppriv' set but not used
[-Wunused-but-set-variable]This has been in there unused since 1da177e4c3f (Linux-2.6.12-rc2) simply
remove it.Signed-off-by: Nicholas Mc Guire
Signed-off-by: David S. Miller -
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.This new feature is available for both IPv4 and IPv6 (Thanks Neal)
Tested:
Wrote a test program and checked its behavior on IPv4 and IPv6.
strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0IPv6 test :
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 socketslpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66Check that a given source port is indeed used by many different
connections :lpaa23:~# ss -t src :40000 | head -10
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
06 Jun, 2015
8 commits
-
The patch e85c9a7abfa4: ("cxgb4/cxgb4vf: Add code to calculate T5 BAR2
Offsets for SGE Queue Registers") from Dec 3, 2014, leads to the
following static checker warning:drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:5358
t4_bar2_sge_qregs()
warn: should '(qid >> qpp_shift) << page_shift' be a 64 bit type?This patch fixes it
Signed-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller -
Hariprasad Shenai says:
====================
Free VI, flush sge ec and some other misc. fixesThis patch series adds the following.
Free VI interface during remove, flush SGE ec routine, rename
t4_link_start to t4_link_l1cfg since it only does l1 configuration, set
mac addr from when we can't contact firmware for debug purpose, set pcie
completion timeout and use fw interface to access TP_PIO_XXX registersThis patch series has been created against net-next tree and includes
patches on cxgb4 driver.We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================Signed-off-by: David S. Miller
-
The TP_PIO_{ADDR,DATA} registers are are in conflict with the firmware's
use of these registers. Added a routine to access it through FW LDST
cmd.
Access all TP_PIO_{ADDR,DATA} register access through new routine if FW
is alive. If firmware is dead, than fall back to indirect access.Signed-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller -
Set pci completion timeout to 0xd.
Signed-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller -
Grab the Adapter MAC Address out of the VPD and use it for the "debug"
network interface when either we can't contact the firmwareSigned-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller -
t4_link_start() was completely misnamed. It does _not_ start up the
link. It merely does the L1 Configuration for the link. The Link Up
process is started automatically by the firmware when the number of
enabled Virtual Interfaces on a port goes from 0 to 1. So renaming
this routine to t4_link_l1cfg() for better documentation.Signed-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller -
Add function to flush the sge ec context cache, and utilize
this new function in the driverSigned-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller -
Free VI interfaces in remove routine. If we don't do this then the
firmware will never drop the physical link to the peer.Signed-off-by: Hariprasad Shenai
Signed-off-by: David S. Miller
05 Jun, 2015
27 commits
-
Or Gerlitz says:
====================
mlx5: Add Interface Step Sequence ID supportISSI (Interface Step Sequence ID) defines the step sequence ID of the
interface between the driver to the firmware and is incremented by
steps of one. ISSI is used to enable deprecating/modifying features,
command interfaces and such, while maintaining compatibility.As the driver serves both ConnectIB (CIB) and ConnectX4, we carefully
made sure that the IB functionality keeps running also on older CIB
firmware releases that don't support ISSI.The Ethernet functionailty is available only on ConnectX4 where all
firmware releases support the feature since the very basic ISSI level.
So at this point no need for compatility code there.As done prior to this series, when the Ethernet functionlity is enabled,
during the initialization flow, the core driver performs a query of the
supported ISSIs using the QUERY_ISSI command, and then, if ISSI is supported,
sets the actual issi value informing the firmware on which ISSI level to run,
using SET_ISSI command.Previously, the IB driver wasn't ready to work on that mode, and hence
building both the IB driver and the Ethernet functionality in the core
driver were disallowed by Kconfigs, with this series, we allow users to
enable them both.
====================Signed-off-by: David S. Miller
-
Ethernet functionality is only available when working in ISSI > 0 mode.
Previously, the IB driver wasn't ready to work on that mode, and hence
building both the IB driver and the Ethernet functionality in the core
driver were disallowed by Kconfigs.Now, once we have all the pre-steps in place, we can remove this limitation.
The last steps in the IB driver for getting that setup to work are:
create dummy SRQ for the driver's use (until now we could use XRC_SRQ
as SRQ and XRC_SRQ, after moving to ISSI > 0, we separate XRC SRQs from
basic SRQs) and adapt the create QP function to be compatible with ISSI > 0.Signed-off-by: Haggai Abramovsky
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Since we still don't have RoCE support in mlx5, avoid
creating IB driver instance over Ethernet ports.Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
In ISSI > 0 mode, most of the MAD_IFC command features are deprecated, and can't
be used. Therefore, when in that mode, we replace all of them with other commands
that provide the required functionality.Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Add the following helpers:
1. mlx5_query_port_proto_oper -- queries the port speed port mask
2. mlx5_query_port_link_width_oper - queries the port link with bitmask
3. mlx5_query_port_vl_hw_cap - queries the Virtual Lanes supported on this portThese helpers will be used from the IB driver when working in ISSI > 0 mode.
Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Until now, mlx5_query_port_ptys always queried port number one.
Added new argument in the function's prototype so we can also query
the second port. This will be needed when thr helper will be invoked
from the IB driver on non FPP (Function-Per-Port) devices.Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Extend the function prototypes for max and operational mtu to take the
local port number. In the Ethernet driver is this hard coded to one,
since ConnectX4 Ethernet devices are always function-per-port.
The IB driver also serves older devices (ConnectIB) which isn't such,
and hence the part can vary.Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Add two wrapper functions to the query adapter command:
1. mlx5_query_board_id -- replaces the old mlx5_cmd_query_adapter.
2. mlx5_core_query_vendor_id -- retrieves the vendor_id from the
query_adapter command.Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Added the implementation for the following commands:
1. QUERY_HCA_VPORT_GID
2. QUERY_HCA_VPORT_PKEY
3. QUERY_HCA_VPORT_CONTEXTThey will be needed when we move to work with ISSI > 0 in the IB driver too.
Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Move the vport header file to be under include/linux/mlx5, such that
the mlx5 IB can use it as well.Also add nic_ prefix to the vport NIC commands to differeniate between
HCA vport commands and NIC vport commands.Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
The determination of the supported ISSI versions should be conditioned
on the returned mask, and not only on the return status of the query
ISSI command, fix that.Signed-off-by: Haggai Abramovsky
Signed-off-by: Majd Dibbiny
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
When working in ISSI > 0 mode, the model exposed by the device for
XRCs and SRQs is different. XRCs use XRC SRQs and plain SRQs are based
on RPM (Receive Memory Pool).Add helper functions to create, modify, query, and arm XRC SRQs and RMPs.
Signed-off-by: Haggai Abramovsky
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Some core helper functions were named with mlx5_ only prefix, fix that to
mlx5_core_ so we're aligned with the overall scheme used for core services.Signed-off-by: Haggai Abramovsky
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
The patch afb736e9330a: "net/mlx5: Ethernet resource handling files"
from May 28, 2015, leads to the following static checker warning:drivers/net/ethernet/mellanox/mlx5/core/en_flow_table.c:726 mlx5e_create_main_flow_table()
error: potential null dereference 'g'. (kcalloc returns null)Fixes: afb736e9330a ("net/mlx5: Ethernet resource handling files")
Reported-by: Dan Carpenter
Signed-off-by: Amir Vadai
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller -
Tom Herbert says:
====================
net: Increase inputs to flow_keys hashingThis patch set adds new fields to the flow_keys structure and hashes
over these fields to get a better flow hash. In particular, these
patches now include hashing over the full IPv6 addresses in order
to defend against address spoofing that always results in the
same hash. The new input also includes the Ethertype, L4 protocol,
VLAN, flow label, GRE keyid, and MPLS entropy label.In order to increase hash inputs, we switch to using jhash2
which operates an an array of u32's. jhash2 operates on multiples of
three words. The data in the hash is constructed for that, and there
are are two variants for IPv4 and Ipv6 addressing. For IPv4 addresses,
jhash is performed over six u32's and for IPv6 it is done over twelve.flow_keys can store either IPv4 or IPv6 addresses (addr_proto field
is a selector). ipv6_addr_hash is no longer used to convert addresses
for setting in flow table. For legacy uses of flow keys outside of
flow_dissector the flow_get_u32_src and flow_get_u32_dst functions
have been added to get u32 representation representations of addresses
in flow_keys.For flow lables we also eliminate the short circuit in flow_dissector
for non-zero flow label. The flow label is now considered additional
input to ports.Testing: Ran netperf TCP_RR for 200 flows using IPv4 and IPv6 comparing
before the patches and with the patches. Did not detect any performance
degradation.v2:
- Took out MPLS entropy label. Will add this later.
v3:
- Ensure hash start offset is a four byte boundary. Add BUG_BUILD_ON
to check for this.
- Fixes sparse error in GRE to get entropy from keyid.
v4:
- Rebase to Jiri changes to generalize flow dissection
- Support TIPC as its own address
- Bring back MPLS entropy label dissection
- Remove FLOW_DISSECTOR_KEY_IPV6_HASH_ADDRSv5:
- Minor fixes from feedbackv6:
- Cleanup and sparse issue with flow label
- Change keyid to returned by flow_dissector to be __be32
====================Signed-off-by: David S. Miller
-
In flow dissector if an MPLS header contains an entropy label this is
saved in the new keyid field of flow_keys. The entropy label is
then represented in the flow hash function input.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
In flow dissector if a GRE header contains a keyid this is saved in the
new keyid field of flow_keys. The GRE keyid is then represented
in the flow hash function input.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
In flow_dissector set the flow label in flow_keys for IPv6. This also
removes the shortcircuiting of flow dissection when a non-zero label
is present, the flow label can be considered to provide additional
entropy for a hash.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
In flow_dissector set vlan_id in flow_keys when VLAN is found.
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
We don't need to return the IPv6 address hash as part of flow keys.
In general, using the IPv6 address hash is risky in a hash value
since the underlying use of xor provides no entropy. If someone
really needs the hash value they can get it from the full IPv6
addresses in flow keys (e.g. from flow_get_u32_src).Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Add a new flow key for TIPC addresses.
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.With this patch, Ethertype and IP protocol are now included in the
flow hash input.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This patch changes flow hashing to use jhash2 over the flow_keys
structure instead just doing jhash_3words over src, dst, and ports.
This method will allow us take more input into the hashing function
so that we can include full IPv6 addresses, VLAN, flow labels etc.
without needing to resort to xor'ing which makes for a poor hash.Acked-by: Jiri Pirko
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
key_basic is set twice in __skb_flow_dissect which seems unnecessary.
Remove second one.Acked-by: Jiri Pirko
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Add uapi define for MPLS over IP.
Acked-by: Jiri Pirko
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Do break when we see routing flag or a non-zero version number in GRE
header.Acked-by: Jiri Pirko
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
According to false is always '0' and
Static variables are initialised to 0 by GCC.Signed-off-by: Shailendra Verma
Signed-off-by: David S. Miller