04 May, 2016
4 commits
-
During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.This commit adds this functionality.
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
In the refactoring commit d570d86497ee ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the testif (sk->sk_backlog.len == 0)
atomic_set(&tsk->dupl_rcvcnt, 0);with
if (sk->sk_backlog.len)
atomic_set(&tsk->dupl_rcvcnt, 0);This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.We now fix this by inverting the mentioned condition.
Acked-by: Ying Xue
Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller -
The vsock_transport structure is never modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall
Signed-off-by: David S. Miller -
In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.[1] Assuming a 64byte cache line, and 1024 buckets
Signed-off-by: Eric Dumazet
Reported-by: Dave Taht
Cc: Jonathan Morton
Acked-by: Jesper Dangaard Brouer
Acked-by: Dave Taht
Signed-off-by: David S. Miller
03 May, 2016
19 commits
-
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()Reallocating skb->head is a lot of work.
Test should really check if a 'real clone' of the packet was done.
TCP does not care if the original gso_type is changed while the packet
travels in the stack.This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().This variant can probably be used from other points.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller -
Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller -
Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller -
The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller -
Changes in GREv6 transmit path:
- Call gre_checksum, remove gre6_checksum
- Rename ip6gre_xmit2 to __gre6_xmit
- Call gre_build_header utility function
- Call ip6_tnl_xmit common function
- Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtuSigned-off-by: Tom Herbert
Signed-off-by: David S. Miller -
A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.Common functions are for:
- GRE checksum calculation. Move gre_checksum to gre.h.
- Building a GRE header. Move GRE build_header and rename
gre_build_header.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
- Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
most of the code in ip6gre_rcv.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).These include:
- GRE flag handling functions are move to gre.h
- GRE build_header is moved to gre.h and renamed gre_build_header
- parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
- iptunnel_pull_header is taken out of gre_parse_header. This is now
done by caller. The header length is returned from gre_parse_header
in an int* argument.Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
- Make ip6_tnl_rcv non-static so that GREv6 and others can call it
- Make ip6_tnl_rcv look like ip_tunnel_rcv
- Switch to gro_cells_receive
- Make ip6_tnl_rcv non-static and export itSigned-off-by: Tom Herbert
Signed-off-by: David S. Miller -
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.Kernel should know better, right ?
Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.Signed-off-by: Eric Dumazet
Cc: Alexei Starovoitov
Cc: Marcelo Ricardo Leitner
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
Socket backlog processing is a major latency source.
With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.cond_resched_softirq() could be removed, as it has no more user.
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.Signed-off-by: Eric Dumazet
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller -
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.We need to use appropriate SNMP accessors.
Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.Many SNMP updates were assuming that BH (and preemption) was disabled.
Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller
02 May, 2016
2 commits
-
Dave Miller pointed out that fb586f25300f ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.This patch thus basically inverts the logic on fb586f25300f and signals
it as early as possible, similar to what we had before.Fixes: fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller
Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller -
When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.Signed-off-by: Jon Maloy
Signed-off-by: David S. Miller
30 Apr, 2016
1 commit
-
is_skb_forwardable is not supposed to change anything so constify its
argumentsSigned-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller
29 Apr, 2016
9 commits
-
This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.The new ethtool -S output would therefore look like this now:
statistics
p_Signed-off-by: Florian Fainelli
Signed-off-by: David S. Miller -
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.Current prequeue implementation has following issues :
1) It only checks size of the prequeue against sk_rcvbuf
It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
But we now have ~8MB values to cope with modern networking needs.
We have to add sk_rmem_alloc in the equation, since out of order
packets can definitely use up to sk_rcvbuf memory themselves.2) Even with a fixed memory truesize check, prequeue can be filled
by thousands of packets. When prequeue needs to be flushed, either
from sofirq context (in tcp_prequeue() or timer code), or process
context (in tcp_prequeue_process()), this adds a latency spike
which is often not desirable.
I added a fixed limit of 32 packets, as this translated to a max
flush time of 60 us on my test hosts.Also note that all packets in prequeue are not accounted for tcp_mem,
since they are not charged against sk_forward_alloc at this point.
This is probably not a big deal.Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.Signed-off-by: Dan Carpenter
Acked-by: Ying Xue
Signed-off-by: David S. Miller -
There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).Before:
~1150000 pps
After:
~1300000 ppsCc: Michael S. Tsirkin
Signed-off-by: Jason Wang
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller -
When fragmenting a skb, the next_skb should carry
the eor from prev_skb. The eor of prev_skb should
also be reset.Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 00.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 00.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 7300.200 > . 1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 10.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 10.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Acked-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
This patch:
1. Prevent next_skb from coalescing to the prev_skb if
TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
allowedPacketdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 00.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 00.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 116800.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 10.300 < . 1:1(0) ack 1 win 257
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 2570.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Acked-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 00.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 00.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 7300.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 10.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 10.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Suggested-by: Eric Dumazet
Acked-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller -
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.Note that this frees one bit in shinfo->tx_flags.
Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Martin KaFai Lau
Suggested-by: Willem de Bruijn
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller -
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
28 Apr, 2016
5 commits
-
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.This more closely matches convention used to update
percpu variables.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
IPv6 ICMP stats are atomics anyway.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller