Eric Lee / smarc-fsl-linux-kernel

04 May, 2016

1 commit

9d18562a2 fq_codel: add batch ability to fq_codel_drop() ... Browse Code »

In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.

fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.

Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.

TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.

With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.

[1] Assuming a 64byte cache line, and 1024 buckets

Signed-off-by: Eric Dumazet
Reported-by: Dave Taht
Cc: Jonathan Morton
Acked-by: Jesper Dangaard Brouer
Acked-by: Dave Taht
Signed-off-by: David S. Miller

Eric Dumazet
2016-05-04 00:47:09 +0800

03 May, 2016

11 commits

9580bf2ed net: relax expensive skb_unclone() in iptunnel_handle_offloads() ... Browse Code »

Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-05-03 12:22:19 +0800
c0ef079ca netdevice: shrink size of struct netdev_queue ... Browse Code »

- trans_timeout is incremented when tx queue timed out (tx watchdog).
- tx_maxrate is set via sysfs

Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes.
While at it, also move trans_timeout (it is out-of-place in the
'write-mostly' part).

Signed-off-by: Florian Westphal
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Florian Westphal
2016-05-03 10:51:41 +0800
a60c09036 bridge: netlink: export per-vlan stats ... Browse Code »

Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.

Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2016-05-03 10:27:06 +0800
6dada9b10 bridge: vlan: learn to count ... Browse Code »

Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).

Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2016-05-03 10:27:06 +0800
97a47facf net: rtnetlink: add linkxstats callbacks and attribute ... Browse Code »

Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.

Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2016-05-03 10:27:06 +0800
79ecb90e6 ipv6: Generic tunnel cleanup ... Browse Code »

A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-05-03 07:23:32 +0800
182a352d2 gre: Create common functions for transmit ... Browse Code »

Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.

Common functions are for:
- GRE checksum calculation. Move gre_checksum to gre.h.
- Building a GRE header. Move GRE build_header and rename
gre_build_header.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-05-03 07:23:31 +0800
8eb30be03 ipv6: Create ip6_tnl_xmit ... Browse Code »

This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-05-03 07:23:31 +0800
95f5c64c3 gre: Move utility functions to common headers ... Browse Code »

Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).

These include:
- GRE flag handling functions are move to gre.h
- GRE build_header is moved to gre.h and renamed gre_build_header
- parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
- iptunnel_pull_header is taken out of gre_parse_header. This is now
done by caller. The header length is returned from gre_parse_header
in an int* argument.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-05-03 07:23:31 +0800
0d3c703a9 ipv6: Cleanup IPv6 tunnel receive path ... Browse Code »

Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
- Make ip6_tnl_rcv non-static so that GREv6 and others can call it
- Make ip6_tnl_rcv look like ip_tunnel_rcv
- Switch to gro_cells_receive
- Make ip6_tnl_rcv non-static and export it

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-05-03 07:23:31 +0800
d41a69f1d tcp: make tcp_sendmsg() aware of socket backlog ... Browse Code »

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet
Cc: Alexei Starovoitov
Cc: Marcelo Ricardo Leitner
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2016-05-03 05:02:26 +0800

02 May, 2016

2 commits

03dc76ca1 qed: add infrastructure for device self tests. ... Browse Code »

This patch adds the functionality and APIs needed for selftests.
It adds the ability to configure the link-mode which is required for the
implementation of loopback tests. It adds the APIs for clock test,
register test, interrupt test and memory test.

Signed-off-by: Sudarsana Reddy Kalluru
Signed-off-by: Yuval Mintz
Signed-off-by: David S. Miller

Sudarsana Reddy Kalluru
2016-05-02 12:16:39 +0800
0970f5b36 sctp: signal sk_data_ready earlier on data chunks reception ... Browse Code »

Dave Miller pointed out that fb586f25300f ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.

This patch thus basically inverts the logic on fb586f25300f and signals
it as early as possible, similar to what we had before.

Fixes: fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller
Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2016-05-02 09:06:10 +0800

30 Apr, 2016

5 commits

5a7b27eb9 net/mlx5: Initializing CPU reverse mapping ... Browse Code »

Allocating CPU rmap and add entry for each IRQ.
CPU rmap is used in aRFS to get the RX queue number
of the RX completion interrupts.

Signed-off-by: Maor Gottlieb
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Maor Gottlieb
2016-04-30 04:29:11 +0800
d63cd2860 net/mlx5: Add user chosen levels when allocating flow tables ... Browse Code »

Currently, consumers of the flow steering infrastructure can't
choose their own flow table levels and are limited to one
flow table per level. This just waste levels.
Instead, we introduce here the possibility to use multiple
flow tables in a level. The user is free to connect these
flow tables, while following the rule (FTEs in FT of level x
could only point to FTs of level y where y > x).

In addition this patch switch the order of the create/destroy
flow tables of the NIC(vlan and main).

Signed-off-by: Maor Gottlieb
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Maor Gottlieb
2016-04-30 04:29:09 +0800
d745098ce net/mlx5: Introduce modify flow rule destination ... Browse Code »

This API is used for modifying the flow rule destination.
This is needed for modifying the pointed flow table by the
traffic type classifier rules to point on the aRFS tables.

Signed-off-by: Maor Gottlieb
Signed-off-by: Saeed Mahameed
Signed-off-by: David S. Miller

Maor Gottlieb
2016-04-30 04:29:08 +0800
f4b05d27e net: constify is_skb_forwardable's arguments ... Browse Code »

is_skb_forwardable is not supposed to change anything so constify its
arguments

Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2016-04-30 04:13:36 +0800
96d934c70 ppp: add rtnetlink device creation support ... Browse Code »

Define PPP device handler for use with rtnetlink.
The only PPP specific attribute is IFLA_PPP_DEV_FD. It is mandatory and
contains the file descriptor of the associated /dev/ppp instance (the
file descriptor which would have been used for ioctl(PPPIOCNEWUNIT) in
the ioctl-based API). The PPP device is removed when this file
descriptor is released (same behaviour as with ioctl based PPP
devices).

PPP devices created with the rtnetlink API behave like the ones created
with ioctl(PPPIOCNEWUNIT). In particular existing ioctls work the same
way, no matter how the PPP device was created.
The rtnl callbacks are also assigned to ioctl based PPP devices. This
way, rtnl messages have the same effect on any PPP devices.
The immediate effect is that all PPP devices, even ioctl-based
ones, can now be removed with "ip link del".

A minor difference still exists between ioctl and rtnl based PPP
interfaces: in the device name, the number following the "ppp" prefix
corresponds to the PPP unit number for ioctl based devices, while it is
just an unrelated incrementing index for rtnl ones.

Signed-off-by: Guillaume Nault
Signed-off-by: David S. Miller

Guillaume Nault
2016-04-30 04:09:44 +0800

29 Apr, 2016

4 commits

badf3ada6 net: dsa: Provide CPU port statistics to master netdev ... Browse Code »

This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).

We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.

We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.

The new ethtool -S output would therefore look like this now:
statistics
p_

Signed-off-by: Florian Fainelli
Signed-off-by: David S. Miller

Florian Fainelli
2016-04-29 05:16:17 +0800
b43e7199a fq: split out backlog update logic ... Browse Code »

mac80211 (which will be the first user of the
fq.h) recently started to support software A-MSDU
aggregation. It glues skbuffs together into a
single one so the backlog accounting needs to be
more fine-grained.

To avoid backlog sorting logic duplication split
it up for re-use.

Signed-off-by: Michal Kazior
Signed-off-by: David S. Miller

Michal Kazior
2016-04-29 05:03:38 +0800
c134ecb87 tcp: Make use of MSG_EOR in tcp_sendmsg ... Browse Code »

This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.

The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.

This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.

One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792
0.100 > S. 0:0(0) ack 1
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau
Cc: Eric Dumazet
Cc: Neal Cardwell
Cc: Soheil Hassas Yeganeh
Cc: Willem de Bruijn
Cc: Yuchung Cheng
Suggested-by: Eric Dumazet
Acked-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-04-29 04:14:18 +0800
0a2cf20c3 tcp: remove SKBTX_ACK_TSTAMP since it is redundant ... Browse Code »

The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.

Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.

Note that this frees one bit in shinfo->tx_flags.

Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Martin KaFai Lau
Suggested-by: Willem de Bruijn
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2016-04-29 04:06:10 +0800

28 Apr, 2016

17 commits

ba7863f4d net: snmp: fix 64bit stats on 32bit arches ... Browse Code »

I accidentally replaced BH disabling by preemption disabling
in SNMP_ADD_STATS64() and SNMP_UPD_PO_STATS64() on 32bit builds.

For 64bit stats on 32bit arch, we really need to disable BH,
since the "struct u64_stats_sync syncp" might be manipulated
both from process and BH contexts.

Fixes: 6aef70a851ac ("net: snmp: kill various STATS_USER() helpers")
Reported-by: Nicolas Dichtel
Tested-by: Nicolas Dichtel
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 23:49:45 +0800
4be735225 net: SOCKWQ_ASYNC_WAITDATA optimizations ... Browse Code »

SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.

Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 11:08:40 +0800
9317bb698 net: SOCKWQ_ASYNC_NOSPACE optimizations ... Browse Code »

SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.

tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.

We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())

This also removes one cache line miss to access sk->sk_wq->flags
in tcp_sendmsg()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 11:08:40 +0800
13415e46c net: snmp: kill STATS_BH macros ... Browse Code »

There is nothing related to BH in SNMP counters anymore,
since linux-3.0.

Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.

This more closely matches convention used to update
percpu variables.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:25 +0800
f3832ed2c ipv6: kill ICMP6MSGIN_INC_STATS_BH() ... Browse Code »

IPv6 ICMP stats are atomics anyway.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:25 +0800
c2005eb01 ipv6: rename IP6_UPD_PO_STATS_BH() ... Browse Code »

Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:25 +0800
1d0155035 ipv6: rename IP6_INC_STATS_BH() ... Browse Code »

Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
02a1d6e7a net: rename NET_{ADD|INC}_STATS_BH() ... Browse Code »

Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
b15084ec7 net: rename IP_UPD_PO_STATS_BH() ... Browse Code »

Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
98f619957 net: rename IP_ADD_STATS_BH() ... Browse Code »

Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
a16292a0f net: rename ICMP6_INC_STATS_BH() ... Browse Code »

Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:24 +0800
b45386efa net: rename IP_INC_STATS_BH() ... Browse Code »

Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800
08e3baef6 net: sctp: rename SCTP_INC_STATS_BH() ... Browse Code »

Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800
214d3f1f8 net: icmp: rename ICMPMSGIN_INC_STATS_BH() ... Browse Code »

Remove misleading _BH suffix.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800
90bbcc608 net: tcp: rename TCP_INC_STATS_BH ... Browse Code »

Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800
b540f9d70 net: xfrm: kill XFRM_INC_STATS_BH() ... Browse Code »

Not used anymore.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800
02c223470 net: udp: rename UDP_INC_STATS_BH() ... Browse Code »

Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-28 10:48:23 +0800