Eric Lee / smarc-fsl-linux-kernel

09 Sep, 2016

5 commits

9f5afeae5 tcp: use an RB tree for ooo receive queue ... Browse Code »

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang
Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Cc: Ilpo Järvinen
Acked-By: Ilpo Järvinen
Signed-off-by: David S. Miller

Yaogong Wang
2016-09-09 08:25:58 +0800
fe19c4f97 vlan: Check for vlan ethernet types for 8021.q or 802.1ad ... Browse Code »

This is to simplify using double tagged vlans. This function allows all
valid vlan ethertypes to be checked in a single function call.
Also replace some instances that check for both ETH_P_8021Q and
ETH_P_8021AD.

Patch based on one originally by Thomas F Herbert.

Signed-off-by: Thomas F Herbert
Signed-off-by: Eric Garver
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Eric Garver
2016-09-09 08:10:28 +0800
8c146bb9d openvswitch: 802.1ad uapi changes. ... Browse Code »

openvswitch: Add support for 8021.AD

Change the description of the VLAN tpid field.

Signed-off-by: Thomas F Herbert
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Thomas F Herbert
2016-09-09 08:10:27 +0800
d545caca8 net: inet: diag: expose the socket mark to privileged processes. ... Browse Code »

This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.

Commit a52e95abf772 ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes. It is useful for
tools like ss which display information about sockets.

Tested: https://android-review.googlesource.com/270210
Signed-off-by: Lorenzo Colitti
Signed-off-by: David S. Miller

Lorenzo Colitti
2016-09-09 07:13:09 +0800
575f9c43e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next ... Browse Code »

Steffen Klassert says:

====================
ipsec-next 2016-09-08

1) Constify the xfrm_replay structures. From Julia Lawall

2) Protect xfrm state hash tables with rcu, lookups
can be done now without acquiring xfrm_state_lock.
From Florian Westphal.

3) Protect xfrm policy hash tables with rcu, lookups
can be done now without acquiring xfrm_policy_lock.
From Florian Westphal.

4) We don't need to have a garbage collector list per
namespace anymore, so use a global one instead.
From Florian Westphal.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-09-09 04:09:41 +0800

08 Sep, 2016

2 commits

e0971c832 qed*: Add support for the ethtool get_regs operation ... Browse Code »

Signed-off-by: Tomer Tayar
Signed-off-by: Yuval Mintz
Signed-off-by: David S. Miller

Tomer Tayar
2016-09-08 08:46:59 +0800
c965db444 qed: Add support for debug data collection ... Browse Code »

This patch adds the support for dumping and formatting the HW/FW debug data.

Signed-off-by: Tomer Tayar
Signed-off-by: Yuval Mintz
Signed-off-by: David S. Miller

Tomer Tayar
2016-09-08 08:46:59 +0800

07 Sep, 2016

4 commits

5a42976d4 rxrpc: Add tracepoint for working out where aborts happen ... Browse Code »

Add a tracepoint for working out where local aborts happen. Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.

rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.

Signed-off-by: David Howells

David Howells
2016-09-07 23:34:40 +0800
fff72429c rxrpc: Improve the call tracking tracepoint ... Browse Code »

Improve the call tracking tracepoint by showing more differentiation
between some of the put and get events, including:

(1) Getting and putting refs for the socket call user ID tree.

(2) Getting and putting refs for queueing and failing to queue the call
processor work item.

Note that these aren't necessarily used in this patch, but will be taken
advantage of in future patches.

An enum is added for the event subtype numbers rather than coding them
directly as decimal numbers and a table of 3-letter strings is provided
rather than a sequence of ?: operators.

Signed-off-by: David Howells

David Howells
2016-09-07 22:30:22 +0800
0122c6d5f Merge tag 'rxrpc-rewrite-20160904-1' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Small fixes

Here's a set of small fix patches:

(1) Fix some uninitialised variables.

(2) Set the client call state before making it live by attaching it to the
conn struct.

(3) Randomise the epoch and starting client conn ID values, and don't
change the epoch when the client conn ID rolls round.

(4) Replace deprecated create_singlethread_workqueue() calls.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2016-09-07 04:46:26 +0800
60175ccdf Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.

More specifically, they are:

1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.

2) Missing whitespace on error message in physdev match, from Hangbin Liu.

3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.

4) Add nf_ct_expires() helper function and use it, from Florian Westphal.

5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.

6) Rename nf_tables set implementation to nft_set_{name}.c

7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.

8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.

9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.

10) Add quota expression for nf_tables.

11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.

12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.

13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.

14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.

15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.

16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.

17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.

18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.

19) Add a garbage collector to get rid of stale conntracks, from
Florian.

20) Reschedule garbage collector if eviction rate is high.

21) Get rid of the __nf_ct_kill_acct() helper.

22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.

23) Make nf_log_set() interface assertive on unsupported families.
====================

Signed-off-by: David S. Miller

David S. Miller
2016-09-07 03:45:26 +0800

05 Sep, 2016

1 commit

5f2d9c443 rxrpc: Randomise epoch and starting client conn ID values ... Browse Code »

Create a random epoch value rather than a time-based one on startup and set
the top bit to indicate that this is the case.

Also create a random starting client connection ID value. This will be
incremented from here as new client connections are created.

Signed-off-by: David Howells

David Howells
2016-09-05 04:41:39 +0800

03 Sep, 2016

3 commits

dd19bde36 switchdev: Fix return value of switchdev_port_fdb_dump(). ... Browse Code »

This patch fixes the retun value of switchdev_port_fdb_dump() when
CONFIG_NET_SWITCHDEV is not set. This avoids getting "warning: return makes
integer from pointer without a cast [-Wint-conversion]" when building
when CONFIG_NET_SWITCHDEV is not set under several compiler versions.
This warning is due to commit d297653dd6f07afbe7e6c702a4bcd7615680002e
("rtnetlink: fdb dump: optimize by saving last interface markers").

Signed-off-by: Rami Rosen
Acked-by: Roopa Prabhu
Reported-by: Eric Dumazet
Signed-off-by: David S. Miller

Rosen, Rami
2016-09-03 02:00:21 +0800
aa6a5f3cb perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs ... Browse Code »

Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-09-03 01:46:44 +0800
0515e5999 bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type ... Browse Code »

Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
struct pt_regs regs;
__u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
... ctx->sample_period ...
... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-09-03 01:46:44 +0800

02 Sep, 2016

5 commits

04bed1434 net: dsa: remove ds_to_priv ... Browse Code »

Access the priv member of the dsa_switch structure directly, instead of
having an unnecessary helper.

Signed-off-by: Vivien Didelot
Signed-off-by: David S. Miller

Vivien Didelot
2016-09-02 13:51:12 +0800
b6cb5ac83 net: bridge: add per-port multicast flood flag ... Browse Code »

Add a per-port flag to control the unknown multicast flood, similar to the
unknown unicast flood flag and break a few long lines in the netlink flag
exports.

Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2016-09-02 13:48:33 +0800
d297653dd rtnetlink: fdb dump: optimize by saving last interface markers ... Browse Code »

fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real 1m11.791s
user 0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real 0m2.017s
user 0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu
Signed-off-by: Wilson Kok
Signed-off-by: David S. Miller

Roopa Prabhu
2016-09-02 07:56:15 +0800
66fdd05e7 rps: flow_dissector: Add the const for the parameter of flow_keys_have_l4 ... Browse Code »

Add the const for the parameter of flow_keys_have_l4 for the readability.

Signed-off-by: Gao Feng
Signed-off-by: David S. Miller

Gao Feng
2016-09-02 07:51:08 +0800
d001648ec rxrpc: Don't expose skbs to in-kernel users [ver #2] ... Browse Code »

Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

(1) Call refcounting can be made simpler if skbs don't hold refs to calls.

(2) skbs referring to non-data events will be able to be freed much sooner
rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
will be able to consult the call state.

(3) We can shortcut the receive phase when a call is remotely aborted
because we don't have to go through all the packets to get to the one
cancelling the operation.

(4) It makes it easier to do encryption/decryption directly between AFS's
buffers and sk_buffs.

(5) Encryption/decryption can more easily be done in the AFS's thread
contexts - usually that of the userspace process that issued a syscall
- rather than in one of rxrpc's background threads on a workqueue.

(6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

int rxrpc_kernel_recv_data(
struct socket *sock, struct rxrpc_call *call,
void *buffer, size_t bufsize, size_t *_offset,
bool want_more, u32 *_abort_code);

This is the recvmsg equivalent. It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them. They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user. To process the queue internally, a temporary function,
temp_deliver_data() has been added. This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells
Signed-off-by: David S. Miller

David Howells
2016-09-02 07:43:27 +0800

01 Sep, 2016

1 commit

8df302552 net: dsa: add MDB support ... Browse Code »

Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.

Signed-off-by: Vivien Didelot
Signed-off-by: David S. Miller

Vivien Didelot
2016-09-01 05:15:42 +0800

31 Aug, 2016

1 commit

14972cbd3 net: lwtunnel: Handle fragmentation ... Browse Code »

Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu
Signed-off-by: David Ahern
Signed-off-by: David S. Miller

Roopa Prabhu
2016-08-31 13:27:18 +0800

30 Aug, 2016

9 commits

4de48af66 rxrpc: Pass struct socket * to more rxrpc kernel interface functions ... Browse Code »

Pass struct socket * to more rxrpc kernel interface functions. They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.

I have left:

rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()

unmodified as they're all about to be removed (and, in any case, don't
touch the socket).

Signed-off-by: David Howells

David Howells
2016-08-30 23:07:53 +0800
8324f0bcf rxrpc: Provide a way for AFS to ask for the peer address of a call ... Browse Code »

Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:

void rxrpc_kernel_get_peer(struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);

In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.

Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.

Signed-off-by: David Howells

David Howells
2016-08-30 23:07:53 +0800
e34d4234b rxrpc: Trace rxrpc_call usage ... Browse Code »

Add a trace event for debuging rxrpc_call struct usage.

Signed-off-by: David Howells

David Howells
2016-08-30 23:02:36 +0800
779994fa3 netfilter: log: Check param to avoid overflow in nf_log_set ... Browse Code »

The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.

Signed-off-by: Gao Feng
Signed-off-by: Pablo Neira Ayuso

Gao Feng
2016-08-30 17:52:32 +0800
ad66713f5 netfilter: remove __nf_ct_kill_acct helper ... Browse Code »

After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-08-30 17:43:10 +0800
f330a7fdb netfilter: conntrack: get rid of conntrack timer ... Browse Code »

With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de657f8, 'timers: Switch to
a non-cascading wheel').

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal
Acked-by: Eric Dumazet
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-08-30 17:43:09 +0800
616b14b46 netfilter: don't rely on DYING bit to detect when destroy event was sent ... Browse Code »

The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension. The worker can then skip all entries that are in
a different state. Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso
Signed-off-by: Florian Westphal
Acked-by: Eric Dumazet
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2016-08-30 17:43:08 +0800
6abdd5f59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller

David S. Miller
2016-08-30 12:54:02 +0800
1f6a563ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Segregate namespaces properly in conntrack dumps, from Liping Zhang.

2) tcp listener refcount fix in netfilter tproxy, from Eric Dumazet.

3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz.

4) Fix use-after-free in tcp_xmit_retransmit_queue().

5) Userspace header fixups (use of __u32, missing includes, etc.) from
Mikko Rapeli.

6) Further refinements to fragmentation wrt gso and tunnels, from
Shmulik Ladkani.

7) Trigger poll correctly for zero length UDP packets, from Eric
Dumazet.

8) TCP window scaling fix, also from Eric Dumazet.

9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets.

10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet.

11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng.

12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric Dumazet.

13) Add new device ID to alx driver, from Owen Lin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
Add Killer E2500 device ID in alx driver.
net: smc91x: fix SMC accesses
Documentation: networking: dsa: Remove platform device TODO
net/mlx5: Increase number of ethtool steering priorities
net/mlx5: Add error prints when validate ETS failed
net/mlx5e: Fix memory leak if refreshing TIRs fails
net/mlx5e: Add ethtool counter for TX xmit_more
net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
net/mlx5e: Don't wait for SQ completions on close
net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
net/mlx5e: Don't wait for RQ completions on close
net/mlx5e: Limit UMR length to the device's limitation
rhashtable: fix a memory leak in alloc_bucket_locks()
sfc: fix potential stack corruption from running past stat bitmask
team: loadbalance: push lacpdus to exact delivery
net: hns: dereference ppe_cb->ppe_common_cb if it is non-null
8139cp: Fix one possible deadloop in cp_rx_poll
i40e: Change some init flow for the client
Revert "phy: IRQ cannot be shared"
net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
...

Linus Torvalds
2016-08-30 03:29:13 +0800

29 Aug, 2016

7 commits

5711a9822 net: ethtool: add support for 1000BaseX and missing 10G link modes ... Browse Code »

This patch enhances ethtool link mode bitmap to include
missing interface modes for 1G/10G speeds

Changes:
1000baseX is the mode introduced to cover all 1G Fiber cases.
All modes under 1000BaseX i.e. 1000BASE-SX, 1000BASE-LX, 1000BASE-LX10
and 1000BASE-BX10 are not explicitly defined at this moment.
10G CR,SR,LR and ER link modes are included for 10G speed..

Issue:
ethtool on 1G/10G SFP port reports Base-T
as this port supports 1000baseX,10G CR, SR and LR modes.

root@tor-02$ ethtool swp1
Settings for swp1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 10000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: external
Auto-negotiation: off
Current message level: 0x00000000 (0)

Link detected: yes

After fix:
root@tor-02$ ethtool swp1
Settings for swp1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseX/Full
10000baseCR/Full
10000baseSR/Full
10000baseLR/Full
10000baseER/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 10000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: external
Auto-negotiation: off
Current message level: 0x00000000 (0)
Link detected: yes

Signed-off-by: Vidya Sagar Ravipati
Signed-off-by: David S. Miller

Vidya Sagar Ravipati
2016-08-29 12:27:34 +0800
c9c332125 tcp: add tcp_add_backlog() ... Browse Code »

When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.

While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.

Cause of the drops is the skb->truesize overestimation caused by :

- drivers allocating ~2048 (or more) bytes as a fragment to hold an
Ethernet frame.

- various pskb_may_pull() calls bringing the headers into skb->head
might have pulled all the frame content, but skb->truesize could
not be lowered, as the stack has no idea of each fragment truesize.

The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.

Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.

Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2016-08-29 12:20:24 +0800
2fb04fdf3 net: smc91x: fix SMC accesses ... Browse Code »

Commit b70661c70830 ("net: smc91x: use run-time configuration on all ARM
machines") broke some ARM platforms through several mistakes. Firstly,
the access size must correspond to the following rule:

(a) at least one of 16-bit or 8-bit access size must be supported
(b) 32-bit accesses are optional, and may be enabled in addition to
the above.

Secondly, it provides no emulation of 16-bit accesses, instead blindly
making 16-bit accesses even when the platform specifies that only 8-bit
is supported.

Reorganise smc91x.h so we can make use of the existing 16-bit access
emulation already provided - if 16-bit accesses are supported, use
16-bit accesses directly, otherwise if 8-bit accesses are supported,
use the provided 16-bit access emulation. If neither, BUG(). This
exactly reflects the driver behaviour prior to the commit being fixed.

Since the conversion incorrectly cut down the available access sizes on
several platforms, we also need to go through every platform and fix up
the overly-restrictive access size: Arnd assumed that if a platform can
perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access
size needed to be specified - not so, all available access sizes must
be specified.

This likely fixes some performance regressions in doing this: if a
platform does not support 8-bit accesses, 8-bit accesses have been
emulated by performing a 16-bit read-modify-write access.

Tested on the Intel Assabet/Neponset platform, which supports only 8-bit
accesses, which was broken by the original commit.

Fixes: b70661c70830 ("net: smc91x: use run-time configuration on all ARM machines")
Signed-off-by: Russell King
Tested-by: Robert Jarzmik
Signed-off-by: David S. Miller

Russell King
2016-08-29 11:44:55 +0800
96a590834 kcm: Remove TCP specific references from kcm and strparser ... Browse Code »

kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-08-29 11:32:41 +0800
320355858 tcp: Set read_sock and peek_len proto_ops ... Browse Code »

In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-08-29 11:32:41 +0800
0294b625a net: Add read_sock proto_op ... Browse Code »

Add new function in proto_ops structure. This includes moving the
typedef got sk_read_actor into net.h and removing the definition from
tcp.h.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2016-08-29 11:32:41 +0800
25d0d91af Merge tag 'drm-fixes-for-4.8-rc4' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

Pull drm fixes from Dave Airlie:
"A bunch of fixes covering i915, amdgpu, one tegra and some core DRM
ones. Nothing too strange at this point"

* tag 'drm-fixes-for-4.8-rc4' of git://people.freedesktop.org/~airlied/linux: (21 commits)
drm/atomic: Don't potentially reset color_mgmt_changed on successive property updates.
drm: Protect fb_defio in drivers with CONFIG_KMS_FBDEV_EMULATION
drm/amdgpu: skip TV/CV in display parsing
drm/amdgpu: avoid a possible array overflow
drm/amdgpu: fix lru size grouping v2
drm/tegra: dsi: Enhance runtime power management
drm/i915: Fix botched merge that downgrades CSR versions.
drm/i915/skl: Ensure pipes with changed wms get added to the state
drm/i915/gen9: Only copy WM results for changed pipes to skl_hw
drm/i915/skl: Add support for the SAGV, fix underrun hangs
drm/i915/gen6+: Interpret mailbox error flags
drm/i915: Reattach comment, complete type specification
drm/i915: Unconditionally flush any chipset buffers before execbuf
drm/i915/gen9: Drop invalid WARN() during data rate calculation
drm/i915/gen9: Initialize intel_state->active_crtcs during WM sanitization (v2)
drm: Reject page_flip for !DRIVER_MODESET
drm/amdgpu: fix timeout value check in amd_sched_job_recovery
drm/amdgpu: fix sdma_v2_4_ring_test_ib
drm/amdgpu: fix amdgpu_move_blit on 32bit systems
drm/radeon: fix radeon_move_blit on 32bit systems
...

Linus Torvalds
2016-08-29 05:31:36 +0800

28 Aug, 2016

1 commit

af56ff27e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM fixes from Paolo Bonzini:
"ARM:
- fixes for ITS init issues, error handling, IRQ leakage, race
conditions
- an erratum workaround for timers
- some removal of misleading use of errors and comments
- a fix for GICv3 on 32-bit guests

MIPS:
- fix for where the guest could wrongly map the first page of
physical memory

x86:
- nested virtualization fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
MIPS: KVM: Check for pfn noslot case
kvm: nVMX: fix nested tsc scaling
KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC
arm64: KVM: report configured SRE value to 32-bit world
arm64: KVM: remove misleading comment on pmu status
KVM: arm/arm64: timer: Workaround misconfigured timer interrupt
arm64: Document workaround for Cortex-A72 erratum #853709
KVM: arm/arm64: Change misleading use of is_error_pfn
KVM: arm64: ITS: avoid re-mapping LPIs
KVM: arm64: check for ITS device on MSI injection
KVM: arm64: ITS: move ITS registration into first VCPU run
KVM: arm64: vgic-its: Make updates to propbaser/pendbaser atomic
KVM: arm64: vgic-its: Plug race in vgic_put_irq
KVM: arm64: vgic-its: Handle errors from vgic_add_lpi
KVM: arm64: ITS: return 1 on successful MSI injection

Linus Torvalds
2016-08-28 06:51:50 +0800

27 Aug, 2016

1 commit

5e608a027 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge fixes from Andrew Morton:
"11 fixes"

* emailed patches from Andrew Morton :
mm: silently skip readahead for DAX inodes
dax: fix device-dax region base
fs/seq_file: fix out-of-bounds read
mm: memcontrol: avoid unused function warning
mm: clarify COMPACTION Kconfig text
treewide: replace config_enabled() with IS_ENABLED() (2nd round)
printk: fix parsing of "brl=" option
soft_dirty: fix soft_dirty during THP split
sysctl: handle error writing UINT_MAX to u32 fields
get_maintainer: quiet noisy implicit -f vcs_file_exists checking
byteswap: don't use __builtin_bswap*() with sparse

Linus Torvalds
2016-08-27 14:12:12 +0800