Eric Lee / smarc-fsl-linux-kernel

10 Jan, 2019

1 commit

e5af70e98 sock: Make sock->sk_stamp thread-safe ... Browse Code »

[ Upstream commit 3a0ed3e9619738067214871e9cb826fa23b2ddb9 ]

Al Viro mentioned (Message-ID
)
that there is probably a race condition
lurking in accesses of sk_stamp on 32-bit machines.

sock->sk_stamp is of type ktime_t which is always an s64.
On a 32 bit architecture, we might run into situations of
unsafe access as the access to the field becomes non atomic.

Use seqlocks for synchronization.
This allows us to avoid using spinlocks for readers as
readers do not need mutual exclusion.

Another approach to solve this is to require sk_lock for all
modifications of the timestamps. The current approach allows
for timestamps to have their own lock: sk_stamp_lock.
This allows for the patch to not compete with already
existing critical sections, and side effects are limited
to the paths in the patch.

The addition of the new field maintains the data locality
optimizations from
commit 9115e8cd2a0c ("net: reorganize struct sock for better data
locality")

Note that all the instances of the sk_stamp accesses
are either through the ioctl or the syscall recvmsg.

Signed-off-by: Deepa Dinamani
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Deepa Dinamani
2019-01-10 00:14:46 +0800

01 Dec, 2018

1 commit

e6ddc2c3d tcp: do not release socket ownership in tcp_close() ... Browse Code »

commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
in tcp_close()

While a socket is being closed, it is very possible other
threads find it in rtnetlink dump.

tcp_get_info() will acquire the socket lock for a short amount
of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
enough to trigger the warning.

Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-12-01 16:42:51 +0800

22 Feb, 2018

1 commit

2abfcdf8e kmemcheck: remove annotations ... Browse Code »

commit 4950276672fce5c241857540f8561c440663673d upstream.

Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin
Cc: Alexander Potapenko
Cc: Eric W. Biederman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Steven Rostedt
Cc: Tim Hansen
Cc: Vegard Nossum
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Levin, Alexander (Sasha Levin)
2018-02-22 22:42:23 +0800

17 Dec, 2017

1 commit

6878a0635 net: remove hlist_nulls_add_tail_rcu() ... Browse Code »

[ Upstream commit d7efc6c11b277d9d80b99b1334a78bfe7d7edf10 ]

Alexander Potapenko reported use of uninitialized memory [1]

This happens when inserting a request socket into TCP ehash,
in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.

Bug was added by commit d894ba18d4e4 ("soreuseport: fix ordering for
mixed v4/v6 sockets")

Note that d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6
ordering fix") missed the opportunity to get rid of
hlist_nulls_add_tail_rcu() :

Both UDP sockets and TCP/DCCP listeners no longer use
__sk_nulls_add_node_rcu() for their hash insertion.

Since all other sockets have unique 4-tuple, the reuseport status
has no special meaning, so we can always use hlist_nulls_add_head_rcu()
for them and save few cycles/instructions.

[1]

==================================================================
BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:

__dump_stack lib/dump_stack.c:16
dump_stack+0x185/0x1d0 lib/dump_stack.c:52
kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
__msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
__sk_nulls_add_node_rcu ./include/net/sock.h:684
inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
NF_HOOK ./include/linux/netfilter.h:248
ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
dst_input ./include/net/dst.h:477
ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
NF_HOOK ./include/linux/netfilter.h:248
ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
__netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
__netif_receive_skb net/core/dev.c:4336
netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
napi_skb_finish net/core/dev.c:4858
napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
e1000_clean_rx_irq+0x1492/0x1d30
drivers/net/ethernet/intel/e1000/e1000_main.c:4474
e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
napi_poll net/core/dev.c:5500
net_rx_action+0x73c/0x1820 net/core/dev.c:5566
__do_softirq+0x4b4/0x8dd kernel/softirq.c:284
invoke_softirq kernel/softirq.c:364
irq_exit+0x203/0x240 kernel/softirq.c:405
exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
common_interrupt+0x86/0x86

Fixes: d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
Fixes: d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
Signed-off-by: Eric Dumazet
Reported-by: Alexander Potapenko
Acked-by: Craig Gallek
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2017-12-17 22:07:57 +0800

22 Sep, 2017

1 commit

222d7dbd2 net: prevent dst uses after free ... Browse Code »

In linux-4.13, Wei worked hard to convert dst to a traditional
refcounted model, removing GC.

We now want to make sure a dst refcount can not transition from 0 back
to 1.

The problem here is that input path attached a not refcounted dst to an
skb. Then later, because packet is forwarded and hits skb_dst_force()
before exiting RCU section, we might try to take a refcount on one dst
that is about to be freed, if another cpu saw 1 -> 0 transition in
dst_release() and queued the dst for freeing after one RCU grace period.

Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
always perform the complete check against dst refcount, and not assume
it is not zero.

Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005

[ 989.919496] skb_dst_force+0x32/0x34
[ 989.919498] __dev_queue_xmit+0x1ad/0x482
[ 989.919501] ? eth_header+0x28/0xc6
[ 989.919502] dev_queue_xmit+0xb/0xd
[ 989.919504] neigh_connected_output+0x9b/0xb4
[ 989.919507] ip_finish_output2+0x234/0x294
[ 989.919509] ? ipt_do_table+0x369/0x388
[ 989.919510] ip_finish_output+0x12c/0x13f
[ 989.919512] ip_output+0x53/0x87
[ 989.919513] ip_forward_finish+0x53/0x5a
[ 989.919515] ip_forward+0x2cb/0x3e6
[ 989.919516] ? pskb_trim_rcsum.part.9+0x4b/0x4b
[ 989.919518] ip_rcv_finish+0x2e2/0x321
[ 989.919519] ip_rcv+0x26f/0x2eb
[ 989.919522] ? vlan_do_receive+0x4f/0x289
[ 989.919523] __netif_receive_skb_core+0x467/0x50b
[ 989.919526] ? tcp_gro_receive+0x239/0x239
[ 989.919529] ? inet_gro_receive+0x226/0x238
[ 989.919530] __netif_receive_skb+0x4d/0x5f
[ 989.919532] netif_receive_skb_internal+0x5c/0xaf
[ 989.919533] napi_gro_receive+0x45/0x81
[ 989.919536] ixgbe_poll+0xc8a/0xf09
[ 989.919539] ? kmem_cache_free_bulk+0x1b6/0x1f7
[ 989.919540] net_rx_action+0xf4/0x266
[ 989.919543] __do_softirq+0xa8/0x19d
[ 989.919545] irq_exit+0x5d/0x6b
[ 989.919546] do_IRQ+0x9c/0xb5
[ 989.919548] common_interrupt+0x93/0x93
[ 989.919548]

Similarly dst_clone() can use dst_hold() helper to have additional
debugging, as a follow up to commit 44ebe79149ff ("net: add debug
atomic_inc_not_zero() in dst_hold()")

In net-next we will convert dst atomic_t to refcount_t for peace of
mind.

Fixes: a4c2fd7f7891 ("net: remove DST_NOCACHE flag")
Signed-off-by: Eric Dumazet
Cc: Wei Wang
Reported-by: Paweł Staszewski
Bisected-by: Paweł Staszewski
Acked-by: Wei Wang
Acked-by: Martin KaFai Lau
Signed-off-by: David S. Miller

Eric Dumazet
2017-09-22 11:42:15 +0800

30 Aug, 2017

1 commit

eaa72dc47 neigh: increase queue_len_bytes to match wmem_default ... Browse Code »

Florian reported UDP xmit drops that could be root caused to the
too small neigh limit.

Current limit is 64 KB, meaning that even a single UDP socket would hit
it, since its default sk_sndbuf comes from net.core.wmem_default
(~212992 bytes on 64bit arches).

Once ARP/ND resolution is in progress, we should allow a little more
packets to be queued, at least for one producer.

Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
limit and either block in sendmsg() or return -EAGAIN.

Signed-off-by: Eric Dumazet
Reported-by: Florian Fainelli
Signed-off-by: David S. Miller

Eric Dumazet
2017-08-30 07:10:50 +0800

22 Aug, 2017

1 commit

e2a7c34fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2017-08-22 08:06:42 +0800

19 Aug, 2017

1 commit

a0917e0bc datagram: When peeking datagrams with offset < 0 don't skip empty skbs ... Browse Code »

Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
headers from UDP packets before queueing"), when udp packets are being
peeked the requested extra offset is always 0 as there is no need to skip
the udp header. However, when the offset is 0 and the next skb is
of length 0, it is only returned once. The behaviour can be seen with
the following python script:

from socket import *;
f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
f.bind(('::', 0));
addr=('::1', f.getsockname()[1]);
g.sendto(b'', addr)
g.sendto(b'b', addr)
print(f.recvfrom(10, MSG_PEEK));
print(f.recvfrom(10, MSG_PEEK));

Where the expected output should be the empty string twice.

Instead, make sk_peek_offset return negative values, and pass those values
to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
__skb_try_recv_from_queue will then ensure the offset is reset back to 0
if a peek is requested without an offset, unless no packets are found.

Also simplify the if condition in __skb_try_recv_from_queue. If _off is
greater then 0, and off is greater then or equal to skb->len, then
(_off || skb->len) must always be true assuming skb->len >= 0 is always
true.

Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
as it double checked if MSG_PEEK was set in the flags.

V2:
- Moved the negative fixup into __skb_try_recv_from_queue, and remove now
redundant checks
- Fix peeking in udp{,v6}_recvmsg to report the right value when the
offset is 0

V3:
- Marked new branch in __skb_try_recv_from_queue as unlikely.

Signed-off-by: Matthew Dawson
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Matthew Dawson
2017-08-19 06:12:54 +0800

04 Aug, 2017

2 commits

52267790e sock: add MSG_ZEROCOPY ... Browse Code »

The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2017-08-04 12:37:29 +0800
98ba0bd55 sock: allocate skbs from optmem ... Browse Code »

Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.

The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2017-08-04 12:37:29 +0800

02 Aug, 2017

1 commit

306b13eb3 proto_ops: Add locked held versions of sendmsg and sendpage ... Browse Code »

Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.

These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2017-08-02 06:26:18 +0800

19 Jul, 2017

1 commit

e06fdaf40 Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull structure randomization updates from Kees Cook:
"Now that IPC and other changes have landed, enable manual markings for
randstruct plugin, including the task_struct.

This is the rest of what was staged in -next for the gcc-plugins, and
comes in three patches, largest first:

- mark "easy" structs with __randomize_layout

- mark task_struct with an optional anonymous struct to isolate the
__randomize_layout section

- mark structs to opt _out_ of automated marking (which will come
later)

And, FWIW, this continues to pass allmodconfig (normal and patched to
enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
s390 for me"

* tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
randstruct: opt-out externally exposed function pointer structs
task_struct: Allow randomized layout
randstruct: Mark various structs for randomization

Linus Torvalds
2017-07-19 23:55:18 +0800

13 Jul, 2017

1 commit

771edcafa socket: add documentation for missing elements ... Browse Code »

Fill in missing kernel-doc for missing elements in struct sock.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2017-07-13 05:39:43 +0800

08 Jul, 2017

1 commit

0ffdaf5b4 net/sock: add WARN_ON(parent->sk) in sock_graft() ... Browse Code »

sock_graft() unilaterally sets up parent->sk based on the
assumption that the existing parent->sk is null. If this
condition is not true, then the existing parent->sk would
be leaked, so add a WARN_ON() to alert callers who may fall
in this category.

Signed-off-by: Sowmini Varadhan
Signed-off-by: David S. Miller

Sowmini Varadhan
2017-07-08 18:16:16 +0800

06 Jul, 2017

1 commit

5518b69b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Reasonably busy this cycle, but perhaps not as busy as in the 4.12
merge window:

1) Several optimizations for UDP processing under high load from
Paolo Abeni.

2) Support pacing internally in TCP when using the sch_fq packet
scheduler for this is not practical. From Eric Dumazet.

3) Support mutliple filter chains per qdisc, from Jiri Pirko.

4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

5) Add batch dequeueing to vhost_net, from Jason Wang.

6) Flesh out more completely SCTP checksum offload support, from
Davide Caratti.

7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
Neira Ayuso, and Matthias Schiffer.

8) Add devlink support to nfp driver, from Simon Horman.

9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
Prabhu.

10) Add stack depth tracking to BPF verifier and use this information
in the various eBPF JITs. From Alexei Starovoitov.

11) Support XDP on qed device VFs, from Yuval Mintz.

12) Introduce BPF PROG ID for better introspection of installed BPF
programs. From Martin KaFai Lau.

13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

14) For loads, allow narrower accesses in bpf verifier checking, from
Yonghong Song.

15) Support MIPS in the BPF selftests and samples infrastructure, the
MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
Daney.

16) Support kernel based TLS, from Dave Watson and others.

17) Remove completely DST garbage collection, from Wei Wang.

18) Allow installing TCP MD5 rules using prefixes, from Ivan
Delalande.

19) Add XDP support to Intel i40e driver, from Björn Töpel

20) Add support for TC flower offload in nfp driver, from Simon
Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
Kicinski, and Bert van Leeuwen.

21) IPSEC offloading support in mlx5, from Ilan Tayari.

22) Add HW PTP support to macb driver, from Rafal Ozieblo.

23) Networking refcount_t conversions, From Elena Reshetova.

24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
for tuning the TCP sockopt settings of a group of applications,
currently via CGROUPs"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
cxgb4: Support for get_ts_info ethtool method
cxgb4: Add PTP Hardware Clock (PHC) support
cxgb4: time stamping interface for PTP
nfp: default to chained metadata prepend format
nfp: remove legacy MAC address lookup
nfp: improve order of interfaces in breakout mode
net: macb: remove extraneous return when MACB_EXT_DESC is defined
bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
bpf: fix return in load_bpf_file
mpls: fix rtm policy in mpls_getroute
net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
...

Linus Torvalds
2017-07-06 03:31:59 +0800

01 Jul, 2017

3 commits

41c6d650f net: convert sock.sk_refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
14afee4b6 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
3859a271a randstruct: Mark various structs for randomization ... Browse Code »

This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook

Kees Cook
2017-07-01 03:00:51 +0800

21 Jun, 2017

1 commit

34cfb542b sock: avoid dirtying incoming_cpu if not needed ... Browse Code »

for connected socket, the incoming_cpu field in the sock struct
is not going to change frequently, but we are setting it
unconditionally for each packet.

Since sk_incoming_cpu and sk_flags share the same cacheline,
and the latter is access by udp_recvmsg(), this cause a cache
miss for each packet for UDP connected socket.

With this patch, we set the incoming cpu field only when the
ingress cpu really changes.

This gives a small but measurable performance improvement for
connected UDP socket.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2017-06-21 23:43:01 +0800

08 Jun, 2017

1 commit

060447511 tcp: add TCPMemoryPressuresChrono counter ... Browse Code »

DRAM supply shortage and poor memory pressure tracking in TCP
stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
limits) and tcp_mem[] quite hazardous.

TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
limits being hit, but only tracking number of transitions.

If TCP stack behavior under stress was perfect :
1) It would maintain memory usage close to the limit.
2) Memory pressure state would be entered for short times.

We certainly prefer 100 events lasting 10ms compared to one event
lasting 200 seconds.

This patch adds a new SNMP counter tracking cumulative duration of
memory pressure events, given in ms units.

$ cat /proc/sys/net/ipv4/tcp_mem
3088 4117 6176
$ grep TCP /proc/net/sockstat
TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
$ nstat -n ; sleep 10 ; nstat |grep Pressure
TcpExtTCPMemoryPressures 1700
TcpExtTCPMemoryPressuresChrono 5209

v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
instructed.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-06-08 23:26:19 +0800

19 May, 2017

1 commit

6312811be Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbook ... Browse Code »

Mauro says:

This patch series convert the remaining DocBooks to ReST.

The first version was originally
send as 3 patch series:

[PATCH 00/36] Convert DocBook documents to ReST
[PATCH 0/5] Convert more books to ReST
[PATCH 00/13] Get rid of DocBook

The lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.

It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.

I also added some patches here to add PDF output for all
existing ReST books.

Jonathan Corbet
2017-05-19 01:03:08 +0800

17 May, 2017

2 commits

218af599f tcp: internal implementation for pacing ... Browse Code »

BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet
Acked-by: Soheil Hassas Yeganeh
Cc: Neal Cardwell
Cc: Yuchung Cheng
Cc: Van Jacobson
Cc: Jerry Chu
Signed-off-by: David S. Miller

Eric Dumazet
2017-05-17 03:43:31 +0800
65101aeca net/sock: factor out dequeue/peek with offset code ... Browse Code »

And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2017-05-17 03:41:29 +0800

16 May, 2017

1 commit

d651983dd net: fix some identation issues at kernel-doc markups ... Browse Code »

Sphinx is very pedantic with regards to identation and
escape sequences:

./include/net/sock.h:1967: ERROR: Unexpected indentation.
./include/net/sock.h:1969: ERROR: Unexpected indentation.
./include/net/sock.h:1970: WARNING: Block quote ends without a blank line; unexpected unindent.
./include/net/sock.h:1971: WARNING: Block quote ends without a blank line; unexpected unindent.
./include/net/sock.h:2268: WARNING: Inline emphasis start-string without end-string.
./net/core/sock.c:2686: ERROR: Unexpected indentation.
./net/core/sock.c:2687: WARNING: Block quote ends without a blank line; unexpected unindent.
./net/core/datagram.c:182: WARNING: Inline emphasis start-string without end-string.
./include/linux/netdevice.h:1444: ERROR: Unexpected indentation.
./drivers/net/phy/phy.c:381: ERROR: Unexpected indentation.
./drivers/net/phy/phy.c:382: WARNING: Block quote ends without a blank line; unexpected unindent.

- Fix spacing where needed;
- Properly escape constants;
- Use a literal block for a race description.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2017-05-16 19:44:14 +0800

11 May, 2017

1 commit

de4d19530 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main changes are:

- Debloat RCU headers

- Parallelize SRCU callback handling (plus overlapping patches)

- Improve the performance of Tree SRCU on a CPU-hotplug stress test

- Documentation updates

- Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...

Linus Torvalds
2017-05-11 01:30:46 +0800

23 Apr, 2017

1 commit

58d30c36d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmc… ... Browse Code »

…k/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney:

- Documentation updates.

- Miscellaneous fixes.

- Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-04-23 17:12:44 +0800

19 Apr, 2017

1 commit

5f0d5a3ae mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU ... Browse Code »

A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc:
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes

Paul E. McKenney
2017-04-19 02:42:36 +0800

03 Apr, 2017

1 commit

d3fbff306 sock: correctly test SOCK_TIMESTAMP in sock_recv_ts_and_drops() ... Browse Code »

It seems the code does not match the intent.

This broke packetdrill, and probably other programs.

Fixes: 6c7c98bad488 ("sock: avoid dirtying sk_stamp, if possible")
Signed-off-by: Eric Dumazet
Cc: Paolo Abeni
Acked-by: Paolo Abeni
Signed-off-by: David S. Miller

Eric Dumazet
2017-04-03 10:34:55 +0800

31 Mar, 2017

1 commit

6c7c98bad sock: avoid dirtying sk_stamp, if possible ... Browse Code »

sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
every packet, even if the SOCK_TIMESTAMP flag is not set in the
related socket.
If selinux is enabled, this cause a cache miss for every packet
since sk->sk_stamp and sk->sk_security share the same cacheline.
With this change sk_stamp is set only if the SOCK_TIMESTAMP
flag is set, and is cleared for the first packet, so that the user
perceived behavior is unchanged.

This gives up to 5% speed-up under udp-flood with small packets.

Signed-off-by: Paolo Abeni
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Paolo Abeni
2017-03-31 11:05:24 +0800

23 Mar, 2017

1 commit

a2d133b1d sock: introduce SO_MEMINFO getsockopt ... Browse Code »

Allows reading of SK_MEMINFO_VARS via socket option. This way an
application can get all meminfo related information in single socket
option call instead of multiple calls.

Adds helper function, sk_get_meminfo(), and uses that for both
getsockopt and sock_diag_put_meminfo().

Suggested by Eric Dumazet.

Signed-off-by: Josh Hunt
Reviewed-by: Jason Baron
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Josh Hunt
2017-03-23 02:18:58 +0800

16 Mar, 2017

1 commit

101c43149 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/broadcom/genet/bcmgenet.c
net/core/sock.c

Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.

Signed-off-by: David S. Miller

David S. Miller
2017-03-16 02:59:10 +0800

10 Mar, 2017

1 commit

cdfbabfb2 net: Work around lockdep limitation in sockets that use sockets ... Browse Code »

Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

(1) If the pagefault handler decides it needs to read pages from AFS, it
calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
creating a call requires the socket lock:

mmap_sem must be taken before sk_lock-AF_RXRPC

(2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
binds the underlying UDP socket whilst holding its socket lock.
inet_bind() takes its own socket lock:

sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

(3) Reading from a TCP socket into a userspace buffer might cause a fault
and thus cause the kernel to take the mmap_sem, but the TCP socket is
locked whilst doing this:

sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks. The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace. This is
a limitation in the design of lockdep.

Fix the general case by:

(1) Double up all the locking keys used in sockets so that one set are
used if the socket is created by userspace and the other set is used
if the socket is created by the kernel.

(2) Store the kern parameter passed to sk_alloc() in a variable in the
sock struct (sk_kern_sock). This informs sock_lock_init(),
sock_init_data() and sk_clone_lock() as to the lock keys to be used.

Note that the child created by sk_clone_lock() inherits the parent's
kern setting.

(3) Add a 'kern' parameter to ->accept() that is analogous to the one
passed in to ->create() that distinguishes whether kernel_accept() or
sys_accept4() was the caller and can be passed to sk_alloc().

Note that a lot of accept functions merely dequeue an already
allocated socket. I haven't touched these as the new socket already
exists before we get the parameter.

Note also that there are a couple of places where I've made the accepted
socket unconditionally kernel-based:

irda_accept()
rds_rcp_accept_one()
tcp_accept_from_sock()

because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal. I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells
Signed-off-by: David S. Miller

David Howells
2017-03-10 10:23:27 +0800

09 Mar, 2017

1 commit

95964c6de net: use proper lockdep annotation in __sk_dst_set() ... Browse Code »

__sk_dst_set() must be called while we own the socket.

We can get proper lockdep coverage using lockdep_sock_is_held()
and rcu_dereference_protected()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-03-09 15:14:39 +0800

03 Mar, 2017

1 commit

94352d450 net: Introduce sk_clone_lock() error path routine ... Browse Code »

When handling problems in cloning a socket with the sk_clone_locked()
function we need to perform several steps that were open coded in it and
its callers, so introduce a routine to avoid this duplication:
sk_free_unlock_clone().

Cc: Cong Wang
Cc: Dmitry Vyukov
Cc: Eric Dumazet
Cc: Gerrit Renker
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/n/net-ui6laqkotycunhtmqryl9bfx@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Arnaldo Carvalho de Melo
2017-03-03 05:19:33 +0800

08 Feb, 2017

4 commits

3efa70d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-02-08 05:29:30 +0800
4ff062035 net: add dst_pending_confirm flag to skbuff ... Browse Code »

Add new skbuff flag to allow protocols to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

Add sock_confirm_neigh() helper to confirm the neighbour and
use it for IPv4, IPv6 and VRF before dst_neigh_output.

Signed-off-by: Julian Anastasov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Julian Anastasov
2017-02-08 02:07:46 +0800
9b8805a32 sock: add sk_dst_pending_confirm flag ... Browse Code »

Add new sock flag to allow sockets to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
As not all call paths lock the socket use full word for
the flag.

Add sk_dst_confirm as replacement for dst_confirm when
called for received packets.

Signed-off-by: Julian Anastasov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Julian Anastasov
2017-02-08 02:07:46 +0800
69629464e udp: properly cope with csum errors ... Browse Code »

Dmitry reported that UDP sockets being destroyed would trigger the
WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct()

It turns out we do not properly destroy skb(s) that have wrong UDP
checksum.

Thanks again to syzkaller team.

Fixes : 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")
Reported-by: Dmitry Vyukov
Signed-off-by: Eric Dumazet
Cc: Paolo Abeni
Cc: Hannes Frederic Sowa
Acked-by: Paolo Abeni
Signed-off-by: David S. Miller

Eric Dumazet
2017-02-08 00:19:00 +0800

28 Jan, 2017

1 commit

158f323b9 net: adjust skb->truesize in pskb_expand_head() ... Browse Code »

Slava Shwartsman reported a warning in skb_try_coalesce(), when we
detect skb->truesize is completely wrong.

In his case, issue came from IPv6 reassembly coping with malicious
datagrams, that forced various pskb_may_pull() to reallocate a bigger
skb->head than the one allocated by NIC driver before entering GRO
layer.

Current code does not change skb->truesize, leaving this burden to
callers if they care enough.

Blindly changing skb->truesize in pskb_expand_head() is not
easy, as some producers might track skb->truesize, for example
in xmit path for back pressure feedback (sk->sk_wmem_alloc)

We can detect the cases where it should be safe to change
skb->truesize :

1) skb is not attached to a socket.
2) If it is attached to a socket, destructor is sock_edemux()

My audit gave only two callers doing their own skb->truesize
manipulation.

I had to remove skb parameter in sock_edemux macro when
CONFIG_INET is not set to avoid a compile error.

Signed-off-by: Eric Dumazet
Reported-by: Slava Shwartsman
Signed-off-by: David S. Miller

Eric Dumazet
2017-01-28 01:03:29 +0800

21 Jan, 2017

1 commit

6c59ebd35 sock: use hlist_entry_safe ... Browse Code »

Use hlist_entry_safe() instead of open-coding it.

Signed-off-by: Geliang Tang
Signed-off-by: David S. Miller

Geliang Tang
2017-01-21 00:38:45 +0800