Eric Lee / smarc-fsl-linux-kernel

10 Jan, 2019

1 commit

3bade4b76 tcp: fix a race in inet_diag_dump_icsk() ... Browse Code »

[ Upstream commit f0c928d878e7d01b613c9ae5c971a6b1e473a938 ]

Alexei reported use after frees in inet_diag_dump_icsk() [1]

Because we use refcount_set() when various sockets are setup and
inserted into ehash, we also need to make sure inet_diag_dump_icsk()
wont race with the refcount_set() operations.

Jonathan Lemon sent a patch changing net_twsk_hashdance() but
other spots would need risky changes.

Instead, fix inet_diag_dump_icsk() as this bug came with
linux-4.10 only.

[1] Quoting Alexei :

First something iterating over sockets finds already freed tw socket:

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
RIP: 0010:refcount_inc+0x26/0x30
RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
FS: 00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag] // sock_hold(sk); in net/ipv4/inet_diag.c:1002
? kmalloc_large_node+0x37/0x70
? __kmalloc_node_track_caller+0x1cb/0x260
? __alloc_skb+0x72/0x1b0
? __kmalloc_reserve.isra.40+0x2e/0x80
__inet_diag_dump+0x3b/0x80 [inet_diag]
netlink_dump+0x116/0x2a0
netlink_recvmsg+0x205/0x3c0
sock_read_iter+0x89/0xd0
__vfs_read+0xf7/0x140
vfs_read+0x8a/0x140
SyS_read+0x3f/0xa0
do_syscall_64+0x5a/0x100

then a minute later twsk timer fires and hits two bad refcnts
for this freed socket:

refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
Modules linked in:
RIP: 0010:refcount_dec+0x2e/0x40
RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:

inet_twsk_kill+0x9d/0xc0 // inet_twsk_bind_unhash(tw, hashinfo);
call_timer_fn+0x29/0x110
run_timer_softirq+0x36b/0x3a0

refcount_t: underflow; use-after-free.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
RIP: 0010:refcount_sub_and_test+0x46/0x50
RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:

inet_twsk_put+0x12/0x20 // inet_twsk_put(tw);
call_timer_fn+0x29/0x110
run_timer_softirq+0x36b/0x3a0

Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet
Reported-by: Alexei Starovoitov
Cc: Jonathan Lemon
Acked-by: Jonathan Lemon
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2019-01-10 00:14:44 +0800

02 Sep, 2017

1 commit

b37e88407 inet_diag: allow protocols to provide additional data ... Browse Code »

Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.

Signed-off-by: Ivan Delalande
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Ivan Delalande
2017-09-02 09:38:09 +0800

19 Aug, 2017

1 commit

0888e372c net: inet: diag: expose sockets cgroup classid ... Browse Code »

This is useful for directly looking up a task based on class id rather than
having to scan through all open file descriptors.

Signed-off-by: Sasha Levin
Signed-off-by: David S. Miller

Levin, Alexander (Sasha Levin)
2017-08-19 07:10:50 +0800

14 Jan, 2017

2 commits

bec41a11d tcp: remove early retransmit ... Browse Code »

This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with
Signed-off-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2017-01-14 11:37:16 +0800
57dde7f70 tcp: add reordering timer in RACK loss detection ... Browse Code »

This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.

It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to

RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.

The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.

When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Yuchung Cheng
2017-01-14 11:37:16 +0800

10 Nov, 2016

1 commit

67db3e4bf tcp: no longer hold ehash lock while calling tcp_get_info() ... Browse Code »

We had various problems in the past in tcp_get_info() and used
specific synchronization to avoid deadlocks.

We would like to add more instrumentation points for TCP, and
avoiding grabing socket lock in tcp_getinfo() was too costly.

Being able to lock the socket allows to provide consistent set
of fields.

inet_diag_dump_icsk() can make sure ehash locks are not
held any more when tcp_get_info() is called.

We can remove syncp added in commit d654976cbf85
("tcp: fix a potential deadlock in tcp_get_info()"), but we need
to use lock_sock_fast() instead of spin_lock_bh() since TCP input
path can now be run from process context.

Signed-off-by: Eric Dumazet
Signed-off-by: Yuchung Cheng
Acked-by: Soheil Hassas Yeganeh
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2016-11-10 02:02:27 +0800

24 Oct, 2016

1 commit

432490f9d net: ip, diag -- Add diag interface for raw sockets ... Browse Code »

In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

v2:
- add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
- implement @destroy for diag requests (by dsa@)

v3:
- add export of raw_abort for IPv6 (by dsa@)
- pass net-admin flag into inet_sk_diag_fill due to
changes in net-next branch (by dsa@)

v4:
- use @pad in struct inet_diag_req_v2 for raw socket
protocol specification: raw module carries sockets
which may have custom protocol passed from socket()
syscall and sole @sdiag_protocol is not enough to
match underlied ones
- start reporting protocol specifed in socket() call
when sockets are raw ones for the same reason: user
space tools like ss may parse this attribute and use
it for socket matching

v5 (by eric.dumazet@):
- use sock_hold in raw_sock_get instead of atomic_inc,
we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
when looking up so counter won't be zero here.

v6:
- use sdiag_raw_protocol() helper which will access @pad
structure used for raw sockets protocol specification:
we can't simply rename this member without breaking uapi

v7:
- sine sdiag_raw_protocol() helper is not suitable for
uapi lets rather make an alias structure with proper
names. __check_inet_diag_req_raw helper will catch
if any of structure unintentionally changed.

CC: David S. Miller
CC: Eric Dumazet
CC: David Ahern
CC: Alexey Kuznetsov
CC: James Morris
CC: Hideaki YOSHIFUJI
CC: Patrick McHardy
CC: Andrey Vagin
CC: Stephen Hemminger
Signed-off-by: Cyrill Gorcunov
Signed-off-by: David S. Miller

Cyrill Gorcunov
2016-10-24 07:35:24 +0800

20 Oct, 2016

1 commit

9652dc2eb tcp: relax listening_hash operations ... Browse Code »

softirq handlers use RCU protection to lookup listeners,
and write operations all happen from process context.
We do not need to block BH for dump operations.

Also SYN_RECV since request sockets are stored in the ehash table :

1) inet_diag_dump_icsk() no longer need to clear
cb->args[3] and cb->args[4] that were used as cursors while
iterating the old per listener hash table.

2) Also factorize a test : No need to scan listening_hash[]
if r->id.idiag_dport is not zero.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-10-20 23:24:32 +0800

09 Sep, 2016

1 commit

d545caca8 net: inet: diag: expose the socket mark to privileged processes. ... Browse Code »

This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.

Commit a52e95abf772 ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes. It is useful for
tools like ss which display information about sockets.

Tested: https://android-review.googlesource.com/270210
Signed-off-by: Lorenzo Colitti
Signed-off-by: David S. Miller

Lorenzo Colitti
2016-09-09 07:13:09 +0800

25 Aug, 2016

2 commits

a52e95abf net: diag: allow socket bytecode filters to match socket marks ... Browse Code »

This allows a privileged process to filter by socket mark when
dumping sockets via INET_DIAG_BY_FAMILY. This is useful on
systems that use mark-based routing such as Android.

The ability to filter socket marks requires CAP_NET_ADMIN, which
is consistent with other privileged operations allowed by the
SOCK_DIAG interface such as the ability to destroy sockets and
the ability to inspect BPF filters attached to packet sockets.

Tested: https://android-review.googlesource.com/261350
Signed-off-by: Lorenzo Colitti
Acked-by: David Ahern
Signed-off-by: David S. Miller

Lorenzo Colitti
2016-08-25 12:57:20 +0800
627cc4add net: diag: slightly refactor the inet_diag_bc_audit error checks. ... Browse Code »

This simplifies the code a bit and also allows inet_diag_bc_audit
to send to userspace an error that isn't EINVAL.

Signed-off-by: Lorenzo Colitti
Acked-by: David Ahern
Signed-off-by: David S. Miller

Lorenzo Colitti
2016-08-25 12:57:20 +0800

28 Jun, 2016

1 commit

637c841dd net: diag: Add support to filter on device index ... Browse Code »

Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2016-06-28 17:25:04 +0800

27 Apr, 2016

1 commit

6ed46d124 sock_diag: align nlattr properly when needed ... Browse Code »

I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f190
which is only in net-next right now, thus I didn't make a separate patch.

Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2016-04-27 00:00:48 +0800

22 Apr, 2016

1 commit

b7de529c7 net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diag ... Browse Code »

EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
back to before jiffies_to_msecs() has been introduced.

Now we can remove it and use jiffies_to_msecs().

Suggested-by: Jakub Sitnicki
Signed-off-by: Xin Long
Acked-by: Jakub Sitnicki
Acked-by: Marcelo Ricardo Leitner
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Xin Long
2016-04-22 01:55:33 +0800

16 Apr, 2016

1 commit

cb2050a7b sctp: export some functions for sctp_diag in inet_diag ... Browse Code »

inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.

inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.

v2->v3:
- do not need to define and export inet_diag_get_handler any more.
cause all the functions in it are in sctp_diag.ko, we just call
them in sctp_diag.ko.

- add inet_diag_msg_attrs_fill to make codes clear.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2016-04-16 05:29:36 +0800

05 Apr, 2016

2 commits

3b24d854c tcp/dccp: do not touch listener sk_refcnt under synflood ... Browse Code »

When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-05 10:11:20 +0800
2d331915a tcp/dccp: use rcu locking in inet_diag_find_one_icsk() ... Browse Code »

RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.

This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-04-05 10:11:19 +0800

15 Mar, 2016

1 commit

acffb584c net: diag: add a scheduling point in inet_diag_dump_icsk() ... Browse Code »

On loaded TCP servers, looking at millions of sockets can hold
cpu for many seconds, if the lookup condition is very narrow.

(eg : ss dst 1.2.3.4 )

Better add a cond_resched() to allow other processes to access
the cpu.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-03-15 07:38:09 +0800

11 Feb, 2016

1 commit

a583636a8 inet: refactor inet[6]_lookup functions to take skb ... Browse Code »

This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups. Doing so with a BPF filter will require access to the
skb in question. This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.

Signed-off-by: Craig Gallek
Signed-off-by: David S. Miller

Craig Gallek
2016-02-11 16:54:14 +0800

21 Jan, 2016

1 commit

7c1306723 net: diag: support v4mapped sockets in inet_diag_find_one_icsk() ... Browse Code »

Lorenzo reported that we could not properly find v4mapped sockets
in inet_diag_find_one_icsk(). This patch fixes the issue.

Reported-by: Lorenzo Colitti
Signed-off-by: Eric Dumazet
Acked-by: Lorenzo Colitti
Signed-off-by: David S. Miller

Eric Dumazet
2016-01-21 10:51:31 +0800

16 Dec, 2015

2 commits

6eb5d2e08 net: diag: Support SOCK_DESTROY for inet sockets. ... Browse Code »

This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.

Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.

Signed-off-by: Lorenzo Colitti
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Lorenzo Colitti
2015-12-16 12:26:51 +0800
b613f56ec net: diag: split inet_diag_dump_one_icsk into two ... Browse Code »

Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.

Signed-off-by: Lorenzo Colitti
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Lorenzo Colitti
2015-12-16 12:26:51 +0800

03 Oct, 2015

2 commits

079096f10 tcp/dccp: install syn_recv requests into ehash table ... Browse Code »

In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet
Cc: Ying Cai
Cc: Willem de Bruijn
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-03 19:32:41 +0800
aac065c50 tcp: move qlen/young out of struct listen_sock ... Browse Code »

qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.

Everything needs to be atomic for upcoming lockless listener.

Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-03 19:32:36 +0800

11 Jul, 2015

1 commit

8220ea232 net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets ... Browse Code »

Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
sockopt", I am not happy with the limitations it causes for socket
analysing code in userspace. Exporting the value only if it is set makes
it hard for userspace to decide whether the option is not set or the
kernel does not support exporting the option at all.

>From an auditor's perspective, the interesting question for listening
AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
the unexpected case. This patch allows to answer this question reliably.

Signed-off-by: Phil Sutter
Cc: Eric Dumazet
Signed-off-by: David S. Miller

Phil Sutter
2015-07-11 14:25:24 +0800

24 Jun, 2015

1 commit

204621551 net: inet_diag: export IPV6_V6ONLY sockopt ... Browse Code »

For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
exported to userspace. It indicates whether a socket bound to in6addr_any
listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
listed by e.g. 'ss -l -4'.

This patch is accompanied by an appropriate one for iproute2 to enable
the additional information in 'ss -e'.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-06-24 17:51:39 +0800

23 Jun, 2015

1 commit

3b1884435 inet_diag: Remove _bh suffix in inet_diag_dump_reqs(). ... Browse Code »

inet_diag_dump_reqs() is called from inet_diag_dump_icsk() with BH
disabled. So no need to disable BH in inet_diag_dump_reqs().

Signed-off-by: Hiroaki Shimoda
Signed-off-by: David S. Miller

Hiroaki SHIMODA
2015-06-23 16:19:52 +0800

22 Jun, 2015

1 commit

e0df02e0c sock_diag: fetch source port from inet_sock ... Browse Code »

When an inet_sock is destroyed, its source port (sk_num) is set to
zero as part of the unhash procedure. In order to supply a source
port as part of the NETLINK_SOCK_DIAG socket destruction broadcasts,
the source port number must be read from inet_sport instead.

Tested: ss -E
Signed-off-by: Craig Gallek
Signed-off-by: David S. Miller

Craig Gallek
2015-06-22 01:16:50 +0800

16 Jun, 2015

2 commits

35ac838a9 sock_diag: implement a get_info handler for inet ... Browse Code »

This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.

This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL). This attribute is currently only used
for multicast messages. Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast. This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.

Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).

Signed-off-by: Craig Gallek
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Craig Gallek
2015-06-16 10:49:22 +0800
3fd22af80 sock_diag: specify info_size per inet protocol ... Browse Code »

Previously, there was no clear distinction between the inet protocols
that used struct tcp_info to report information and those that didn't.
This change adds a specific size attribute to the inet_diag_handler
struct which defines these interfaces. This will make dispatching
sock_diag get_info requests identical for all inet protocols in a
following patch.

Tested: ss -au
Tested: ss -at
Signed-off-by: Craig Gallek
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Craig Gallek
2015-06-16 10:49:22 +0800

30 Apr, 2015

1 commit

64f40ff5b tcp: prepare CC get_info() access from getsockopt() ... Browse Code »

We would like that optional info provided by Congestion Control
modules using netlink can also be read using getsockopt()

This patch changes get_info() to put this information in a buffer,
instead of skb, like tcp_get_info(), so that following patch
can reuse this common infrastructure.

Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Acked-by: Neal Cardwell
Acked-by: Daniel Borkmann
Acked-by: Yuchung Cheng
Signed-off-by: David S. Miller

Eric Dumazet
2015-04-30 05:10:38 +0800

18 Apr, 2015

1 commit

521f1cf1d inet_diag: fix access to tcp cc information ... Browse Code »

Two different problems are fixed here :

1) inet_sk_diag_fill() might be called without socket lock held.
icsk->icsk_ca_ops can change under us and module be unloaded.
-> Access to freed memory.
Fix this using rcu_read_lock() to prevent module unload.

2) Some TCP Congestion Control modules provide information
but again this is not safe against icsk->icsk_ca_ops
change and nla_put() errors were ignored. Some sockets
could not get the additional info if skb was almost full.

Fix this by returning a status from get_info() handlers and
using rcu protection as well.

Signed-off-by: Eric Dumazet
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Eric Dumazet
2015-04-18 01:28:31 +0800

14 Apr, 2015

1 commit

789f558cf tcp/dccp: get rid of central timewait timer ... Browse Code »

Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-04-14 04:40:05 +0800

24 Mar, 2015

1 commit

b28270533 net: convert syn_wait_lock to a spinlock ... Browse Code »

This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-03-24 04:52:26 +0800

21 Mar, 2015

2 commits

0fa74a4be Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller

David S. Miller
2015-03-21 06:51:09 +0800
fa76ce732 inet: get rid of central tcp/dccp listener timer ... Browse Code »

One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-03-21 00:40:25 +0800

19 Mar, 2015

1 commit

08d2cc3b2 inet: request sock should init IPv6/IPv4 addresses ... Browse Code »

In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-03-19 10:00:35 +0800

17 Mar, 2015

1 commit

a58917f58 inet_diag: allow sk_diag_fill() to handle request socks ... Browse Code »

inet_diag_fill_req() is renamed to inet_req_diag_fill()
and moved up, so that it can be called fom sk_diag_fill()

inet_diag_bc_sk() is ready to handle request socks.

inet_twsk_diag_dump() is no longer needed.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-03-17 03:55:29 +0800

15 Mar, 2015

2 commits

a4458343a inet_diag: factorize code in new inet_diag_msg_common_fill() helper ... Browse Code »

Now the three type of sockets share a common base, we can factorize
code in inet_diag_msg_common_fill().

inet_diag_entry no longer requires saddr_storage & daddr_storage
and the extra copies.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-03-15 03:05:10 +0800
a07c92078 inet_diag: adjust inet_sk_diag_fill() bug condition ... Browse Code »

inet_sk_diag_fill() only copes with non timewait and non request socks

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-03-15 03:05:10 +0800