Eric Lee / smarc-fsl-linux-kernel

15 Dec, 2015

1 commit

5c77e2686 unix: avoid use-after-free in ep_remove_wait_queue ... Browse Code »

[ Upstream commit 7d267278a9ece963d77eefec61630223fce08c6c ]

Rainer Weikusat writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Rainer Weikusat
10 years ago

27 Oct, 2015

2 commits

e09e88967 net/unix: fix logic about sk_peek_offset ... Browse Code »

[ Upstream commit e9193d60d363e4dff75ff6d43a48f22be26d59c7 ]

Now send with MSG_PEEK can return data from multiple SKBs.

Unfortunately we take into account the peek offset for each skb,
that is wrong. We need to apply the peek offset only once.

In addition, the peek offset should be used only if MSG_PEEK is set.

Cc: "David S. Miller" (maintainer:NETWORKING
Cc: Eric Dumazet (commit_signer:1/14=7%)
Cc: Aaron Conole
Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag")
Signed-off-by: Andrey Vagin
Tested-by: Aaron Conole
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Andrey Vagin
10 years ago
9bf31c538 af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag ... Browse Code »

[ Upstream commit 9f389e35674f5b086edd70ed524ca0f287259725 ]

AF_UNIX sockets now return multiple skbs from recv() when MSG_PEEK flag
is set.

This is referenced in kernel bugzilla #12323 @
https://bugzilla.kernel.org/show_bug.cgi?id=12323

As described both in the BZ and lkml thread @
http://lkml.org/lkml/2008/1/8/444 calling recv() with MSG_PEEK on an
AF_UNIX socket only reads a single skb, where the desired effect is
to return as much skb data has been queued, until hitting the recv
buffer size (whichever comes first).

The modified MSG_PEEK path will now move to the next skb in the tree
and jump to the again: label, rather than following the natural loop
structure. This requires duplicating some of the loop head actions.

This was tested using the python socketpair python code attached to
the bugzilla issue.

Signed-off-by: Aaron Conole
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Aaron Conole
10 years ago

27 May, 2015

1 commit

b48732e4a unix/caif: sk_socket can disappear when state is unlocked ... Browse Code »

got a rare NULL pointer dereference in clear_bit

Signed-off-by: Mark Salyzyn
Acked-by: Hannes Frederic Sowa
----
v2: switch to sock_flag(sk, SOCK_DEAD) and added net/caif/caif_socket.c
v3: return -ECONNRESET in upstream caller of wait function for SOCK_DEAD
Signed-off-by: David S. Miller

Mark Salyzyn
10 years ago

28 Apr, 2015

1 commit

2decb2682 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) mlx4 doesn't check fully for supported valid RSS hash function, fix
from Amir Vadai

2) Off by one in ibmveth_change_mtu(), from David Gibson

3) Prevent altera chip from reporting false error interrupts in some
circumstances, from Chee Nouk Phoon

4) Get rid of that stupid endless loop trying to allocate a FIN packet
in TCP, and in the process kill deadlocks. From Eric Dumazet

5) Fix get_rps_cpus() crash due to wrong invalid-cpu value, also from
Eric Dumazet

6) Fix two bugs in async rhashtable resizing, from Thomas Graf

7) Fix topology server listener socket namespace bug in TIPC, from Ying
Xue

8) Add some missing HAS_DMA kconfig dependencies, from Geert
Uytterhoeven

9) bgmac driver intends to force re-polling but does so by returning
the wrong value from it's ->poll() handler. Fix from Rafał Miłecki

10) When the creater of an rhashtable configures a max size for it,
don't bark in the logs and drop insertions when that is exceeded.
Fix from Johannes Berg

11) Recover from out of order packets in ppp mppe properly, from Sylvain
Rochet

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
bnx2x: really disable TPA if 'disable_tpa' option is set
net:treewide: Fix typo in drivers/net
net/mlx4_en: Prevent setting invalid RSS hash function
mdio-mux-gpio: use new gpiod_get_array and gpiod_put_array functions
netfilter; Add some missing default cases to switch statements in nft_reject.
ppp: mppe: discard late packet in stateless mode
ppp: mppe: sanity error path rework
net/bonding: Make DRV macros private
net: rfs: fix crash in get_rps_cpus()
altera tse: add support for fixed-links.
pxa168: fix double deallocation of managed resources
net: fix crash in build_skb()
net: eth: altera: Resolve false errors from MSGDMA to TSE
ehea: Fix memory hook reference counting crashes
net/tg3: Release IRQs on permanent error
net: mdio-gpio: support access that may sleep
inet: fix possible panic in reqsk_queue_unlink()
rhashtable: don't attempt to grow when at max_size
bgmac: fix requests for extra polling calls from NAPI
tcp: avoid looping in tcp_send_fin()
...

Linus Torvalds
10 years ago

24 Apr, 2015

1 commit

d1ab39f17 net: unix: garbage: fixed several comment and whitespace style issues ... Browse Code »

fixed several comment and whitespace style issues

Signed-off-by: Jason Eastman
Signed-off-by: David S. Miller

Jason Eastman
10 years ago

16 Apr, 2015

2 commits

a25b376bd VFS: net/unix: d_backing_inode() annotations ... Browse Code »

places where we are dealing with S_ISSOCK file creation/lookups.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
10 years ago
ee8ac4d61 VFS: AF_UNIX sockets should call mknod on the top layer only ... Browse Code »

AF_UNIX sockets should call mknod on the top layer only and should not attempt
to modify the lower layer in a layered filesystem such as overlayfs.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
10 years ago

03 Mar, 2015

1 commit

1b7841404 net: Remove iocb argument from sendmsg and recvmsg ... Browse Code »

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig
Suggested-by: Al Viro
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
10 years ago

29 Jan, 2015

1 commit

7cc056626 net: remove sock_iocb ... Browse Code »

The sock_iocb structure is allocate on stack for each read/write-like
operation on sockets, and contains various fields of which only the
embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
Get rid of the sock_iocb and put a msghdr directly on the stack and pass
the scm_cookie explicitly to netlink_mmap_sendmsg.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
11 years ago

18 Jan, 2015

1 commit

053c095a8 netlink: make nlmsg_end() and genlmsg_end() void ... Browse Code »

Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

if (my_function(...))
/* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for
Signed-off-by: David S. Miller

Johannes Berg
11 years ago

10 Dec, 2014

1 commit

c0371da60 put iov_iter into msghdr ... Browse Code »

Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro

Al Viro
11 years ago

24 Nov, 2014

1 commit

8feb2fb2b switch AF_PACKET and AF_UNIX to skb_copy_datagram_from_iter() ... Browse Code »

... and kill skb_copy_datagram_iovec()

Signed-off-by: Al Viro

Al Viro
11 years ago

06 Nov, 2014

1 commit

51f3d02b9 net: Add and use skb_copy_datagram_msg() helper. ... Browse Code »

This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.

Signed-off-by: David S. Miller

David S. Miller
11 years ago

08 Oct, 2014

1 commit

505e907db af_unix: remove 0 assignment on static ... Browse Code »

static values are automatically initialized to 0

Signed-off-by: Fabian Frederick
Signed-off-by: David S. Miller

Fabian Frederick
11 years ago

13 Jun, 2014

1 commit

f9da455b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.

3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.

4) BPF now has a "random" opcode, from Chema Gonzalez.

5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.

6) Support TCP fastopen over ipv6, from Daniel Lee.

7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

8) Support software TSO in fec driver too, from Nimrod Andy.

9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.

11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.

12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.

13) Support busy polling in SCTP, from Neal Horman.

14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

15) Bridge promisc mode handling improvements from Vlad Yasevich.

16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...

Linus Torvalds
11 years ago

17 May, 2014

1 commit

31ff6aa5c net: unix: Align send data_len up to PAGE_SIZE ... Browse Code »

Using whole of allocated pages reduces requested skb->data size.
This is just a little more thriftily allocation.

netperf does not show difference with the current performance.

Signed-off-by: Kirill Tkhai
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Kirill Tkhai
11 years ago

18 Apr, 2014

1 commit

4e857c58e arch: Mass conversion of smp_mb__*() ... Browse Code »

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
11 years ago

12 Apr, 2014

1 commit

676d23690 net: Fix use after free by removing length arg from sk_data_ready callbacks. ... Browse Code »

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller

David S. Miller
11 years ago

27 Mar, 2014

1 commit

de1443916 net: unix: non blocking recvmsg() should not return -EINTR ... Browse Code »

Some applications didn't expect recvmsg() on a non blocking socket
could return -EINTR. This possibility was added as a side effect
of commit b3ca9b02b00704 ("net: fix multithreaded signal handling in
unix recv routines").

To hit this bug, you need to be a bit unlucky, as the u->readlock
mutex is usually held for very small periods.

Fixes: b3ca9b02b00704 ("net: fix multithreaded signal handling in unix recv routines")
Signed-off-by: Eric Dumazet
Cc: Rainer Weikusat
Signed-off-by: David S. Miller

Eric Dumazet
11 years ago

07 Mar, 2014

1 commit

0a13404dd net: unix socket code abuses csum_partial ... Browse Code »

The unix socket code is using the result of csum_partial to
hash into a lookup table:

unix_hash_fold(csum_partial(sunaddr, len, 0));

csum_partial is only guaranteed to produce something that can be
folded into a checksum, as its prototype explains:

* returns a 32-bit number suitable for feeding into itself
* or csum_tcpudp_magic

The 32bit value should not be used directly.

Depending on the alignment, the ppc64 csum_partial will return
different 32bit partial checksums that will fold into the same
16bit checksum.

This difference causes the following testcase (courtesy of
Gustavo) to sometimes fail:

#include
#include

int main()
{
int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

int i = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

struct sockaddr addr;
addr.sa_family = AF_LOCAL;
bind(fd, &addr, 2);

listen(fd, 128);

struct sockaddr_storage ss;
socklen_t sslen = (socklen_t)sizeof(ss);
getsockname(fd, (struct sockaddr*)&ss, &sslen);

fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
perror(NULL);
return 1;
}
printf("OK\n");
return 0;
}

As suggested by davem, fix this by using csum_fold to fold the
partial 32bit checksum into a 16bit checksum before using it.

Signed-off-by: Anton Blanchard
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller

Anton Blanchard
11 years ago

19 Jan, 2014

1 commit

342dfc306 net: add build-time checks for msg->msg_name size ... Browse Code »

This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
handler msg_name and msg_namelen logic").

DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.

Signed-off-by: Steffen Hurrle
Suggested-by: Hannes Frederic Sowa
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Steffen Hurrle
12 years ago

19 Dec, 2013

1 commit

143c90549 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/intel/i40e/i40e_main.c
drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
12 years ago

18 Dec, 2013

1 commit

37ab4fa78 net: unix: allow bind to fail on mutex lock ... Browse Code »

This is similar to the set_peek_off patch where calling bind while the
socket is stuck in unix_dgram_recvmsg() will block and cause a hung task
spew after a while.

This is also the last place that did a straightforward mutex_lock(), so
there shouldn't be any more of these patches.

Signed-off-by: Sasha Levin
Signed-off-by: David S. Miller

Sasha Levin
12 years ago

11 Dec, 2013

1 commit

12663bfc9 net: unix: allow set_peek_off to fail ... Browse Code »

unix_dgram_recvmsg() will hold the readlock of the socket until recv
is complete.

In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
unix_dgram_recvmsg() will complete (which can take a while) without allowing
us to break out of it, triggering a hung task spew.

Instead, allow set_peek_off to fail, this way userspace will not hang.

Signed-off-by: Sasha Levin
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Sasha Levin
12 years ago

07 Dec, 2013

1 commit

5cc208bec unix: convert printks to pr_<level> ... Browse Code »

use pr_ instead of printk(LEVEL)

Signed-off-by: Wang Weidong
Signed-off-by: David S. Miller

wangweidong
12 years ago

21 Nov, 2013

1 commit

f3d334260 net: rework recvmsg handler msg_name and msg_namelen logic ... Browse Code »

This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size
Suggested-by: Eric Dumazet
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
12 years ago

20 Oct, 2013

1 commit

90c6bd34f net: unix: inherit SOCK_PASS{CRED, SEC} flags from socket to fix race ... Browse Code »

In the case of credentials passing in unix stream sockets (dgram
sockets seem not affected), we get a rather sparse race after
commit 16e5726 ("af_unix: dont send SCM_CREDENTIALS by default").

We have a stream server on receiver side that requests credential
passing from senders (e.g. nc -U). Since we need to set SO_PASSCRED
on each spawned/accepted socket on server side to 1 first (as it's
not inherited), it can happen that in the time between accept() and
setsockopt() we get interrupted, the sender is being scheduled and
continues with passing data to our receiver. At that time SO_PASSCRED
is neither set on sender nor receiver side, hence in cmsg's
SCM_CREDENTIALS we get eventually pid:0, uid:65534, gid:65534
(== overflow{u,g}id) instead of what we actually would like to see.

On the sender side, here nc -U, the tests in maybe_add_creds()
invoked through unix_stream_sendmsg() would fail, as at that exact
time, as mentioned, the sender has neither SO_PASSCRED on his side
nor sees it on the server side, and we have a valid 'other' socket
in place. Thus, sender believes it would just look like a normal
connection, not needing/requesting SO_PASSCRED at that time.

As reverting 16e5726 would not be an option due to the significant
performance regression reported when having creds always passed,
one way/trade-off to prevent that would be to set SO_PASSCRED on
the listener socket and allow inheriting these flags to the spawned
socket on server side in accept(). It seems also logical to do so
if we'd tell the listener socket to pass those flags onwards, and
would fix the race.

Before, strace:

recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=0, uid=65534, gid=65534}},
msg_flags=0}, 0) = 5

After, strace:

recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=11580, uid=1000, gid=1000}},
msg_flags=0}, 0) = 5

Signed-off-by: Daniel Borkmann
Cc: Eric Dumazet
Cc: Eric W. Biederman
Signed-off-by: David S. Miller

Daniel Borkmann
12 years ago

03 Oct, 2013

1 commit

6865d1e83 unix_diag: fix info leak ... Browse Code »

When filling the netlink message we miss to wipe the pad field,
therefore leak one byte of heap memory to userland. Fix this by
setting pad to 0.

Signed-off-by: Mathias Krause
Signed-off-by: David S. Miller

Mathias Krause
12 years ago

12 Aug, 2013

1 commit

f3dfd2086 af_unix: fix bug on large send() ... Browse Code »

commit e370a723632 ("af_unix: improve STREAM behavior with fragmented
memory") added a bug on large send() because the
skb_copy_datagram_from_iovec() call always start from the beginning
of iovec.

We must instead use the @sent variable to properly skip the
already processed part.

Reported-by: Hannes Frederic Sowa
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
12 years ago

10 Aug, 2013

2 commits

28d642710 net: attempt high order allocations in sock_alloc_send_pskb() ... Browse Code »

Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

2304 212992 212992 10.00 46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

2304 212992 212992 10.00 57981.11

Signed-off-by: Eric Dumazet
Cc: David Rientjes
Signed-off-by: David S. Miller

Eric Dumazet
12 years ago
e370a7236 af_unix: improve STREAM behavior with fragmented memory ... Browse Code »

unix_stream_sendmsg() currently uses order-2 allocations,
and we had numerous reports this can fail.

The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
not helping.

This patch extends the work done in commit eb6a24816b247c
("af_unix: reduce high order page allocations) for
datagram sockets.

This opens the possibility of zero copy IO (splice() and
friends)

The trick is to not use skb_pull() anymore in recvmsg() path,
and instead add a @consumed field in UNIXCB() to track amount
of already read payload in the skb.

There is a performance regression for large sends
because of extra page allocations that will be addressed
in a follow-up patch, allowing sock_alloc_send_pskb()
to attempt high order page allocations.

Signed-off-by: Eric Dumazet
Cc: David Rientjes
Signed-off-by: David S. Miller

Eric Dumazet
12 years ago

10 Jul, 2013

1 commit

496322bc9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"This is a re-do of the net-next pull request for the current merge
window. The only difference from the one I made the other day is that
this has Eliezer's interface renames and the timeout handling changes
made based upon your feedback, as well as a few bug fixes that have
trickeled in.

Highlights:

1) Low latency device polling, eliminating the cost of interrupt
handling and context switches. Allows direct polling of a network
device from socket operations, such as recvmsg() and poll().

Currently ixgbe, mlx4, and bnx2x support this feature.

Full high level description, performance numbers, and design in
commit 0a4db187a999 ("Merge branch 'll_poll'")

From Eliezer Tamir.

2) With the routing cache removed, ip_check_mc_rcu() gets exercised
more than ever before in the case where we have lots of multicast
addresses. Use a hash table instead of a simple linked list, from
Eric Dumazet.

3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

4) Support reporting the TUN device persist flag to userspace, from
Pavel Emelyanov.

5) Allow controlling network device VF link state using netlink, from
Rony Efraim.

6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
Daniel Borkmann and Eric Dumazet.

8) Allow controlling of TCP quickack behavior on a per-route basis,
from Cong Wang.

9) Several bug fixes and improvements to vxlan from Stephen
Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
support receiving on multiple UDP ports.

10) Major cleanups, particular in the area of debugging and cookie
lifetime handline, to the SCTP protocol code. From Daniel
Borkmann.

11) Allow packets to cross network namespaces when traversing tunnel
devices. From Nicolas Dichtel.

12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
manner akin to how we monitor real network traffic via ptype_all.
From Daniel Borkmann.

13) Several bug fixes and improvements for the new alx device driver,
from Johannes Berg.

14) Fix scalability issues in the netem packet scheduler's time queue,
by using an rbtree. From Eric Dumazet.

15) Several bug fixes in TCP loss recovery handling, from Yuchung
Cheng.

16) Add support for GSO segmentation of MPLS packets, from Simon
Horman.

17) Make network notifiers have a real data type for the opaque
pointer that's passed into them. Use this to properly handle
network device flag changes in arp_netdev_event(). From Jiri
Pirko and Timo Teräs.

18) Convert several drivers over to module_pci_driver(), from Peter
Huewe.

19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
O(1) calculation instead. From Eric Dumazet.

20) Support setting of explicit tunnel peer addresses in ipv6, just
like ipv4. From Nicolas Dichtel.

21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

22) Prevent a single high rate flow from overruning an individual cpu
during RX packet processing via selective flow shedding. From
Willem de Bruijn.

23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
Dumazet.

24) Don't just drop GSO packets which are above the TBF scheduler's
burst limit, chop them up so they are in-bounds instead. Also
from Eric Dumazet.

25) VLAN offloads are missed when configured on top of a bridge, fix
from Vlad Yasevich.

26) Support IPV6 in ping sockets. From Lorenzo Colitti.

27) Receive flow steering targets should be updated at poll() time
too, from David Majnemer.

28) Fix several corner case regressions in PMTU/redirect handling due
to the routing cache removal, from Timo Teräs.

29) We have to be mindful of ipv4 mapped ipv6 sockets in
upd_v6_push_pending_frames(). From Hannes Frederic Sowa.

30) Fix L2TP sequence number handling bugs, from James Chapman."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
drivers/net: caif: fix wrong rtnl_is_locked() usage
drivers/net: enic: release rtnl_lock on error-path
vhost-net: fix use-after-free in vhost_net_flush
net: mv643xx_eth: do not use port number as platform device id
net: sctp: confirm route during forward progress
virtio_net: fix race in RX VQ processing
virtio: support unlocked queue poll
net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
Documentation: Fix references to defunct linux-net@vger.kernel.org
net/fs: change busy poll time accounting
net: rename low latency sockets functions to busy poll
bridge: fix some kernel warning in multicast timer
sfc: Fix memory leak when discarding scattered packets
sit: fix tunnel update via netlink
dt:net:stmmac: Add dt specific phy reset callback support.
dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
dt:net:stmmac: Allocate platform data only if its NULL.
net:stmmac: fix memleak in the open method
ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
net: ipv6: fix wrong ping_v6_sendmsg return value
...

Linus Torvalds
12 years ago

13 Jun, 2013

1 commit

fe2c6338f net: Convert uses of typedef ctl_table to struct ctl_table ... Browse Code »

Reduce the uses of this unnecessary typedef.

Done via perl script:

$ git grep --name-only -w ctl_table net | \
xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

Reflow the modified lines that now exceed 80 columns.

Signed-off-by: Joe Perches
Signed-off-by: David S. Miller

Joe Perches
12 years ago

12 May, 2013

1 commit

2b15af6f9 af_unix: use freezable blocking calls in read ... Browse Code »

Avoid waking up every thread sleeping in read call on an AF_UNIX
socket during suspend and resume by calling a freezable blocking
call. Previous patches modified the freezer to avoid sending
wakeups to threads that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo
Signed-off-by: Colin Cross
Signed-off-by: Rafael J. Wysocki

Colin Cross
12 years ago

02 May, 2013

1 commit

60bc851ae af_unix: fix a fatal race with bit fields ... Browse Code »

Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
uses 64bit instructions to manipulate them.
If the 64bit word includes any atomic_t or spinlock_t, we can lose
critical concurrent changes.

This is happening in af_unix, where unix_sk(sk)->gc_candidate/
gc_maybe_cycle/lock share the same 64bit word.

This leads to fatal deadlock, as one/several cpus spin forever
on a spinlock that will never be available again.

A safer way would be to use a long to store flags.
This way we are sure compiler/arch wont do bad things.

As we own unix_gc_lock spinlock when clearing or setting bits,
we can use the non atomic __set_bit()/__clear_bit().

recursion_level can share the same 64bit location with the spinlock,
as it is set only with this spinlock held.

[1] bug fixed in gcc-4.8.0 :
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Reported-by: Ambrose Feinstein
Signed-off-by: Eric Dumazet
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: David S. Miller

Eric Dumazet
12 years ago

30 Apr, 2013

2 commits

58717686c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
drivers/net/ethernet/emulex/benet/be.h
include/net/tcp.h
net/mac802154/mac802154.h

Most conflicts were minor overlapping stuff.

The be2net driver brought in some fixes that added __vlan_put_tag
calls, which in net-next take an additional argument.

Signed-off-by: David S. Miller

David S. Miller
12 years ago
79f632c71 unix/stream: fix peeking with an offset larger than data in queue ... Browse Code »

Currently, peeking on a unix stream socket with an offset larger than len of
the data in the sk receive queue returns immediately with bogus data.

This patch fixes this so that the behavior is the same as peeking with no
offset on an empty queue: the caller blocks.

Signed-off-by: Benjamin Poirier
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Benjamin Poirier
12 years ago

23 Apr, 2013

1 commit

6e0895c2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.c

The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.

The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.

An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.

Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.

Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.

Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.

Signed-off-by: David S. Miller

David S. Miller
12 years ago

08 Apr, 2013

1 commit

6b0ee8c03 scm: Stop passing struct cred ... Browse Code »

Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.

Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
12 years ago