Eric Lee / smarc-fsl-linux-kernel

08 Mar, 2011

1 commit

b3ca9b02b net: fix multithreaded signal handling in unix recv routines ... Browse Code »

The unix_dgram_recvmsg and unix_stream_recvmsg routines in
net/af_unix.c utilize mutex_lock(&u->readlock) calls in order to
serialize read operations of multiple threads on a single socket. This
implies that, if all n threads of a process block in an AF_UNIX recv
call trying to read data from the same socket, one of these threads
will be sleeping in state TASK_INTERRUPTIBLE and all others in state
TASK_UNINTERRUPTIBLE. Provided that a particular signal is supposed to
be handled by a signal handler defined by the process and that none of
this threads is blocking the signal, the complete_signal routine in
kernel/signal.c will select the 'first' such thread it happens to
encounter when deciding which thread to notify that a signal is
supposed to be handled and if this is one of the TASK_UNINTERRUPTIBLE
threads, the signal won't be handled until the one thread not blocking
on the u->readlock mutex is woken up because some data to process has
arrived (if this ever happens). The included patch fixes this by
changing mutex_lock to mutex_lock_interruptible and handling possible
error returns in the same way interruptions are handled by the actual
receive-code.

Signed-off-by: Rainer Weikusat
Signed-off-by: David S. Miller

Rainer Weikusat
2011-03-08 07:31:16 +0800

06 Jan, 2011

1 commit

3610cda53 af_unix: Avoid socket->sk NULL OOPS in stream connect security hooks. ... Browse Code »

unix_release() can asynchornously set socket->sk to NULL, and
it does so without holding the unix_state_lock() on "other"
during stream connects.

However, the reverse mapping, sk->sk_socket, is only transitioned
to NULL under the unix_state_lock().

Therefore make the security hooks follow the reverse mapping instead
of the forward mapping.

Reported-by: Jeremy Fitzhardinge
Reported-by: Linus Torvalds
Signed-off-by: David S. Miller

David S. Miller
2011-01-06 07:38:53 +0800

09 Dec, 2010

1 commit

fe6c79157 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
drivers/net/wireless/ath/ath9k/ar9003_eeprom.c
net/llc/af_llc.c

David S. Miller
2010-12-09 05:47:38 +0800

30 Nov, 2010

1 commit

25888e303 af_unix: limit recursion level ... Browse Code »

Its easy to eat all kernel memory and trigger NMI watchdog, using an
exploit program that queues unix sockets on top of others.

lkml ref : http://lkml.org/lkml/2010/11/25/8

This mechanism is used in applications, one choice we have is to have a
recursion limit.

Other limits might be needed as well (if we queue other types of files),
since the passfd mechanism is currently limited by socket receive queue
sizes only.

Add a recursion_level to unix socket, allowing up to 4 levels.

Each time we send an unix socket through sendfd mechanism, we copy its
recursion level (plus one) to receiver. This recursion level is cleared
when socket receive queue is emptied.

Reported-by: Марк Коренберг
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-30 01:45:15 +0800

25 Nov, 2010

1 commit

9915672d4 af_unix: limit unix_tot_inflight ... Browse Code »

Vegard Nossum found a unix socket OOM was possible, posting an exploit
program.

My analysis is we can eat all LOWMEM memory before unix_gc() being
called from unix_release_sock(). Moreover, the thread blocked in
unix_gc() can consume huge amount of time to perform cleanup because of
huge working set.

One way to handle this is to have a sensible limit on unix_tot_inflight,
tested from wait_for_unix_gc() and to force a call to unix_gc() if this
limit is hit.

This solves the OOM and also reduce overall latencies, and should not
slowdown normal workloads.

Reported-by: Vegard Nossum
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-25 01:15:27 +0800

09 Nov, 2010

3 commits

973a34aa8 af_unix: optimize unix_dgram_poll() ... Browse Code »

unix_dgram_poll() is pretty expensive to check POLLOUT status, because
it has to lock the socket to get its peer, take a reference on the peer
to check its receive queue status, and queue another poll_wait on
peer_wait. This all can be avoided if the process calling
unix_dgram_poll() is not interested in POLLOUT status. It makes
unix_dgram_recvmsg() faster by not queueing irrelevant pollers in
peer_wait.

On a test program provided by Alan Crequy :

Before:

real 0m0.211s
user 0m0.000s
sys 0m0.208s

After:

real 0m0.044s
user 0m0.000s
sys 0m0.040s

Suggested-by: Davide Libenzi
Reported-by: Alban Crequy
Acked-by: Davide Libenzi
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-09 05:50:09 +0800
5456f09aa af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event ... Browse Code »

Alban Crequy reported a problem with connected dgram af_unix sockets and
provided a test program. epoll() would miss to send an EPOLLOUT event
when a thread unqueues a packet from the other peer, making its receive
queue not full.

This is because unix_dgram_poll() fails to call sock_poll_wait(file,
&unix_sk(other)->peer_wait, wait);
if the socket is not writeable at the time epoll_ctl(ADD) is called.

We must call sock_poll_wait(), regardless of 'writable' status, so that
epoll can be notified later of states changes.

Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN)

Reported-by: Alban Crequy
Cc: Davide Libenzi
Signed-off-by: Eric Dumazet
Acked-by: Davide Libenzi
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-09 05:50:09 +0800
67426b756 af_unix: use keyed wakeups ... Browse Code »

Instead of wakeup all sleepers, use wake_up_interruptible_sync_poll() to
wakeup only ones interested into writing the socket.

This patch is a specialization of commit 37e5540b3c9d (epoll keyed
wakeups: make sockets use keyed wakeups).

On a test program provided by Alan Crequy :

Before:
real 0m3.101s
user 0m0.000s
sys 0m6.104s

After:

real 0m0.211s
user 0m0.000s
sys 0m0.208s

Reported-by: Alban Crequy
Signed-off-by: Eric Dumazet
Cc: Davide Libenzi
Signed-off-by: David S. Miller

Eric Dumazet
2010-11-09 05:50:08 +0800

27 Oct, 2010

1 commit

518de9b39 fs: allow for more than 2^31 files ... Browse Code »

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:

atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.

get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648

Reported-by: Robin Holt
Signed-off-by: Eric Dumazet
Acked-by: David Miller
Reviewed-by: Robin Holt
Tested-by: Robin Holt
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2010-10-27 07:52:15 +0800

06 Oct, 2010

1 commit

3f66116e8 AF_UNIX: Implement SO_TIMESTAMP and SO_TIMETAMPNS on Unix sockets ... Browse Code »

Userspace applications can already request to receive timestamps with:
setsockopt(sockfd, SOL_SOCKET, SO_TIMESTAMP, ...)

Although setsockopt() returns zero (success), timestamps are not added to the
ancillary data. This patch fixes that on SOCK_DGRAM and SOCK_SEQPACKET Unix
sockets.

Signed-off-by: Alban Crequy
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Alban Crequy
2010-10-06 05:54:36 +0800

10 Sep, 2010

1 commit

e548833df Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
net/mac80211/main.c

David S. Miller
2010-09-10 13:27:33 +0800

08 Sep, 2010

1 commit

8df73ff90 UNIX: Do not loop forever at unix_autobind(). ... Browse Code »

We assumed that unix_autobind() never fails if kzalloc() succeeded.
But unix_autobind() allows only 1048576 names. If /proc/sys/fs/file-max is
larger than 1048576 (e.g. systems with more than 10GB of RAM), a local user can
consume all names using fork()/socket()/bind().

If all names are in use, those who call bind() with addr_len == sizeof(short)
or connect()/sendmsg() with setsockopt(SO_PASSCRED) will continue

while (1)
yield();

loop at unix_autobind() till a name becomes available.
This patch adds a loop counter in order to give up after 1048576 attempts.

Calling yield() for once per 256 attempts may not be sufficient when many names
are already in use, for __unix_find_socket_byname() can take long time under
such circumstance. Therefore, this patch also adds cond_resched() call.

Note that currently a local user can consume 2GB of kernel memory if the user
is allowed to create and autobind 1048576 UNIX domain sockets. We should
consider adding some restriction for autobind operation.

Signed-off-by: Tetsuo Handa
Signed-off-by: David S. Miller

Tetsuo Handa
2010-09-08 04:57:23 +0800

07 Sep, 2010

1 commit

db40980fc net: poll() optimizations ... Browse Code »

No need to test twice sk->sk_shutdown & RCV_SHUTDOWN

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-09-07 09:48:45 +0800

21 Jul, 2010

1 commit

70d4bf6d4 drop_monitor: convert some kfree_skb call sites to consume_skb ... Browse Code »

Convert a few calls from kfree_skb to consume_skb

Noticed while I was working on dropwatch that I was detecting lots of internal
skb drops in several places. While some are legitimate, several were not,
freeing skbs that were at the end of their life, rather than being discarded due
to an error. This patch converts those calls sites from using kfree_skb to
consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
drops. Tested successfully by myself

Signed-off-by: Neil Horman
Signed-off-by: David S. Miller

Neil Horman
2010-07-21 04:28:05 +0800

17 Jun, 2010

3 commits

6616f7888 af_unix: Allow connecting to sockets in other network namespaces. ... Browse Code »

Remove the restriction that only allows connecting to a unix domain
socket identified by unix path that is in the same network namespace.

Crossing network namespaces is always tricky and we did not support
this at first, because of a strict policy of don't mix the namespaces.
Later after Pavel proposed this we did not support this because no one
had performed the audit to make certain using unix domain sockets
across namespaces is safe.

What fundamentally makes connecting to af_unix sockets in other
namespaces is safe is that you have to have the proper permissions on
the unix domain socket inode that lives in the filesystem. If you
want strict isolation you just don't create inodes where unfriendlys
can get at them, or with permissions that allow unfriendlys to open
them. All nicely handled for us by the mount namespace and other
standard file system facilities.

I looked through unix domain sockets and they are a very controlled
environment so none of the work that goes on in dev_forward_skb to
make crossing namespaces safe appears needed, we are not loosing
controll of the skb and so do not need to set up the skb to look like
it is comming in fresh from the outside world. Further the fields in
struct unix_skb_parms should not have any problems crossing network
namespaces.

Now that we handle SCM_CREDENTIALS in a way that gives useable values
across namespaces. There does not appear to be any operational
problems with encouraging the use of unix domain sockets across
containers either.

Signed-off-by: Eric W. Biederman
Acked-by: Daniel Lezcano
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric W. Biederman
2010-06-17 05:58:17 +0800
7361c36c5 af_unix: Allow credentials to work across user and pid namespaces. ... Browse Code »

In unix_skb_parms store pointers to struct pid and struct cred instead
of raw uid, gid, and pid values, then translate the credentials on
reception into values that are meaningful in the receiving processes
namespaces.

Signed-off-by: Eric W. Biederman
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric W. Biederman
2010-06-17 05:58:16 +0800
109f6e39f af_unix: Allow SO_PEERCRED to work across namespaces. ... Browse Code »

Use struct pid and struct cred to store the peer credentials on struct
sock. This gives enough information to convert the peer credential
information to a value relative to whatever namespace the socket is in
at the time.

This removes nasty surprises when using SO_PEERCRED on socket
connetions where the processes on either side are in different pid and
user namespaces.

Signed-off-by: Eric W. Biederman
Acked-by: Daniel Lezcano
Acked-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Eric W. Biederman
2010-06-17 05:55:55 +0800

04 May, 2010

1 commit

a2f3be17c unix/garbage: kill copy of the skb queue walker ... Browse Code »

Worse yet, it seems that its arguments were in reverse order. Also
remove one related helper which seems hardly worth keeping.

Signed-off-by: Ilpo Järvinen
Signed-off-by: David S. Miller

Ilpo Järvinen
2010-05-04 06:39:58 +0800

02 May, 2010

1 commit

438154823 net: sock_def_readable() and friends RCU conversion ... Browse Code »

sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
- Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
- Use wq_has_sleeper() to eventually wakeup tasks.
- Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-05-02 06:00:15 +0800

21 Apr, 2010

1 commit

aa3951451 net: sk_sleep() helper ... Browse Code »

Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-04-21 07:37:13 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

19 Feb, 2010

1 commit

663717f65 AF_UNIX: update locking comment ... Browse Code »

The lock used in unix_state_lock() is a spin_lock not reader-writer.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2010-02-19 06:12:06 +0800

18 Jan, 2010

1 commit

2c8c1e729 net: spread __net_init, __net_exit ... Browse Code »

__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan
Signed-off-by: David S. Miller

Alexey Dobriyan
2010-01-18 11:16:02 +0800

08 Dec, 2009

1 commit

d7fc02c7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...

Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c

Linus Torvalds
2009-12-08 23:55:01 +0800

30 Nov, 2009

1 commit

f64f9e719 net: Move && and || to end of previous line ... Browse Code »

Not including net/atm/

Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.

Signed-off-by: Joe Perches
Signed-off-by: David S. Miller

Joe Perches
2009-11-30 08:55:45 +0800

12 Nov, 2009

1 commit

f8572d8f2 sysctl net: Remove unused binary sysctl code ... Browse Code »

Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.

In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.

Cc: "David Miller"
Cc: Hideaki YOSHIFUJI
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-12 18:05:06 +0800

11 Nov, 2009

1 commit

13cfa97be net: netlink_getname, packet_getname -- use DECLARE_SOCKADDR guard ... Browse Code »

Use guard DECLARE_SOCKADDR in a few more places which allow
us to catch if the structure copied back is too big.

Signed-off-by: Cyrill Gorcunov
Signed-off-by: David S. Miller

Cyrill Gorcunov
2009-11-11 12:54:41 +0800

06 Nov, 2009

1 commit

3f378b684 net: pass kern to net_proto_family create function ... Browse Code »

The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace. This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.

Signed-off-by: Eric Paris
Acked-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller

Eric Paris
2009-11-06 14:18:14 +0800

27 Oct, 2009

1 commit

cfadf853f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
drivers/net/sh_eth.c

David S. Miller
2009-10-27 16:03:26 +0800

19 Oct, 2009

1 commit

77238f2b9 AF_UNIX: Fix deadlock on connecting to shutdown socket ... Browse Code »

I found a deadlock bug in UNIX domain socket, which makes able to DoS
attack against the local machine by non-root users.

How to reproduce:
1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
namespace(*), and shutdown(2) it.
2. Repeat connect(2)ing to the listening socket from the other sockets
until the connection backlog is full-filled.
3. connect(2) takes the CPU forever. If every core is taken, the
system hangs.

PoC code: (Run as many times as cores on SMP machines.)

int main(void)
{
int ret;
int csd;
int lsd;
struct sockaddr_un sun;

/* make an abstruct name address (*) */
memset(&sun, 0, sizeof(sun));
sun.sun_family = PF_UNIX;
sprintf(&sun.sun_path[1], "%d", getpid());

/* create the listening socket and shutdown */
lsd = socket(AF_UNIX, SOCK_STREAM, 0);
bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
listen(lsd, 1);
shutdown(lsd, SHUT_RDWR);

/* connect loop */
alarm(15); /* forcely exit the loop after 15 sec */
for (;;) {
csd = socket(AF_UNIX, SOCK_STREAM, 0);
ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
if (-1 == ret) {
perror("connect()");
break;
}
puts("Connection OK");
}
return 0;
}

(*) Make sun_path[0] = 0 to use the abstruct namespace.
If a file-based socket is used, the system doesn't deadlock because
of context switches in the file system layer.

Why this happens:
Error checks between unix_socket_connect() and unix_wait_for_peer() are
inconsistent. The former calls the latter to wait until the backlog is
processed. Despite the latter returns without doing anything when the
socket is shutdown, the former doesn't check the shutdown state and
just retries calling the latter forever.

Patch:
The patch below adds shutdown check into unix_socket_connect(), so
connect(2) to the shutdown socket will return -ECONREFUSED.

Signed-off-by: Tomoki Sekiyama
Signed-off-by: Masanori Yoshida
Signed-off-by: David S. Miller

Tomoki Sekiyama
2009-10-19 14:17:37 +0800

07 Oct, 2009

1 commit

ec1b4cf74 net: mark net_proto_ops as const ... Browse Code »

All usages of structure net_proto_ops should be declared const.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2009-10-07 16:10:46 +0800

12 Sep, 2009

1 commit

8ba69ba6a net: unix: fix sending fds in multiple buffers ... Browse Code »

Kalle Olavi Niemitalo reported that:

"..., when one process calls sendmsg once to send 43804 bytes of
data and one file descriptor, and another process then calls recvmsg
three times to receive the 16032+16032+11740 bytes, each of those
recvmsg calls returns the file descriptor in the ancillary data. I
confirmed this with strace. The behaviour differs from Linux
2.6.26, where reportedly only one of those recvmsg calls (I think
the first one) returned the file descriptor."

This bug was introduced by a patch from me titled "net: unix: fix inflight
counting bug in garbage collector", commit 6209344f5.

And the reason is, quoting Kalle:

"Before your patch, unix_attach_fds() would set scm->fp = NULL, so
that if the loop in unix_stream_sendmsg() ran multiple iterations,
it could not call unix_attach_fds() again. But now,
unix_attach_fds() leaves scm->fp unchanged, and I think this causes
it to be called multiple times and duplicate the same file
descriptors to each struct sk_buff."

Fix this by introducing a flag that is cleared at the start and set
when the fds attached to the first buffer. The resulting code should
work equivalently to the one on 2.6.26.

Reported-by: Kalle Olavi Niemitalo
Signed-off-by: Miklos Szeredi
Signed-off-by: David S. Miller

Miklos Szeredi
2009-09-12 02:31:45 +0800

10 Jul, 2009

1 commit

a57de0b43 net: adding memory barrier to the poll and receive callbacks ... Browse Code »

Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.

Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.

CPU1 CPU2

sys_select receive packet
... ...
__add_wait_queue update tp->rcv_nxt
... ...
tp->rcv_nxt check sock_def_readable
... {
schedule ...
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible(sk->sk_sleep)
...
}

If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.

Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.

Calls to poll_wait in following modules were ommited:
net/bluetooth/af_bluetooth.c
net/irda/af_irda.c
net/irda/irnet/irnet_ppp.c
net/mac80211/rc80211_pid_debugfs.c
net/phonet/socket.c
net/rds/af_rds.c
net/rfkill/core.c
net/sunrpc/cache.c
net/sunrpc/rpc_pipe.c
net/tipc/socket.c

Signed-off-by: Jiri Olsa
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Olsa
2009-07-10 08:06:57 +0800

18 Jun, 2009

1 commit

31e6d363a net: correct off-by-one write allocations reports ... Browse Code »

commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

We need to take into account this offset when reporting
sk_wmem_alloc to user, in PROC_FS files or various
ioctls (SIOCOUTQ/TIOCOUTQ)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2009-06-18 15:29:12 +0800

01 Apr, 2009

1 commit

ce3b0f8d5 New helper - current_umask() ... Browse Code »

current->fs->umask is what most of fs_struct users are doing.
Put that into a helper function.

Signed-off-by: Al Viro

Al Viro
2009-04-01 11:00:26 +0800

27 Feb, 2009

1 commit

40d44446c unix: remove some pointless conditionals before kfree_skb() ... Browse Code »

Remove some pointless conditionals before kfree_skb().

Signed-off-by: Wei Yongjun
Signed-off-by: David S. Miller

Wei Yongjun
2009-02-27 15:07:34 +0800

01 Jan, 2009

1 commit

be6d3e56a introduce new LSM hooks where vfsmount is available. ... Browse Code »

Add new LSM hooks for path-based checks. Call them on directory-modifying
operations at the points where we still know the vfsmount involved.

Signed-off-by: Kentaro Takeda
Signed-off-by: Tetsuo Handa
Signed-off-by: Toshiharu Harada
Signed-off-by: Al Viro

Kentaro Takeda
2009-01-01 07:07:37 +0800

29 Dec, 2008

1 commit

0191b625c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
net: Allow dependancies of FDDI & Tokenring to be modular.
igb: Fix build warning when DCA is disabled.
net: Fix warning fallout from recent NAPI interface changes.
gro: Fix potential use after free
sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
sfc: When disabling the NIC, close the device rather than unregistering it
sfc: SFT9001: Add cable diagnostics
sfc: Add support for multiple PHY self-tests
sfc: Merge top-level functions for self-tests
sfc: Clean up PHY mode management in loopback self-test
sfc: Fix unreliable link detection in some loopback modes
sfc: Generate unique names for per-NIC workqueues
802.3ad: use standard ethhdr instead of ad_header
802.3ad: generalize out mac address initializer
802.3ad: initialize ports LACPDU from const initializer
802.3ad: remove typedef around ad_system
802.3ad: turn ports is_individual into a bool
802.3ad: turn ports is_enabled into a bool
802.3ad: make ntt bool
ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
...

Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
to the conversion to %pI (in this networking merge) and the addition of
doing IPv6 addresses (from the earlier merge of CIFS).

Linus Torvalds
2008-12-29 04:49:40 +0800

04 Dec, 2008

1 commit

ec98ce480 Merge branch 'master' into next ... Browse Code »

Conflicts:
fs/nfsd/nfs4recover.c

Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().

Signed-off-by: James Morris

James Morris
2008-12-04 14:16:36 +0800

03 Dec, 2008

1 commit

aa2ba5f10 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:

drivers/net/ixgbe/ixgbe_main.c
drivers/net/smc91x.c

David S. Miller
2008-12-03 11:50:27 +0800