15 Oct, 2007
1 commit
-
make sync wakeups affine for cache-cold tasks: if a cache-cold task
is woken up by a sync wakeup then use the opportunity to migrate it
straight away. (the two tasks are 'related' because they communicate)Signed-off-by: Ingo Molnar
11 Oct, 2007
3 commits
-
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller -
This patch passes in the namespace a new socket should be created in
and has the socket code do the appropriate reference counting. By
virtue of this all socket create methods are touched. In addition
the socket create methods are modified so that they will fail if
you attempt to create a socket in a non-default network namespace.Failing if we attempt to create a socket outside of the default
network namespace ensures that as we incrementally make the network stack
network namespace aware we will not export functionality that someone
has not audited and made certain is network namespace safe.
Allowing us to partially enable network namespaces before all of the
exotic protocols are supported.Any protocol layers I have missed will fail to compile because I now
pass an extra parameter into the socket creation code.[ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]
Signed-off-by: Eric W. Biederman
Signed-off-by: David S. Miller -
This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.Signed-off-by: Eric W. Biederman
Signed-off-by: David S. Miller
31 Jul, 2007
1 commit
-
The following code can now become static:
- struct unix_socket_table
- unix_table_lockSigned-off-by: Adrian Bunk
Signed-off-by: David S. Miller
12 Jul, 2007
1 commit
-
Throw out the old mark & sweep garbage collector and put in a
refcounting cycle detecting one.The old one had a race with recvmsg, that resulted in false positives
and hence data loss. The old algorithm operated on all unix sockets
in the system, so any additional locking would have meant performance
problems for all users of these.The new algorithm instead only operates on "in flight" sockets, which
are very rare, and the additional locking for these doesn't negatively
impact the vast majority of users.In fact it's probable, that there weren't *any* heavy senders of
sockets over sockets, otherwise the above race would have been
discovered long ago.The patch works OK with the app that exposed the race with the old
code. The garbage collection has also been verified to work in a few
simple cases.Signed-off-by: Miklos Szeredi
Signed-off-by: David S. Miller
11 Jul, 2007
1 commit
-
Make all initialized struct seq_operations in net/ const
Signed-off-by: Philippe De Muyter
Signed-off-by: David S. Miller
08 Jun, 2007
1 commit
-
A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
send()+close() on the peer, causing recv() to return zero, even though
the sent data should be received.This happens if the send() and the close() is performed between
skb_dequeue() and checking sk->sk_shutdown in unix_stream_recvmsg():process A skb_dequeue() returns NULL, there's no data in the socket queue
process B new data is inserted onto the queue by unix_stream_sendmsg()
process B sk->sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
process A sk->sk_shutdown is checked, unix_release_sock() returns zeroI'm surprised nobody noticed this, it's not hard to trigger. Maybe
it's just (un)luck with the timing.It's possible to work around this bug in userspace, by retrying the
recv() once in case of a zero return value.Signed-off-by: Miklos Szeredi
Signed-off-by: David S. Miller
04 Jun, 2007
2 commits
-
Based upon an excellent bug report and initial patch by
Frederik Deweerdt.The UNIX datagram connect code blindly dereferences other->sk_socket
via the call down to the security_unix_may_send() function.Without locking 'other' that pointer can go NULL via unix_release_sock()
which does sock_orphan() which also marks the socket SOCK_DEAD.So we have to lock both 'sk' and 'other' yet avoid all kinds of
potential deadlocks (connect to self is OK for datagram sockets and it
is possible for two datagram sockets to perform a simultaneous connect
to each other). So what we do is have a "double lock" function similar
to how we handle this situation in other areas of the kernel. We take
the lock of the socket pointer with the smallest address first in
order to avoid ABBA style deadlocks.Once we have them both locked, we check to see if SOCK_DEAD is set
for 'other' and if so, drop everything and retry the lookup.Signed-off-by: David S. Miller
-
The unix_state_*() locking macros imply that there is some
rwlock kind of thing going on, but the implementation is
actually a spinlock which makes the code more confusing than
it needs to be.So use plain unix_state_lock and unix_state_unlock.
Signed-off-by: David S. Miller
09 May, 2007
1 commit
-
Remove includes of where it is not used/needed.
Suggested by Al Viro.Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Apr, 2007
1 commit
-
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.This one touches just the most simple cases:
skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()The next ones will handle the slightly more "complex" cases.
Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: David S. Miller
07 Mar, 2007
1 commit
-
This reverts two changes:
8488df894d05d6fa41c2bd298c335f944bb0e401
248f06726e866942b3d8ca8f411f9067713b7ff8A backlog value of N really does mean allow "N + 1" connections
to queue to a listening socket. This allows one to specify
"0" as the backlog and still get 1 connection.Noticed by Gerrit Renker and Rick Jones.
Signed-off-by: David S. Miller
03 Mar, 2007
1 commit
-
This brings things inline with the sk_acceptq_is_full() bug
fix. The limit test should be x >= sk_max_ack_backlog.Signed-off-by: David S. Miller
15 Feb, 2007
2 commits
-
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.Signed-off-by: Eric W. Biederman
Acked-by: Ralf Baechle
Acked-by: Martin Schwidefsky
Cc: Russell King
Cc: David Howells
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Jens Axboe
Cc: Corey Minyard
Cc: Neil Brown
Cc: "John W. Linville"
Cc: James Bottomley
Cc: Jan Kara
Cc: Trond Myklebust
Cc: Mark Fasheh
Cc: David Chinner
Cc: "David S. Miller"
Cc: Patrick McHardy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).Signed-off-by: Tim Schmielau
Acked-by: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Feb, 2007
1 commit
-
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Feb, 2007
1 commit
-
Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller
09 Dec, 2006
1 commit
-
Signed-off-by: Josef Sipek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Dec, 2006
1 commit
-
Signed-off-by: Al Viro
Signed-off-by: David S. Miller
23 Sep, 2006
2 commits
-
Signed-off-by: Brian Haley
Signed-off-by: David S. Miller -
Signed-off-by: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller
03 Aug, 2006
1 commit
-
From: Catherine Zhang
This patch implements a cleaner fix for the memory leak problem of the
original unix datagram getpeersec patch. Instead of creating a
security context each time a unix datagram is sent, we only create the
security context when the receiver requests it.This new design requires modification of the current
unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
secid_to_secctx and release_secctx. The former retrieves the security
context and the latter releases it. A hook is required for releasing
the security context because it is up to the security module to decide
how that's done. In the case of Selinux, it's a simple kfree
operation.Acked-by: Stephen Smalley
Signed-off-by: David S. Miller
22 Jul, 2006
1 commit
-
Signed-off-by: Panagiotis Issaris
Signed-off-by: David S. Miller
04 Jul, 2006
2 commits
-
The unix_get_peersec_dgram() stub should have been inlined so that it
disappears.Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller -
Teach special (recursive) locking code to the lock validator. Also splits
af_unix's sk_receive_queue.lock class from the other networking skb-queue
locks. Has no effect on non-lockdep kernels.Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jul, 2006
1 commit
-
Signed-off-by: Jörn Engel
Signed-off-by: Adrian Bunk
30 Jun, 2006
1 commit
-
This patch implements an API whereby an application can determine the
label of its peer's Unix datagram sockets via the auxiliary data mechanism of
recvmsg.Patch purpose:
This patch enables a security-aware application to retrieve the
security context of the peer of a Unix datagram socket. The application
can then use this security context to determine the security context for
processing on behalf of the peer who sent the packet.Patch design and implementation:
The design and implementation is very similar to the UDP case for INET
sockets. Basically we build upon the existing Unix domain socket API for
retrieving user credentials. Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message). To retrieve the security
context, the application first indicates to the kernel such desire by
setting the SO_PASSSEC option via getsockopt. Then the application
retrieves the security context using the auxiliary data mechanism.An example server application for Unix datagram socket should look like this:
toggle = 1;
toggle_len = sizeof(toggle);setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len cmsg_level == SOL_SOCKET &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
a server socket to receive security context of the peer.Testing:
We have tested the patch by setting up Unix datagram client and server
applications. We verified that the server can retrieve the security context
using the auxiliary data mechanism of recvmsg.Signed-off-by: Catherine Zhang
Acked-by: Acked-by: James Morris
Signed-off-by: David S. Miller
26 Mar, 2006
1 commit
-
Implement the half-closed devices notifiation, by adding a new POLLRDHUP
(and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the
existing POLLHUP handling, that does not report correctly half-closed
devices, was feared to be changed, this implementation leaves the current
POLLHUP reporting unchanged and simply add a new bit that is set in the few
places where it makes sense. The same thing was discussed and conceptually
agreed quite some time ago:http://lkml.org/lkml/2003/7/12/116
Since this new event bit is added to the existing Linux poll infrastruture,
even the existing poll/select system calls will be able to use it. As far
as the existing POLLHUP handling, the patch leaves it as is. The
pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
archs and sets the bit in the six relevant files. The other attached diff
is the simple change required to sys/epoll.h to add the EPOLLRDHUP
definition.There is "a stupid program" to test POLLRDHUP delivery here:
http://www.xmailserver.org/pollrdhup-test.c
It tests poll(2), but since the delivery is same epoll(2) will work equally.
Signed-off-by: Davide Libenzi
Cc: "David S. Miller"
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Mar, 2006
3 commits
-
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller -
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.Signed-off-by: Arjan van de Ven
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller -
The patch below replaces a divide by 2 with a shift -- sk_sndbuf is an
integer, so gcc emits an idiv, which takes 10x longer than a shift by 1.
This improves af_unix bandwidth by ~6-10K/s. Also, tidy up the comment
to fit in 80 columns while we're at it.Signed-off-by: Benjamin LaHaise
Signed-off-by: David S. Miller
09 Mar, 2006
1 commit
-
I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench. Tested on both x86_64 and powerpc.The way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was
constructor/destructor based. This meant that nr_files was decremented
only when the object was removed from the slab cache. This is susceptible
to slab fragmentation. With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844At the same time, I see only a 2000+ objects in filp cache. The following
patch I fixes this problem.This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.Signed-off-by: Dipankar Sarma
Cc: "Paul E. McKenney"
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Jan, 2006
1 commit
-
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.Modified-by: Ingo Molnar
(finished the conversion)
Signed-off-by: Jes Sorensen
Signed-off-by: Ingo Molnar
04 Jan, 2006
5 commits
-
Currently all network protocols need to call dev_ioctl as the default
fallback in their ioctl implementations. This patch adds a fallback
to dev_ioctl to sock_ioctl if the protocol returned -ENOIOCTLCMD.
This way all the procotol ioctl handlers can be simplified and we don't
need to export dev_ioctl.Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller -
From: Benjamin LaHaise
In af_unix, a rwlock is used to protect internal state. At least on my
P4 with HT it is faster to use a spinlock due to the simpler memory
barrier used to unlock. This patch raises bw_unix to ~690K/s.Signed-off-by: David S. Miller
-
I noticed that some of 'struct proto_ops' used in the kernel may share
a cache line used by locks or other heavily modified data. (default
linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
least)This patch makes sure a 'struct proto_ops' can be declared as const,
so that all cpus can share all parts of it without false sharing.This is not mandatory : a driver can still use a read/write structure
if it needs to (and eventually a __read_mostly)I made a global stubstitute to change all existing occurences to make
them const.This should reduce the possibility of false sharing on SMP, and
speedup some socket system calls.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
This lock is actually taken mostly as a writer,
so using a rwlock actually just makes performance
worse especially on chips like the Intel P4.Signed-off-by: David S. Miller
-
AF_UNIX stream socket performance on P4 CPUs tends to suffer due to a
lot of pipeline flushes from atomic operations. The patch below
removes the sock_hold() and sock_put() in unix_stream_sendmsg(). This
should be safe as the socket still holds a reference to its peer which
is only released after the file descriptor's final user invokes
unix_release_sock(). The only consideration is that we must add a
memory barrier before setting the peer initially.Signed-off-by: Benjamin LaHaise
Signed-off-by: David S. Miller
09 Nov, 2005
1 commit
-
Most permission() calls have a struct nameidata * available. This helper
takes that as an argument and thus makes sure we pass it down for lookup
intents and prepares for per-mount read-only support where we need a struct
vfsmount for checking whether a file is writeable.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds