15 Oct, 2007

1 commit

  • make sync wakeups affine for cache-cold tasks: if a cache-cold task
    is woken up by a sync wakeup then use the opportunity to migrate it
    straight away. (the two tasks are 'related' because they communicate)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Oct, 2007

3 commits

  • This concerns the ipv4 and ipv6 code mostly, but also the netlink
    and unix sockets.

    The netlink code is an example of how to use the __seq_open_private()
    call - it saves the net namespace on this private.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This patch passes in the namespace a new socket should be created in
    and has the socket code do the appropriate reference counting. By
    virtue of this all socket create methods are touched. In addition
    the socket create methods are modified so that they will fail if
    you attempt to create a socket in a non-default network namespace.

    Failing if we attempt to create a socket outside of the default
    network namespace ensures that as we incrementally make the network stack
    network namespace aware we will not export functionality that someone
    has not audited and made certain is network namespace safe.
    Allowing us to partially enable network namespaces before all of the
    exotic protocols are supported.

    Any protocol layers I have missed will fail to compile because I now
    pass an extra parameter into the socket creation code.

    [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

31 Jul, 2007

1 commit


12 Jul, 2007

1 commit

  • Throw out the old mark & sweep garbage collector and put in a
    refcounting cycle detecting one.

    The old one had a race with recvmsg, that resulted in false positives
    and hence data loss. The old algorithm operated on all unix sockets
    in the system, so any additional locking would have meant performance
    problems for all users of these.

    The new algorithm instead only operates on "in flight" sockets, which
    are very rare, and the additional locking for these doesn't negatively
    impact the vast majority of users.

    In fact it's probable, that there weren't *any* heavy senders of
    sockets over sockets, otherwise the above race would have been
    discovered long ago.

    The patch works OK with the app that exposed the race with the old
    code. The garbage collection has also been verified to work in a few
    simple cases.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

11 Jul, 2007

1 commit


08 Jun, 2007

1 commit

  • A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
    send()+close() on the peer, causing recv() to return zero, even though
    the sent data should be received.

    This happens if the send() and the close() is performed between
    skb_dequeue() and checking sk->sk_shutdown in unix_stream_recvmsg():

    process A skb_dequeue() returns NULL, there's no data in the socket queue
    process B new data is inserted onto the queue by unix_stream_sendmsg()
    process B sk->sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
    process A sk->sk_shutdown is checked, unix_release_sock() returns zero

    I'm surprised nobody noticed this, it's not hard to trigger. Maybe
    it's just (un)luck with the timing.

    It's possible to work around this bug in userspace, by retrying the
    recv() once in case of a zero return value.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

04 Jun, 2007

2 commits

  • Based upon an excellent bug report and initial patch by
    Frederik Deweerdt.

    The UNIX datagram connect code blindly dereferences other->sk_socket
    via the call down to the security_unix_may_send() function.

    Without locking 'other' that pointer can go NULL via unix_release_sock()
    which does sock_orphan() which also marks the socket SOCK_DEAD.

    So we have to lock both 'sk' and 'other' yet avoid all kinds of
    potential deadlocks (connect to self is OK for datagram sockets and it
    is possible for two datagram sockets to perform a simultaneous connect
    to each other). So what we do is have a "double lock" function similar
    to how we handle this situation in other areas of the kernel. We take
    the lock of the socket pointer with the smallest address first in
    order to avoid ABBA style deadlocks.

    Once we have them both locked, we check to see if SOCK_DEAD is set
    for 'other' and if so, drop everything and retry the lookup.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The unix_state_*() locking macros imply that there is some
    rwlock kind of thing going on, but the implementation is
    actually a spinlock which makes the code more confusing than
    it needs to be.

    So use plain unix_state_lock and unix_state_unlock.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 May, 2007

1 commit


26 Apr, 2007

1 commit

  • For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
    later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
    64bit land while possibly keeping it as a pointer on 32bit.

    This one touches just the most simple cases:

    skb->h.raw = skb->data;
    skb->h.raw = {skb_push|[__]skb_pull}()

    The next ones will handle the slightly more "complex" cases.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

07 Mar, 2007

1 commit

  • This reverts two changes:

    8488df894d05d6fa41c2bd298c335f944bb0e401
    248f06726e866942b3d8ca8f411f9067713b7ff8

    A backlog value of N really does mean allow "N + 1" connections
    to queue to a listening socket. This allows one to specify
    "0" as the backlog and still get 1 connection.

    Noticed by Gerrit Renker and Rick Jones.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Mar, 2007

1 commit


15 Feb, 2007

2 commits

  • The semantic effect of insert_at_head is that it would allow new registered
    sysctl entries to override existing sysctl entries of the same name. Which is
    pain for caching and the proc interface never implemented.

    I have done an audit and discovered that none of the current users of
    register_sysctl care as (excpet for directories) they do not register
    duplicate sysctl entries.

    So this patch simply removes the support for overriding existing entries in
    the sys_sysctl interface since no one uses it or cares and it makes future
    enhancments harder.

    Signed-off-by: Eric W. Biederman
    Acked-by: Ralf Baechle
    Acked-by: Martin Schwidefsky
    Cc: Russell King
    Cc: David Howells
    Cc: "Luck, Tony"
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Jens Axboe
    Cc: Corey Minyard
    Cc: Neil Brown
    Cc: "John W. Linville"
    Cc: James Bottomley
    Cc: Jan Kara
    Cc: Trond Myklebust
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: "David S. Miller"
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • After Al Viro (finally) succeeded in removing the sched.h #include in module.h
    recently, it makes sense again to remove other superfluous sched.h includes.
    There are quite a lot of files which include it but don't actually need
    anything defined in there. Presumably these includes were once needed for
    macros that used to live in sched.h, but moved to other header files in the
    course of cleaning it up.

    To ease the pain, this time I did not fiddle with any header files and only
    removed #includes from .c-files, which tend to cause less trouble.

    Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
    arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
    allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
    configs in arch/arm/configs on arm. I also checked that no new warnings were
    introduced by the patch (actually, some warnings are removed that were emitted
    by unnecessarily included header files).

    Signed-off-by: Tim Schmielau
    Acked-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Schmielau
     

13 Feb, 2007

1 commit

  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

11 Feb, 2007

1 commit


09 Dec, 2006

1 commit


03 Dec, 2006

1 commit


23 Sep, 2006

2 commits


03 Aug, 2006

1 commit

  • From: Catherine Zhang

    This patch implements a cleaner fix for the memory leak problem of the
    original unix datagram getpeersec patch. Instead of creating a
    security context each time a unix datagram is sent, we only create the
    security context when the receiver requests it.

    This new design requires modification of the current
    unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
    secid_to_secctx and release_secctx. The former retrieves the security
    context and the latter releases it. A hook is required for releasing
    the security context because it is up to the security module to decide
    how that's done. In the case of Selinux, it's a simple kfree
    operation.

    Acked-by: Stephen Smalley
    Signed-off-by: David S. Miller

    Catherine Zhang
     

22 Jul, 2006

1 commit


04 Jul, 2006

2 commits

  • The unix_get_peersec_dgram() stub should have been inlined so that it
    disappears.

    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Andrew Morton
     
  • Teach special (recursive) locking code to the lock validator. Also splits
    af_unix's sk_receive_queue.lock class from the other networking skb-queue
    locks. Has no effect on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jul, 2006

1 commit


30 Jun, 2006

1 commit

  • This patch implements an API whereby an application can determine the
    label of its peer's Unix datagram sockets via the auxiliary data mechanism of
    recvmsg.

    Patch purpose:

    This patch enables a security-aware application to retrieve the
    security context of the peer of a Unix datagram socket. The application
    can then use this security context to determine the security context for
    processing on behalf of the peer who sent the packet.

    Patch design and implementation:

    The design and implementation is very similar to the UDP case for INET
    sockets. Basically we build upon the existing Unix domain socket API for
    retrieving user credentials. Linux offers the API for obtaining user
    credentials via ancillary messages (i.e., out of band/control messages
    that are bundled together with a normal message). To retrieve the security
    context, the application first indicates to the kernel such desire by
    setting the SO_PASSSEC option via getsockopt. Then the application
    retrieves the security context using the auxiliary data mechanism.

    An example server application for Unix datagram socket should look like this:

    toggle = 1;
    toggle_len = sizeof(toggle);

    setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
    recvmsg(sockfd, &msg_hdr, 0);
    if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
    cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
    if (cmsg_hdr->cmsg_len cmsg_level == SOL_SOCKET &&
    cmsg_hdr->cmsg_type == SCM_SECURITY) {
    memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
    }
    }

    sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
    a server socket to receive security context of the peer.

    Testing:

    We have tested the patch by setting up Unix datagram client and server
    applications. We verified that the server can retrieve the security context
    using the auxiliary data mechanism of recvmsg.

    Signed-off-by: Catherine Zhang
    Acked-by: Acked-by: James Morris
    Signed-off-by: David S. Miller

    Catherine Zhang
     

26 Mar, 2006

1 commit

  • Implement the half-closed devices notifiation, by adding a new POLLRDHUP
    (and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the
    existing POLLHUP handling, that does not report correctly half-closed
    devices, was feared to be changed, this implementation leaves the current
    POLLHUP reporting unchanged and simply add a new bit that is set in the few
    places where it makes sense. The same thing was discussed and conceptually
    agreed quite some time ago:

    http://lkml.org/lkml/2003/7/12/116

    Since this new event bit is added to the existing Linux poll infrastruture,
    even the existing poll/select system calls will be able to use it. As far
    as the existing POLLHUP handling, the patch leaves it as is. The
    pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
    archs and sets the bit in the six relevant files. The other attached diff
    is the simple change required to sys/epoll.h to add the EPOLLRDHUP
    definition.

    There is "a stupid program" to test POLLRDHUP delivery here:

    http://www.xmailserver.org/pollrdhup-test.c

    It tests poll(2), but since the delivery is same epoll(2) will work equally.

    Signed-off-by: Davide Libenzi
    Cc: "David S. Miller"
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

21 Mar, 2006

3 commits

  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Ingo Molnar
     
  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Arjan van de Ven
     
  • The patch below replaces a divide by 2 with a shift -- sk_sndbuf is an
    integer, so gcc emits an idiv, which takes 10x longer than a shift by 1.
    This improves af_unix bandwidth by ~6-10K/s. Also, tidy up the comment
    to fit in 80 columns while we're at it.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: David S. Miller

    Benjamin LaHaise
     

09 Mar, 2006

1 commit

  • I have benchmarked this on an x86_64 NUMA system and see no significant
    performance difference on kernbench. Tested on both x86_64 and powerpc.

    The way we do file struct accounting is not very suitable for batched
    freeing. For scalability reasons, file accounting was
    constructor/destructor based. This meant that nr_files was decremented
    only when the object was removed from the slab cache. This is susceptible
    to slab fragmentation. With RCU based file structure, consequent batched
    freeing and a test program like Serge's, we just speed this up and end up
    with a very fragmented slab -

    llm22:~ # cat /proc/sys/fs/file-nr
    587730 0 758844

    At the same time, I see only a 2000+ objects in filp cache. The following
    patch I fixes this problem.

    This patch changes the file counting by removing the filp_count_lock.
    Instead we use a separate percpu counter, nr_files, for now and all
    accesses to it are through get_nr_files() api. In the sysctl handler for
    nr_files, we populate files_stat.nr_files before returning to user.

    Counting files as an when they are created and destroyed (as opposed to
    inside slab) allows us to correctly count open files with RCU.

    Signed-off-by: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     

10 Jan, 2006

1 commit


04 Jan, 2006

5 commits

  • Currently all network protocols need to call dev_ioctl as the default
    fallback in their ioctl implementations. This patch adds a fallback
    to dev_ioctl to sock_ioctl if the protocol returned -ENOIOCTLCMD.
    This way all the procotol ioctl handlers can be simplified and we don't
    need to export dev_ioctl.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • From: Benjamin LaHaise

    In af_unix, a rwlock is used to protect internal state. At least on my
    P4 with HT it is faster to use a spinlock due to the simpler memory
    barrier used to unlock. This patch raises bw_unix to ~690K/s.

    Signed-off-by: David S. Miller

    Benjamin LaHaise
     
  • I noticed that some of 'struct proto_ops' used in the kernel may share
    a cache line used by locks or other heavily modified data. (default
    linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
    least)

    This patch makes sure a 'struct proto_ops' can be declared as const,
    so that all cpus can share all parts of it without false sharing.

    This is not mandatory : a driver can still use a read/write structure
    if it needs to (and eventually a __read_mostly)

    I made a global stubstitute to change all existing occurences to make
    them const.

    This should reduce the possibility of false sharing on SMP, and
    speedup some socket system calls.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This lock is actually taken mostly as a writer,
    so using a rwlock actually just makes performance
    worse especially on chips like the Intel P4.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • AF_UNIX stream socket performance on P4 CPUs tends to suffer due to a
    lot of pipeline flushes from atomic operations. The patch below
    removes the sock_hold() and sock_put() in unix_stream_sendmsg(). This
    should be safe as the socket still holds a reference to its peer which
    is only released after the file descriptor's final user invokes
    unix_release_sock(). The only consideration is that we must add a
    memory barrier before setting the peer initially.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: David S. Miller

    Benjamin LaHaise
     

09 Nov, 2005

1 commit

  • Most permission() calls have a struct nameidata * available. This helper
    takes that as an argument and thus makes sure we pass it down for lookup
    intents and prepares for per-mount read-only support where we need a struct
    vfsmount for checking whether a file is writeable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig