05 Sep, 2016

1 commit

  • Right now we use the 'readlock' both for protecting some of the af_unix
    IO path and for making the bind be single-threaded.

    The two are independent, but using the same lock makes for a nasty
    deadlock due to ordering with regards to filesystem locking. The bind
    locking would want to nest outside the VSF pathname locking, but the IO
    locking wants to nest inside some of those same locks.

    We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
    splice-bind deadlock") which moved the readlock inside the vfs locks,
    but that caused problems with overlayfs that will then call back into
    filesystem routines that take the lock in the wrong order anyway.

    Splitting the locks means that we can go back to having the bind lock be
    the outermost lock, and we don't have any deadlocks with lock ordering.

    Acked-by: Rainer Weikusat
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Linus Torvalds
     

08 Feb, 2016

1 commit

  • The commit referenced in the Fixes tag incorrectly accounted the number
    of in-flight fds over a unix domain socket to the original opener
    of the file-descriptor. This allows another process to arbitrary
    deplete the original file-openers resource limit for the maximum of
    open files. Instead the sending processes and its struct cred should
    be credited.

    To do so, we add a reference counted struct user_struct pointer to the
    scm_fp_list and use it to account for the number of inflight unix fds.

    Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets")
    Reported-by: David Herrmann
    Cc: David Herrmann
    Cc: Willy Tarreau
    Cc: Linus Torvalds
    Suggested-by: Linus Torvalds
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

24 Nov, 2015

1 commit

  • Rainer Weikusat writes:
    An AF_UNIX datagram socket being the client in an n:1 association with
    some server socket is only allowed to send messages to the server if the
    receive queue of this socket contains at most sk_max_ack_backlog
    datagrams. This implies that prospective writers might be forced to go
    to sleep despite none of the message presently enqueued on the server
    receive queue were sent by them. In order to ensure that these will be
    woken up once space becomes again available, the present unix_dgram_poll
    routine does a second sock_poll_wait call with the peer_wait wait queue
    of the server socket as queue argument (unix_dgram_recvmsg does a wake
    up on this queue after a datagram was received). This is inherently
    problematic because the server socket is only guaranteed to remain alive
    for as long as the client still holds a reference to it. In case the
    connection is dissolved via connect or by the dead peer detection logic
    in unix_dgram_sendmsg, the server socket may be freed despite "the
    polling mechanism" (in particular, epoll) still has a pointer to the
    corresponding peer_wait queue. There's no way to forcibly deregister a
    wait queue with epoll.

    Based on an idea by Jason Baron, the patch below changes the code such
    that a wait_queue_t belonging to the client socket is enqueued on the
    peer_wait queue of the server whenever the peer receive queue full
    condition is detected by either a sendmsg or a poll. A wake up on the
    peer queue is then relayed to the ordinary wait queue of the client
    socket via wake function. The connection to the peer wait queue is again
    dissolved if either a wake up is about to be relayed or the client
    socket reconnects or a dead peer is detected or the client socket is
    itself closed. This enables removing the second sock_poll_wait from
    unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
    that no blocked writer sleeps forever.

    Signed-off-by: Rainer Weikusat
    Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
    Reviewed-by: Jason Baron
    Signed-off-by: David S. Miller

    Rainer Weikusat
     

08 Oct, 2015

1 commit


30 Sep, 2015

1 commit


11 Jun, 2015

1 commit

  • SCM_SECURITY was originally only implemented for datagram sockets,
    not for stream sockets. However, SCM_CREDENTIALS is supported on
    Unix stream sockets. For consistency, implement Unix stream support
    for SCM_SECURITY as well. Also clean up the existing code and get
    rid of the superfluous UNIXSID macro.

    Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1224211,
    where systemd was using SCM_CREDENTIALS and assumed wrongly that
    SCM_SECURITY was also supported on Unix stream sockets.

    Signed-off-by: Stephen Smalley
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller

    Stephen Smalley
     

10 Aug, 2013

1 commit

  • unix_stream_sendmsg() currently uses order-2 allocations,
    and we had numerous reports this can fail.

    The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
    not helping.

    This patch extends the work done in commit eb6a24816b247c
    ("af_unix: reduce high order page allocations) for
    datagram sockets.

    This opens the possibility of zero copy IO (splice() and
    friends)

    The trick is to not use skb_pull() anymore in recvmsg() path,
    and instead add a @consumed field in UNIXCB() to track amount
    of already read payload in the skb.

    There is a performance regression for large sends
    because of extra page allocations that will be addressed
    in a follow-up patch, allowing sock_alloc_send_pskb()
    to attempt high order page allocations.

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Aug, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Reflow modified prototypes to 80 columns.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

02 May, 2013

1 commit

  • Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
    uses 64bit instructions to manipulate them.
    If the 64bit word includes any atomic_t or spinlock_t, we can lose
    critical concurrent changes.

    This is happening in af_unix, where unix_sk(sk)->gc_candidate/
    gc_maybe_cycle/lock share the same 64bit word.

    This leads to fatal deadlock, as one/several cpus spin forever
    on a spinlock that will never be available again.

    A safer way would be to use a long to store flags.
    This way we are sure compiler/arch wont do bad things.

    As we own unix_gc_lock spinlock when clearing or setting bits,
    we can use the non atomic __set_bit()/__clear_bit().

    recursion_level can share the same 64bit location with the spinlock,
    as it is set only with this spinlock held.

    [1] bug fixed in gcc-4.8.0 :
    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

    Reported-by: Ambrose Feinstein
    Signed-off-by: Eric Dumazet
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Apr, 2013

1 commit

  • Now that uids and gids are completely encapsulated in kuid_t
    and kgid_t we no longer need to pass struct cred which allowed
    us to test both the uid and the user namespace for equality.

    Passing struct cred potentially allows us to pass the entire group
    list as BSD does but I don't believe the cost of cache line misses
    justifies retaining code for a future potential application.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

22 Oct, 2012

1 commit


09 Jun, 2012

1 commit

  • /proc/net/unix has quadratic behavior, and can hold unix_table_lock for
    a while if high number of unix sockets are alive. (90 ms for 200k
    sockets...)

    We already have a hash table, so its quite easy to use it.

    Problem is unbound sockets are still hashed in a single hash slot
    (unix_socket_table[UNIX_HASH_TABLE])

    This patch also spreads unbound sockets to 256 hash slots, to speedup
    both /proc/net/unix and unix_diag.

    Time to read /proc/net/unix with 200k unix sockets :
    (time dd if=/proc/net/unix of=/dev/null bs=4k)

    before : 520 secs
    after : 2 secs

    Signed-off-by: Eric Dumazet
    Cc: Steven Whitehouse
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Apr, 2012

1 commit


21 Mar, 2012

1 commit


31 Dec, 2011

1 commit


17 Dec, 2011

1 commit


25 Apr, 2011

1 commit

  • These header files are never installed to user consumption, so any
    __KERNEL__ cpp checks are superfluous.

    Projects should also not copy these files into their userland utility
    sources and try to use them there. If they insist on doing so, the
    onus is on them to sanitize the headers as needed.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Nov, 2010

1 commit

  • Its easy to eat all kernel memory and trigger NMI watchdog, using an
    exploit program that queues unix sockets on top of others.

    lkml ref : http://lkml.org/lkml/2010/11/25/8

    This mechanism is used in applications, one choice we have is to have a
    recursion limit.

    Other limits might be needed as well (if we queue other types of files),
    since the passfd mechanism is currently limited by socket receive queue
    sizes only.

    Add a recursion_level to unix socket, allowing up to 4 levels.

    Each time we send an unix socket through sendfd mechanism, we copy its
    recursion level (plus one) to receiver. This recursion level is cleared
    when socket receive queue is emptied.

    Reported-by: Марк Коренберг
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jun, 2010

1 commit


02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2008

1 commit

  • This is an implementation of David Miller's suggested fix in:
    https://bugzilla.redhat.com/show_bug.cgi?id=470201

    It has been updated to use wait_event() instead of
    wait_event_interruptible().

    Paraphrasing the description from the above report, it makes sendmsg()
    block while UNIX garbage collection is in progress. This avoids a
    situation where child processes continue to queue new FDs over a
    AF_UNIX socket to a parent which is in the exit path and running
    garbage collection on these FDs. This contention can result in soft
    lockups and oom-killing of unrelated processes.

    Signed-off-by: dann frazier
    Signed-off-by: David S. Miller

    dann frazier
     

10 Nov, 2008

1 commit

  • Previously I assumed that the receive queues of candidates don't
    change during the GC. This is only half true, nothing can be received
    from the queues (see comment in unix_gc()), but buffers could be added
    through the other half of the socket pair, which may still have file
    descriptors referring to it.

    This can result in inc_inflight_move_tail() erronously increasing the
    "inflight" counter for a unix socket for which dec_inflight() wasn't
    previously called. This in turn can trigger the "BUG_ON(total_refs <
    inflight_refs)" in a later garbage collection run.

    Fix this by only manipulating the "inflight" counter for sockets which
    are candidates themselves. Duplicating the file references in
    unix_attach_fds() is also needed to prevent a socket becoming a
    candidate for GC while the skb that contains it is not yet queued.

    Reported-by: Andrea Bittau
    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

27 Jul, 2008

1 commit


29 Jan, 2008

2 commits


11 Nov, 2007

1 commit


31 Jul, 2007

1 commit


12 Jul, 2007

1 commit

  • Throw out the old mark & sweep garbage collector and put in a
    refcounting cycle detecting one.

    The old one had a race with recvmsg, that resulted in false positives
    and hence data loss. The old algorithm operated on all unix sockets
    in the system, so any additional locking would have meant performance
    problems for all users of these.

    The new algorithm instead only operates on "in flight" sockets, which
    are very rare, and the additional locking for these doesn't negatively
    impact the vast majority of users.

    In fact it's probable, that there weren't *any* heavy senders of
    sockets over sockets, otherwise the above race would have been
    discovered long ago.

    The patch works OK with the app that exposed the race with the old
    code. The garbage collection has also been verified to work in a few
    simple cases.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

04 Jun, 2007

1 commit


03 Aug, 2006

1 commit

  • From: Catherine Zhang

    This patch implements a cleaner fix for the memory leak problem of the
    original unix datagram getpeersec patch. Instead of creating a
    security context each time a unix datagram is sent, we only create the
    security context when the receiver requests it.

    This new design requires modification of the current
    unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
    secid_to_secctx and release_secctx. The former retrieves the security
    context and the latter releases it. A hook is required for releasing
    the security context because it is up to the security module to decide
    how that's done. In the case of Selinux, it's a simple kfree
    operation.

    Acked-by: Stephen Smalley
    Signed-off-by: David S. Miller

    Catherine Zhang
     

04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Also splits
    af_unix's sk_receive_queue.lock class from the other networking skb-queue
    locks. Has no effect on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

30 Jun, 2006

1 commit

  • This patch implements an API whereby an application can determine the
    label of its peer's Unix datagram sockets via the auxiliary data mechanism of
    recvmsg.

    Patch purpose:

    This patch enables a security-aware application to retrieve the
    security context of the peer of a Unix datagram socket. The application
    can then use this security context to determine the security context for
    processing on behalf of the peer who sent the packet.

    Patch design and implementation:

    The design and implementation is very similar to the UDP case for INET
    sockets. Basically we build upon the existing Unix domain socket API for
    retrieving user credentials. Linux offers the API for obtaining user
    credentials via ancillary messages (i.e., out of band/control messages
    that are bundled together with a normal message). To retrieve the security
    context, the application first indicates to the kernel such desire by
    setting the SO_PASSSEC option via getsockopt. Then the application
    retrieves the security context using the auxiliary data mechanism.

    An example server application for Unix datagram socket should look like this:

    toggle = 1;
    toggle_len = sizeof(toggle);

    setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
    recvmsg(sockfd, &msg_hdr, 0);
    if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
    cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
    if (cmsg_hdr->cmsg_len cmsg_level == SOL_SOCKET &&
    cmsg_hdr->cmsg_type == SCM_SECURITY) {
    memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
    }
    }

    sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
    a server socket to receive security context of the peer.

    Testing:

    We have tested the patch by setting up Unix datagram client and server
    applications. We verified that the server can retrieve the security context
    using the auxiliary data mechanism of recvmsg.

    Signed-off-by: Catherine Zhang
    Acked-by: Acked-by: James Morris
    Signed-off-by: David S. Miller

    Catherine Zhang
     

26 Apr, 2006

1 commit


21 Mar, 2006

1 commit

  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Ingo Molnar
     

04 Jan, 2006

2 commits


30 Aug, 2005

1 commit

  • Of this type, mostly:

    CHECK net/ipv6/netfilter.c
    net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
    net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds