02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

21 Jul, 2017

1 commit


19 Jul, 2017

1 commit

  • Pull structure randomization updates from Kees Cook:
    "Now that IPC and other changes have landed, enable manual markings for
    randstruct plugin, including the task_struct.

    This is the rest of what was staged in -next for the gcc-plugins, and
    comes in three patches, largest first:

    - mark "easy" structs with __randomize_layout

    - mark task_struct with an optional anonymous struct to isolate the
    __randomize_layout section

    - mark structs to opt _out_ of automated marking (which will come
    later)

    And, FWIW, this continues to pass allmodconfig (normal and patched to
    enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
    s390 for me"

    * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    randstruct: opt-out externally exposed function pointer structs
    task_struct: Allow randomized layout
    randstruct: Mark various structs for randomization

    Linus Torvalds
     

17 Jul, 2017

1 commit

  • All unix sockets now account inflight FDs to the respective sender.
    This was introduced in:

    commit 712f4aad406bb1ed67f3f98d04c044191f0ff593
    Author: willy tarreau
    Date: Sun Jan 10 07:54:56 2016 +0100

    unix: properly account for FDs passed over unix sockets

    and further refined in:

    commit 415e3d3e90ce9e18727e8843ae343eda5a58fad6
    Author: Hannes Frederic Sowa
    Date: Wed Feb 3 02:11:03 2016 +0100

    unix: correctly track in-flight fds in sending process user_struct

    Hence, regardless of the stacking depth of FDs, the total number of
    inflight FDs is limited, and accounted. There is no known way for a
    local user to exceed those limits or exploit the accounting.

    Furthermore, the GC logic is independent of the recursion/stacking depth
    as well. It solely depends on the total number of inflight FDs,
    regardless of their layout.

    Lastly, the current `recursion_level' suffers a TOCTOU race, since it
    checks and inherits depths only at queue time. If we consider `A
    Cc: Simon McVittie
    Signed-off-by: David Herrmann
    Reviewed-by: Tom Gundersen
    Signed-off-by: David S. Miller

    David Herrmann
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

01 Jul, 2017

2 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Sep, 2016

1 commit

  • Right now we use the 'readlock' both for protecting some of the af_unix
    IO path and for making the bind be single-threaded.

    The two are independent, but using the same lock makes for a nasty
    deadlock due to ordering with regards to filesystem locking. The bind
    locking would want to nest outside the VSF pathname locking, but the IO
    locking wants to nest inside some of those same locks.

    We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
    splice-bind deadlock") which moved the readlock inside the vfs locks,
    but that caused problems with overlayfs that will then call back into
    filesystem routines that take the lock in the wrong order anyway.

    Splitting the locks means that we can go back to having the bind lock be
    the outermost lock, and we don't have any deadlocks with lock ordering.

    Acked-by: Rainer Weikusat
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Linus Torvalds
     

08 Feb, 2016

1 commit

  • The commit referenced in the Fixes tag incorrectly accounted the number
    of in-flight fds over a unix domain socket to the original opener
    of the file-descriptor. This allows another process to arbitrary
    deplete the original file-openers resource limit for the maximum of
    open files. Instead the sending processes and its struct cred should
    be credited.

    To do so, we add a reference counted struct user_struct pointer to the
    scm_fp_list and use it to account for the number of inflight unix fds.

    Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets")
    Reported-by: David Herrmann
    Cc: David Herrmann
    Cc: Willy Tarreau
    Cc: Linus Torvalds
    Suggested-by: Linus Torvalds
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

24 Nov, 2015

1 commit

  • Rainer Weikusat writes:
    An AF_UNIX datagram socket being the client in an n:1 association with
    some server socket is only allowed to send messages to the server if the
    receive queue of this socket contains at most sk_max_ack_backlog
    datagrams. This implies that prospective writers might be forced to go
    to sleep despite none of the message presently enqueued on the server
    receive queue were sent by them. In order to ensure that these will be
    woken up once space becomes again available, the present unix_dgram_poll
    routine does a second sock_poll_wait call with the peer_wait wait queue
    of the server socket as queue argument (unix_dgram_recvmsg does a wake
    up on this queue after a datagram was received). This is inherently
    problematic because the server socket is only guaranteed to remain alive
    for as long as the client still holds a reference to it. In case the
    connection is dissolved via connect or by the dead peer detection logic
    in unix_dgram_sendmsg, the server socket may be freed despite "the
    polling mechanism" (in particular, epoll) still has a pointer to the
    corresponding peer_wait queue. There's no way to forcibly deregister a
    wait queue with epoll.

    Based on an idea by Jason Baron, the patch below changes the code such
    that a wait_queue_t belonging to the client socket is enqueued on the
    peer_wait queue of the server whenever the peer receive queue full
    condition is detected by either a sendmsg or a poll. A wake up on the
    peer queue is then relayed to the ordinary wait queue of the client
    socket via wake function. The connection to the peer wait queue is again
    dissolved if either a wake up is about to be relayed or the client
    socket reconnects or a dead peer is detected or the client socket is
    itself closed. This enables removing the second sock_poll_wait from
    unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
    that no blocked writer sleeps forever.

    Signed-off-by: Rainer Weikusat
    Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
    Reviewed-by: Jason Baron
    Signed-off-by: David S. Miller

    Rainer Weikusat
     

08 Oct, 2015

1 commit


30 Sep, 2015

1 commit


11 Jun, 2015

1 commit

  • SCM_SECURITY was originally only implemented for datagram sockets,
    not for stream sockets. However, SCM_CREDENTIALS is supported on
    Unix stream sockets. For consistency, implement Unix stream support
    for SCM_SECURITY as well. Also clean up the existing code and get
    rid of the superfluous UNIXSID macro.

    Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1224211,
    where systemd was using SCM_CREDENTIALS and assumed wrongly that
    SCM_SECURITY was also supported on Unix stream sockets.

    Signed-off-by: Stephen Smalley
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller

    Stephen Smalley
     

10 Aug, 2013

1 commit

  • unix_stream_sendmsg() currently uses order-2 allocations,
    and we had numerous reports this can fail.

    The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
    not helping.

    This patch extends the work done in commit eb6a24816b247c
    ("af_unix: reduce high order page allocations) for
    datagram sockets.

    This opens the possibility of zero copy IO (splice() and
    friends)

    The trick is to not use skb_pull() anymore in recvmsg() path,
    and instead add a @consumed field in UNIXCB() to track amount
    of already read payload in the skb.

    There is a performance regression for large sends
    because of extra page allocations that will be addressed
    in a follow-up patch, allowing sock_alloc_send_pskb()
    to attempt high order page allocations.

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Aug, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Reflow modified prototypes to 80 columns.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

02 May, 2013

1 commit

  • Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
    uses 64bit instructions to manipulate them.
    If the 64bit word includes any atomic_t or spinlock_t, we can lose
    critical concurrent changes.

    This is happening in af_unix, where unix_sk(sk)->gc_candidate/
    gc_maybe_cycle/lock share the same 64bit word.

    This leads to fatal deadlock, as one/several cpus spin forever
    on a spinlock that will never be available again.

    A safer way would be to use a long to store flags.
    This way we are sure compiler/arch wont do bad things.

    As we own unix_gc_lock spinlock when clearing or setting bits,
    we can use the non atomic __set_bit()/__clear_bit().

    recursion_level can share the same 64bit location with the spinlock,
    as it is set only with this spinlock held.

    [1] bug fixed in gcc-4.8.0 :
    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

    Reported-by: Ambrose Feinstein
    Signed-off-by: Eric Dumazet
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Apr, 2013

1 commit

  • Now that uids and gids are completely encapsulated in kuid_t
    and kgid_t we no longer need to pass struct cred which allowed
    us to test both the uid and the user namespace for equality.

    Passing struct cred potentially allows us to pass the entire group
    list as BSD does but I don't believe the cost of cache line misses
    justifies retaining code for a future potential application.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

22 Oct, 2012

1 commit


09 Jun, 2012

1 commit

  • /proc/net/unix has quadratic behavior, and can hold unix_table_lock for
    a while if high number of unix sockets are alive. (90 ms for 200k
    sockets...)

    We already have a hash table, so its quite easy to use it.

    Problem is unbound sockets are still hashed in a single hash slot
    (unix_socket_table[UNIX_HASH_TABLE])

    This patch also spreads unbound sockets to 256 hash slots, to speedup
    both /proc/net/unix and unix_diag.

    Time to read /proc/net/unix with 200k unix sockets :
    (time dd if=/proc/net/unix of=/dev/null bs=4k)

    before : 520 secs
    after : 2 secs

    Signed-off-by: Eric Dumazet
    Cc: Steven Whitehouse
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Apr, 2012

1 commit


21 Mar, 2012

1 commit


31 Dec, 2011

1 commit


17 Dec, 2011

1 commit


25 Apr, 2011

1 commit

  • These header files are never installed to user consumption, so any
    __KERNEL__ cpp checks are superfluous.

    Projects should also not copy these files into their userland utility
    sources and try to use them there. If they insist on doing so, the
    onus is on them to sanitize the headers as needed.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Nov, 2010

1 commit

  • Its easy to eat all kernel memory and trigger NMI watchdog, using an
    exploit program that queues unix sockets on top of others.

    lkml ref : http://lkml.org/lkml/2010/11/25/8

    This mechanism is used in applications, one choice we have is to have a
    recursion limit.

    Other limits might be needed as well (if we queue other types of files),
    since the passfd mechanism is currently limited by socket receive queue
    sizes only.

    Add a recursion_level to unix socket, allowing up to 4 levels.

    Each time we send an unix socket through sendfd mechanism, we copy its
    recursion level (plus one) to receiver. This recursion level is cleared
    when socket receive queue is emptied.

    Reported-by: Марк Коренберг
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jun, 2010

1 commit


02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2008

1 commit

  • This is an implementation of David Miller's suggested fix in:
    https://bugzilla.redhat.com/show_bug.cgi?id=470201

    It has been updated to use wait_event() instead of
    wait_event_interruptible().

    Paraphrasing the description from the above report, it makes sendmsg()
    block while UNIX garbage collection is in progress. This avoids a
    situation where child processes continue to queue new FDs over a
    AF_UNIX socket to a parent which is in the exit path and running
    garbage collection on these FDs. This contention can result in soft
    lockups and oom-killing of unrelated processes.

    Signed-off-by: dann frazier
    Signed-off-by: David S. Miller

    dann frazier
     

10 Nov, 2008

1 commit

  • Previously I assumed that the receive queues of candidates don't
    change during the GC. This is only half true, nothing can be received
    from the queues (see comment in unix_gc()), but buffers could be added
    through the other half of the socket pair, which may still have file
    descriptors referring to it.

    This can result in inc_inflight_move_tail() erronously increasing the
    "inflight" counter for a unix socket for which dec_inflight() wasn't
    previously called. This in turn can trigger the "BUG_ON(total_refs <
    inflight_refs)" in a later garbage collection run.

    Fix this by only manipulating the "inflight" counter for sockets which
    are candidates themselves. Duplicating the file references in
    unix_attach_fds() is also needed to prevent a socket becoming a
    candidate for GC while the skb that contains it is not yet queued.

    Reported-by: Andrea Bittau
    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

27 Jul, 2008

1 commit


29 Jan, 2008

2 commits


11 Nov, 2007

1 commit


31 Jul, 2007

1 commit


12 Jul, 2007

1 commit

  • Throw out the old mark & sweep garbage collector and put in a
    refcounting cycle detecting one.

    The old one had a race with recvmsg, that resulted in false positives
    and hence data loss. The old algorithm operated on all unix sockets
    in the system, so any additional locking would have meant performance
    problems for all users of these.

    The new algorithm instead only operates on "in flight" sockets, which
    are very rare, and the additional locking for these doesn't negatively
    impact the vast majority of users.

    In fact it's probable, that there weren't *any* heavy senders of
    sockets over sockets, otherwise the above race would have been
    discovered long ago.

    The patch works OK with the app that exposed the race with the old
    code. The garbage collection has also been verified to work in a few
    simple cases.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

04 Jun, 2007

1 commit


03 Aug, 2006

1 commit

  • From: Catherine Zhang

    This patch implements a cleaner fix for the memory leak problem of the
    original unix datagram getpeersec patch. Instead of creating a
    security context each time a unix datagram is sent, we only create the
    security context when the receiver requests it.

    This new design requires modification of the current
    unix_getsecpeer_dgram LSM hook and addition of two new hooks, namely,
    secid_to_secctx and release_secctx. The former retrieves the security
    context and the latter releases it. A hook is required for releasing
    the security context because it is up to the security module to decide
    how that's done. In the case of Selinux, it's a simple kfree
    operation.

    Acked-by: Stephen Smalley
    Signed-off-by: David S. Miller

    Catherine Zhang
     

04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Also splits
    af_unix's sk_receive_queue.lock class from the other networking skb-queue
    locks. Has no effect on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

30 Jun, 2006

1 commit

  • This patch implements an API whereby an application can determine the
    label of its peer's Unix datagram sockets via the auxiliary data mechanism of
    recvmsg.

    Patch purpose:

    This patch enables a security-aware application to retrieve the
    security context of the peer of a Unix datagram socket. The application
    can then use this security context to determine the security context for
    processing on behalf of the peer who sent the packet.

    Patch design and implementation:

    The design and implementation is very similar to the UDP case for INET
    sockets. Basically we build upon the existing Unix domain socket API for
    retrieving user credentials. Linux offers the API for obtaining user
    credentials via ancillary messages (i.e., out of band/control messages
    that are bundled together with a normal message). To retrieve the security
    context, the application first indicates to the kernel such desire by
    setting the SO_PASSSEC option via getsockopt. Then the application
    retrieves the security context using the auxiliary data mechanism.

    An example server application for Unix datagram socket should look like this:

    toggle = 1;
    toggle_len = sizeof(toggle);

    setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
    recvmsg(sockfd, &msg_hdr, 0);
    if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
    cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
    if (cmsg_hdr->cmsg_len cmsg_level == SOL_SOCKET &&
    cmsg_hdr->cmsg_type == SCM_SECURITY) {
    memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
    }
    }

    sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
    a server socket to receive security context of the peer.

    Testing:

    We have tested the patch by setting up Unix datagram client and server
    applications. We verified that the server can retrieve the security context
    using the auxiliary data mechanism of recvmsg.

    Signed-off-by: Catherine Zhang
    Acked-by: Acked-by: James Morris
    Signed-off-by: David S. Miller

    Catherine Zhang