03 Nov, 2017

1 commit

  • …el/git/gregkh/driver-core

    Pull initial SPDX identifiers from Greg KH:
    "License cleanup: add SPDX license identifiers to some files

    Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the
    'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
    binding shorthand, which can be used instead of the full boiler plate
    text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart
    and Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset
    of the use cases:

    - file had no licensing information it it.

    - file was a */uapi/* one with no licensing information in it,

    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to
    license had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied
    to a file was done in a spreadsheet of side by side results from of
    the output of two independent scanners (ScanCode & Windriver)
    producing SPDX tag:value files created by Philippe Ombredanne.
    Philippe prepared the base worksheet, and did an initial spot review
    of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537
    files assessed. Kate Stewart did a file by file comparison of the
    scanner results in the spreadsheet to determine which SPDX license
    identifier(s) to be applied to the file. She confirmed any
    determination that was not immediately clear with lawyers working with
    the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:

    - Files considered eligible had to be source code files.

    - Make and config files were included as candidates if they contained
    >5 lines of source

    - File already had some variant of a license header in it (even if <5
    lines).

    All documentation files were explicitly excluded.

    The following heuristics were used to determine which SPDX license
    identifiers to apply.

    - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.

    For non */uapi/* files that summary was:

    SPDX license identifier # files
    ---------------------------------------------------|-------
    GPL-2.0 11139

    and resulted in the first patch in this series.

    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
    was:

    SPDX license identifier # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note 930

    and resulted in the second patch in this series.

    - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point). Results summary:

    SPDX license identifier # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note 270
    GPL-2.0+ WITH Linux-syscall-note 169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
    LGPL-2.1+ WITH Linux-syscall-note 15
    GPL-1.0+ WITH Linux-syscall-note 14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
    LGPL-2.0+ WITH Linux-syscall-note 4
    LGPL-2.1 WITH Linux-syscall-note 3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

    and that resulted in the third patch in this series.

    - when the two scanners agreed on the detected license(s), that
    became the concluded license(s).

    - when there was disagreement between the two scanners (one detected
    a license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.

    - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply
    (and which scanner probably needed to revisit its heuristics).

    - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.

    - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.

    In total, over 70 hours of logged manual review was done on the
    spreadsheet to determine the SPDX license identifiers to apply to the
    source files by Kate, Philippe, Thomas and, in some cases,
    confirmation by lawyers working with the Linux Foundation.

    Kate also obtained a third independent scan of the 4.13 code base from
    FOSSology, and compared selected files where the other two scanners
    disagreed against that SPDX file, to see if there was new insights.
    The Windriver scanner is based on an older version of FOSSology in
    part, so they are related.

    Thomas did random spot checks in about 500 files from the spreadsheets
    for the uapi headers and agreed with SPDX license identifier in the
    files he inspected. For the non-uapi files Thomas did random spot
    checks in about 15000 files.

    In initial set of patches against 4.14-rc6, 3 files were found to have
    copy/paste license identifier errors, and have been fixed to reflect
    the correct identifier.

    Additionally Philippe spent 10 hours this week doing a detailed manual
    inspection and review of the 12,461 patched files from the initial
    patch version early this week with:

    - a full scancode scan run, collecting the matched texts, detected
    license ids and scores

    - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct

    - reviewing anything where there was no detection but the patch
    license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
    applied SPDX license was correct

    This produced a worksheet with 20 files needing minor correction. This
    worksheet was then exported into 3 different .csv files for the
    different types of files to be modified.

    These .csv files were then reviewed by Greg. Thomas wrote a script to
    parse the csv files and add the proper SPDX tag to the file, in the
    format that the file expected. This script was further refined by Greg
    based on the output to detect more types of files automatically and to
    distinguish between header and source .c files (which need different
    comment types.) Finally Greg ran the script using the .csv files to
    generate the patches.

    Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
    Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

    * tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    License cleanup: add SPDX license identifier to uapi header files with a license
    License cleanup: add SPDX license identifier to uapi header files with no license
    License cleanup: add SPDX GPL-2.0 license identifier to files with no license

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

26 Oct, 2017

1 commit

  • socket_diag shows information only about sockets from a namespace where
    a diag socket lives.

    But if we request information about one unix socket, the kernel don't
    check that its netns is matched with a diag socket namespace, so any
    user can get information about any unix socket in a system. This looks
    like a bug.

    v2: add a Fixes tag

    Fixes: 51d7cccf0723 ("net: make sock diag per-namespace")
    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrei Vagin
     

22 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
    headers from UDP packets before queueing"), when udp packets are being
    peeked the requested extra offset is always 0 as there is no need to skip
    the udp header. However, when the offset is 0 and the next skb is
    of length 0, it is only returned once. The behaviour can be seen with
    the following python script:

    from socket import *;
    f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    f.bind(('::', 0));
    addr=('::1', f.getsockname()[1]);
    g.sendto(b'', addr)
    g.sendto(b'b', addr)
    print(f.recvfrom(10, MSG_PEEK));
    print(f.recvfrom(10, MSG_PEEK));

    Where the expected output should be the empty string twice.

    Instead, make sk_peek_offset return negative values, and pass those values
    to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
    to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
    __skb_try_recv_from_queue will then ensure the offset is reset back to 0
    if a peek is requested without an offset, unless no packets are found.

    Also simplify the if condition in __skb_try_recv_from_queue. If _off is
    greater then 0, and off is greater then or equal to skb->len, then
    (_off || skb->len) must always be true assuming skb->len >= 0 is always
    true.

    Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
    as it double checked if MSG_PEEK was set in the flags.

    V2:
    - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
    redundant checks
    - Fix peeking in udp{,v6}_recvmsg to report the right value when the
    offset is 0

    V3:
    - Marked new branch in __skb_try_recv_from_queue as unlikely.

    Signed-off-by: Matthew Dawson
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Matthew Dawson
     

17 Jul, 2017

1 commit

  • All unix sockets now account inflight FDs to the respective sender.
    This was introduced in:

    commit 712f4aad406bb1ed67f3f98d04c044191f0ff593
    Author: willy tarreau
    Date: Sun Jan 10 07:54:56 2016 +0100

    unix: properly account for FDs passed over unix sockets

    and further refined in:

    commit 415e3d3e90ce9e18727e8843ae343eda5a58fad6
    Author: Hannes Frederic Sowa
    Date: Wed Feb 3 02:11:03 2016 +0100

    unix: correctly track in-flight fds in sending process user_struct

    Hence, regardless of the stacking depth of FDs, the total number of
    inflight FDs is limited, and accounted. There is no known way for a
    local user to exceed those limits or exploit the accounting.

    Furthermore, the GC logic is independent of the recursion/stacking depth
    as well. It solely depends on the total number of inflight FDs,
    regardless of their layout.

    Lastly, the current `recursion_level' suffers a TOCTOU race, since it
    checks and inherits depths only at queue time. If we consider `A
    Cc: Simon McVittie
    Signed-off-by: David Herrmann
    Reviewed-by: Tom Gundersen
    Signed-off-by: David S. Miller

    David Herrmann
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

01 Jul, 2017

3 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 Jun, 2017

1 commit


07 Apr, 2017

1 commit

  • Prepare to mark sensitive kernel structures for randomization by making
    sure they're using designated initializers. These were identified during
    allyesconfig builds of x86, arm, and arm64, and the initializer fixes
    were extracted from grsecurity. In this case, NULL initialize with { }
    instead of undesignated NULLs.

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     

22 Mar, 2017

1 commit

  • Dmitry has reported that a BUG_ON() condition in unix_notinflight()
    may be triggered by a simple code that forwards unix socket in an
    SCM_RIGHTS message.
    That is caused by incorrect unix socket GC implementation in unix_gc().

    The GC first collects list of candidates, then (a) decrements their
    "children's" inflight counter, (b) checks which inflight counters are
    now 0, and then (c) increments all inflight counters back.
    (a) and (c) are done by calling scan_children() with inc_inflight or
    dec_inflight as the second argument.

    Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
    collector") changed scan_children() such that it no longer considers
    sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
    of code that that unsets this flag _before_ invoking
    scan_children(, dec_iflight, ). This may lead to incorrect inflight
    counters for some sockets.

    This change fixes this bug by changing order of operations:
    UNIX_GC_CANDIDATE is now unset only after all inflight counters are
    restored to the original state.

    kernel BUG at net/unix/garbage.c:149!
    RIP: 0010:[] []
    unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
    Call Trace:
    [] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
    [] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
    [] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
    [] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
    [] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
    [] kfree_skb+0x184/0x570 net/core/skbuff.c:705
    [] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
    [] unix_release+0x49/0x90 net/unix/af_unix.c:836
    [] sock_release+0x92/0x1f0 net/socket.c:570
    [] sock_close+0x1b/0x20 net/socket.c:1017
    [] __fput+0x34e/0x910 fs/file_table.c:208
    [] ____fput+0x1a/0x20 fs/file_table.c:244
    [] task_work_run+0x1a0/0x280 kernel/task_work.c:116
    [< inline >] exit_task_work include/linux/task_work.h:21
    [] do_exit+0x183a/0x2640 kernel/exit.c:828
    [] do_group_exit+0x14e/0x420 kernel/exit.c:931
    [] get_signal+0x663/0x1880 kernel/signal.c:2307
    [] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
    [] exit_to_usermode_loop+0x1ea/0x2d0
    arch/x86/entry/common.c:156
    [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
    [] syscall_return_slowpath+0x4d3/0x570
    arch/x86/entry/common.c:259
    [] entry_SYSCALL_64_fastpath+0xc4/0xc6

    Link: https://lkml.org/lkml/2017/3/6/252
    Signed-off-by: Andrey Ulanov
    Reported-by: Dmitry Vyukov
    Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
    Signed-off-by: David S. Miller

    Andrey Ulanov
     

10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

02 Mar, 2017

1 commit


03 Feb, 2017

1 commit

  • This ioctl opens a file to which a socket is bound and
    returns a file descriptor. The caller has to have CAP_NET_ADMIN
    in the socket network namespace.

    Currently it is impossible to get a path and a mount point
    for a socket file. socket_diag reports address, device ID and inode
    number for unix sockets. An address can contain a relative path or
    a file may be moved somewhere. And these properties say nothing about
    a mount namespace and a mount point of a socket file.

    With the introduced ioctl, we can get a path by reading
    /proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.

    In CRIU we are going to use this ioctl to dump and restore unix socket.

    Here is an example how it can be used:

    $ strace -e socket,bind,ioctl ./test /tmp/test_sock
    socket(AF_UNIX, SOCK_STREAM, 0) = 3
    bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
    ioctl(3, SIOCUNIXFILE, 0) = 4
    ^Z

    $ ss -a | grep test_sock
    u_str LISTEN 0 1 test_sock 17798 * 0

    $ ls -l /proc/760/fd/{3,4}
    lrwx------ 1 root root 64 Feb 1 09:41 3 -> 'socket:[17798]'
    l--------- 1 root root 64 Feb 1 09:41 4 -> /tmp/test_sock

    $ cat /proc/760/fdinfo/4
    pos: 0
    flags: 012000000
    mnt_id: 40

    $ cat /proc/self/mountinfo | grep "^40\s"
    40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw

    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

25 Jan, 2017

1 commit

  • Dmitry reported a deadlock scenario:

    unix_bind() path:
    u->bindlock ==> sb_writer

    do_splice() path:
    sb_writer ==> pipe->mutex ==> u->bindlock

    In the unix_bind() code path, unix_mknod() does not have to
    be done with u->bindlock held, since it is a pure fs operation,
    so we can just move unix_mknod() out.

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Cc: Rainer Weikusat
    Cc: Al Viro
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

25 Dec, 2016

1 commit


17 Dec, 2016

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This update contains:

    - try to clone on copy-up

    - allow renaming a directory

    - split source into managable chunks

    - misc cleanups and fixes

    It does not contain the read-only fd data inconsistency fix, which Al
    didn't like. I'll leave that to the next year..."

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
    ovl: fix reStructuredText syntax errors in documentation
    ovl: fix return value of ovl_fill_super
    ovl: clean up kstat usage
    ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
    ovl: create directories inside merged parent opaque
    ovl: opaque cleanup
    ovl: show redirect_dir mount option
    ovl: allow setting max size of redirect
    ovl: allow redirect_dir to default to "on"
    ovl: check for emptiness of redirect dir
    ovl: redirect on rename-dir
    ovl: lookup redirects
    ovl: consolidate lookup for underlying layers
    ovl: fix nested overlayfs mount
    ovl: check namelen
    ovl: split super.c
    ovl: use d_is_dir()
    ovl: simplify lookup
    ovl: check lower existence of rename target
    ovl: rename: simplify handling of lower/merged directory
    ...

    Linus Torvalds
     

16 Dec, 2016

1 commit

  • This reverts commit eb0a4a47ae89aaa0674ab3180de6a162f3be2ddf.

    Since commit 51f7e52dc943 ("ovl: share inode for hard link") there's no
    need to call d_real_inode() to check two overlay inodes for equality.

    Side effect of this revert is that it's no longer possible to connect one
    socket on overlayfs to one on the underlying layer (something which didn't
    make sense anyway).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

23 Nov, 2016

1 commit

  • All conflicts were simple overlapping changes except perhaps
    for the Thunder driver.

    That driver has a change_mtu method explicitly for sending
    a message to the hardware. If that fails it returns an
    error.

    Normally a driver doesn't need an ndo_change_mtu method becuase those
    are usually just range changes, which are now handled generically.
    But since this extra operation is needed in the Thunder driver, it has
    to stay.

    However, if the message send fails we have to restore the original
    MTU before the change because the entire call chain expects that if
    an error is thrown by ndo_change_mtu then the MTU did not change.
    Therefore code is added to nicvf_change_mtu to remember the original
    MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

    Signed-off-by: David S. Miller

    David S. Miller
     

19 Nov, 2016

1 commit

  • Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
    converts schedule_timeout() to its freezable version, it was probably
    correct at that time, but later, commit 2b514574f7e8
    ("net: af_unix: implement splice for stream af_unix sockets") breaks
    the strong requirement for a freezable sleep, according to
    commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

    The pipe_lock is still held at that point.

    So use freezable version only for the recvmsg call path, avoid impact for
    Android.

    Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets")
    Reported-by: Dmitry Vyukov
    Cc: Tejun Heo
    Cc: Colin Cross
    Cc: Rafael J. Wysocki
    Cc: Hannes Frederic Sowa
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

15 Nov, 2016

1 commit


08 Nov, 2016

1 commit

  • A new argument is added to __skb_recv_datagram to provide
    an explicit skb destructor, invoked under the receive queue
    lock.
    The UDP protocol uses such argument to perform memory
    reclaiming on dequeue, so that the UDP protocol does not
    set anymore skb->desctructor.
    Instead explicit memory reclaiming is performed at close() time and
    when skbs are removed from the receive queue.
    The in kernel UDP protocol users now need to call a
    skb_recv_udp() variant instead of skb_recv_datagram() to
    properly perform memory accounting on dequeue.

    Overall, this allows acquiring only once the receive queue
    lock on dequeue.

    Tested using pktgen with random src port, 64 bytes packet,
    wire-speed on a 10G link as sender and udp_sink as the receiver,
    using an l4 tuple rxhash to stress the contention, and one or more
    udp_sink instances with reuseport.

    nr sinks vanilla patched
    1 440 560
    3 2150 2300
    6 3650 3800
    9 4450 4600
    12 6250 6450

    v1 -> v2:
    - do rmem and allocated memory scheduling under the receive lock
    - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
    - avoid the typdef for the dequeue callback

    Suggested-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

02 Nov, 2016

1 commit

  • Abstract unix domain socket may embed null characters,
    these should be translated to '@' when printed out to
    proc the same way the null prefix is currently being
    translated.

    This helps for tools such as netstat, lsof and the proc
    based implementation in ss to show all the significant
    bytes of the name (instead of getting cut at the first
    null occurrence).

    Signed-off-by: Isaac Boukris
    Signed-off-by: David S. Miller

    Isaac Boukris
     

04 Oct, 2016

1 commit

  • since pipe_lock is the outermost now, we don't need to drop/regain
    socket locks around the call of splice_to_pipe() from skb_splice_bits(),
    which kills the need to have a socket-specific callback; we can just
    call splice_to_pipe() and be done with that.

    Signed-off-by: Al Viro

    Al Viro
     

05 Sep, 2016

2 commits

  • Right now we use the 'readlock' both for protecting some of the af_unix
    IO path and for making the bind be single-threaded.

    The two are independent, but using the same lock makes for a nasty
    deadlock due to ordering with regards to filesystem locking. The bind
    locking would want to nest outside the VSF pathname locking, but the IO
    locking wants to nest inside some of those same locks.

    We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
    splice-bind deadlock") which moved the readlock inside the vfs locks,
    but that caused problems with overlayfs that will then call back into
    filesystem routines that take the lock in the wrong order anyway.

    Splitting the locks means that we can go back to having the bind lock be
    the outermost lock, and we don't have any deadlocks with lock ordering.

    Acked-by: Rainer Weikusat
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Linus Torvalds
     
  • This reverts commit c845acb324aa85a39650a14e7696982ceea75dc1.

    It turns out that it just replaces one deadlock with another one: we can
    still get the wrong lock ordering with the readlock due to overlayfs
    calling back into the filesystem layer and still taking the vfs locks
    after the readlock.

    The proper solution ends up being to just split the readlock into two
    pieces: the bind lock (taken *outside* the vfs locks) and the IO lock
    (taken *inside* the filesystem locks). The two locks are independent
    anyway.

    Signed-off-by: Linus Torvalds
    Reviewed-by: Shmulik Ladkani
    Signed-off-by: David S. Miller

    Linus Torvalds
     

27 Jul, 2016

1 commit


12 Jun, 2016

1 commit


21 May, 2016

1 commit

  • Overlayfs uses separate inodes even in the case of hard links on the
    underlying filesystems. This is a problem for AF_UNIX socket
    implementation which indexes sockets based on the inode. This resulted in
    hard linked sockets not working.

    The fix is to use the real, underlying inode.

    Test case follows:

    -- ovl-sock-test.c --
    #include
    #include
    #include
    #include

    #define SOCK "test-sock"
    #define SOCK2 "test-sock2"

    int main(void)
    {
    int fd, fd2;
    struct sockaddr_un addr = {
    .sun_family = AF_UNIX,
    .sun_path = SOCK,
    };
    struct sockaddr_un addr2 = {
    .sun_family = AF_UNIX,
    .sun_path = SOCK2,
    };

    unlink(SOCK);
    unlink(SOCK2);
    if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
    err(1, "socket");
    if (bind(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1)
    err(1, "bind");
    if (listen(fd, 0) == -1)
    err(1, "listen");
    if (link(SOCK, SOCK2) == -1)
    err(1, "link");
    if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
    err(1, "socket");
    if (connect(fd2, (struct sockaddr *) &addr2, sizeof(addr2)) == -1)
    err (1, "connect");
    return 0;
    }
    ----

    Reported-by: Alexander Morozov
    Signed-off-by: Miklos Szeredi
    Cc:

    Miklos Szeredi
     

28 Mar, 2016

1 commit


23 Feb, 2016

1 commit


20 Feb, 2016

2 commits

  • The unix_stream_read_generic function tries to use a continue statement
    to restart the receive loop after waiting for a message. This may not
    work as intended as the caller might use a recvmsg call to peek at
    control messages without specifying a message buffer. If this was the
    case, the continue will cause the function to return without an error
    and without the credential information if the function had to wait for a
    message while it had returned with the credentials otherwise. Change to
    using goto to restart the loop without checking the condition first in
    this case so that credentials are returned either way.

    Signed-off-by: Rainer Weikusat
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Rainer Weikusat
     
  • The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
    __u32, but unix_lookup_by_ino's argument ino has type int, which is not
    a problem yet.
    However, when ino is compared with sock_i_ino return value of type
    unsigned long, ino is sign extended to signed long, and this results
    to incorrect comparison on 64-bit architectures for inode numbers
    greater than INT_MAX.

    This bug was found by strace test suite.

    Fixes: 5d3cae8bc39d ("unix_diag: Dumping exact socket core")
    Signed-off-by: Dmitry V. Levin
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Dmitry V. Levin
     

17 Feb, 2016

2 commits

  • The unix_dgram_sendmsg routine use the following test

    if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {

    to determine if sk and other are in an n:1 association (either
    established via connect or by using sendto to send messages to an
    unrelated socket identified by address). This isn't correct as the
    specified address could have been bound to the sending socket itself or
    because this socket could have been connected to itself by the time of
    the unix_peer_get but disconnected before the unix_state_lock(other). In
    both cases, the if-block would be entered despite other == sk which
    might either block the sender unintentionally or lead to trying to unlock
    the same spin lock twice for a non-blocking send. Add a other != sk
    check to guard against this.

    Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue")
    Reported-By: Philipp Hahn
    Signed-off-by: Rainer Weikusat
    Tested-by: Philipp Hahn
    Signed-off-by: David S. Miller

    Rainer Weikusat
     
  • The present unix_stream_read_generic contains various code sequences of
    the form

    err = -EDISASTER;
    if ()
    goto out;

    This has the unfortunate side effect of possibly causing the error code
    to bleed through to the final

    out:
    return copied ? : err;

    and then to be wrongly returned if no data was copied because the caller
    didn't supply a data buffer, as demonstrated by the program available at

    http://pad.lv/1540731

    Change it such that err is only set if an error condition was detected.

    Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive code")
    Reported-by: Joseph Salisbury
    Signed-off-by: Rainer Weikusat
    Signed-off-by: David S. Miller

    Rainer Weikusat
     

08 Feb, 2016

2 commits

  • The commit referenced in the Fixes tag incorrectly accounted the number
    of in-flight fds over a unix domain socket to the original opener
    of the file-descriptor. This allows another process to arbitrary
    deplete the original file-openers resource limit for the maximum of
    open files. Instead the sending processes and its struct cred should
    be credited.

    To do so, we add a reference counted struct user_struct pointer to the
    scm_fp_list and use it to account for the number of inflight unix fds.

    Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets")
    Reported-by: David Herrmann
    Cc: David Herrmann
    Cc: Willy Tarreau
    Cc: Linus Torvalds
    Suggested-by: Linus Torvalds
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Remove a write-only stack variable from unix_attach_fds(). This is a
    left-over from the security fix in:

    commit 712f4aad406bb1ed67f3f98d04c044191f0ff593
    Author: willy tarreau
    Date: Sun Jan 10 07:54:56 2016 +0100

    unix: properly account for FDs passed over unix sockets

    Signed-off-by: David Herrmann
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    David Herrmann