19 Mar, 2019

1 commit

  • [ Upstream commit ae3b564179bfd06f32d051b9e5d72ce4b2a07c37 ]

    Several u->addr and u->path users are not holding any locks in
    common with unix_bind(). unix_state_lock() is useless for those
    purposes.

    u->addr is assign-once and *(u->addr) is fully set up by the time
    we set u->addr (all under unix_table_lock). u->path is also
    set in the same critical area, also before setting u->addr, and
    any unix_sock with ->path filled will have non-NULL ->addr.

    So setting ->addr with smp_store_release() is all we need for those
    "lockless" users - just have them fetch ->addr with smp_load_acquire()
    and don't even bother looking at ->path if they see NULL ->addr.

    Users of ->addr and ->path fall into several classes now:
    1) ones that do smp_load_acquire(u->addr) and access *(u->addr)
    and u->path only if smp_load_acquire() has returned non-NULL.
    2) places holding unix_table_lock. These are guaranteed that
    *(u->addr) is seen fully initialized. If unix_sock is in one of the
    "bound" chains, so's ->path.
    3) unix_sock_destructor() using ->addr is safe. All places
    that set u->addr are guaranteed to have seen all stores *(u->addr)
    while holding a reference to u and unix_sock_destructor() is called
    when (atomic) refcount hits zero.
    4) unix_release_sock() using ->path is safe. unix_bind()
    is serialized wrt unix_release() (normally - by struct file
    refcount), and for the instances that had ->path set by unix_bind()
    unix_release_sock() comes from unix_release(), so they are fine.
    Instances that had it set in unix_stream_connect() either end up
    attached to a socket (in unix_accept()), in which case the call
    chain to unix_release_sock() and serialization are the same as in
    the previous case, or they never get accept'ed and unix_release_sock()
    is called when the listener is shut down and its queue gets purged.
    In that case the listener's queue lock provides the barriers needed -
    unix_stream_connect() shoves our unix_sock into listener's queue
    under that lock right after having set ->path and eventual
    unix_release_sock() caller picks them from that queue under the
    same lock right before calling unix_release_sock().
    5) unix_find_other() use of ->path is pointless, but safe -
    it happens with successful lookup by (abstract) name, so ->path.dentry
    is guaranteed to be NULL there.

    earlier-variant-reviewed-by: "Paul E. McKenney"
    Signed-off-by: Al Viro
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

04 Nov, 2018

1 commit

  • [ Upstream commit 89ab066d4229acd32e323f1569833302544a4186 ]

    This reverts commit dd979b4df817e9976f18fb6f9d134d6bc4a3c317.

    This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
    internal TCP socket for the initial handshake with the remote peer.
    Whenever the SMC connection can not be established this TCP socket is
    used as a fallback. All socket operations on the SMC socket are then
    forwarded to the TCP socket. In case of poll, the file->private_data
    pointer references the SMC socket because the TCP socket has no file
    assigned. This causes tcp_poll to wait on the wrong socket.

    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Karsten Graul
     

04 Aug, 2018

1 commit

  • Applications use -ECONNREFUSED as returned from write() in order to
    determine that a socket should be closed. However, when using connected
    dgram unix sockets in a poll/write loop, a final POLLOUT event can be
    missed when the remote end closes. Thus, the poll is stuck forever:

    thread 1 (client) thread 2 (server)

    connect() to server
    write() returns -EAGAIN
    unix_dgram_poll()
    -> unix_recvq_full() is true
    close()
    ->unix_release_sock()
    ->wake_up_interruptible_all()
    unix_dgram_poll() (due to the
    wake_up_interruptible_all)
    -> unix_recvq_full() still is true
    ->free all skbs

    Now thread 1 is stuck and will not receive anymore wakeups. In this
    case, when thread 1 gets the -EAGAIN, it has not queued any skbs
    otherwise the 'free all skbs' step would in fact cause a wakeup and
    a POLLOUT return. So the race here is probably fairly rare because
    it means there are no skbs that thread 1 queued and that thread 1
    schedules before the 'free all skbs' step.

    This issue was reported as a hang when /dev/log is closed.

    The fix is to signal POLLOUT if the socket is marked as SOCK_DEAD, which
    means a subsequent write() will get -ECONNREFUSED.

    Reported-by: Ian Lance Taylor
    Cc: David Rientjes
    Cc: Rainer Weikusat
    Cc: Eric Dumazet
    Signed-off-by: Jason Baron
    Signed-off-by: David S. Miller

    Jason Baron
     

31 Jul, 2018

1 commit


29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Jun, 2018

1 commit

  • Pull aio updates from Al Viro:
    "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.

    The only thing I'm holding back for a day or so is Adam's aio ioprio -
    his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
    but let it sit in -next for decency sake..."

    * 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    aio: sanitize the limit checking in io_submit(2)
    aio: fold do_io_submit() into callers
    aio: shift copyin of iocb into io_submit_one()
    aio_read_events_ring(): make a bit more readable
    aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
    aio: take list removal to (some) callers of aio_complete()
    aio: add missing break for the IOCB_CMD_FDSYNC case
    random: convert to ->poll_mask
    timerfd: convert to ->poll_mask
    eventfd: switch to ->poll_mask
    pipe: convert to ->poll_mask
    crypto: af_alg: convert to ->poll_mask
    net/rxrpc: convert to ->poll_mask
    net/iucv: convert to ->poll_mask
    net/phonet: convert to ->poll_mask
    net/nfc: convert to ->poll_mask
    net/caif: convert to ->poll_mask
    net/bluetooth: convert to ->poll_mask
    net/sctp: convert to ->poll_mask
    net/tipc: convert to ->poll_mask
    ...

    Linus Torvalds
     

26 May, 2018

1 commit


16 May, 2018

1 commit

  • Variants of proc_create{,_data} that directly take a struct seq_operations
    and deal with network namespaces in ->open and ->release. All callers of
    proc_create + seq_open_net converted over, and seq_{open,release}_net are
    removed entirely.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

04 Apr, 2018

1 commit

  • After commit 581319c58600 ("net/socket: use per af lockdep classes for sk queues")
    sock queue locks now have per-af lockdep classes, including unix socket.
    It is no longer necessary to workaround it.

    I noticed this while looking at a syzbot deadlock report, this patch
    itself doesn't fix it (this is why I don't add Reported-by).

    Fixes: 581319c58600 ("net/socket: use per af lockdep classes for sk queues")
    Cc: Paolo Abeni
    Signed-off-by: Cong Wang
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Cong Wang
     

28 Mar, 2018

1 commit


20 Feb, 2018

1 commit


14 Feb, 2018

1 commit


13 Feb, 2018

2 commits

  • These pernet_operations are just create and destroy
    /proc and sysctl entries, and are not touched by
    foreign pernet_operations.

    So, we are able to make them async.

    Signed-off-by: Kirill Tkhai
    Acked-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Kirill Tkhai
     
  • Changes since v1:
    Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

    Before:
    All these functions either return a negative error indicator,
    or store length of sockaddr into "int *socklen" parameter
    and return zero on success.

    "int *socklen" parameter is awkward. For example, if caller does not
    care, it still needs to provide on-stack storage for the value
    it does not need.

    None of the many FOO_getname() functions of various protocols
    ever used old value of *socklen. They always just overwrite it.

    This change drops this parameter, and makes all these functions, on success,
    return length of sockaddr. It's always >= 0 and can be differentiated
    from an error.

    Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

    rpc_sockname() lost "int buflen" parameter, since its only use was
    to be passed to kernel_getsockname() as &buflen and subsequently
    not used in any way.

    Userspace API is not changed.

    text data bss dec hex filename
    30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
    30108109 2633612 873672 33615393 200ee21 vmlinux.o

    Signed-off-by: Denys Vlasenko
    CC: David S. Miller
    CC: linux-kernel@vger.kernel.org
    CC: netdev@vger.kernel.org
    CC: linux-bluetooth@vger.kernel.org
    CC: linux-decnet-user@lists.sourceforge.net
    CC: linux-wireless@vger.kernel.org
    CC: linux-rdma@vger.kernel.org
    CC: linux-sctp@vger.kernel.org
    CC: linux-nfs@vger.kernel.org
    CC: linux-x25@vger.kernel.org
    Signed-off-by: David S. Miller

    Denys Vlasenko
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

    2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

    3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

    4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

    5) Add eBPF based queue selection to tun, from Jason Wang.

    6) Lockless qdisc support, from John Fastabend.

    7) SCTP stream interleave support, from Xin Long.

    8) Smoother TCP receive autotuning, from Eric Dumazet.

    9) Lots of erspan tunneling enhancements, from William Tu.

    10) Add true function call support to BPF, from Alexei Starovoitov.

    11) Add explicit support for GRO HW offloading, from Michael Chan.

    12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

    13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

    14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

    15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

    16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

    17) Add resource abstration to devlink, from Arkadi Sharshevsky.

    18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

    19) Avoid locking in act_csum, from Davide Caratti.

    20) devinet_ioctl() simplifications from Al viro.

    21) More TCP bpf improvements from Lawrence Brakmo.

    22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
    tls: Add support for encryption using async offload accelerator
    ip6mr: fix stale iterator
    net/sched: kconfig: Remove blank help texts
    openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
    tcp_nv: fix potential integer overflow in tcpnv_acked
    r8169: fix RTL8168EP take too long to complete driver initialization.
    qmi_wwan: Add support for Quectel EP06
    rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
    ipmr: Fix ptrdiff_t print formatting
    ibmvnic: Wait for device response when changing MAC
    qlcnic: fix deadlock bug
    tcp: release sk_frag.page in tcp_disconnect
    ipv4: Get the address of interface correctly.
    net_sched: gen_estimator: fix lockdep splat
    net: macb: Handle HRESP error
    net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
    ipv6: addrconf: break critical section in addrconf_verify_rtnl()
    ipv6: change route cache aging logic
    i40e/i40evf: Update DESC_NEEDED value to reflect larger value
    bnxt_en: cleanup DIM work on device shutdown
    ...

    Linus Torvalds
     

17 Jan, 2018

1 commit

  • /proc has been ignoring struct file_operations::owner field for 10 years.
    Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba
    ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
    inode->i_fop is initialized with proxy struct file_operations for
    regular files:

    - if (de->proc_fops)
    - inode->i_fop = de->proc_fops;
    + if (de->proc_fops) {
    + if (S_ISREG(inode->i_mode))
    + inode->i_fop = &proc_reg_file_ops;
    + else
    + inode->i_fop = de->proc_fops;
    + }

    VFS stopped pinning module at this point.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

28 Nov, 2017

2 commits


04 Nov, 2017

1 commit


03 Nov, 2017

1 commit

  • …el/git/gregkh/driver-core

    Pull initial SPDX identifiers from Greg KH:
    "License cleanup: add SPDX license identifiers to some files

    Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the
    'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
    binding shorthand, which can be used instead of the full boiler plate
    text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart
    and Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset
    of the use cases:

    - file had no licensing information it it.

    - file was a */uapi/* one with no licensing information in it,

    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to
    license had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied
    to a file was done in a spreadsheet of side by side results from of
    the output of two independent scanners (ScanCode & Windriver)
    producing SPDX tag:value files created by Philippe Ombredanne.
    Philippe prepared the base worksheet, and did an initial spot review
    of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537
    files assessed. Kate Stewart did a file by file comparison of the
    scanner results in the spreadsheet to determine which SPDX license
    identifier(s) to be applied to the file. She confirmed any
    determination that was not immediately clear with lawyers working with
    the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:

    - Files considered eligible had to be source code files.

    - Make and config files were included as candidates if they contained
    >5 lines of source

    - File already had some variant of a license header in it (even if <5
    lines).

    All documentation files were explicitly excluded.

    The following heuristics were used to determine which SPDX license
    identifiers to apply.

    - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.

    For non */uapi/* files that summary was:

    SPDX license identifier # files
    ---------------------------------------------------|-------
    GPL-2.0 11139

    and resulted in the first patch in this series.

    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
    was:

    SPDX license identifier # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note 930

    and resulted in the second patch in this series.

    - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point). Results summary:

    SPDX license identifier # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note 270
    GPL-2.0+ WITH Linux-syscall-note 169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
    LGPL-2.1+ WITH Linux-syscall-note 15
    GPL-1.0+ WITH Linux-syscall-note 14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
    LGPL-2.0+ WITH Linux-syscall-note 4
    LGPL-2.1 WITH Linux-syscall-note 3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

    and that resulted in the third patch in this series.

    - when the two scanners agreed on the detected license(s), that
    became the concluded license(s).

    - when there was disagreement between the two scanners (one detected
    a license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.

    - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply
    (and which scanner probably needed to revisit its heuristics).

    - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.

    - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.

    In total, over 70 hours of logged manual review was done on the
    spreadsheet to determine the SPDX license identifiers to apply to the
    source files by Kate, Philippe, Thomas and, in some cases,
    confirmation by lawyers working with the Linux Foundation.

    Kate also obtained a third independent scan of the 4.13 code base from
    FOSSology, and compared selected files where the other two scanners
    disagreed against that SPDX file, to see if there was new insights.
    The Windriver scanner is based on an older version of FOSSology in
    part, so they are related.

    Thomas did random spot checks in about 500 files from the spreadsheets
    for the uapi headers and agreed with SPDX license identifier in the
    files he inspected. For the non-uapi files Thomas did random spot
    checks in about 15000 files.

    In initial set of patches against 4.14-rc6, 3 files were found to have
    copy/paste license identifier errors, and have been fixed to reflect
    the correct identifier.

    Additionally Philippe spent 10 hours this week doing a detailed manual
    inspection and review of the 12,461 patched files from the initial
    patch version early this week with:

    - a full scancode scan run, collecting the matched texts, detected
    license ids and scores

    - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct

    - reviewing anything where there was no detection but the patch
    license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
    applied SPDX license was correct

    This produced a worksheet with 20 files needing minor correction. This
    worksheet was then exported into 3 different .csv files for the
    different types of files to be modified.

    These .csv files were then reviewed by Greg. Thomas wrote a script to
    parse the csv files and add the proper SPDX tag to the file, in the
    format that the file expected. This script was further refined by Greg
    based on the output to detect more types of files automatically and to
    distinguish between header and source .c files (which need different
    comment types.) Finally Greg ran the script using the .csv files to
    generate the patches.

    Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
    Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

    * tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    License cleanup: add SPDX license identifier to uapi header files with a license
    License cleanup: add SPDX license identifier to uapi header files with no license
    License cleanup: add SPDX GPL-2.0 license identifier to files with no license

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

30 Oct, 2017

1 commit

  • Several conflicts here.

    NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
    nfp_fl_output() needed some adjustments because the code block is in
    an else block now.

    Parallel additions to net/pkt_cls.h and net/sch_generic.h

    A bug fix in __tcp_retransmit_skb() conflicted with some of
    the rbtree changes in net-next.

    The tc action RCU callback fixes in 'net' had some overlap with some
    of the recent tcf_block reworking.

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Oct, 2017

1 commit

  • socket_diag shows information only about sockets from a namespace where
    a diag socket lives.

    But if we request information about one unix socket, the kernel don't
    check that its netns is matched with a diag socket namespace, so any
    user can get information about any unix socket in a system. This looks
    like a bug.

    v2: add a Fixes tag

    Fixes: 51d7cccf0723 ("net: make sock diag per-namespace")
    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrei Vagin
     

22 Oct, 2017

1 commit


22 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
    headers from UDP packets before queueing"), when udp packets are being
    peeked the requested extra offset is always 0 as there is no need to skip
    the udp header. However, when the offset is 0 and the next skb is
    of length 0, it is only returned once. The behaviour can be seen with
    the following python script:

    from socket import *;
    f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
    f.bind(('::', 0));
    addr=('::1', f.getsockname()[1]);
    g.sendto(b'', addr)
    g.sendto(b'b', addr)
    print(f.recvfrom(10, MSG_PEEK));
    print(f.recvfrom(10, MSG_PEEK));

    Where the expected output should be the empty string twice.

    Instead, make sk_peek_offset return negative values, and pass those values
    to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset
    to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
    __skb_try_recv_from_queue will then ensure the offset is reset back to 0
    if a peek is requested without an offset, unless no packets are found.

    Also simplify the if condition in __skb_try_recv_from_queue. If _off is
    greater then 0, and off is greater then or equal to skb->len, then
    (_off || skb->len) must always be true assuming skb->len >= 0 is always
    true.

    Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
    as it double checked if MSG_PEEK was set in the flags.

    V2:
    - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
    redundant checks
    - Fix peeking in udp{,v6}_recvmsg to report the right value when the
    offset is 0

    V3:
    - Marked new branch in __skb_try_recv_from_queue as unlikely.

    Signed-off-by: Matthew Dawson
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Matthew Dawson
     

17 Jul, 2017

1 commit

  • All unix sockets now account inflight FDs to the respective sender.
    This was introduced in:

    commit 712f4aad406bb1ed67f3f98d04c044191f0ff593
    Author: willy tarreau
    Date: Sun Jan 10 07:54:56 2016 +0100

    unix: properly account for FDs passed over unix sockets

    and further refined in:

    commit 415e3d3e90ce9e18727e8843ae343eda5a58fad6
    Author: Hannes Frederic Sowa
    Date: Wed Feb 3 02:11:03 2016 +0100

    unix: correctly track in-flight fds in sending process user_struct

    Hence, regardless of the stacking depth of FDs, the total number of
    inflight FDs is limited, and accounted. There is no known way for a
    local user to exceed those limits or exploit the accounting.

    Furthermore, the GC logic is independent of the recursion/stacking depth
    as well. It solely depends on the total number of inflight FDs,
    regardless of their layout.

    Lastly, the current `recursion_level' suffers a TOCTOU race, since it
    checks and inherits depths only at queue time. If we consider `A
    Cc: Simon McVittie
    Signed-off-by: David Herrmann
    Reviewed-by: Tom Gundersen
    Signed-off-by: David S. Miller

    David Herrmann
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

01 Jul, 2017

3 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 Jun, 2017

1 commit


07 Apr, 2017

1 commit

  • Prepare to mark sensitive kernel structures for randomization by making
    sure they're using designated initializers. These were identified during
    allyesconfig builds of x86, arm, and arm64, and the initializer fixes
    were extracted from grsecurity. In this case, NULL initialize with { }
    instead of undesignated NULLs.

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     

22 Mar, 2017

1 commit

  • Dmitry has reported that a BUG_ON() condition in unix_notinflight()
    may be triggered by a simple code that forwards unix socket in an
    SCM_RIGHTS message.
    That is caused by incorrect unix socket GC implementation in unix_gc().

    The GC first collects list of candidates, then (a) decrements their
    "children's" inflight counter, (b) checks which inflight counters are
    now 0, and then (c) increments all inflight counters back.
    (a) and (c) are done by calling scan_children() with inc_inflight or
    dec_inflight as the second argument.

    Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
    collector") changed scan_children() such that it no longer considers
    sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
    of code that that unsets this flag _before_ invoking
    scan_children(, dec_iflight, ). This may lead to incorrect inflight
    counters for some sockets.

    This change fixes this bug by changing order of operations:
    UNIX_GC_CANDIDATE is now unset only after all inflight counters are
    restored to the original state.

    kernel BUG at net/unix/garbage.c:149!
    RIP: 0010:[] []
    unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
    Call Trace:
    [] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
    [] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
    [] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
    [] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
    [] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
    [] kfree_skb+0x184/0x570 net/core/skbuff.c:705
    [] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
    [] unix_release+0x49/0x90 net/unix/af_unix.c:836
    [] sock_release+0x92/0x1f0 net/socket.c:570
    [] sock_close+0x1b/0x20 net/socket.c:1017
    [] __fput+0x34e/0x910 fs/file_table.c:208
    [] ____fput+0x1a/0x20 fs/file_table.c:244
    [] task_work_run+0x1a0/0x280 kernel/task_work.c:116
    [< inline >] exit_task_work include/linux/task_work.h:21
    [] do_exit+0x183a/0x2640 kernel/exit.c:828
    [] do_group_exit+0x14e/0x420 kernel/exit.c:931
    [] get_signal+0x663/0x1880 kernel/signal.c:2307
    [] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
    [] exit_to_usermode_loop+0x1ea/0x2d0
    arch/x86/entry/common.c:156
    [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
    [] syscall_return_slowpath+0x4d3/0x570
    arch/x86/entry/common.c:259
    [] entry_SYSCALL_64_fastpath+0xc4/0xc6

    Link: https://lkml.org/lkml/2017/3/6/252
    Signed-off-by: Andrey Ulanov
    Reported-by: Dmitry Vyukov
    Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
    Signed-off-by: David S. Miller

    Andrey Ulanov
     

10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

02 Mar, 2017

1 commit


03 Feb, 2017

1 commit

  • This ioctl opens a file to which a socket is bound and
    returns a file descriptor. The caller has to have CAP_NET_ADMIN
    in the socket network namespace.

    Currently it is impossible to get a path and a mount point
    for a socket file. socket_diag reports address, device ID and inode
    number for unix sockets. An address can contain a relative path or
    a file may be moved somewhere. And these properties say nothing about
    a mount namespace and a mount point of a socket file.

    With the introduced ioctl, we can get a path by reading
    /proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.

    In CRIU we are going to use this ioctl to dump and restore unix socket.

    Here is an example how it can be used:

    $ strace -e socket,bind,ioctl ./test /tmp/test_sock
    socket(AF_UNIX, SOCK_STREAM, 0) = 3
    bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
    ioctl(3, SIOCUNIXFILE, 0) = 4
    ^Z

    $ ss -a | grep test_sock
    u_str LISTEN 0 1 test_sock 17798 * 0

    $ ls -l /proc/760/fd/{3,4}
    lrwx------ 1 root root 64 Feb 1 09:41 3 -> 'socket:[17798]'
    l--------- 1 root root 64 Feb 1 09:41 4 -> /tmp/test_sock

    $ cat /proc/760/fdinfo/4
    pos: 0
    flags: 012000000
    mnt_id: 40

    $ cat /proc/self/mountinfo | grep "^40\s"
    40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw

    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

25 Jan, 2017

1 commit

  • Dmitry reported a deadlock scenario:

    unix_bind() path:
    u->bindlock ==> sb_writer

    do_splice() path:
    sb_writer ==> pipe->mutex ==> u->bindlock

    In the unix_bind() code path, unix_mknod() does not have to
    be done with u->bindlock held, since it is a pure fs operation,
    so we can just move unix_mknod() out.

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Cc: Rainer Weikusat
    Cc: Al Viro
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong