01 Oct, 2020

1 commit

  • Grab actual references to the files_struct. To avoid circular references
    issues due to this, we add a per-task note that keeps track of what
    io_uring contexts a task has used. When the tasks execs or exits its
    assigned files, we cancel requests based on this tracking.

    With that, we can grab proper references to the files table, and no
    longer need to rely on stashing away ring_fd and ring_file to check
    if the ring_fd may have been closed.

    Cc: stable@vger.kernel.org # v5.5+
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Aug, 2020

1 commit

  • Pull init and set_fs() cleanups from Al Viro:
    "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

    * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
    init: add an init_dup helper
    init: add an init_utimes helper
    init: add an init_stat helper
    init: add an init_mknod helper
    init: add an init_mkdir helper
    init: add an init_symlink helper
    init: add an init_link helper
    init: add an init_eaccess helper
    init: add an init_chmod helper
    init: add an init_chown helper
    init: add an init_chroot helper
    init: add an init_chdir helper
    init: add an init_rmdir helper
    init: add an init_unlink helper
    init: add an init_umount helper
    init: add an init_mount helper
    init: mark create_dev as __init
    init: mark console_on_rootfs as __init
    init: initialize ramdisk_execute_command at compile time
    devtmpfs: refactor devtmpfsd()
    ...

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pull close_range() implementation from Christian Brauner:
    "This adds the close_range() syscall. It allows to efficiently close a
    range of file descriptors up to all file descriptors of a calling
    task.

    This is coordinated with the FreeBSD folks which have copied our
    version of this syscall and in the meantime have already merged it in
    April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

    The syscall originally came up in a discussion around the new mount
    API and making new file descriptor types cloexec by default. During
    this discussion, Al suggested the close_range() syscall.

    First, it helps to close all file descriptors of an exec()ing task.
    This can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that
    file descriptors are not cloexec by default. This is aggravated by the
    fact that we can't just switch them over without massively regressing
    userspace. For a whole class of programs having an in-kernel method of
    closing all file descriptors is very helpful (e.g. demons, service
    managers, programming language standard libraries, container managers
    etc.).

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on
    each file descriptor and other hacks. From looking at various
    large(ish) userspace code bases this or similar patterns are very
    common in service managers, container runtimes, and programming
    language runtimes/standard libraries such as Python or Rust.

    In addition, the syscall will also work for tasks that do not have
    procfs mounted and on kernels that do not have procfs support compiled
    in. In such situations the only way to make sure that all file
    descriptors are closed is to call close() on each file descriptor up
    to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

    Based on Linus' suggestion close_range() also comes with a new flag
    CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
    right before exec. This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part
    of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
    gets especially handy when we're closing all file descriptors above a
    certain threshold.

    Test-suite as always included"

    * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add CLOSE_RANGE_UNSHARE tests
    close_range: add CLOSE_RANGE_UNSHARE
    tests: add close_range() tests
    arch: wire-up close_range()
    open: add close_range()

    Linus Torvalds
     

31 Jul, 2020

1 commit


14 Jul, 2020

3 commits

  • Expand __receive_fd() with support for replace_fd() for the coming seccomp
    "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
    and update existing wrappers to retain old behavior.

    Thanks to Colin Ian King for pointing out an
    uninitialized variable exposure in an earlier version of this patch.

    Cc: Alexander Viro
    Cc: Dmitry Kadashev
    Cc: Jens Axboe
    Cc: Arnd Bergmann
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     
  • For both pidfd and seccomp, the __user pointer is not used. Update
    __receive_fd() to make writing to ufd optional via a NULL check. However,
    for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
    can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
    new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
    argument. For the new helper, the allocated fd needs to be returned on
    success. Update the existing callers to handle it.

    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     
  • In preparation for users of the "install a received file" logic outside
    of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
    net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
    named receive_fd_user(), as future patches will change the interface
    to __receive_fd().

    Additionally add a comment to fd_install() as a counterpoint to how
    __receive_fd() interacts with fput().

    Cc: Alexander Viro
    Cc: "David S. Miller"
    Cc: Jakub Kicinski
    Cc: Dmitry Kadashev
    Cc: Jens Axboe
    Cc: Arnd Bergmann
    Cc: Sargun Dhillon
    Cc: Ido Schimmel
    Cc: Ioana Ciornei
    Cc: linux-fsdevel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     

17 Jun, 2020

2 commits

  • One of the use-cases of close_range() is to drop file descriptors just before
    execve(). This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part of
    close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

    This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
    maximum number of file descriptors to copy from the old struct files. When the
    user requests that all file descriptors are supposed to be closed via
    close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
    to do any of the heavy fput() work for everything above min.

    The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
    fact currently share our file descriptor table we create a new private copy.
    We then close all fds in the requested range and finally after we're done we
    install the new fd table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • This adds the close_range() syscall. It allows to efficiently close a range
    of file descriptors up to all file descriptors of a calling task.

    I was contacted by FreeBSD as they wanted to have the same close_range()
    syscall as we proposed here. We've coordinated this and in the meantime, Kyle
    was fast enough to merge close_range() into FreeBSD already in April:
    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836
    and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
    once its merged in Linux too. Python is in the process of switching to
    close_range() on FreeBSD and they are waiting on us to merge this to switch on
    Linux as well: https://bugs.python.org/issue38061

    The syscall came up in a recent discussion around the new mount API and
    making new file descriptor types cloexec by default. During this
    discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
    syscall in this manner has been requested by various people over time.

    First, it helps to close all file descriptors of an exec()ing task. This
    can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that file
    descriptors are not cloexec by default. This is aggravated by the fact that
    we can't just switch them over without massively regressing userspace. For
    a whole class of programs having an in-kernel method of closing all file
    descriptors is very helpful (e.g. demons, service managers, programming
    language standard libraries, container managers etc.).
    (Please note, unshare(CLONE_FILES) should only be needed if the calling
    task is multi-threaded and shares the file descriptor table with another
    thread in which case two threads could race with one thread allocating file
    descriptors and the other one closing them via close_range(). For the
    general case close_range() before the execve() is sufficient.)

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on each
    file descriptor. From looking at various large(ish) userspace code bases
    this or similar patterns are very common in:
    - service managers (cf. [4])
    - libcs (cf. [6])
    - container runtimes (cf. [5])
    - programming language runtimes/standard libraries
    - Python (cf. [2])
    - Rust (cf. [7], [8])
    As Dmitry pointed out there's even a long-standing glibc bug about missing
    kernel support for this task (cf. [3]).
    In addition, the syscall will also work for tasks that do not have procfs
    mounted and on kernels that do not have procfs support compiled in. In such
    situations the only way to make sure that all file descriptors are closed
    is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
    OPEN_MAX trickery (cf. comment [8] on Rust).

    The performance is striking. For good measure, comparing the following
    simple close_all_fds() userspace implementation that is essentially just
    glibc's version in [6]:

    static int close_all_fds(void)
    {
    int dir_fd;
    DIR *dir;
    struct dirent *direntp;

    dir = opendir("/proc/self/fd");
    if (!dir)
    return -1;
    dir_fd = dirfd(dir);
    while ((direntp = readdir(dir))) {
    int fd;
    if (strcmp(direntp->d_name, ".") == 0)
    continue;
    if (strcmp(direntp->d_name, "..") == 0)
    continue;
    fd = atoi(direntp->d_name);
    if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
    continue;
    close(fd);
    }
    closedir(dir);
    return 0;
    }

    to close_range() yields:
    1. closing 4 open files:
    - close_all_fds(): ~280 us
    - close_range(): ~24 us

    2. closing 1000 open files:
    - close_all_fds(): ~5000 us
    - close_range(): ~800 us

    close_range() is designed to allow for some flexibility. Specifically, it
    does not simply always close all open file descriptors of a task. Instead,
    callers can specify an upper bound.
    This is e.g. useful for scenarios where specific file descriptors are
    created with well-known numbers that are supposed to be excluded from
    getting closed.
    For extra paranoia close_range() comes with a flags argument. This can e.g.
    be used to implement extension. Once can imagine userspace wanting to stop
    at the first error instead of ignoring errors under certain circumstances.
    There might be other valid ideas in the future. In any case, a flag
    argument doesn't hurt and keeps us on the safe side.

    From an implementation side this is kept rather dumb. It saw some input
    from David and Jann but all nonsense is obviously my own!
    - Errors to close file descriptors are currently ignored. (Could be changed
    by setting a flag in the future if needed.)
    - __close_range() is a rather simplistic wrapper around __close_fd().
    My reasoning behind this is based on the nature of how __close_fd() needs
    to release an fd. But maybe I misunderstood specifics:
    We take the files_lock and rcu-dereference the fdtable of the calling
    task, we find the entry in the fdtable, get the file and need to release
    files_lock before calling filp_close().
    In the meantime the fdtable might have been altered so we can't just
    retake the spinlock and keep the old rcu-reference of the fdtable
    around. Instead we need to grab a fresh reference to the fdtable.
    If my reasoning is correct then there's really no point in fancyfying
    __close_range(): We just need to rcu-dereference the fdtable of the
    calling task once to cap the max_fd value correctly and then go on
    calling __close_fd() in a loop.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
    [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
    [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
    [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
    [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
    [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
    Note that this is an internal implementation that is not exported.
    Currently, libc seems to not provide an exported version of this
    because of missing kernel support to do this.
    Note, in a recent patch series Florian made grantpt() a nop thereby
    removing the code referenced here.
    [7]: https://github.com/rust-lang/rust/issues/12148
    [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
    Rust's solution is slightly different but is equally unperformant.
    Rust calls getdtablesize() which is a glibc library function that
    simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
    goes on to call close() on each fd. That's obviously overkill for most
    tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
    OPEN_MAX.
    Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
    to 1024. Even in this case, there's a very high chance that in the
    common case Rust is calling the close() syscall 1021 times pointlessly
    if the task just has 0, 1, and 2 open.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Kyle Evans
    Cc: Jann Horn
    Cc: David Howells
    Cc: Dmitry V. Levin
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Florian Weimer
    Cc: linux-api@vger.kernel.org

    Christian Brauner
     

20 May, 2020

1 commit

  • cpy and set really should be size_t; we won't get an overflow on that,
    since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *),
    so nr that would've managed to overflow size_t on that multiplication
    won't get anywhere near copy_fdtable() - we'll fail with EMFILE
    before that.

    Cc: stable@kernel.org # v2.6.25+
    Fixes: 9cfe015aa424 (get rid of NR_OPEN and introduce a sysctl_nr_open)
    Reported-by: Thiago Macieira
    Signed-off-by: Al Viro

    Al Viro
     

20 Mar, 2020

1 commit

  • Dmitry reports that a test case shows that io_uring isn't honoring a
    modified rlimit nofile setting. get_unused_fd_flags() checks the task
    signal->rlimi[] for the limits. As this isn't easily inheritable,
    provide a __get_unused_fd_flags() that takes the value instead. Then we
    can grab it when the request is prepared (from the original task), and
    pass that in when we do the async part part of the open.

    Reported-by: Dmitry Kadashev
    Tested-by: Dmitry Kadashev
    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     

30 Jan, 2020

1 commit

  • Pull thread management updates from Christian Brauner:
    "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
    syscall.

    This syscall allows for the retrieval of file descriptors of a process
    based on its pidfd. A task needs to have ptrace_may_access()
    permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
    Andy) on the target.

    One of the main use-cases is in combination with seccomp's user
    notification feature. As a reminder, seccomp's user notification
    feature was made available in v5.0. It allows a task to retrieve a
    file descriptor for its seccomp filter. The file descriptor is usually
    handed of to a more privileged supervising process. The supervisor can
    then listen for syscall events caught by the seccomp filter of the
    supervisee and perform actions in lieu of the supervisee, usually
    emulating syscalls. pidfd_getfd() is needed to expand its uses.

    There are currently two major users that wait on pidfd_getfd() and one
    future user:

    - Netflix, Sargun said, is working on a service mesh where users
    should be able to connect to a dns-based VIP. When a user connects
    to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
    redirected to an envoy process. This service mesh uses seccomp user
    notifications and pidfd to intercept all connect calls and instead
    of connecting them to 1.2.3.4:80 connects them to e.g.
    127.0.0.1:8080.

    - LXD uses the seccomp notifier heavily to intercept and emulate
    mknod() and mount() syscalls for unprivileged containers/processes.
    With pidfd_getfd() more uses-cases e.g. bridging socket connections
    will be possible.

    - The patchset has also seen some interest from the browser corner.
    Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
    broker process. In the future glibc will start blocking all signals
    during dlopen() rendering this type of sandbox impossible. Hence,
    in the future Firefox will switch to a seccomp-user-nofication
    based sandbox which also makes use of file descriptor retrieval.
    The thread for this can be found at
    https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html

    With pidfd_getfd() it is e.g. possible to bridge socket connections
    for the supervisee (binding to a privileged port) and taking actions
    on file descriptors on behalf of the supervisee in general.

    Sargun's first version was using an ioctl on pidfds but various people
    pushed for it to be a proper syscall which he duely implemented as
    well over various review cycles. Selftests are of course included.
    I've also added instructions how to deal with merge conflicts below.

    There's also a small fix coming from the kernel mentee project to
    correctly annotate struct sighand_struct with __rcu to fix various
    sparse warnings. We've received a few more such fixes and even though
    they are mostly trivial I've decided to postpone them until after -rc1
    since they came in rather late and I don't want to risk introducing
    build warnings.

    Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
    needed to avoid allocation recursions triggerable by storage drivers
    that have userspace parts that run in the IO path (e.g. dm-multipath,
    iscsi, etc). These allocation recursions deadlock the device.

    The new prctl() allows such privileged userspace components to avoid
    allocation recursions by setting the PF_MEMALLOC_NOIO and
    PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
    relevant maintainers and is routed here as part of prctl()
    thread-management."

    * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
    sched.h: Annotate sighand_struct with __rcu
    test: Add test for pidfd getfd
    arch: wire up pidfd_getfd syscall
    pid: Implement pidfd_getfd syscall
    vfs, fdtable: Add fget_task helper

    Linus Torvalds
     

21 Jan, 2020

1 commit


14 Jan, 2020

1 commit

  • This introduces a function which can be used to fetch a file, given an
    arbitrary task. As long as the user holds a reference (refcnt) to the
    task_struct it is safe to call, and will either return NULL on failure,
    or a pointer to the file, with a refcnt.

    This patch is based on Oleg Nesterov's (cf. [1]) patch from September
    2018.

    [1]: Link: https://lore.kernel.org/r/20180915160423.GA31461@redhat.com

    Signed-off-by: Sargun Dhillon
    Suggested-by: Oleg Nesterov
    Acked-by: Christian Brauner
    Reviewed-by: Arnd Bergmann
    Link: https://lore.kernel.org/r/20200107175927.4558-2-sargun@sargun.me
    Signed-off-by: Christian Brauner

    Sargun Dhillon
     

03 Jan, 2020

1 commit

  • This reverts commit 8243186f0cc7 ("fs: remove ksys_dup()") and the
    subsequent fix for it in commit 2d3145f8d280 ("early init: fix error
    handling when opening /dev/console").

    Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
    caused more trouble than what is worth it: it requires accessing vfs
    internals and it turns out there were other bugs in it too.

    In particular, the file reference counting was wrong - because unlike
    the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
    thus had an extra leaked file reference.

    That in turn then caused odd problems with Androidx86 long after boot
    becaue of how the extra reference to the console kept the session active
    even after all file descriptors had been closed.

    Reported-by: youling 257
    Cc: Arvind Sankar
    Cc: Al Viro
    Signed-off-by: Dominik Brodowski
    Signed-off-by: Linus Torvalds

    Dominik Brodowski
     

13 Dec, 2019

1 commit

  • ksys_dup() is used only at one place in the kernel, namely to duplicate
    fd 0 of /dev/console to stdout and stderr. The same functionality can be
    achieved by using functions already available within the kernel namespace.

    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

27 Nov, 2019

1 commit

  • This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.

    I was hoping it would be benign to switch over entirely to FMODE_STREAM,
    and we'd have just a couple of small fixups we'd need, but it looks like
    we're not quite there yet.

    While it worked fine on both my desktop and laptop, they are fairly
    similar in other respects, and run mostly the same loads. Kenneth
    Crudup reports that it seems to break both his vmware installation and
    the KDE upower service. In both cases apparently leading to timeouts
    due to waitinmg for the f_pos lock.

    There are a number of character devices in particular that definitely
    want stream-like behavior, but that currently don't get marked as
    streams, and as a result get the exclusion between concurrent
    read()/write() on the same file descriptor. Which doesn't work well for
    them.

    The most obvious example if this is /dev/console and /dev/tty, which use
    console_fops and tty_fops respectively (and ptmx_fops for the pty master
    side). It may be that it's just this that causes problems, but we
    clearly weren't ready yet.

    Because there's a number of other likely common cases that don't have
    llseek implementations and would seem to act as stream devices:

    /dev/fuse (fuse_dev_operations)
    /dev/mcelog (mce_chrdev_ops)
    /dev/mei0 (mei_fops)
    /dev/net/tun (tun_fops)
    /dev/nvme0 (nvme_dev_fops)
    /dev/tpm0 (tpm_fops)
    /proc/self/ns/mnt (ns_file_operations)
    /dev/snd/pcm* (snd_pcm_f_ops[])

    and while some of these could be trivially automatically detected by the
    vfs layer when the character device is opened by just noticing that they
    have no read or write operations either, it often isn't that obvious.

    Some character devices most definitely do use the file position, even if
    they don't allow seeking: the firmware update code, for example, uses
    simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
    back and forth.

    We'll revisit this when there's a better way to detect the problem and
    fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
    annotations).

    Reported-by: Kenneth R. Crudup
    Cc: Kirill Smelkov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Nov, 2019

1 commit

  • fdget_pos() is used by file operations that will read and update f_pos:
    things like "read()", "write()" and "lseek()" (but not, for example,
    "pread()/pwrite" that get their file positions elsewhere).

    However, it had two separate escape clauses for this, because not
    everybody wants or needs serialization of the file position.

    The first and most obvious case is the "file descriptor doesn't have a
    position at all", ie a stream-like file. Except we didn't actually use
    FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
    was that FMODE_STREAM didn't exist back in the days, but also that we
    didn't want to mark all the special cases, so we only marked the ones
    that _required_ position atomicity according to POSIX - regular files
    and directories.

    The case one was intentionally lazy, but now that we _do_ have
    FMODE_STREAM we could and should just use it. With the change to use
    FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
    the code to set it is deleted.

    Any cases where we don't want the serialization because the driver (or
    subsystem) doesn't use the file position should just be updated to do
    "stream_open()". We've done that for all the obvious and common
    situations, we may need a few more. Quoting Kirill Smelkov in the
    original FMODE_STREAM thread (see link below for full email):

    "And I appreciate if people could help at least somehow with "getting
    rid of mixed case entirely" (i.e. always lock f_pos_lock on
    !FMODE_STREAM), because this transition starts to diverge from my
    particular use-case too far. To me it makes sense to do that
    transition as follows:

    - convert nonseekable_open -> stream_open via stream_open.cocci;
    - audit other nonseekable_open calls and convert left users that
    truly don't depend on position to stream_open;
    - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
    will cover pipes and sockets), or maybe convert pipes and sockets
    to FMODE_STREAM manually;
    - extend stream_open.cocci to analyze file_operations that use
    no_llseek or noop_llseek, but do not use nonseekable_open or
    alloc_file_pseudo. This might find files that have stream semantic
    but are opened differently;
    - extend stream_open.cocci to analyze file_operations whose
    .read/.write do not use ppos at all (independently of how file was
    opened);
    - ...
    - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
    !FMODE_STREAM;
    - gather bug reports for deadlocked read/write and convert missed
    cases to FMODE_STREAM, probably extending stream_open.cocci along
    the road to catch similar cases

    i.e. always take f_pos_lock unless a file is explicitly marked as
    being stream, and try to find and cover all files that are streams"

    We have not done the "extend stream_open.cocci to analyze
    alloc_file_pseudo" as well, but the previous commit did manually handle
    the case of pipes and sockets.

    The other case where we can avoid locking f_pos is the "this file
    descriptor only has a single user and it is us, and thus there is no
    need to lock it".

    The second test was correct, although a bit subtle and worth just
    re-iterating here. There are two kinds of other sources of references
    to the same file descriptor: file descriptors that have been explicitly
    shared across fork() or with dup(), and file tables having elevated
    reference counts due to threading (or explicit file sharing with
    clone()).

    The first case would have incremented the file count explicitly, and in
    the second case the previous __fdget() would have incremented it for us
    and set the FDPUT_FPUT flag.

    But in both cases the file count would be greater than one, so the
    "file_count(file) > 1" test catches both situations. Also note that if
    file_count is 1, that also means that no other thread can have access to
    the file table, so there also cannot be races with concurrent calls to
    dup()/fork()/clone() that would increment the file count any other way.

    Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
    Cc: Kirill Smelkov
    Cc: Eic Dumazet
    Cc: Al Viro
    Cc: Alan Stern
    Cc: Marco Elver
    Cc: Andrea Parri
    Cc: Paul McKenney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Mar, 2019

1 commit

  • Pull io_uring IO interface from Jens Axboe:
    "Second attempt at adding the io_uring interface.

    Since the first one, we've added basic unit testing of the three
    system calls, that resides in liburing like the other unit tests that
    we have so far. It'll take a while to get full coverage of it, but
    we're working towards it. I've also added two basic test programs to
    tools/io_uring. One uses the raw interface and has support for all the
    various features that io_uring supports outside of standard IO, like
    fixed files, fixed IO buffers, and polled IO. The other uses the
    liburing API, and is a simplified version of cp(1).

    This adds support for a new IO interface, io_uring.

    io_uring allows an application to communicate with the kernel through
    two rings, the submission queue (SQ) and completion queue (CQ) ring.
    This allows for very efficient handling of IOs, see the v5 posting for
    some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

    Outside of just efficiency, the interface is also flexible and
    extendable, and allows for future use cases like the upcoming NVMe
    key-value store API, networked IO, and so on. It also supports async
    buffered IO, something that we've always failed to support in the
    kernel.

    Outside of basic IO features, it supports async polled IO as well.
    This particular feature has already been tested at Facebook months ago
    for flash storage boxes, with 25-33% improvements. It makes polled IO
    actually useful for real world use cases, where even basic flash sees
    a nice win in terms of efficiency, latency, and performance. These
    boxes were IOPS bound before, now they are not.

    This series adds three new system calls. One for setting up an
    io_uring instance (io_uring_setup(2)), one for submitting/completing
    IO (io_uring_enter(2)), and one for aux functions like registrating
    file sets, buffers, etc (io_uring_register(2)). Through the help of
    Arnd, I've coordinated the syscall numbers so merge on that front
    should be painless.

    Jon did a writeup of the interface a while back, which (except for
    minor details that have been tweaked) is still accurate. Find that
    here:

    https://lwn.net/Articles/776703/

    Huge thanks to Al Viro for helping getting the reference cycle code
    correct, and to Jann Horn for his extensive reviews focused on both
    security and bugs in general.

    There's a userspace library that provides basic functionality for
    applications that don't need or want to care about how to fiddle with
    the rings directly. It has helpers to allow applications to easily set
    up an io_uring instance, and submit/complete IO through it without
    knowing about the intricacies of the rings. It also includes man pages
    (thanks to Jeff Moyer), and will continue to grow support helper
    functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

    Fio has full support for the raw interface, both in the form of an IO
    engine (io_uring), but also with a small test application (t/io_uring)
    that can exercise and benchmark the interface"

    * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
    io_uring: add a few test tools
    io_uring: allow workqueue item to handle multiple buffered requests
    io_uring: add support for IORING_OP_POLL
    io_uring: add io_kiocb ref count
    io_uring: add submission polling
    io_uring: add file set registration
    net: split out functions related to registering inflight socket files
    io_uring: add support for pre-mapped user IO buffers
    block: implement bio helper to add iter bvec pages to bio
    io_uring: batch io_kiocb allocation
    io_uring: use fget/fput_many() for file references
    fs: add fget_many() and fput_many()
    io_uring: support for IO polling
    io_uring: add fsync support
    Add io_uring IO interface

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • (Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)

    'get_unused_fd_flags' in kthread cause kernel crash. It works fine on
    4.1, but causes crash after get 64 fds. It also cause crash on
    ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
    same.

    The crash message on centos7.5 shows below:

    start fd 61
    start fd 62
    start fd 63
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: __wake_up_common+0x2e/0x90
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
    virtio floppy dm_mirror dm_region_hash dm_log dm_mod
    CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G OE ------------ 3.10.0-862.3.3.el7.x86_64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
    task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
    RIP: 0010:__wake_up_common+0x2e/0x90
    RSP: 0018:ffff8e94247a2d18 EFLAGS: 00010086
    RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
    RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
    FS: 0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
    Call Trace:
    __wake_up+0x39/0x50
    expand_files+0x131/0x250
    __alloc_fd+0x47/0x170
    get_unused_fd_flags+0x30/0x40
    test_fd+0x12a/0x1c0 [test]
    kthread+0xd1/0xe0
    ret_from_fork_nospec_begin+0x21/0x21
    Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
    RIP __wake_up_common+0x2e/0x90
    RSP
    CR2: 0000000000000000

    This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
    (3.10.0-693.21.1 ) is ok. Root cause: the item 'resize_wait' is not
    initialized before being used.

    Reported-by: Richard Zhang
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shuriyc Chu
     

28 Feb, 2019

1 commit

  • Some uses cases repeatedly get and put references to the same file, but
    the only exposed interface is doing these one at the time. As each of
    these entail an atomic inc or dec on a shared structure, that cost can
    add up.

    Add fget_many(), which works just like fget(), except it takes an
    argument for how many references to get on the file. Ditto fput_many(),
    which can drop an arbitrary number of references to a file.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Dec, 2018

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of char and misc driver patches for 4.21-rc1.

    Lots of different types of driver things in here, as this tree seems
    to be the "collection of various driver subsystems not big enough to
    have their own git tree" lately.

    Anyway, some highlights of the changes in here:

    - binderfs: is it a rule that all driver subsystems will eventually
    grow to have their own filesystem? Binder now has one to handle the
    use of it in containerized systems.

    This was discussed at the Plumbers conference a few months ago and
    knocked into mergable shape very fast by Christian Brauner. Who
    also has signed up to be another binder maintainer, showing a
    distinct lack of good judgement :)

    - binder updates and fixes

    - mei driver updates

    - fpga driver updates and additions

    - thunderbolt driver updates

    - soundwire driver updates

    - extcon driver updates

    - nvmem driver updates

    - hyper-v driver updates

    - coresight driver updates

    - pvpanic driver additions and reworking for more device support

    - lp driver updates. Yes really, it's _finally_ moved to the proper
    parallal port driver model, something I never thought I would see
    happen. Good stuff.

    - other tiny driver updates and fixes.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (116 commits)
    MAINTAINERS: add another Android binder maintainer
    intel_th: msu: Fix an off-by-one in attribute store
    stm class: Add a reference to the SyS-T document
    stm class: Fix a module refcount leak in policy creation error path
    char: lp: use new parport device model
    char: lp: properly count the lp devices
    char: lp: use first unused lp number while registering
    char: lp: detach the device when parallel port is removed
    char: lp: introduce list to save port number
    bus: qcom: remove duplicated include from qcom-ebi2.c
    VMCI: Use memdup_user() rather than duplicating its implementation
    char/rtc: Use of_node_name_eq for node name comparisons
    misc: mic: fix a DMA pool free failure
    ptp: fix an IS_ERR() vs NULL check
    genwqe: Fix size check
    binder: implement binderfs
    binder: fix use-after-free due to ksys_close() during fdget()
    bus: fsl-mc: remove duplicated include files
    bus: fsl-mc: explicitly define the fsl_mc_command endianness
    misc: ti-st: make array read_ver_cmd static, shrinks object size
    ...

    Linus Torvalds
     

19 Dec, 2018

1 commit

  • 44d8047f1d8 ("binder: use standard functions to allocate fds")
    exposed a pre-existing issue in the binder driver.

    fdget() is used in ksys_ioctl() as a performance optimization.
    One of the rules associated with fdget() is that ksys_close() must
    not be called between the fdget() and the fdput(). There is a case
    where this requirement is not met in the binder driver which results
    in the reference count dropping to 0 when the device is still in
    use. This can result in use-after-free or other issues.

    If userpace has passed a file-descriptor for the binder driver using
    a BINDER_TYPE_FDA object, then kys_close() is called on it when
    handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
    the assumptions for using fdget().

    The problem is fixed by deferring the close using task_work_add(). A
    new variant of __close_fd() was created that returns a struct file
    with a reference. The fput() is deferred instead of using ksys_close().

    Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
    Suggested-by: Al Viro
    Signed-off-by: Todd Kjos
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman

    Todd Kjos
     

28 Nov, 2018

1 commit


03 Apr, 2018

2 commits

  • Using the ksys_close() wrapper allows us to get rid of in-kernel calls
    to the sys_close() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_close(), with one subtle
    difference:

    The few places which checked the return value did not care about the return
    value re-writing in sys_close(), so simply use a wrapper around
    __close_fd().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using ksys_dup() and ksys_dup3() as helper functions allows us to
    avoid the in-kernel calls to the sys_dup() and sys_dup3() syscalls.
    The ksys_ prefix denotes that these functions are meant as a drop-in
    replacement for the syscalls. In particular, they use the same
    calling convention as sys_dup{,3}().

    In the near future, the fs-external callers of ksys_dup{,3}() should be
    converted to call do_dup2() directly. Then, ksys_dup{,3}() can be moved
    within sys_dup{,3}() again.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Alexander Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

01 Feb, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of misc stuff, without any unifying topic, from various
    people.

    Neil's d_anon patch, several bugfixes, introduction of kvmalloc
    analogue of kmemdup_user(), extending bitfield.h to deal with
    fixed-endians, assorted cleanups all over the place..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
    alpha: osf_sys.c: use timespec64 where appropriate
    alpha: osf_sys.c: fix put_tv32 regression
    jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
    dcache: delete unused d_hash_mask
    dcache: subtract d_hash_shift from 32 in advance
    fs/buffer.c: fold init_buffer() into init_page_buffers()
    fs: fold __inode_permission() into inode_permission()
    fs: add RWF_APPEND
    sctp: use vmemdup_user() rather than badly open-coding memdup_user()
    snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
    replace_user_tlv(): switch to vmemdup_user()
    new primitive: vmemdup_user()
    memdup_user(): switch to GFP_USER
    eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
    eventfd: fold eventfd_ctx_read() into eventfd_read()
    eventfd: convert to use anon_inode_getfd()
    nfs4file: get rid of pointless include of btrfs.h
    uvc_v4l2: clean copyin/copyout up
    vme_user: don't use __copy_..._user()
    usx2y: don't bother with memdup_user() for 16-byte structure
    ...

    Linus Torvalds
     

05 Dec, 2017

2 commits


18 Nov, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, really no common topic here"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: grab the lock instead of blocking in __fd_install during resizing
    vfs: stop clearing close on exec when closing a fd
    include/linux/fs.h: fix comment about struct address_space
    fs: make fiemap work from compat_ioctl
    coda: fix 'kernel memory exposure attempt' in fsync
    pstore: remove unneeded unlikely()
    vfs: remove unneeded unlikely()
    stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
    make vfs_ustat() static
    do_handle_open() should be static
    elf_fdpic: fix unused variable warning
    fold destroy_super() into __put_super()
    new helper: destroy_unused_super()
    fix address space warnings in ipc/
    acct.h: get rid of detritus

    Linus Torvalds
     

06 Nov, 2017

2 commits


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

1 commit


09 May, 2017

1 commit

  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Mar, 2017

1 commit


28 Sep, 2016

1 commit

  • Propagate unsignedness for grand total of 149 bytes:

    $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
    add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
    function old new delta
    set_close_on_exec 99 98 -1
    put_files_struct 201 200 -1
    get_close_on_exec 59 58 -1
    do_prlimit 498 497 -1
    do_execveat_common.isra 1662 1661 -1
    __close_fd 178 173 -5
    do_dup2 219 204 -15
    seq_show 685 660 -25
    __alloc_fd 384 357 -27
    dup_fd 718 646 -72

    It mostly comes from converting "unsigned int" to "long" for bit operations.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Al Viro

    Alexey Dobriyan
     

03 May, 2016

1 commit


15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Dec, 2015

1 commit