08 Apr, 2022

2 commits

  • [ Upstream commit d888c83fcec75194a8a48ccd283953bdba7b2550 ]

    Jason Donenfeld reports that my commit 1c24a186398f ("fs: fd tables have
    to be multiples of BITS_PER_LONG") doesn't work, and the reason is an
    embarrassing brown-paper-bag bug.

    Yes, we want to align the number of fds to BITS_PER_LONG, and yes, the
    reason they might not be aligned is because the incoming 'max_fd'
    argument might not be aligned.

    But aligining the argument - while simple - will cause a "infinitely
    big" maxfd (eg NR_OPEN_MAX) to just overflow to zero. Which most
    definitely isn't what we want either.

    The obvious fix was always just to do the alignment last, but I had
    moved it earlier just to make the patch smaller and the code look
    simpler. Duh. It certainly made _me_ look simple.

    Fixes: 1c24a186398f ("fs: fd tables have to be multiples of BITS_PER_LONG")
    Reported-and-tested-by: Jason A. Donenfeld
    Cc: Fedor Pchelkin
    Cc: Alexey Khoroshilov
    Cc: Christian Brauner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     
  • [ Upstream commit 1c24a186398f59c80adb9a967486b65c1423a59d ]

    This has always been the rule: fdtables have several bitmaps in them,
    and as a result they have to be sized properly for bitmaps. We walk
    those bitmaps in chunks of 'unsigned long' in serveral cases, but even
    when we don't, we use the regular kernel bitops that are defined to work
    on arrays of 'unsigned long', not on some byte array.

    Now, the distinction between arrays of bytes and 'unsigned long'
    normally only really ends up being noticeable on big-endian systems, but
    Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps()
    could be called with an argument that wasn't even a multiple of
    BITS_PER_BYTE. And then it fails to do the proper copy even on
    little-endian machines.

    The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which
    didn't actually sanitize the fdtable size sufficiently, and never made
    sure it had the proper BITS_PER_LONG alignment.

    That's partly because the alignment historically came not from having to
    explicitly align things, but simply from previous fdtable sizes, and
    from count_open_files(), which counts the file descriptors by walking
    them one 'unsigned long' word at a time and thus naturally ends up doing
    sizing in the proper 'chunks of unsigned long'.

    But with the introduction of close_range(), we now have an external
    source of "this is how many files we want to have", and so
    sane_fdtable_size() needs to do a better job.

    This also adds that explicit alignment to alloc_fdtable(), although
    there it is mainly just for documentation at a source code level. The
    arithmetic we do there to pick a reasonable fdtable size already aligns
    the result sufficiently.

    In fact,clang notices that the added ALIGN() in that function doesn't
    actually do anything, and does not generate any extra code for it.

    It turns out that gcc ends up confusing itself by combining a previous
    constant-sized shift operation with the variable-sized shift operations
    in roundup_pow_of_two(). And probably due to that doesn't notice that
    the ALIGN() is a no-op. But that's a (tiny) gcc misfeature that doesn't
    matter. Having the explicit alignment makes sense, and would actually
    matter on a 128-bit architecture if we ever go there.

    This also adds big comments above both functions about how fdtable sizes
    have to have that BITS_PER_LONG alignment.

    Fixes: 60997c3d45d9 ("close_range: add CLOSE_RANGE_UNSHARE")
    Reported-by: Fedor Pchelkin
    Reported-by: Alexey Khoroshilov
    Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/
    Tested-and-acked-by: Christian Brauner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     

16 Jan, 2022

1 commit

  • commit e386dfc56f837da66d00a078e5314bc8382fab83 upstream.

    Commit 054aa8d439b9 ("fget: check that the fd still exists after getting
    a ref to it") fixed a race with getting a reference to a file just as it
    was being closed. It was a fairly minimal patch, and I didn't think
    re-checking the file pointer lookup would be a measurable overhead,
    since it was all right there and cached.

    But I was wrong, as pointed out by the kernel test robot.

    The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
    quite noticeably. Admittedly it seems to be a very artificial test:
    doing "poll()" system calls on regular files in a very tight loop in
    multiple threads.

    That means that basically all the time is spent just looking up file
    descriptors without ever doing anything useful with them (not that doing
    'poll()' on a regular file is useful to begin with). And as a result it
    shows the extra "re-check fd" cost as a sore thumb.

    Happily, the regression is fixable by just writing the code to loook up
    the fd to be better and clearer. There's still a cost to verify the
    file pointer, but now it's basically in the noise even for that
    benchmark that does nothing else - and the code is more understandable
    and has better comments too.

    [ Side note: this patch is also a classic case of one that looks very
    messy with the default greedy Myers diff - it's much more legible with
    either the patience of histogram diff algorithm ]

    Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
    Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/
    Reported-by: kernel test robot
    Tested-by: Carel Si
    Cc: Jann Horn
    Cc: Miklos Szeredi
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

08 Dec, 2021

1 commit

  • commit 054aa8d439b9185d4f5eb9a90282d1ce74772969 upstream.

    Jann Horn points out that there is another possible race wrt Unix domain
    socket garbage collection, somewhat reminiscent of the one fixed in
    commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").

    See the extended comment about the garbage collection requirements added
    to unix_peek_fds() by that commit for details.

    The race comes from how we can locklessly look up a file descriptor just
    as it is in the process of being closed, and with the right artificial
    timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
    Unix domain socket garbage collector could see the reference count
    decrement of the close() happen before fget() took its reference to the
    file and the file was attached onto a new file descriptor.

    This is all (intentionally) correct on the 'struct file *' side, with
    RCU lookups and lockless reference counting very much part of the
    design. Getting that reference count out of order isn't a problem per
    se.

    But the garbage collector can get confused by seeing this situation of
    having seen a file not having any remaining external references and then
    seeing it being attached to an fd.

    In commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK") the
    fix was to serialize the file descriptor install with the garbage
    collector by taking and releasing the unix_gc_lock.

    That's not really an option here, but since this all happens when we are
    in the process of looking up a file descriptor, we can instead simply
    just re-check that the file hasn't been closed in the meantime, and just
    re-do the lookup if we raced with a concurrent close() of the same file
    descriptor.

    Reported-and-tested-by: Jann Horn
    Acked-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

12 Sep, 2021

1 commit

  • Pull virtio updates from Michael Tsirkin:

    - vduse driver ("vDPA Device in Userspace") supporting emulated virtio
    block devices

    - virtio-vsock support for end of record with SEQPACKET

    - vdpa: mac and mq support for ifcvf and mlx5

    - vdpa: management netlink for ifcvf

    - virtio-i2c, gpio dt bindings

    - misc fixes and cleanups

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (39 commits)
    Documentation: Add documentation for VDUSE
    vduse: Introduce VDUSE - vDPA Device in Userspace
    vduse: Implement an MMU-based software IOTLB
    vdpa: Support transferring virtual addressing during DMA mapping
    vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
    vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
    vhost-iotlb: Add an opaque pointer for vhost IOTLB
    vhost-vdpa: Handle the failure of vdpa_reset()
    vdpa: Add reset callback in vdpa_config_ops
    vdpa: Fix some coding style issues
    file: Export receive_fd() to modules
    eventfd: Export eventfd_wake_count to modules
    iova: Export alloc_iova_fast() and free_iova_fast()
    virtio-blk: remove unneeded "likely" statements
    virtio-balloon: Use virtio_find_vqs() helper
    vdpa: Make use of PFN_PHYS/PFN_UP/PFN_DOWN helper macro
    vsock_test: update message bounds test for MSG_EOR
    af_vsock: rename variables in receive loop
    virtio/vsock: support MSG_EOR bit processing
    vhost/vsock: support MSG_EOR bit processing
    ...

    Linus Torvalds
     

06 Sep, 2021

1 commit

  • Export receive_fd() so that some modules can use
    it to pass file descriptor between processes without
    missing any security stuffs.

    Signed-off-by: Xie Yongji
    Acked-by: Jason Wang
    Link: https://lore.kernel.org/r/20210831103634.33-4-xieyongji@bytedance.com
    Signed-off-by: Michael S. Tsirkin

    Xie Yongji
     

01 Sep, 2021

1 commit

  • Pull close_range() cleanup from Christian Brauner:
    "This is a cleanup for close_range() which was sent as part of a bugfix
    we did some time ago in commit 9b5b872215fe ("file: fix close_range()
    for unshare+cloexec").

    We used to share more code between some helpers for close_range()
    which made retrieving the maximum number of open fds before calling
    into the helpers sensible. But with the introduction of
    CLOSE_RANGE_CLOEXEC and the need to retrieve the number of maximum fds
    once more for CLOSE_RANGE_CLOEXEC that stopped making sense. So the
    code was in a dumb in-limbo state.

    Fix this by simplifying the code a bit.

    The original idea was to only fix the bug itself and make backporting
    easy. And since the cleanup wasn't very pressing I left it in
    linux-next for a very long time. I didn't pull the patches from the
    list again back then which is why they don't have lore-links. So I'm
    listing them below explicitly"

    Commit 03ba0fe4d09f ("file: simplify logic in __close_range()")
    Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-3-brauner@kernel.org

    Commit f49fd6d3c070 ("file: let pick_file() tell caller it's done")
    Link: https://lore.kernel.org/linux-fsdevel/20210402123548.108372-4-brauner@kernel.org

    * tag 'fs.close_range.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    file: simplify logic in __close_range()
    file: let pick_file() tell caller it's done

    Linus Torvalds
     

04 May, 2021

1 commit


16 Apr, 2021

1 commit

  • receive_fd_replace shares almost no code with the general case, so split
    it out. Also remove the "Bump the sock usage counts" comment from
    both copies, as that is now what __receive_sock actually does.

    [AV: ... and make the only user of receive_fd_replace() choose between
    it and receive_fd() according to what userland had passed to it in
    flags]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

02 Apr, 2021

3 commits

  • It never looked too pleasant and it doesn't really buy us anything
    anymore now that CLOSE_RANGE_CLOEXEC exists and we need to retake the
    current maximum under the lock for it anyway. This also makes the logic
    easier to follow.

    Cc: Christoph Hellwig
    Cc: Giuseppe Scrivano
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • syzbot reported a bug when putting the last reference to a tasks file
    descriptor table. Debugging this showed we didn't recalculate the
    current maximum fd number for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
    after we unshared the file descriptors table. So max_fd could exceed the
    current fdtable maximum causing us to set excessive bits. As a concrete
    example, let's say the user requested everything from fd 4 to ~0UL to be
    closed and their current fdtable size is 256 with their highest open fd
    being 4. With CLOSE_RANGE_UNSHARE the caller will end up with a new
    fdtable which has room for 64 file descriptors since that is the lowest
    fdtable size we accept. But now max_fd will still point to 255 and needs
    to be adjusted. Fix this by retrieving the correct maximum fd value in
    __range_cloexec().

    Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
    Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
    Fixes: fec8a6a69103 ("close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC")
    Cc: Christoph Hellwig
    Cc: Giuseppe Scrivano
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Let pick_file() report back that the fd it was passed exceeded the
    maximum fd in that fdtable. This allows us to simplify the caller of
    this helper because it doesn't need to care anymore whether the passed
    in max_fd is excessive. It can rely on pick_file() telling it that it's
    past the last valid fd.

    Cc: Christoph Hellwig
    Cc: Giuseppe Scrivano
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Christian Brauner

    Christian Brauner
     

02 Feb, 2021

1 commit


31 Dec, 2020

1 commit

  • For cancelling io_uring requests it needs either to be able to run
    currently enqueued task_works or having it shut down by that moment.
    Otherwise io_uring_cancel_files() may be waiting for requests that won't
    ever complete.

    Go with the first way and do cancellations before setting PF_EXITING and
    so before putting the task_work infrastructure into a transition state
    where task_work_run() would better not be called.

    Cc: stable@vger.kernel.org # 5.5+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

19 Dec, 2020

1 commit

  • After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
    CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
    When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
    file descriptor table in case their file descriptor table is currently
    shared.

    For the case where the caller has requested all file descriptors to be
    actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
    the caller does not need any of the file descriptors anymore and will
    optimize the close operation by only copying all files in the range from
    0 to 3 and no others.

    However, if the caller requested CLOSE_RANGE_CLOEXEC together with
    CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
    descriptors so the kernel needs to copy all of them and can't optimize.

    The original patch didn't account for this and thus could cause oopses
    as evidenced by the syzbot report because it assumed that all fds had
    been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.

    syzbot reported
    ==================================================================
    BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
    BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
    BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
    BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
    Read of size 8 at addr 0000000000000077 by task syz-executor511/8522

    CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:79 [inline]
    dump_stack+0x107/0x163 lib/dump_stack.c:120
    __kasan_report mm/kasan/report.c:549 [inline]
    kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
    check_memory_region_inline mm/kasan/generic.c:186 [inline]
    check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
    instrument_atomic_read include/linux/instrumented.h:71 [inline]
    atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
    atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
    filp_close+0x22/0x170 fs/open.c:1274
    close_files fs/file.c:402 [inline]
    put_files_struct fs/file.c:417 [inline]
    put_files_struct+0x1cc/0x350 fs/file.c:414
    exit_files+0x12a/0x170 fs/file.c:435
    do_exit+0xb4f/0x2a00 kernel/exit.c:818
    do_group_exit+0x125/0x310 kernel/exit.c:920
    get_signal+0x428/0x2100 kernel/signal.c:2792
    arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
    handle_signal_work kernel/entry/common.c:147 [inline]
    exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
    exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
    __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
    syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x447039
    Code: Unable to access opcode bytes at RIP 0x44700f.
    RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
    RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
    RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
    RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
    R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
    ==================================================================

    syzbot has tested the proposed patch and the reproducer did not trigger any issue:

    Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com

    Tested on:

    commit: 10f7cddd selftests/core: add regression test for CLOSE_RAN..
    git tree: git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
    kernel config: https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
    dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
    compiler: gcc (GCC) 10.1.0-syz 20200507

    Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
    Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
    Cc: Giuseppe Scrivano
    Cc: linux-fsdevel@vger.kernel.org
    Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.com
    Signed-off-by: Christian Brauner

    Christian Brauner
     

16 Dec, 2020

1 commit

  • …biederm/user-namespace

    Pull execve updates from Eric Biederman:
    "This set of changes ultimately fixes the interaction of posix file
    lock and exec. Fundamentally most of the change is just moving where
    unshare_files is called during exec, and tweaking the users of
    files_struct so that the count of files_struct is not unnecessarily
    played with.

    Along the way fcheck and related helpers were renamed to more
    accurately reflect what they do.

    There were also many other small changes that fell out, as this is the
    first time in a long time much of this code has been touched.

    Benchmarks haven't turned up any practical issues but Al Viro has
    observed a possibility for a lot of pounding on task_lock. So I have
    some changes in progress to convert put_files_struct to always rcu
    free files_struct. That wasn't ready for the merge window so that will
    have to wait until next time"

    * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    exec: Move io_uring_task_cancel after the point of no return
    coredump: Document coredump code exclusively used by cell spufs
    file: Remove get_files_struct
    file: Rename __close_fd_get_file close_fd_get_file
    file: Replace ksys_close with close_fd
    file: Rename __close_fd to close_fd and remove the files parameter
    file: Merge __alloc_fd into alloc_fd
    file: In f_dupfd read RLIMIT_NOFILE once.
    file: Merge __fd_install into fd_install
    proc/fd: In fdinfo seq_show don't use get_files_struct
    bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
    proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
    file: Implement task_lookup_next_fd_rcu
    kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
    proc/fd: In tid_fd_mode use task_lookup_fd_rcu
    file: Implement task_lookup_fd_rcu
    file: Rename fcheck lookup_fd_rcu
    file: Replace fcheck_files with files_lookup_fd_rcu
    file: Factor files_lookup_fd_locked out of fcheck_files
    file: Rename __fcheck_files to files_lookup_fd_raw
    ...

    Linus Torvalds
     

11 Dec, 2020

12 commits

  • When discussing[1] exec and posix file locks it was realized that none
    of the callers of get_files_struct fundamentally needed to call
    get_files_struct, and that by switching them to helper functions
    instead it will both simplify their code and remove unnecessary
    increments of files_struct.count. Those unnecessary increments can
    result in exec unnecessarily unsharing files_struct which breaking
    posix locks, and it can result in fget_light having to fallback to
    fget reducing system performance.

    Now that get_files_struct has no more users and can not cause the
    problems for posix file locking and fget_light remove get_files_struct
    so that it does not gain any new users.

    [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
    Suggested-by: Oleg Nesterov
    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-13-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-24-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The function close_fd_get_file is explicitly a variant of
    __close_fd[1]. Now that __close_fd has been renamed close_fd, rename
    close_fd_get_file to be consistent with close_fd.

    When __alloc_fd, __close_fd and __fd_install were introduced the
    double underscore indicated that the function took a struct
    files_struct parameter. The function __close_fd_get_file never has so
    the naming has always been inconsistent. This just cleans things up
    so there are not any lingering mentions or references __close_fd left
    in the code.

    [1] 80cd795630d6 ("binder: fix use-after-free due to ksys_close() during fdget()")
    Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The function __close_fd was added to support binder[1]. Now that
    binder has been fixed to no longer need __close_fd[2] all calls
    to __close_fd pass current->files.

    Therefore transform the files parameter into a local variable
    initialized to current->files, and rename __close_fd to close_fd to
    reflect this change, and keep it in sync with the similar changes to
    __alloc_fd, and __fd_install.

    This removes the need for callers to care about the extra care that
    needs to be take if anything except current->files is passed, by
    limiting the callers to only operation on current->files.

    [1] 483ce1d4b8c3 ("take descriptor-related part of close() to file.c")
    [2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The function __alloc_fd was added to support binder[1]. With binder
    fixed[2] there are no more users.

    As alloc_fd just calls __alloc_fd with "files=current->files",
    merge them together by transforming the files parameter into a
    local variable initialized to current->files.

    [1] dcfadfa4ec5a ("new helper: __alloc_fd()")
    [2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-16-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-20-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Simplify the code, and remove the chance of races by reading
    RLIMIT_NOFILE only once in f_dupfd.

    Pass the read value of RLIMIT_NOFILE into alloc_fd which is the other
    location the rlimit was read in f_dupfd. As f_dupfd is the only
    caller of alloc_fd this changing alloc_fd is trivially safe.

    Further this causes alloc_fd to take all of the same arguments as
    __alloc_fd except for the files_struct argument.

    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-15-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-19-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The function __fd_install was added to support binder[1]. With binder
    fixed[2] there are no more users.

    As fd_install just calls __fd_install with "files=current->files",
    merge them together by transforming the files parameter into a
    local variable initialized to current->files.

    [1] f869e8a7f753 ("expose a low-level variant of fd_install() for binder")
    [2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
    Acked-by: Christian Brauner
    v1:https://lkml.kernel.org/r/20200817220425.9389-14-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-18-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • As a companion to fget_task and task_lookup_fd_rcu implement
    task_lookup_next_fd_rcu that will return the struct file for the first
    file descriptor number that is equal or greater than the fd argument
    value, or NULL if there is no such struct file.

    This allows file descriptors of foreign processes to be iterated
    through safely, without needed to increment the count on files_struct.

    Some concern[1] has been expressed that this function takes the task_lock
    for each iteration and thus for each file descriptor. This place
    where this function will be called in a commonly used code path is for
    listing /proc//fd. I did some small benchmarks and did not see
    any measurable performance differences. For ordinary users ls is
    likely to stat each of the directory entries and tid_fd_mode called
    from tid_fd_revalidae has always taken the task lock for each file
    descriptor. So this does not look like it will be a big change in
    practice.

    At some point is will probably be worth changing put_files_struct to
    free files_struct after an rcu grace period so that task_lock won't be
    needed at all.

    [1] https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com
    v1: https://lkml.kernel.org/r/20200817220425.9389-9-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-14-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for
    querying an arbitrary process about a specific file.

    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein
    Link: https://lkml.kernel.org/r/20201120231441.29911-11-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • This change renames fcheck_files to files_lookup_fd_rcu. All of the
    remaining callers take the rcu_read_lock before calling this function
    so the _rcu suffix is appropriate. This change also tightens up the
    debug check to verify that all callers hold the rcu_read_lock.

    All callers that used to call files_check with the files->file_lock
    held have now been changed to call files_lookup_fd_locked.

    This change of name has helped remind me of which locks and which
    guarantees are in place helping me to catch bugs later in the
    patchset.

    The need for better names became apparent in the last round of
    discussion of this set of changes[1].

    [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • To make it easy to tell where files->file_lock protection is being
    used when looking up a file create files_lookup_fd_locked. Only allow
    this function to be called with the file_lock held.

    Update the callers of fcheck and fcheck_files that are called with the
    files->file_lock held to call files_lookup_fd_locked instead.

    Hopefully this makes it easier to quickly understand what is going on.

    The need for better names became apparent in the last round of
    discussion of this set of changes[1].

    [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-8-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The function fcheck despite it's comment is poorly named
    as it has no callers that only check it's return value.
    All of fcheck's callers use the returned file descriptor.
    The same is true for fcheck_files and __fcheck_files.

    A new less confusing name is needed. In addition the names
    of these functions are confusing as they do not report
    the kind of locks that are needed to be held when these
    functions are called making error prone to use them.

    To remedy this I am making the base functio name lookup_fd
    and will and prefixes and sufficies to indicate the rest
    of the context.

    Name the function (previously called __fcheck_files) that proceeds
    from a struct files_struct, looks up the struct file of a file
    descriptor, and requires it's callers to verify all of the appropriate
    locks are held files_lookup_fd_raw.

    The need for better names became apparent in the last round of
    discussion of this set of changes[1].

    [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-7-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Now that exec no longer needs to restore the previous value of current->files
    on error there are no more callers of reset_files_struct so remove it.

    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-3-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-3-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

04 Dec, 2020

1 commit

  • When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
    immediately close the files but it sets the close-on-exec bit.

    It is useful for e.g. container runtimes that usually install a
    seccomp profile "as late as possible" before execv'ing the container
    process itself. The container runtime could either do:
    1 2
    - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0);
    - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
    - execve(...); - execve(...);

    Both alternative have some disadvantages.

    In the first variant the seccomp_profile cannot block the close_range
    syscall, as well as opendir/read/close/... for the fallback on older
    kernels.
    In the second variant, close_range() can be used only on the fds
    that are not going to be needed by the runtime anymore, and it must be
    potentially called multiple times to account for the different ranges
    that must be closed.

    Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
    The runtime is able to use the existing open fds, the seccomp profile
    can block close_range() and the syscalls used for its fallback.

    Signed-off-by: Giuseppe Scrivano
    Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com
    Signed-off-by: Christian Brauner

    Giuseppe Scrivano
     

01 Oct, 2020

1 commit

  • Grab actual references to the files_struct. To avoid circular references
    issues due to this, we add a per-task note that keeps track of what
    io_uring contexts a task has used. When the tasks execs or exits its
    assigned files, we cancel requests based on this tracking.

    With that, we can grab proper references to the files table, and no
    longer need to rely on stashing away ring_fd and ring_file to check
    if the ring_fd may have been closed.

    Cc: stable@vger.kernel.org # v5.5+
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Aug, 2020

1 commit

  • Pull init and set_fs() cleanups from Al Viro:
    "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

    * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
    init: add an init_dup helper
    init: add an init_utimes helper
    init: add an init_stat helper
    init: add an init_mknod helper
    init: add an init_mkdir helper
    init: add an init_symlink helper
    init: add an init_link helper
    init: add an init_eaccess helper
    init: add an init_chmod helper
    init: add an init_chown helper
    init: add an init_chroot helper
    init: add an init_chdir helper
    init: add an init_rmdir helper
    init: add an init_unlink helper
    init: add an init_umount helper
    init: add an init_mount helper
    init: mark create_dev as __init
    init: mark console_on_rootfs as __init
    init: initialize ramdisk_execute_command at compile time
    devtmpfs: refactor devtmpfsd()
    ...

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pull close_range() implementation from Christian Brauner:
    "This adds the close_range() syscall. It allows to efficiently close a
    range of file descriptors up to all file descriptors of a calling
    task.

    This is coordinated with the FreeBSD folks which have copied our
    version of this syscall and in the meantime have already merged it in
    April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

    The syscall originally came up in a discussion around the new mount
    API and making new file descriptor types cloexec by default. During
    this discussion, Al suggested the close_range() syscall.

    First, it helps to close all file descriptors of an exec()ing task.
    This can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that
    file descriptors are not cloexec by default. This is aggravated by the
    fact that we can't just switch them over without massively regressing
    userspace. For a whole class of programs having an in-kernel method of
    closing all file descriptors is very helpful (e.g. demons, service
    managers, programming language standard libraries, container managers
    etc.).

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on
    each file descriptor and other hacks. From looking at various
    large(ish) userspace code bases this or similar patterns are very
    common in service managers, container runtimes, and programming
    language runtimes/standard libraries such as Python or Rust.

    In addition, the syscall will also work for tasks that do not have
    procfs mounted and on kernels that do not have procfs support compiled
    in. In such situations the only way to make sure that all file
    descriptors are closed is to call close() on each file descriptor up
    to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

    Based on Linus' suggestion close_range() also comes with a new flag
    CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
    right before exec. This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part
    of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
    gets especially handy when we're closing all file descriptors above a
    certain threshold.

    Test-suite as always included"

    * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add CLOSE_RANGE_UNSHARE tests
    close_range: add CLOSE_RANGE_UNSHARE
    tests: add close_range() tests
    arch: wire-up close_range()
    open: add close_range()

    Linus Torvalds
     

31 Jul, 2020

1 commit


14 Jul, 2020

3 commits

  • Expand __receive_fd() with support for replace_fd() for the coming seccomp
    "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
    and update existing wrappers to retain old behavior.

    Thanks to Colin Ian King for pointing out an
    uninitialized variable exposure in an earlier version of this patch.

    Cc: Alexander Viro
    Cc: Dmitry Kadashev
    Cc: Jens Axboe
    Cc: Arnd Bergmann
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     
  • For both pidfd and seccomp, the __user pointer is not used. Update
    __receive_fd() to make writing to ufd optional via a NULL check. However,
    for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
    can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
    new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
    argument. For the new helper, the allocated fd needs to be returned on
    success. Update the existing callers to handle it.

    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     
  • In preparation for users of the "install a received file" logic outside
    of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
    net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
    named receive_fd_user(), as future patches will change the interface
    to __receive_fd().

    Additionally add a comment to fd_install() as a counterpoint to how
    __receive_fd() interacts with fput().

    Cc: Alexander Viro
    Cc: "David S. Miller"
    Cc: Jakub Kicinski
    Cc: Dmitry Kadashev
    Cc: Jens Axboe
    Cc: Arnd Bergmann
    Cc: Sargun Dhillon
    Cc: Ido Schimmel
    Cc: Ioana Ciornei
    Cc: linux-fsdevel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Reviewed-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook

    Kees Cook
     

17 Jun, 2020

2 commits

  • One of the use-cases of close_range() is to drop file descriptors just before
    execve(). This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part of
    close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

    This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
    maximum number of file descriptors to copy from the old struct files. When the
    user requests that all file descriptors are supposed to be closed via
    close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
    to do any of the heavy fput() work for everything above min.

    The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
    fact currently share our file descriptor table we create a new private copy.
    We then close all fds in the requested range and finally after we're done we
    install the new fd table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • This adds the close_range() syscall. It allows to efficiently close a range
    of file descriptors up to all file descriptors of a calling task.

    I was contacted by FreeBSD as they wanted to have the same close_range()
    syscall as we proposed here. We've coordinated this and in the meantime, Kyle
    was fast enough to merge close_range() into FreeBSD already in April:
    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836
    and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
    once its merged in Linux too. Python is in the process of switching to
    close_range() on FreeBSD and they are waiting on us to merge this to switch on
    Linux as well: https://bugs.python.org/issue38061

    The syscall came up in a recent discussion around the new mount API and
    making new file descriptor types cloexec by default. During this
    discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
    syscall in this manner has been requested by various people over time.

    First, it helps to close all file descriptors of an exec()ing task. This
    can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that file
    descriptors are not cloexec by default. This is aggravated by the fact that
    we can't just switch them over without massively regressing userspace. For
    a whole class of programs having an in-kernel method of closing all file
    descriptors is very helpful (e.g. demons, service managers, programming
    language standard libraries, container managers etc.).
    (Please note, unshare(CLONE_FILES) should only be needed if the calling
    task is multi-threaded and shares the file descriptor table with another
    thread in which case two threads could race with one thread allocating file
    descriptors and the other one closing them via close_range(). For the
    general case close_range() before the execve() is sufficient.)

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on each
    file descriptor. From looking at various large(ish) userspace code bases
    this or similar patterns are very common in:
    - service managers (cf. [4])
    - libcs (cf. [6])
    - container runtimes (cf. [5])
    - programming language runtimes/standard libraries
    - Python (cf. [2])
    - Rust (cf. [7], [8])
    As Dmitry pointed out there's even a long-standing glibc bug about missing
    kernel support for this task (cf. [3]).
    In addition, the syscall will also work for tasks that do not have procfs
    mounted and on kernels that do not have procfs support compiled in. In such
    situations the only way to make sure that all file descriptors are closed
    is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
    OPEN_MAX trickery (cf. comment [8] on Rust).

    The performance is striking. For good measure, comparing the following
    simple close_all_fds() userspace implementation that is essentially just
    glibc's version in [6]:

    static int close_all_fds(void)
    {
    int dir_fd;
    DIR *dir;
    struct dirent *direntp;

    dir = opendir("/proc/self/fd");
    if (!dir)
    return -1;
    dir_fd = dirfd(dir);
    while ((direntp = readdir(dir))) {
    int fd;
    if (strcmp(direntp->d_name, ".") == 0)
    continue;
    if (strcmp(direntp->d_name, "..") == 0)
    continue;
    fd = atoi(direntp->d_name);
    if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
    continue;
    close(fd);
    }
    closedir(dir);
    return 0;
    }

    to close_range() yields:
    1. closing 4 open files:
    - close_all_fds(): ~280 us
    - close_range(): ~24 us

    2. closing 1000 open files:
    - close_all_fds(): ~5000 us
    - close_range(): ~800 us

    close_range() is designed to allow for some flexibility. Specifically, it
    does not simply always close all open file descriptors of a task. Instead,
    callers can specify an upper bound.
    This is e.g. useful for scenarios where specific file descriptors are
    created with well-known numbers that are supposed to be excluded from
    getting closed.
    For extra paranoia close_range() comes with a flags argument. This can e.g.
    be used to implement extension. Once can imagine userspace wanting to stop
    at the first error instead of ignoring errors under certain circumstances.
    There might be other valid ideas in the future. In any case, a flag
    argument doesn't hurt and keeps us on the safe side.

    From an implementation side this is kept rather dumb. It saw some input
    from David and Jann but all nonsense is obviously my own!
    - Errors to close file descriptors are currently ignored. (Could be changed
    by setting a flag in the future if needed.)
    - __close_range() is a rather simplistic wrapper around __close_fd().
    My reasoning behind this is based on the nature of how __close_fd() needs
    to release an fd. But maybe I misunderstood specifics:
    We take the files_lock and rcu-dereference the fdtable of the calling
    task, we find the entry in the fdtable, get the file and need to release
    files_lock before calling filp_close().
    In the meantime the fdtable might have been altered so we can't just
    retake the spinlock and keep the old rcu-reference of the fdtable
    around. Instead we need to grab a fresh reference to the fdtable.
    If my reasoning is correct then there's really no point in fancyfying
    __close_range(): We just need to rcu-dereference the fdtable of the
    calling task once to cap the max_fd value correctly and then go on
    calling __close_fd() in a loop.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
    [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
    [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
    [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
    [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
    [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
    Note that this is an internal implementation that is not exported.
    Currently, libc seems to not provide an exported version of this
    because of missing kernel support to do this.
    Note, in a recent patch series Florian made grantpt() a nop thereby
    removing the code referenced here.
    [7]: https://github.com/rust-lang/rust/issues/12148
    [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
    Rust's solution is slightly different but is equally unperformant.
    Rust calls getdtablesize() which is a glibc library function that
    simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
    goes on to call close() on each fd. That's obviously overkill for most
    tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
    OPEN_MAX.
    Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
    to 1024. Even in this case, there's a very high chance that in the
    common case Rust is calling the close() syscall 1021 times pointlessly
    if the task just has 0, 1, and 2 open.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Kyle Evans
    Cc: Jann Horn
    Cc: David Howells
    Cc: Dmitry V. Levin
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Florian Weimer
    Cc: linux-api@vger.kernel.org

    Christian Brauner
     

20 May, 2020

1 commit

  • cpy and set really should be size_t; we won't get an overflow on that,
    since sysctl_nr_open can't be set above ~(size_t)0 / sizeof(void *),
    so nr that would've managed to overflow size_t on that multiplication
    won't get anywhere near copy_fdtable() - we'll fail with EMFILE
    before that.

    Cc: stable@kernel.org # v2.6.25+
    Fixes: 9cfe015aa424 (get rid of NR_OPEN and introduce a sysctl_nr_open)
    Reported-by: Thiago Macieira
    Signed-off-by: Al Viro

    Al Viro
     

20 Mar, 2020

1 commit

  • Dmitry reports that a test case shows that io_uring isn't honoring a
    modified rlimit nofile setting. get_unused_fd_flags() checks the task
    signal->rlimi[] for the limits. As this isn't easily inheritable,
    provide a __get_unused_fd_flags() that takes the value instead. Then we
    can grab it when the request is prepared (from the original task), and
    pass that in when we do the async part part of the open.

    Reported-by: Dmitry Kadashev
    Tested-by: Dmitry Kadashev
    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Jens Axboe