13 Mar, 2020

1 commit

  • several iterations of ->atomic_open() calling conventions ago, we
    used to need fput() if ->atomic_open() failed at some point after
    successful finish_open(). Now (since 2016) it's not needed -
    struct file carries enough state to make fput() work regardless
    of the point in struct file lifecycle and discarding it on
    failure exits in open() got unified. Unfortunately, I'd missed
    the fact that we had an instance of ->atomic_open() (cifs one)
    that used to need that fput(), as well as the stale comment in
    finish_open() demanding such late failure handling. Trivially
    fixed...

    Fixes: fe9ec8291fca "do_last(): take fput() on error after opening to out:"
    Cc: stable@kernel.org # v4.7+
    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2020

1 commit


18 Jan, 2020

1 commit

  • /* Background. */
    For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown flags
    are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road to
    being added to openat(2).

    Userspace also has a hard time figuring out whether a particular flag is
    supported on a particular kernel. While it is now possible with
    contemporary kernels (thanks to [3]), older kernels will expose unknown
    flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
    openat(2) time matches modern syscall designs and is far more
    fool-proof.

    In addition, the newly-added path resolution restriction LOOKUP flags
    (which we would like to expose to user-space) don't feel related to the
    pre-existing O_* flag set -- they affect all components of path lookup.
    We'd therefore like to add a new flag argument.

    Adding a new syscall allows us to finally fix the flag-ignoring problem,
    and we can make it extensible enough so that we will hopefully never
    need an openat3(2).

    /* Syscall Prototype. */
    /*
    * open_how is an extensible structure (similar in interface to
    * clone3(2) or sched_setattr(2)). The size parameter must be set to
    * sizeof(struct open_how), to allow for future extensions. All future
    * extensions will be appended to open_how, with their zero value
    * acting as a no-op default.
    */
    struct open_how { /* ... */ };

    int openat2(int dfd, const char *pathname,
    struct open_how *how, size_t size);

    /* Description. */
    The initial version of 'struct open_how' contains the following fields:

    flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

    mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

    resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH => LOOKUP_BENEATH
    RESOLVE_IN_ROOT => LOOKUP_IN_ROOT

    open_how does not contain an embedded size field, because it is of
    little benefit (userspace can figure out the kernel open_how size at
    runtime fairly easily without it). It also only contains u64s (even
    though ->mode arguably should be a u16) to avoid having padding fields
    which are never used in the future.

    Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
    is no longer permitted for openat(2). As far as I can tell, this has
    always been a bug and appears to not be used by userspace (and I've not
    seen any problems on my machines by disallowing it). If it turns out
    this breaks something, we can special-case it and only permit it for
    openat(2) but not openat2(2).

    After input from Florian Weimer, the new open_how and flag definitions
    are inside a separate header from uapi/linux/fcntl.h, to avoid problems
    that glibc has with importing that header.

    /* Testing. */
    In a follow-up patch there are over 200 selftests which ensure that this
    syscall has the correct semantics and will correctly handle several
    attack scenarios.

    In addition, I've written a userspace library[4] which provides
    convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
    because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
    must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
    syscalls). During the development of this patch, I've run numerous
    verification tests using libpathrs (showing that the API is reasonably
    usable by userspace).

    /* Future Work. */
    Additional RESOLVE_ flags have been suggested during the review period.
    These can be easily implemented separately (such as blocking auto-mount
    during resolution).

    Furthermore, there are some other proposed changes to the openat(2)
    interface (the most obvious example is magic-link hardening[5]) which
    would be a good opportunity to add a way for userspace to restrict how
    O_PATH file descriptors can be re-opened.

    Another possible avenue of future work would be some kind of
    CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
    which openat2(2) flags and fields are supported by the current kernel
    (to avoid userspace having to go through several guesses to figure it
    out).

    [1]: https://lwn.net/Articles/588444/
    [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
    [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
    [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
    [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
    [6]: https://youtu.be/ggD-eb3yPVs

    Suggested-by: Christian Brauner
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

27 Nov, 2019

1 commit

  • This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.

    I was hoping it would be benign to switch over entirely to FMODE_STREAM,
    and we'd have just a couple of small fixups we'd need, but it looks like
    we're not quite there yet.

    While it worked fine on both my desktop and laptop, they are fairly
    similar in other respects, and run mostly the same loads. Kenneth
    Crudup reports that it seems to break both his vmware installation and
    the KDE upower service. In both cases apparently leading to timeouts
    due to waitinmg for the f_pos lock.

    There are a number of character devices in particular that definitely
    want stream-like behavior, but that currently don't get marked as
    streams, and as a result get the exclusion between concurrent
    read()/write() on the same file descriptor. Which doesn't work well for
    them.

    The most obvious example if this is /dev/console and /dev/tty, which use
    console_fops and tty_fops respectively (and ptmx_fops for the pty master
    side). It may be that it's just this that causes problems, but we
    clearly weren't ready yet.

    Because there's a number of other likely common cases that don't have
    llseek implementations and would seem to act as stream devices:

    /dev/fuse (fuse_dev_operations)
    /dev/mcelog (mce_chrdev_ops)
    /dev/mei0 (mei_fops)
    /dev/net/tun (tun_fops)
    /dev/nvme0 (nvme_dev_fops)
    /dev/tpm0 (tpm_fops)
    /proc/self/ns/mnt (ns_file_operations)
    /dev/snd/pcm* (snd_pcm_f_ops[])

    and while some of these could be trivially automatically detected by the
    vfs layer when the character device is opened by just noticing that they
    have no read or write operations either, it often isn't that obvious.

    Some character devices most definitely do use the file position, even if
    they don't allow seeking: the firmware update code, for example, uses
    simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
    back and forth.

    We'll revisit this when there's a better way to detect the problem and
    fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
    annotations).

    Reported-by: Kenneth R. Crudup
    Cc: Kirill Smelkov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Nov, 2019

1 commit

  • fdget_pos() is used by file operations that will read and update f_pos:
    things like "read()", "write()" and "lseek()" (but not, for example,
    "pread()/pwrite" that get their file positions elsewhere).

    However, it had two separate escape clauses for this, because not
    everybody wants or needs serialization of the file position.

    The first and most obvious case is the "file descriptor doesn't have a
    position at all", ie a stream-like file. Except we didn't actually use
    FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
    was that FMODE_STREAM didn't exist back in the days, but also that we
    didn't want to mark all the special cases, so we only marked the ones
    that _required_ position atomicity according to POSIX - regular files
    and directories.

    The case one was intentionally lazy, but now that we _do_ have
    FMODE_STREAM we could and should just use it. With the change to use
    FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
    the code to set it is deleted.

    Any cases where we don't want the serialization because the driver (or
    subsystem) doesn't use the file position should just be updated to do
    "stream_open()". We've done that for all the obvious and common
    situations, we may need a few more. Quoting Kirill Smelkov in the
    original FMODE_STREAM thread (see link below for full email):

    "And I appreciate if people could help at least somehow with "getting
    rid of mixed case entirely" (i.e. always lock f_pos_lock on
    !FMODE_STREAM), because this transition starts to diverge from my
    particular use-case too far. To me it makes sense to do that
    transition as follows:

    - convert nonseekable_open -> stream_open via stream_open.cocci;
    - audit other nonseekable_open calls and convert left users that
    truly don't depend on position to stream_open;
    - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
    will cover pipes and sockets), or maybe convert pipes and sockets
    to FMODE_STREAM manually;
    - extend stream_open.cocci to analyze file_operations that use
    no_llseek or noop_llseek, but do not use nonseekable_open or
    alloc_file_pseudo. This might find files that have stream semantic
    but are opened differently;
    - extend stream_open.cocci to analyze file_operations whose
    .read/.write do not use ppos at all (independently of how file was
    opened);
    - ...
    - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
    !FMODE_STREAM;
    - gather bug reports for deadlocked read/write and convert missed
    cases to FMODE_STREAM, probably extending stream_open.cocci along
    the road to catch similar cases

    i.e. always take f_pos_lock unless a file is explicitly marked as
    being stream, and try to find and cover all files that are streams"

    We have not done the "extend stream_open.cocci to analyze
    alloc_file_pseudo" as well, but the previous commit did manually handle
    the case of pipes and sockets.

    The other case where we can avoid locking f_pos is the "this file
    descriptor only has a single user and it is us, and thus there is no
    need to lock it".

    The second test was correct, although a bit subtle and worth just
    re-iterating here. There are two kinds of other sources of references
    to the same file descriptor: file descriptors that have been explicitly
    shared across fork() or with dup(), and file tables having elevated
    reference counts due to threading (or explicit file sharing with
    clone()).

    The first case would have incremented the file count explicitly, and in
    the second case the previous __fdget() would have incremented it for us
    and set the FDPUT_FPUT flag.

    But in both cases the file count would be greater than one, so the
    "file_count(file) > 1" test catches both situations. Also note that if
    file_count is 1, that also means that no other thread can have access to
    the file table, so there also cannot be races with concurrent calls to
    dup()/fork()/clone() that would increment the file count any other way.

    Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
    Cc: Kirill Smelkov
    Cc: Eic Dumazet
    Cc: Al Viro
    Cc: Alan Stern
    Cc: Marco Elver
    Cc: Andrea Parri
    Cc: Paul McKenney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Sep, 2019

1 commit

  • "unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
    internally.

    Link: http://lkml.kernel.org/r/20190829165025.15750-5-efremov@linux.com
    Signed-off-by: Denis Efremov
    Cc: Alexander Viro
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Efremov
     

25 Sep, 2019

1 commit

  • In previous patch, an application could put part of its text section in
    THP via madvise(). These THPs will be protected from writes when the
    application is still running (TXTBSY). However, after the application
    exits, the file is available for writes.

    This patch avoids writes to file THP by dropping page cache for the file
    when the file is open for write. A new counter nr_thps is added to struct
    address_space. In do_dentry_open(), if the file is open for write and
    nr_thps is non-zero, we drop page cache for the whole file.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reported-by: kbuild test robot
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

25 Jul, 2019

1 commit

  • It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
    work because it installs a temporary credential that gets allocated and
    freed for each system call.

    The allocation and freeing overhead is mostly benign, but because
    credentials can be accessed under the RCU read lock, the freeing
    involves a RCU grace period.

    Which is not a huge deal normally, but if you have a lot of access()
    calls, this causes a fair amount of seconday damage: instead of having a
    nice alloc/free patterns that hits in hot per-CPU slab caches, you have
    all those delayed free's, and on big machines with hundreds of cores,
    the RCU overhead can end up being enormous.

    But it turns out that all of this is entirely unnecessary. Exactly
    because access() only installs the credential as the thread-local
    subjective credential, the temporary cred pointer doesn't actually need
    to be RCU free'd at all. Once we're done using it, we can just free it
    synchronously and avoid all the RCU overhead.

    So add a 'non_rcu' flag to 'struct cred', which can be set by users that
    know they only use it in non-RCU context (there are other potential
    users for this). We can make it a union with the rcu freeing list head
    that we need for the RCU case, so this doesn't need any extra storage.

    Note that this also makes 'get_current_cred()' clear the new non_rcu
    flag, in case we have filesystems that take a long-term reference to the
    cred and then expect the RCU delayed freeing afterwards. It's not
    entirely clear that this is required, but it makes for clear semantics:
    the subjective cred remains non-RCU as long as you only access it
    synchronously using the thread-local accessors, but you _can_ use it as
    a generic cred if you want to.

    It is possible that we should just remove the whole RCU markings for
    ->cred entirely. Only ->real_cred is really supposed to be accessed
    through RCU, and the long-term cred copies that nfs uses might want to
    explicitly re-enable RCU freeing if required, rather than have
    get_current_cred() do it implicitly.

    But this is a "minimal semantic changes" change for the immediate
    problem.

    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Eric Dumazet
    Acked-by: Paul E. McKenney
    Cc: Oleg Nesterov
    Cc: Jan Glauber
    Cc: Jiri Kosina
    Cc: Jayachandran Chandrasekharan Nair
    Cc: Greg KH
    Cc: Kees Cook
    Cc: David Howells
    Cc: Miklos Szeredi
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 May, 2019

1 commit

  • This amends commit 10dce8af3422 ("fs: stream_open - opener for
    stream-like files so that read and write can run simultaneously without
    deadlock") in how position is passed into .read()/.write() handler for
    stream-like files:

    Rasmus noticed that we currently pass 0 as position and ignore any position
    change if that is done by a file implementation. This papers over bugs if ppos
    is used in files that declare themselves as being stream-like as such bugs will
    go unnoticed. Even if a file implementation is correctly converted into using
    stream_open, its read/write later could be changed to use ppos and even though
    that won't be working correctly, that bug might go unnoticed without someone
    doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
    read/write for stream-like files as that don't give any chance for ppos usage
    bugs because it will oops if ppos is ever used inside .read() or .write().

    Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
    because they are called by vfs_read/vfs_write & friends before
    file_operations .read/.write .

    Note 2: if file backend uses new-style .read_iter/.write_iter, position
    is still passed into there as non-pointer kiocb.ki_pos . Currently
    stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
    whose file_operations has *_iter methods.

    Suggested-by: Rasmus Villemoes
    Signed-off-by: Kirill Smelkov

    Kirill Smelkov
     

07 Apr, 2019

1 commit

  • …multaneously without deadlock

    Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
    locking for file.f_pos access and in particular made concurrent read and
    write not possible - now both those functions take f_pos lock for the
    whole run, and so if e.g. a read is blocked waiting for data, write will
    deadlock waiting for that read to complete.

    This caused regression for stream-like files where previously read and
    write could run simultaneously, but after that patch could not do so
    anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
    to /proc/xen/xenbus") which fixes such regression for particular case of
    /proc/xen/xenbus.

    The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
    safety for read/write/lseek and added the locking to file descriptors of
    all regular files. In 2014 that thread-safety problem was not new as it
    was already discussed earlier in 2006.

    However even though 2006'th version of Linus's patch was adding f_pos
    locking "only for files that are marked seekable with FMODE_LSEEK (thus
    avoiding the stream-like objects like pipes and sockets)", the 2014
    version - the one that actually made it into the tree as 9c225f2655e3 -
    is doing so irregardless of whether a file is seekable or not.

    See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

    for historic context.

    The reason that it did so is, probably, that there are many files that
    are marked non-seekable, but e.g. their read implementation actually
    depends on knowing current position to correctly handle the read. Some
    examples:

    kernel/power/user.c snapshot_read
    fs/debugfs/file.c u32_array_read
    fs/fuse/control.c fuse_conn_waiting_read + ...
    drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
    arch/s390/hypfs/inode.c hypfs_read_iter
    ...

    Despite that, many nonseekable_open users implement read and write with
    pure stream semantics - they don't depend on passed ppos at all. And for
    those cases where read could wait for something inside, it creates a
    situation similar to xenbus - the write could be never made to go until
    read is done, and read is waiting for some, potentially external, event,
    for potentially unbounded time -> deadlock.

    Besides xenbus, there are 14 such places in the kernel that I've found
    with semantic patch (see below):

    drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
    drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
    drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
    drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
    net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
    drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
    drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
    drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
    net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
    drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
    drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
    drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
    drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
    drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

    In addition to the cases above another regression caused by f_pos
    locking is that now FUSE filesystems that implement open with
    FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
    stream-like files - for the same reason as above e.g. read can deadlock
    write locking on file.f_pos in the kernel.

    FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
    implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
    in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
    write routines not depending on current position at all, and with both
    read and write being potentially blocking operations:

    See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

    Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
    "somewhat pipe-like files ..." with read handler not using offset.
    However that test implements only read without write and cannot exercise
    the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

    I've actually hit the read vs write deadlock for real while implementing
    my FUSE filesystem where there is /head/watch file, for which open
    creates separate bidirectional socket-like stream in between filesystem
    and its user with both read and write being later performed
    simultaneously. And there it is semantically not easy to split the
    stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

    Let's fix this regression. The plan is:

    1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
    doing so would break many in-kernel nonseekable_open users which
    actually use ppos in read/write handlers.

    2. Add stream_open() to kernel to open stream-like non-seekable file
    descriptors. Read and write on such file descriptors would never use
    nor change ppos. And with that property on stream-like files read and
    write will be running without taking f_pos lock - i.e. read and write
    could be running simultaneously.

    3. With semantic patch search and convert to stream_open all in-kernel
    nonseekable_open users for which read and write actually do not
    depend on ppos and where there is no other methods in file_operations
    which assume @offset access.

    4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
    steam_open if that bit is present in filesystem open reply.

    It was tempting to change fs/fuse/ open handler to use stream_open
    instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
    grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
    and in particular GVFS which actually uses offset in its read and
    write handlers

    https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

    so if we would do such a change it will break a real user.

    5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
    from v3.14+ (the kernel where 9c225f2655 first appeared).

    This will allow to patch OSSPD and other FUSE filesystems that
    provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
    in their open handler and this way avoid the deadlock on all kernel
    versions. This should work because fs/fuse/ ignores unknown open
    flags returned from a filesystem and so passing FOPEN_STREAM to a
    kernel that is not aware of this flag cannot hurt. In turn the kernel
    that is not aware of FOPEN_STREAM will be < v3.14 where just
    FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
    write deadlock.

    This patch adds stream_open, converts /proc/xen/xenbus to it and adds
    semantic patch to automatically locate in-kernel places that are either
    required to be converted due to read vs write deadlock, or that are just
    safe to be converted because read and write do not use ppos and there
    are no other funky methods in file_operations.

    Regarding semantic patch I've verified each generated change manually -
    that it is correct to convert - and each other nonseekable_open instance
    left - that it is either not correct to convert there, or that it is not
    converted due to current stream_open.cocci limitations.

    The script also does not convert files that should be valid to convert,
    but that currently have .llseek = noop_llseek or generic_file_llseek for
    unknown reason despite file being opened with nonseekable_open (e.g.
    drivers/input/mousedev.c)

    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Yongzhi Pan <panyongzhi@gmail.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: David Vrabel <david.vrabel@citrix.com>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Julia Lawall <Julia.Lawall@lip6.fr>
    Cc: Nikolaus Rath <Nikolaus@rath.org>
    Cc: Han-Wen Nienhuys <hanwen@google.com>
    Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Kirill Smelkov
     

30 Mar, 2019

1 commit

  • syzbot is hitting lockdep warning [1] due to trying to open a fifo
    during an execve() operation. But we don't need to open non regular
    files during an execve() operation, for all files which we will need are
    the executable file itself and the interpreter programs like /bin/sh and
    ld-linux.so.2 .

    Since the manpage for execve(2) says that execve() returns EACCES when
    the file or a script interpreter is not a regular file, and the manpage
    for uselib(2) says that uselib() can return EACCES, and we use
    FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
    regular file is requested with FMODE_EXEC set.

    Since this deadlock followed by khungtaskd warnings is trivially
    reproducible by a local unprivileged user, and syzbot's frequent crash
    due to this deadlock defers finding other bugs, let's workaround this
    deadlock until we get a chance to find a better solution.

    [1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce

    Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Reported-by: syzbot
    Fixes: 8924feff66f35fe2 ("splice: lift pipe_lock out of splice_to_pipe()")
    Signed-off-by: Tetsuo Handa
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds
     

18 Jul, 2018

5 commits


12 Jul, 2018

11 commits


11 Jul, 2018

2 commits

  • An ->open() instances really, really should not be doing that. There's
    a lot of places e.g. around atomic_open() that could be confused by that,
    so let's catch that early.

    Acked-by: Linus Torvalds
    Signed-off-by: Al Viro

    Al Viro
     
  • it's exactly the same thing as
    dentry_open(&file->f_path, file->f_flags, file->f_cred)

    ... and rename it to file_clone_open(), while we are at it.
    'filp' naming convention is bogus; sure, it's "file pointer",
    but we generally don't do that kind of Hungarian notation.
    Some of the instances have too many callers to touch, but this
    one has only two, so let's sanitize it while we can...

    Acked-by: Linus Torvalds
    Signed-off-by: Al Viro

    Al Viro
     

04 Jun, 2018

1 commit

  • This reverts commit cab64df194667dc5d9d786f0a895f647f5501c0d.

    Having vfs_open() in some cases drop the reference to
    struct file combined with

    error = vfs_open(path, f, cred);
    if (error) {
    put_filp(f);
    return ERR_PTR(error);
    }
    return f;

    is flat-out wrong. It used to be

    error = vfs_open(path, f, cred);
    if (!error) {
    /* from now on we need fput() to dispose of f */
    error = open_check_o_direct(f);
    if (error) {
    fput(f);
    f = ERR_PTR(error);
    }
    } else {
    put_filp(f);
    f = ERR_PTR(error);
    }

    and sure, having that open_check_o_direct() boilerplate gotten rid of is
    nice, but not that way...

    Worse, another call chain (via finish_open()) is FUBAR now wrt
    FILE_OPENED handling - in that case we get error returned, with file
    already hit by fput() *AND* FILE_OPENED not set. Guess what happens in
    path_openat(), when it hits

    if (!(opened & FILE_OPENED)) {
    BUG_ON(!error);
    put_filp(file);
    }

    The root cause of all that crap is that the callers of do_dentry_open()
    have no way to tell which way did it fail; while that could be fixed up
    (by passing something like int *opened to do_dentry_open() and have it
    marked if we'd called ->open()), it's probably much too late in the
    cycle to do so right now.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

07 Apr, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, including Christoph's I_DIRTY patches"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: move I_DIRTY_INODE to fs.h
    ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
    ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
    gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
    fs: fold open_check_o_direct into do_dentry_open
    vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
    vfs: make sure struct filename->iname is word-aligned
    get rid of pointless includes of fs_struct.h
    [poll] annotate SAA6588_CMD_POLL users

    Linus Torvalds
     

03 Apr, 2018

7 commits

  • Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
    calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_fallocate().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_truncate() wrapper allows us to get rid of in-kernel
    calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_truncate().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using this wrapper allows us to avoid the in-kernel calls to the
    sys_open() syscall. The ksys_ prefix denotes that this function is meant
    as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_open().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_close() wrapper allows us to get rid of in-kernel calls
    to the sys_close() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_close(), with one subtle
    difference:

    The few places which checked the return value did not care about the return
    value re-writing in sys_close(), so simply use a wrapper around
    __close_fd().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
    calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_ftruncate().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the fs-interal do_fchownat() wrapper allows us to get rid of
    fs-internal calls to the sys_fchownat() syscall.

    Introducing the ksys_fchown() helper and the ksys_{,}chown() wrappers
    allows us to avoid the in-kernel calls to the sys_{,l,f}chown() syscalls.
    The ksys_ prefix denotes that these functions are meant as a drop-in
    replacement for the syscalls. In particular, they use the same calling
    convention as sys_{,l,f}chown().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the fs-internal do_faccessat() helper allows us to get rid of
    fs-internal calls to the sys_faccessat() syscall.

    Introducing the ksys_access() wrapper allows us to avoid the in-kernel
    calls to the sys_access() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_access().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski