30 Dec, 2020

1 commit

  • commit 398840f8bb935d33c64df4ec4fed77a7d24c267d upstream.

    This was an oversight in the original implementation, as it makes no
    sense to specify both scoping flags to the same openat2(2) invocation
    (before this patch, the result of such an invocation was equivalent to
    RESOLVE_IN_ROOT being ignored).

    This is a userspace-visible ABI change, but the only user of openat2(2)
    at the moment is LXC which doesn't specify both flags and so no
    userspace programs will break as a result.

    Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall")
    Signed-off-by: Aleksa Sarai
    Acked-by: Christian Brauner
    Cc: # v5.6+
    Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Aleksa Sarai
     

13 Aug, 2020

1 commit

  • The execve(2)/uselib(2) syscalls have always rejected non-regular files.
    Recently, it was noticed that a deadlock was introduced when trying to
    execute pipes, as the S_ISREG() test was happening too late. This was
    fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
    during execve()"), but it was added after inode_permission() had already
    run, which meant LSMs could see bogus attempts to execute non-regular
    files.

    Move the test into the other inode type checks (which already look for
    other pathological conditions[1]). Since there is no need to use
    FMODE_EXEC while we still have access to "acc_mode", also switch the test
    to MAY_EXEC.

    Also include a comment with the redundant S_ISREG() checks at the end of
    execve(2)/uselib(2) to note that they are present to avoid any mistakes.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs S_ISREG() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    /* old location of FMODE_EXEC vs S_ISREG() test */
    security_file_open(f)
    open()

    [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     

08 Aug, 2020

1 commit

  • Pull init and set_fs() cleanups from Al Viro:
    "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

    * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
    init: add an init_dup helper
    init: add an init_utimes helper
    init: add an init_stat helper
    init: add an init_mknod helper
    init: add an init_mkdir helper
    init: add an init_symlink helper
    init: add an init_link helper
    init: add an init_eaccess helper
    init: add an init_chmod helper
    init: add an init_chown helper
    init: add an init_chroot helper
    init: add an init_chdir helper
    init: add an init_rmdir helper
    init: add an init_unlink helper
    init: add an init_umount helper
    init: add an init_mount helper
    init: mark create_dev as __init
    init: mark console_on_rootfs as __init
    init: initialize ramdisk_execute_command at compile time
    devtmpfs: refactor devtmpfsd()
    ...

    Linus Torvalds
     

31 Jul, 2020

7 commits


16 Jul, 2020

2 commits


17 Jun, 2020

2 commits

  • One of the use-cases of close_range() is to drop file descriptors just before
    execve(). This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part of
    close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

    This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
    maximum number of file descriptors to copy from the old struct files. When the
    user requests that all file descriptors are supposed to be closed via
    close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
    to do any of the heavy fput() work for everything above min.

    The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
    fact currently share our file descriptor table we create a new private copy.
    We then close all fds in the requested range and finally after we're done we
    install the new fd table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • This adds the close_range() syscall. It allows to efficiently close a range
    of file descriptors up to all file descriptors of a calling task.

    I was contacted by FreeBSD as they wanted to have the same close_range()
    syscall as we proposed here. We've coordinated this and in the meantime, Kyle
    was fast enough to merge close_range() into FreeBSD already in April:
    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836
    and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
    once its merged in Linux too. Python is in the process of switching to
    close_range() on FreeBSD and they are waiting on us to merge this to switch on
    Linux as well: https://bugs.python.org/issue38061

    The syscall came up in a recent discussion around the new mount API and
    making new file descriptor types cloexec by default. During this
    discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
    syscall in this manner has been requested by various people over time.

    First, it helps to close all file descriptors of an exec()ing task. This
    can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that file
    descriptors are not cloexec by default. This is aggravated by the fact that
    we can't just switch them over without massively regressing userspace. For
    a whole class of programs having an in-kernel method of closing all file
    descriptors is very helpful (e.g. demons, service managers, programming
    language standard libraries, container managers etc.).
    (Please note, unshare(CLONE_FILES) should only be needed if the calling
    task is multi-threaded and shares the file descriptor table with another
    thread in which case two threads could race with one thread allocating file
    descriptors and the other one closing them via close_range(). For the
    general case close_range() before the execve() is sufficient.)

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on each
    file descriptor. From looking at various large(ish) userspace code bases
    this or similar patterns are very common in:
    - service managers (cf. [4])
    - libcs (cf. [6])
    - container runtimes (cf. [5])
    - programming language runtimes/standard libraries
    - Python (cf. [2])
    - Rust (cf. [7], [8])
    As Dmitry pointed out there's even a long-standing glibc bug about missing
    kernel support for this task (cf. [3]).
    In addition, the syscall will also work for tasks that do not have procfs
    mounted and on kernels that do not have procfs support compiled in. In such
    situations the only way to make sure that all file descriptors are closed
    is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
    OPEN_MAX trickery (cf. comment [8] on Rust).

    The performance is striking. For good measure, comparing the following
    simple close_all_fds() userspace implementation that is essentially just
    glibc's version in [6]:

    static int close_all_fds(void)
    {
    int dir_fd;
    DIR *dir;
    struct dirent *direntp;

    dir = opendir("/proc/self/fd");
    if (!dir)
    return -1;
    dir_fd = dirfd(dir);
    while ((direntp = readdir(dir))) {
    int fd;
    if (strcmp(direntp->d_name, ".") == 0)
    continue;
    if (strcmp(direntp->d_name, "..") == 0)
    continue;
    fd = atoi(direntp->d_name);
    if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
    continue;
    close(fd);
    }
    closedir(dir);
    return 0;
    }

    to close_range() yields:
    1. closing 4 open files:
    - close_all_fds(): ~280 us
    - close_range(): ~24 us

    2. closing 1000 open files:
    - close_all_fds(): ~5000 us
    - close_range(): ~800 us

    close_range() is designed to allow for some flexibility. Specifically, it
    does not simply always close all open file descriptors of a task. Instead,
    callers can specify an upper bound.
    This is e.g. useful for scenarios where specific file descriptors are
    created with well-known numbers that are supposed to be excluded from
    getting closed.
    For extra paranoia close_range() comes with a flags argument. This can e.g.
    be used to implement extension. Once can imagine userspace wanting to stop
    at the first error instead of ignoring errors under certain circumstances.
    There might be other valid ideas in the future. In any case, a flag
    argument doesn't hurt and keeps us on the safe side.

    From an implementation side this is kept rather dumb. It saw some input
    from David and Jann but all nonsense is obviously my own!
    - Errors to close file descriptors are currently ignored. (Could be changed
    by setting a flag in the future if needed.)
    - __close_range() is a rather simplistic wrapper around __close_fd().
    My reasoning behind this is based on the nature of how __close_fd() needs
    to release an fd. But maybe I misunderstood specifics:
    We take the files_lock and rcu-dereference the fdtable of the calling
    task, we find the entry in the fdtable, get the file and need to release
    files_lock before calling filp_close().
    In the meantime the fdtable might have been altered so we can't just
    retake the spinlock and keep the old rcu-reference of the fdtable
    around. Instead we need to grab a fresh reference to the fdtable.
    If my reasoning is correct then there's really no point in fancyfying
    __close_range(): We just need to rcu-dereference the fdtable of the
    calling task once to cap the max_fd value correctly and then go on
    calling __close_fd() in a loop.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
    [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
    [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
    [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
    [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
    [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
    Note that this is an internal implementation that is not exported.
    Currently, libc seems to not provide an exported version of this
    because of missing kernel support to do this.
    Note, in a recent patch series Florian made grantpt() a nop thereby
    removing the code referenced here.
    [7]: https://github.com/rust-lang/rust/issues/12148
    [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
    Rust's solution is slightly different but is equally unperformant.
    Rust calls getdtablesize() which is a glibc library function that
    simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
    goes on to call close() on each fd. That's obviously overkill for most
    tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
    OPEN_MAX.
    Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
    to 1024. Even in this case, there's a very high chance that in the
    common case Rust is calling the close() syscall 1021 times pointlessly
    if the task just has 0, 1, and 2 open.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Kyle Evans
    Cc: Jann Horn
    Cc: David Howells
    Cc: Dmitry V. Levin
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Florian Weimer
    Cc: linux-api@vger.kernel.org

    Christian Brauner
     

03 Jun, 2020

2 commits

  • Merge updates from Andrew Morton:
    "A few little subsystems and a start of a lot of MM patches.

    Subsystems affected by this patch series: squashfs, ocfs2, parisc,
    vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
    swap, memcg, pagemap, memory-failure, vmalloc, kasan"

    * emailed patches from Andrew Morton : (128 commits)
    kasan: move kasan_report() into report.c
    mm/mm_init.c: report kasan-tag information stored in page->flags
    ubsan: entirely disable alignment checks under UBSAN_TRAP
    kasan: fix clang compilation warning due to stack protector
    x86/mm: remove vmalloc faulting
    mm: remove vmalloc_sync_(un)mappings()
    x86/mm/32: implement arch_sync_kernel_mappings()
    x86/mm/64: implement arch_sync_kernel_mappings()
    mm/ioremap: track which page-table levels were modified
    mm/vmalloc: track which page-table levels were modified
    mm: add functions to track page directory modifications
    s390: use __vmalloc_node in stack_alloc
    powerpc: use __vmalloc_node in alloc_vm_stack
    arm64: use __vmalloc_node in arch_alloc_vmap_stack
    mm: remove vmalloc_user_node_flags
    mm: switch the test_vmalloc module to use __vmalloc_node
    mm: remove __vmalloc_node_flags_caller
    mm: remove both instances of __vmalloc_node_flags
    mm: remove the prot argument to __vmalloc_node
    mm: remove the pgprot argument to __vmalloc
    ...

    Linus Torvalds
     
  • Patch series "vfs: have syncfs() return error when there are writeback
    errors", v6.

    Currently, syncfs does not return errors when one of the inodes fails to
    be written back. It will return errors based on the legacy AS_EIO and
    AS_ENOSPC flags when syncing out the block device fails, but that's not
    particularly helpful for filesystems that aren't backed by a blockdev.
    It's also possible for a stray sync to lose those errors.

    The basic idea in this set is to track writeback errors at the
    superblock level, so that we can quickly and easily check whether
    something bad happened without having to fsync each file individually.
    syncfs is then changed to reliably report writeback errors after they
    occur, much in the same fashion as fsync does now.

    This patch (of 2):

    Usually we suggest that applications call fsync when they want to ensure
    that all data written to the file has made it to the backing store, but
    that can be inefficient when there are a lot of open files.

    Calling syncfs on the filesystem can be more efficient in some
    situations, but the error reporting doesn't currently work the way most
    people expect. If a single inode on a filesystem reports a writeback
    error, syncfs won't necessarily return an error. syncfs only returns an
    error if __sync_blockdev fails, and on some filesystems that's a no-op.

    It would be better if syncfs reported an error if there were any
    writeback failures. Then applications could call syncfs to see if there
    are any errors on any open files, and could then call fsync on all of
    the other descriptors to figure out which one failed.

    This patch adds a new errseq_t to struct super_block, and has
    mapping_set_error also record writeback errors there.

    To report those errors, we also need to keep an errseq_t in struct file
    to act as a cursor. This patch adds a dedicated field for that purpose,
    which slots nicely into 4 bytes of padding at the end of struct file on
    x86_64.

    An earlier version of this patch used an O_PATH file descriptor to cue
    the kernel that the open file should track the superblock error and not
    the inode's writeback error.

    I think that API is just too weird though. This is simpler and should
    make syncfs error reporting "just work" even if someone is multiplexing
    fsync and syncfs on the same fds.

    Signed-off-by: Jeff Layton
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Andres Freund
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: David Howells
    Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
    Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

14 May, 2020

2 commits

  • POSIX defines faccessat() as having a fourth "flags" argument, while the
    linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and
    AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

    Add a new faccessat(2) syscall with the added flags argument and implement
    both flags.

    The value of AT_EACCESS is defined in glibc headers to be the same as
    AT_REMOVEDIR. Use this value for the kernel interface as well, together
    with the explanatory comment.

    Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
    be useful and is trivial to implement.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Split out a helper that overrides the credentials in preparation for
    actually doing the access check.

    This prepares for the next patch that optionally disables the creds
    override.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

03 Apr, 2020

1 commit

  • Pull vfs pathwalk sanitizing from Al Viro:
    "Massive pathwalk rewrite and cleanups.

    Several iterations have been posted; hopefully this thing is getting
    readable and understandable now. Pretty much all parts of pathname
    resolutions are affected...

    The branch is identical to what has sat in -next, except for commit
    message in "lift all calls of step_into() out of follow_dotdot/
    follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
    commit message changed there."

    * 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
    lookup_open(): don't bother with fallbacks to lookup+create
    atomic_open(): no need to pass struct open_flags anymore
    open_last_lookups(): move complete_walk() into do_open()
    open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
    open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
    open_last_lookups(): consolidate fsnotify_create() calls
    take post-lookup part of do_last() out of loop
    link_path_walk(): sample parent's i_uid and i_mode for the last component
    __nd_alloc_stack(): make it return bool
    reserve_stack(): switch to __nd_alloc_stack()
    pick_link(): take reserving space on stack into a new helper
    pick_link(): more straightforward handling of allocation failures
    fold path_to_nameidata() into its only remaining caller
    pick_link(): pass it struct path already with normal refcounting rules
    fs/namei.c: kill follow_mount()
    non-RCU analogue of the previous commit
    helper for mount rootwards traversal
    follow_dotdot(): be lazy about changing nd->path
    follow_dotdot_rcu(): be lazy about changing nd->path
    follow_dotdot{,_rcu}(): massage loops
    ...

    Linus Torvalds
     

13 Mar, 2020

1 commit

  • several iterations of ->atomic_open() calling conventions ago, we
    used to need fput() if ->atomic_open() failed at some point after
    successful finish_open(). Now (since 2016) it's not needed -
    struct file carries enough state to make fput() work regardless
    of the point in struct file lifecycle and discarding it on
    failure exits in open() got unified. Unfortunately, I'd missed
    the fact that we had an instance of ->atomic_open() (cifs one)
    that used to need that fput(), as well as the stale comment in
    finish_open() demanding such late failure handling. Trivially
    fixed...

    Fixes: fe9ec8291fca "do_last(): take fput() on error after opening to out:"
    Cc: stable@kernel.org # v4.7+
    Signed-off-by: Al Viro

    Al Viro
     

28 Feb, 2020

1 commit

  • O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
    As it is, we might or might not have LOOKUP_FOLLOW in op->intent
    in that case - that depends upon having O_NOFOLLOW in open flags.
    It doesn't matter, since we won't be checking it in that case -
    do_last() bails out earlier.

    However, making sure it's not set (i.e. acting as if we had an explicit
    O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
    check for O_CREAT | O_EXCL in do_last() with the call of step_into()
    immediately following it.

    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2020

1 commit


18 Jan, 2020

1 commit

  • /* Background. */
    For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown flags
    are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road to
    being added to openat(2).

    Userspace also has a hard time figuring out whether a particular flag is
    supported on a particular kernel. While it is now possible with
    contemporary kernels (thanks to [3]), older kernels will expose unknown
    flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
    openat(2) time matches modern syscall designs and is far more
    fool-proof.

    In addition, the newly-added path resolution restriction LOOKUP flags
    (which we would like to expose to user-space) don't feel related to the
    pre-existing O_* flag set -- they affect all components of path lookup.
    We'd therefore like to add a new flag argument.

    Adding a new syscall allows us to finally fix the flag-ignoring problem,
    and we can make it extensible enough so that we will hopefully never
    need an openat3(2).

    /* Syscall Prototype. */
    /*
    * open_how is an extensible structure (similar in interface to
    * clone3(2) or sched_setattr(2)). The size parameter must be set to
    * sizeof(struct open_how), to allow for future extensions. All future
    * extensions will be appended to open_how, with their zero value
    * acting as a no-op default.
    */
    struct open_how { /* ... */ };

    int openat2(int dfd, const char *pathname,
    struct open_how *how, size_t size);

    /* Description. */
    The initial version of 'struct open_how' contains the following fields:

    flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

    mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

    resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH => LOOKUP_BENEATH
    RESOLVE_IN_ROOT => LOOKUP_IN_ROOT

    open_how does not contain an embedded size field, because it is of
    little benefit (userspace can figure out the kernel open_how size at
    runtime fairly easily without it). It also only contains u64s (even
    though ->mode arguably should be a u16) to avoid having padding fields
    which are never used in the future.

    Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
    is no longer permitted for openat(2). As far as I can tell, this has
    always been a bug and appears to not be used by userspace (and I've not
    seen any problems on my machines by disallowing it). If it turns out
    this breaks something, we can special-case it and only permit it for
    openat(2) but not openat2(2).

    After input from Florian Weimer, the new open_how and flag definitions
    are inside a separate header from uapi/linux/fcntl.h, to avoid problems
    that glibc has with importing that header.

    /* Testing. */
    In a follow-up patch there are over 200 selftests which ensure that this
    syscall has the correct semantics and will correctly handle several
    attack scenarios.

    In addition, I've written a userspace library[4] which provides
    convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
    because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
    must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
    syscalls). During the development of this patch, I've run numerous
    verification tests using libpathrs (showing that the API is reasonably
    usable by userspace).

    /* Future Work. */
    Additional RESOLVE_ flags have been suggested during the review period.
    These can be easily implemented separately (such as blocking auto-mount
    during resolution).

    Furthermore, there are some other proposed changes to the openat(2)
    interface (the most obvious example is magic-link hardening[5]) which
    would be a good opportunity to add a way for userspace to restrict how
    O_PATH file descriptors can be re-opened.

    Another possible avenue of future work would be some kind of
    CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
    which openat2(2) flags and fields are supported by the current kernel
    (to avoid userspace having to go through several guesses to figure it
    out).

    [1]: https://lwn.net/Articles/588444/
    [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
    [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
    [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
    [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
    [6]: https://youtu.be/ggD-eb3yPVs

    Suggested-by: Christian Brauner
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

27 Nov, 2019

1 commit

  • This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.

    I was hoping it would be benign to switch over entirely to FMODE_STREAM,
    and we'd have just a couple of small fixups we'd need, but it looks like
    we're not quite there yet.

    While it worked fine on both my desktop and laptop, they are fairly
    similar in other respects, and run mostly the same loads. Kenneth
    Crudup reports that it seems to break both his vmware installation and
    the KDE upower service. In both cases apparently leading to timeouts
    due to waitinmg for the f_pos lock.

    There are a number of character devices in particular that definitely
    want stream-like behavior, but that currently don't get marked as
    streams, and as a result get the exclusion between concurrent
    read()/write() on the same file descriptor. Which doesn't work well for
    them.

    The most obvious example if this is /dev/console and /dev/tty, which use
    console_fops and tty_fops respectively (and ptmx_fops for the pty master
    side). It may be that it's just this that causes problems, but we
    clearly weren't ready yet.

    Because there's a number of other likely common cases that don't have
    llseek implementations and would seem to act as stream devices:

    /dev/fuse (fuse_dev_operations)
    /dev/mcelog (mce_chrdev_ops)
    /dev/mei0 (mei_fops)
    /dev/net/tun (tun_fops)
    /dev/nvme0 (nvme_dev_fops)
    /dev/tpm0 (tpm_fops)
    /proc/self/ns/mnt (ns_file_operations)
    /dev/snd/pcm* (snd_pcm_f_ops[])

    and while some of these could be trivially automatically detected by the
    vfs layer when the character device is opened by just noticing that they
    have no read or write operations either, it often isn't that obvious.

    Some character devices most definitely do use the file position, even if
    they don't allow seeking: the firmware update code, for example, uses
    simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
    back and forth.

    We'll revisit this when there's a better way to detect the problem and
    fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
    annotations).

    Reported-by: Kenneth R. Crudup
    Cc: Kirill Smelkov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Nov, 2019

1 commit

  • fdget_pos() is used by file operations that will read and update f_pos:
    things like "read()", "write()" and "lseek()" (but not, for example,
    "pread()/pwrite" that get their file positions elsewhere).

    However, it had two separate escape clauses for this, because not
    everybody wants or needs serialization of the file position.

    The first and most obvious case is the "file descriptor doesn't have a
    position at all", ie a stream-like file. Except we didn't actually use
    FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
    was that FMODE_STREAM didn't exist back in the days, but also that we
    didn't want to mark all the special cases, so we only marked the ones
    that _required_ position atomicity according to POSIX - regular files
    and directories.

    The case one was intentionally lazy, but now that we _do_ have
    FMODE_STREAM we could and should just use it. With the change to use
    FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
    the code to set it is deleted.

    Any cases where we don't want the serialization because the driver (or
    subsystem) doesn't use the file position should just be updated to do
    "stream_open()". We've done that for all the obvious and common
    situations, we may need a few more. Quoting Kirill Smelkov in the
    original FMODE_STREAM thread (see link below for full email):

    "And I appreciate if people could help at least somehow with "getting
    rid of mixed case entirely" (i.e. always lock f_pos_lock on
    !FMODE_STREAM), because this transition starts to diverge from my
    particular use-case too far. To me it makes sense to do that
    transition as follows:

    - convert nonseekable_open -> stream_open via stream_open.cocci;
    - audit other nonseekable_open calls and convert left users that
    truly don't depend on position to stream_open;
    - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
    will cover pipes and sockets), or maybe convert pipes and sockets
    to FMODE_STREAM manually;
    - extend stream_open.cocci to analyze file_operations that use
    no_llseek or noop_llseek, but do not use nonseekable_open or
    alloc_file_pseudo. This might find files that have stream semantic
    but are opened differently;
    - extend stream_open.cocci to analyze file_operations whose
    .read/.write do not use ppos at all (independently of how file was
    opened);
    - ...
    - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
    !FMODE_STREAM;
    - gather bug reports for deadlocked read/write and convert missed
    cases to FMODE_STREAM, probably extending stream_open.cocci along
    the road to catch similar cases

    i.e. always take f_pos_lock unless a file is explicitly marked as
    being stream, and try to find and cover all files that are streams"

    We have not done the "extend stream_open.cocci to analyze
    alloc_file_pseudo" as well, but the previous commit did manually handle
    the case of pipes and sockets.

    The other case where we can avoid locking f_pos is the "this file
    descriptor only has a single user and it is us, and thus there is no
    need to lock it".

    The second test was correct, although a bit subtle and worth just
    re-iterating here. There are two kinds of other sources of references
    to the same file descriptor: file descriptors that have been explicitly
    shared across fork() or with dup(), and file tables having elevated
    reference counts due to threading (or explicit file sharing with
    clone()).

    The first case would have incremented the file count explicitly, and in
    the second case the previous __fdget() would have incremented it for us
    and set the FDPUT_FPUT flag.

    But in both cases the file count would be greater than one, so the
    "file_count(file) > 1" test catches both situations. Also note that if
    file_count is 1, that also means that no other thread can have access to
    the file table, so there also cannot be races with concurrent calls to
    dup()/fork()/clone() that would increment the file count any other way.

    Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
    Cc: Kirill Smelkov
    Cc: Eic Dumazet
    Cc: Al Viro
    Cc: Alan Stern
    Cc: Marco Elver
    Cc: Andrea Parri
    Cc: Paul McKenney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Sep, 2019

1 commit

  • "unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
    internally.

    Link: http://lkml.kernel.org/r/20190829165025.15750-5-efremov@linux.com
    Signed-off-by: Denis Efremov
    Cc: Alexander Viro
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Efremov
     

25 Sep, 2019

1 commit

  • In previous patch, an application could put part of its text section in
    THP via madvise(). These THPs will be protected from writes when the
    application is still running (TXTBSY). However, after the application
    exits, the file is available for writes.

    This patch avoids writes to file THP by dropping page cache for the file
    when the file is open for write. A new counter nr_thps is added to struct
    address_space. In do_dentry_open(), if the file is open for write and
    nr_thps is non-zero, we drop page cache for the whole file.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reported-by: kbuild test robot
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

25 Jul, 2019

1 commit

  • It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
    work because it installs a temporary credential that gets allocated and
    freed for each system call.

    The allocation and freeing overhead is mostly benign, but because
    credentials can be accessed under the RCU read lock, the freeing
    involves a RCU grace period.

    Which is not a huge deal normally, but if you have a lot of access()
    calls, this causes a fair amount of seconday damage: instead of having a
    nice alloc/free patterns that hits in hot per-CPU slab caches, you have
    all those delayed free's, and on big machines with hundreds of cores,
    the RCU overhead can end up being enormous.

    But it turns out that all of this is entirely unnecessary. Exactly
    because access() only installs the credential as the thread-local
    subjective credential, the temporary cred pointer doesn't actually need
    to be RCU free'd at all. Once we're done using it, we can just free it
    synchronously and avoid all the RCU overhead.

    So add a 'non_rcu' flag to 'struct cred', which can be set by users that
    know they only use it in non-RCU context (there are other potential
    users for this). We can make it a union with the rcu freeing list head
    that we need for the RCU case, so this doesn't need any extra storage.

    Note that this also makes 'get_current_cred()' clear the new non_rcu
    flag, in case we have filesystems that take a long-term reference to the
    cred and then expect the RCU delayed freeing afterwards. It's not
    entirely clear that this is required, but it makes for clear semantics:
    the subjective cred remains non-RCU as long as you only access it
    synchronously using the thread-local accessors, but you _can_ use it as
    a generic cred if you want to.

    It is possible that we should just remove the whole RCU markings for
    ->cred entirely. Only ->real_cred is really supposed to be accessed
    through RCU, and the long-term cred copies that nfs uses might want to
    explicitly re-enable RCU freeing if required, rather than have
    get_current_cred() do it implicitly.

    But this is a "minimal semantic changes" change for the immediate
    problem.

    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Eric Dumazet
    Acked-by: Paul E. McKenney
    Cc: Oleg Nesterov
    Cc: Jan Glauber
    Cc: Jiri Kosina
    Cc: Jayachandran Chandrasekharan Nair
    Cc: Greg KH
    Cc: Kees Cook
    Cc: David Howells
    Cc: Miklos Szeredi
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 May, 2019

1 commit

  • This amends commit 10dce8af3422 ("fs: stream_open - opener for
    stream-like files so that read and write can run simultaneously without
    deadlock") in how position is passed into .read()/.write() handler for
    stream-like files:

    Rasmus noticed that we currently pass 0 as position and ignore any position
    change if that is done by a file implementation. This papers over bugs if ppos
    is used in files that declare themselves as being stream-like as such bugs will
    go unnoticed. Even if a file implementation is correctly converted into using
    stream_open, its read/write later could be changed to use ppos and even though
    that won't be working correctly, that bug might go unnoticed without someone
    doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
    read/write for stream-like files as that don't give any chance for ppos usage
    bugs because it will oops if ppos is ever used inside .read() or .write().

    Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
    because they are called by vfs_read/vfs_write & friends before
    file_operations .read/.write .

    Note 2: if file backend uses new-style .read_iter/.write_iter, position
    is still passed into there as non-pointer kiocb.ki_pos . Currently
    stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
    whose file_operations has *_iter methods.

    Suggested-by: Rasmus Villemoes
    Signed-off-by: Kirill Smelkov

    Kirill Smelkov
     

07 Apr, 2019

1 commit

  • …multaneously without deadlock

    Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
    locking for file.f_pos access and in particular made concurrent read and
    write not possible - now both those functions take f_pos lock for the
    whole run, and so if e.g. a read is blocked waiting for data, write will
    deadlock waiting for that read to complete.

    This caused regression for stream-like files where previously read and
    write could run simultaneously, but after that patch could not do so
    anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
    to /proc/xen/xenbus") which fixes such regression for particular case of
    /proc/xen/xenbus.

    The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
    safety for read/write/lseek and added the locking to file descriptors of
    all regular files. In 2014 that thread-safety problem was not new as it
    was already discussed earlier in 2006.

    However even though 2006'th version of Linus's patch was adding f_pos
    locking "only for files that are marked seekable with FMODE_LSEEK (thus
    avoiding the stream-like objects like pipes and sockets)", the 2014
    version - the one that actually made it into the tree as 9c225f2655e3 -
    is doing so irregardless of whether a file is seekable or not.

    See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

    for historic context.

    The reason that it did so is, probably, that there are many files that
    are marked non-seekable, but e.g. their read implementation actually
    depends on knowing current position to correctly handle the read. Some
    examples:

    kernel/power/user.c snapshot_read
    fs/debugfs/file.c u32_array_read
    fs/fuse/control.c fuse_conn_waiting_read + ...
    drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
    arch/s390/hypfs/inode.c hypfs_read_iter
    ...

    Despite that, many nonseekable_open users implement read and write with
    pure stream semantics - they don't depend on passed ppos at all. And for
    those cases where read could wait for something inside, it creates a
    situation similar to xenbus - the write could be never made to go until
    read is done, and read is waiting for some, potentially external, event,
    for potentially unbounded time -> deadlock.

    Besides xenbus, there are 14 such places in the kernel that I've found
    with semantic patch (see below):

    drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
    drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
    drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
    drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
    net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
    drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
    drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
    drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
    net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
    drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
    drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
    drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
    drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
    drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

    In addition to the cases above another regression caused by f_pos
    locking is that now FUSE filesystems that implement open with
    FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
    stream-like files - for the same reason as above e.g. read can deadlock
    write locking on file.f_pos in the kernel.

    FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
    implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
    in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
    write routines not depending on current position at all, and with both
    read and write being potentially blocking operations:

    See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

    Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
    "somewhat pipe-like files ..." with read handler not using offset.
    However that test implements only read without write and cannot exercise
    the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

    I've actually hit the read vs write deadlock for real while implementing
    my FUSE filesystem where there is /head/watch file, for which open
    creates separate bidirectional socket-like stream in between filesystem
    and its user with both read and write being later performed
    simultaneously. And there it is semantically not easy to split the
    stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

    Let's fix this regression. The plan is:

    1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
    doing so would break many in-kernel nonseekable_open users which
    actually use ppos in read/write handlers.

    2. Add stream_open() to kernel to open stream-like non-seekable file
    descriptors. Read and write on such file descriptors would never use
    nor change ppos. And with that property on stream-like files read and
    write will be running without taking f_pos lock - i.e. read and write
    could be running simultaneously.

    3. With semantic patch search and convert to stream_open all in-kernel
    nonseekable_open users for which read and write actually do not
    depend on ppos and where there is no other methods in file_operations
    which assume @offset access.

    4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
    steam_open if that bit is present in filesystem open reply.

    It was tempting to change fs/fuse/ open handler to use stream_open
    instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
    grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
    and in particular GVFS which actually uses offset in its read and
    write handlers

    https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

    so if we would do such a change it will break a real user.

    5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
    from v3.14+ (the kernel where 9c225f2655 first appeared).

    This will allow to patch OSSPD and other FUSE filesystems that
    provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
    in their open handler and this way avoid the deadlock on all kernel
    versions. This should work because fs/fuse/ ignores unknown open
    flags returned from a filesystem and so passing FOPEN_STREAM to a
    kernel that is not aware of this flag cannot hurt. In turn the kernel
    that is not aware of FOPEN_STREAM will be < v3.14 where just
    FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
    write deadlock.

    This patch adds stream_open, converts /proc/xen/xenbus to it and adds
    semantic patch to automatically locate in-kernel places that are either
    required to be converted due to read vs write deadlock, or that are just
    safe to be converted because read and write do not use ppos and there
    are no other funky methods in file_operations.

    Regarding semantic patch I've verified each generated change manually -
    that it is correct to convert - and each other nonseekable_open instance
    left - that it is either not correct to convert there, or that it is not
    converted due to current stream_open.cocci limitations.

    The script also does not convert files that should be valid to convert,
    but that currently have .llseek = noop_llseek or generic_file_llseek for
    unknown reason despite file being opened with nonseekable_open (e.g.
    drivers/input/mousedev.c)

    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Yongzhi Pan <panyongzhi@gmail.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: David Vrabel <david.vrabel@citrix.com>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Julia Lawall <Julia.Lawall@lip6.fr>
    Cc: Nikolaus Rath <Nikolaus@rath.org>
    Cc: Han-Wen Nienhuys <hanwen@google.com>
    Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Kirill Smelkov
     

30 Mar, 2019

1 commit

  • syzbot is hitting lockdep warning [1] due to trying to open a fifo
    during an execve() operation. But we don't need to open non regular
    files during an execve() operation, for all files which we will need are
    the executable file itself and the interpreter programs like /bin/sh and
    ld-linux.so.2 .

    Since the manpage for execve(2) says that execve() returns EACCES when
    the file or a script interpreter is not a regular file, and the manpage
    for uselib(2) says that uselib() can return EACCES, and we use
    FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
    regular file is requested with FMODE_EXEC set.

    Since this deadlock followed by khungtaskd warnings is trivially
    reproducible by a local unprivileged user, and syzbot's frequent crash
    due to this deadlock defers finding other bugs, let's workaround this
    deadlock until we get a chance to find a better solution.

    [1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce

    Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Reported-by: syzbot
    Fixes: 8924feff66f35fe2 ("splice: lift pipe_lock out of splice_to_pipe()")
    Signed-off-by: Tetsuo Handa
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds
     

18 Jul, 2018

5 commits


12 Jul, 2018

2 commits

  • open a file by given inode, faking ->f_path. Use with shitloads
    of caution - at the very least you'd damn better make sure that
    some dentry alias of that inode is pinned down by the path in
    question. Again, this is no general-purpose interface and I hope
    it will eventually go away. Right now overlayfs wants something
    like that, but nothing else should.

    Any out-of-tree code with bright idea of using this one *will*
    eventually get hurt, with zero notice and great delight on my part.
    I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
    it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
    as "you are welcome to use it".

    Signed-off-by: Al Viro

    Al Viro
     
  • FMODE_OPENED can be used to distingusish "successful open" from the
    "called finish_no_open(), do it yourself" cases. Since finish_no_open()
    has been adjusted, no changes in the instances were actually needed.
    The caller has been adjusted.

    Acked-by: Linus Torvalds
    Signed-off-by: Al Viro

    Al Viro