20 Jan, 2021

1 commit

  • commit a0a6df9afcaf439a6b4c88a3b522e3d05fdef46f upstream.

    Unfortunately, there's userland code that used to rely upon these
    checks being done before anything else to check for UMOUNT_NOFOLLOW
    support. That broke in 41525f56e256 ("fs: refactor ksys_umount").
    Separate those from the rest of checks and move them to ksys_umount();
    unlike everything else in there, this can be sanely done there.

    Reported-by: Sargun Dhillon
    Fixes: 41525f56e256 ("fs: refactor ksys_umount")
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

06 Jan, 2021

1 commit

  • [ Upstream commit edf7ddbf1c5eb98b720b063b73e20e8a4a1ce673 ]

    Missing calls to mntget() (or equivalently, too many calls to mntput())
    are hard to detect because mntput() delays freeing mounts using
    task_work_add(), then again using call_rcu(). As a result, mnt_count
    can often be decremented to -1 without getting a KASAN use-after-free
    report. Such cases are still bugs though, and they point to real
    use-after-frees being possible.

    For an example of this, see the bug fixed by commit 1b0b9cc8d379
    ("vfs: fsmount: add missing mntget()"), discussed at
    https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u.
    This bug *should* have been trivial to find. But actually, it wasn't
    found until syzkaller happened to use fchdir() to manipulate the
    reference count just right for the bug to be noticeable.

    Address this by making mntput_no_expire() issue a WARN if mnt_count has
    become negative.

    Suggested-by: Miklos Szeredi
    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Eric Biggers
     

25 Oct, 2020

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff all over the place (the largest group here is
    Christoph's stat cleanups)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: remove KSTAT_QUERY_FLAGS
    fs: remove vfs_stat_set_lookup_flags
    fs: move vfs_fstatat out of line
    fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
    fs: remove vfs_statx_fd
    fs: omfs: use kmemdup() rather than kmalloc+memcpy
    [PATCH] reduce boilerplate in fsid handling
    fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
    selftests: mount: add nosymfollow tests
    Add a "nosymfollow" mount option.

    Linus Torvalds
     

18 Oct, 2020

1 commit

  • A previous commit changed the notification mode from true/false to an
    int, allowing notify-no, notify-yes, or signal-notify. This was
    backwards compatible in the sense that any existing true/false user
    would translate to either 0 (on notification sent) or 1, the latter
    which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

    Clean this up properly, and define a proper enum for the notification
    mode. Now we have:

    - TWA_NONE. This is 0, same as before the original change, meaning no
    notification requested.
    - TWA_RESUME. This is 1, same as before the original change, meaning
    that we use TIF_NOTIFY_RESUME.
    - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
    notification.

    Clean up all the callers, switching their 0/1/false/true to using the
    appropriate TWA_* mode for notifications.

    Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Oct, 2020

1 commit

  • Pull compat mount cleanups from Al Viro:
    "The last remnants of mount(2) compat buried by Christoph.

    Buried into NFS, that is.

    Generally I'm less enthusiastic about "let's use in_compat_syscall()
    deep in call chain" kind of approach than Christoph seems to be, but
    in this case it's warranted - that had been an NFS-specific wart,
    hopefully not to be repeated in any other filesystems (read: any new
    filesystem introducing non-text mount options will get NAKed even if
    it doesn't mess the layout up).

    IOW, not worth trying to grow an infrastructure that would avoid that
    use of in_compat_syscall()..."

    * 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: remove compat_sys_mount
    fs,nfs: lift compat nfs4 mount data handling into the nfs code
    nfs: simplify nfs4_parse_monolithic

    Linus Torvalds
     

23 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • The copy_mount_options() function takes a user pointer argument but no
    size and it tries to read up to a PAGE_SIZE. However, copy_from_user()
    is not guaranteed to return all the accessible bytes if, for example,
    the access crosses a page boundary and gets a fault on the second page.
    To work around this, the current copy_mount_options() implementation
    performs two copy_from_user() passes, first to the end of the current
    page and the second to what's left in the subsequent page.

    On arm64 with MTE enabled, access to a user page may trigger a fault
    after part of the buffer in a page has been copied (when the user
    pointer tag, bits 56-59, no longer matches the allocation tag stored in
    memory). Allow copy_mount_options() to handle such intra-page faults by
    resorting to byte at a time copy in case of copy_from_user() failure.

    Note that copy_from_user() handles the zeroing of the kernel buffer in
    case of error.

    Signed-off-by: Catalin Marinas
    Cc: Alexander Viro

    Catalin Marinas
     

28 Aug, 2020

1 commit

  • For mounts that have the new "nosymfollow" option, don't follow symlinks
    when resolving paths. The new option is similar in spirit to the
    existing "nodev", "noexec", and "nosuid" options, as well as to the
    LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
    variants have been supporting the "nosymfollow" mount option for a long
    time with equivalent implementations.

    Note that symlinks may still be created on file systems mounted with
    the "nosymfollow" option present. readlink() remains functional, so
    user space code that is aware of symlinks can still choose to follow
    them explicitly.

    Setting the "nosymfollow" mount option helps prevent privileged
    writers from modifying files unintentionally in case there is an
    unexpected link along the accessed path. The "nosymfollow" option is
    thus useful as a defensive measure for systems that need to deal with
    untrusted file systems in privileged contexts.

    More information on the history and motivation for this patch can be
    found here:

    https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal

    Signed-off-by: Mattias Nissler
    Signed-off-by: Ross Zwisler
    Reviewed-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Mattias Nissler
     

08 Aug, 2020

3 commits

  • Pull mount leak fix from Al Viro:
    "Regression fix for the syscalls-for-init series - fix a leak of a 'struct path'"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: fix a struct path leak in path_umount

    Linus Torvalds
     
  • Make sure we also put the dentry and vfsmnt in the illegal flags
    and !may_umount cases.

    Fixes: 41525f56e256 ("fs: refactor ksys_umount")
    Reported-by: Vikas Kumar
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Pull init and set_fs() cleanups from Al Viro:
    "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

    * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
    init: add an init_dup helper
    init: add an init_utimes helper
    init: add an init_stat helper
    init: add an init_mknod helper
    init: add an init_mkdir helper
    init: add an init_symlink helper
    init: add an init_link helper
    init: add an init_eaccess helper
    init: add an init_chmod helper
    init: add an init_chown helper
    init: add an init_chroot helper
    init: add an init_chdir helper
    init: add an init_rmdir helper
    init: add an init_unlink helper
    init: add an init_umount helper
    init: add an init_mount helper
    init: mark create_dev as __init
    init: mark console_on_rootfs as __init
    init: initialize ramdisk_execute_command at compile time
    devtmpfs: refactor devtmpfsd()
    ...

    Linus Torvalds
     

31 Jul, 2020

4 commits

  • Like ksys_umount, but takes a kernel pointer for the destination path.
    Switch over the umount in the init code, which just happen to work due to
    the implicit set_fs(KERNEL_DS) during early init right now.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Like do_mount, but takes a kernel pointer for the destination path.
    Switch over the mounts in the init code and devtmpfs to it, which
    just happen to work due to the implicit set_fs(KERNEL_DS) during early
    init right now.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Factor out a path_umount helper that takes a struct path * instead of the
    actual file name. This will allow to convert the init and devtmpfs code
    to properly mount based on a kernel pointer instead of relying on the
    implicit set_fs(KERNEL_DS) during early init.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Factor out a path_mount helper that takes a struct path * instead of the
    actual file name. This will allow to convert the init and devtmpfs code
    to properly mount based on a kernel pointer instead of relying on the
    implicit set_fs(KERNEL_DS) during early init.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

14 Jul, 2020

1 commit

  • Previous patch changed handling of remount/reconfigure to ignore all
    options, including those that are unknown to the fuse kernel fs. This was
    done for backward compatibility, but this likely only affects the old
    mount(2) API.

    The new fsconfig(2) based reconfiguration could possibly be improved. This
    would make the new API less of a drop in replacement for the old, OTOH this
    is a good chance to get rid of some weirdnesses in the old API.

    Several other behaviors might make sense:

    1) unknown options are rejected, known options are ignored

    2) unknown options are rejected, known options are rejected if the value
    is changed, allowed otherwise

    3) all options are rejected

    Prior to the backward compatibility fix to ignore all options all known
    options were accepted (1), even if they change the value of a mount
    parameter; fuse_reconfigure() does not look at the config values set by
    fuse_parse_param().

    To fix that we'd need to verify that the value provided is the same as set
    in the initial configuration (2). The major drawback is that this is much
    more complex than just rejecting all attempts at changing options (3);
    i.e. all options signify initial configuration values and don't make sense
    on reconfigure.

    This patch opts for (3) with the rationale that no mount options are
    reconfigurable in fuse.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

11 Jun, 2020

1 commit


10 Jun, 2020

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "Fixes:

    - Resolve mount option conflicts consistently

    - Sync before remount R/O

    - Fix file handle encoding corner cases

    - Fix metacopy related issues

    - Fix an unintialized return value

    - Add missing permission checks for underlying layers

    Optimizations:

    - Allow multipe whiteouts to share an inode

    - Optimize small writes by inheriting SB_NOSEC from upper layer

    - Do not call ->syncfs() multiple times for sync(2)

    - Do not cache negative lookups on upper layer

    - Make private internal mounts longterm"

    * tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits)
    ovl: remove unnecessary lock check
    ovl: make oip->index bool
    ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
    ovl: make private mounts longterm
    ovl: get rid of redundant members in struct ovl_fs
    ovl: add accessor for ofs->upper_mnt
    ovl: initialize error in ovl_copy_xattr
    ovl: drop negative dentry in upper layer
    ovl: check permission to open real file
    ovl: call secutiry hook in ovl_real_ioctl()
    ovl: verify permissions in ovl_path_open()
    ovl: switch to mounter creds in readdir
    ovl: pass correct flags for opening real directory
    ovl: fix redirect traversal on metacopy dentries
    ovl: initialize OVL_UPPERDATA in ovl_lookup()
    ovl: use only uppermetacopy state in ovl_lookup()
    ovl: simplify setting of origin for index lookup
    ovl: fix out of bounds access warning in ovl_check_fb_len()
    ovl: return required buffer size for file handles
    ovl: sync dirty data when remounting to ro mode
    ...

    Linus Torvalds
     

04 Jun, 2020

2 commits

  • Overlayfs is using clone_private_mount() to create internal mounts for
    underlying layers. These are used for operations requiring a path, such as
    dentry_open().

    Since these private mounts are not in any namespace they are treated as
    short term, "detached" mounts and mntput() involves taking the global
    mount_lock, which can result in serious cacheline pingpong.

    Make these private mounts longterm instead, which trade the penalty on
    mntput() for a slightly longer shutdown time due to an added RCU grace
    period when putting these mounts.

    Introduce a new helper kern_unmount_many() that can take care of multiple
    longterm mounts with a single RCU grace period.

    Cc: Al Viro
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Pull thread updates from Christian Brauner:
    "We have been discussing using pidfds to attach to namespaces for quite
    a while and the patches have in one form or another already existed
    for about a year. But I wanted to wait to see how the general api
    would be received and adopted.

    This contains the changes to make it possible to use pidfds to attach
    to the namespaces of a process, i.e. they can be passed as the first
    argument to the setns() syscall.

    When only a single namespace type is specified the semantics are
    equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
    equals setns(pidfd, CLONE_NEWNET).

    However, when a pidfd is passed, multiple namespace flags can be
    specified in the second setns() argument and setns() will attach the
    caller to all the specified namespaces all at once or to none of them.

    Specifying 0 is not valid together with a pidfd. Here are just two
    obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

    Allowing to also attach subsets of namespaces supports various
    use-cases where callers setns to a subset of namespaces to retain
    privilege, perform an action and then re-attach another subset of
    namespaces.

    Apart from significantly reducing the number of syscalls needed to
    attach to all currently supported namespaces (eight "open+setns"
    sequences vs just a single "setns()"), this also allows atomic setns
    to a set of namespaces, i.e. either attaching to all namespaces
    succeeds or we fail without having changed anything.

    This is centered around a new internal struct nsset which holds all
    information necessary for a task to switch to a new set of namespaces
    atomically. Fwiw, with this change a pidfd becomes the only token
    needed to interact with a container. I'm expecting this to be
    picked-up by util-linux for nsenter rather soon.

    Associated with this change is a shiny new test-suite dedicated to
    setns() (for pidfds and nsfds alike)"

    * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests/pidfd: add pidfd setns tests
    nsproxy: attach to namespaces via pidfds
    nsproxy: add struct nsset

    Linus Torvalds
     

02 Jun, 2020

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted patches from Miklos.

    An interesting part here is /proc/mounts stuff..."

    The "/proc/mounts stuff" is using a cursor for keeeping the location
    data while traversing the mount listing.

    Also probably worth noting is the addition of faccessat2(), which takes
    an additional set of flags to specify how the lookup is done
    (AT_EACCESS, AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH).

    * 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: add faccessat2 syscall
    vfs: don't parse "silent" option
    vfs: don't parse "posixacl" option
    vfs: don't parse forbidden flags
    statx: add mount_root
    statx: add mount ID
    statx: don't clear STATX_ATIME on SB_RDONLY
    uapi: deprecate STATX_ALL
    utimensat: AT_EMPTY_PATH support
    vfs: split out access_override_creds()
    proc/mounts: add cursor
    aio: fix async fsync creds
    vfs: allow unprivileged whiteout creation

    Linus Torvalds
     

29 May, 2020

1 commit

  • This function acts as an out-of-line helper for is_local_mountpoint
    is only called after the latter verifies the dentry is not a mountpoint.
    There's no semantic changes and the resulting object code is smaller:

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-26 (-26)
    Function old new delta
    __is_local_mountpoint 147 121 -26
    Total: Before=34161, After=34135, chg -0.08%

    Signed-off-by: Nikolay Borisov
    Signed-off-by: Al Viro

    Nikolay Borisov
     

14 May, 2020

1 commit

  • If mounts are deleted after a read(2) call on /proc/self/mounts (or its
    kin), the subsequent read(2) could miss a mount that comes after the
    deleted one in the list. This is because the file position is interpreted
    as the number mount entries from the start of the list.

    E.g. first read gets entries #0 to #9; the seq file index will be 10. Then
    entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10,
    etc... The next read will continue from entry #10, and #9 is missed.

    Solve this by adding a cursor entry for each open instance. Taking the
    global namespace_sem for write seems excessive, since we are only dealing
    with a per-namespace list. Instead add a per-namespace spinlock and use
    that together with namespace_sem taken for read to protect against
    concurrent modification of the mount list. This may reduce parallelism of
    is_local_mountpoint(), but it's hardly a big contention point. We could
    also use RCU freeing of cursors to make traversal not need additional
    locks, if that turns out to be neceesary.

    Only move the cursor once for each read (cursor is not added on open) to
    minimize cacheline invalidation. When EOF is reached, the cursor is taken
    off the list, in order to prevent an excessive number of cursors due to
    inactive open file descriptors.

    Reported-by: Karel Zak
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

13 May, 2020

1 commit

  • For quite a while we have been thinking about using pidfds to attach to
    namespaces. This patchset has existed for about a year already but we've
    wanted to wait to see how the general api would be received and adopted.
    Now that more and more programs in userspace have started using pidfds
    for process management it's time to send this one out.

    This patch makes it possible to use pidfds to attach to the namespaces
    of another process, i.e. they can be passed as the first argument to the
    setns() syscall. When only a single namespace type is specified the
    semantics are equivalent to passing an nsfd. That means
    setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
    when a pidfd is passed, multiple namespace flags can be specified in the
    second setns() argument and setns() will attach the caller to all the
    specified namespaces all at once or to none of them. Specifying 0 is not
    valid together with a pidfd.

    Here are just two obvious examples:
    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);
    Allowing to also attach subsets of namespaces supports various use-cases
    where callers setns to a subset of namespaces to retain privilege, perform
    an action and then re-attach another subset of namespaces.

    If the need arises, as Eric suggested, we can extend this patchset to
    assume even more context than just attaching all namespaces. His suggestion
    specifically was about assuming the process' root directory when
    setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
    keep it flexible in terms of supporting subsets of namespaces but let's
    wait until we have users asking for even more context to be assumed. At
    that point we can add an extension.

    The obvious example where this is useful is a standard container
    manager interacting with a running container: pushing and pulling files
    or directories, injecting mounts, attaching/execing any kind of process,
    managing network devices all these operations require attaching to all
    or at least multiple namespaces at the same time. Given that nowadays
    most containers are spawned with all namespaces enabled we're currently
    looking at at least 14 syscalls, 7 to open the /proc//ns/
    nsfds, another 7 to actually perform the namespace switch. With time
    namespaces we're looking at about 16 syscalls.
    (We could amortize the first 7 or 8 syscalls for opening the nsfds by
    stashing them in each container's monitor process but that would mean
    we need to send around those file descriptors through unix sockets
    everytime we want to interact with the container or keep on-disk
    state. Even in scenarios where a caller wants to join a particular
    namespace in a particular order callers still profit from batching
    other namespaces. That mostly applies to the user namespace but
    all container runtimes I found join the user namespace first no matter
    if it privileges or deprivileges the container similar to how unshare
    behaves.)
    With pidfds this becomes a single syscall no matter how many namespaces
    are supposed to be attached to.

    A decently designed, large-scale container manager usually isn't the
    parent of any of the containers it spawns so the containers don't die
    when it crashes or needs to update or reinitialize. This means that
    for the manager to interact with containers through pids is inherently
    racy especially on systems where the maximum pid number is not
    significicantly bumped. This is even more problematic since we often spawn
    and manage thousands or ten-thousands of containers. Interacting with a
    container through a pid thus can become risky quite quickly. Especially
    since we allow for an administrator to enable advanced features such as
    syscall interception where we're performing syscalls in lieu of the
    container. In all of those cases we use pidfds if they are available and
    we pass them around as stable references. Using them to setns() to the
    target process' namespaces is as reliable as using nsfds. Either the
    target process is already dead and we get ESRCH or we manage to attach
    to its namespaces but we can't accidently attach to another process'
    namespaces. So pidfds lend themselves to be used with this api.
    The other main advantage is that with this change the pidfd becomes the
    only relevant token for most container interactions and it's the only
    token we need to create and send around.

    Apart from significiantly reducing the number of syscalls from double
    digit to single digit which is a decent reason post-spectre/meltdown
    this also allows to switch to a set of namespaces atomically, i.e.
    either attaching to all the specified namespaces succeeds or we fail. If
    we fail we haven't changed a single namespace. There are currently three
    namespaces that can fail (other than for ENOMEM which really is not
    very interesting since we then have other problems anyway) for
    non-trivial reasons, user, mount, and pid namespaces. We can fail to
    attach to a pid namespace if it is not our current active pid namespace
    or a descendant of it. We can fail to attach to a user namespace because
    we are multi-threaded or because our current mount namespace shares
    filesystem state with other tasks, or because we're trying to setns()
    to the same user namespace, i.e. the target task has the same user
    namespace as we do. We can fail to attach to a mount namespace because
    it shares filesystem state with other tasks or because we fail to lookup
    the new root for the new mount namespace. In most non-pathological
    scenarios these issues can be somewhat mitigated. But there are cases where
    we're half-attached to some namespace and failing to attach to another one.
    I've talked about some of these problem during the hallway track (something
    only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
    in 2018(?). Even if all these issues could be avoided with super careful
    userspace coding it would be nicer to have this done in-kernel. Pidfds seem
    to lend themselves nicely for this.

    The other neat thing about this is that setns() becomes an actual
    counterpart to the namespace bits of unshare().

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com

    Christian Brauner
     

09 May, 2020

1 commit

  • Add a simple struct nsset. It holds all necessary pieces to switch to a new
    set of namespaces without leaving a task in a half-switched state which we
    will make use of in the next patch. This patch switches the existing setns
    logic over without causing a change in setns() behavior. This brings
    setns() closer to how unshare() works(). The prepare_ns() function is
    responsible to prepare all necessary information. This has two reasons.
    First it minimizes dependencies between individual namespaces, i.e. all
    install handler can expect that all fields are properly initialized
    independent in what order they are called in. Second, this makes the code
    easier to maintain and easier to follow if it needs to be changed.

    The prepare_ns() helper will only be switched over to use a flags argument
    in the next patch. Here it will still use nstype as a simple integer
    argument which was argued would be clearer. I'm not particularly
    opinionated about this if it really helps or not. The struct nsset itself
    already contains the flags field since its name already indicates that it
    can contain information required by different namespaces. None of this
    should have functional consequences.

    Signed-off-by: Christian Brauner
    Reviewed-by: Serge Hallyn
    Cc: Eric W. Biederman
    Cc: Serge Hallyn
    Cc: Jann Horn
    Cc: Michael Kerrisk
    Cc: Aleksa Sarai
    Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com

    Christian Brauner
     

21 Apr, 2020

1 commit

  • Some filesystem references got broken by a previous patch
    series I submitted. Address those.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: David Sterba # fs/affs/Kconfig
    Link: https://lore.kernel.org/r/57318c53008dbda7f6f4a5a9e5787f4d37e8565a.1586881715.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

14 Mar, 2020

1 commit


28 Feb, 2020

2 commits

  • 1) no instances of ->d_automount() have ever made use of the "return
    ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
    a rudiment of plans that got superseded before the thing went into
    the tree. Despite the comment in follow_automount(), autofs has
    never done that.

    2) if there's no ->d_automount() in dentry_operations, filesystems
    should not set DCACHE_NEED_AUTOMOUNT in the first place. None have
    ever done so...

    Signed-off-by: Al Viro

    Al Viro
     
  • Protection against automount/automount races (two threads hitting the same
    referral point at the same time) is based upon do_add_mount() prevention of
    identical overmounts - trying to overmount the root of mounted tree with
    the same tree fails with -EBUSY. It's unreliable (the other thread might've
    mounted something on top of the automount it has triggered) *and* causes
    no end of headache for follow_automount() and its caller, since
    finish_automount() behaves like do_new_mount() - if the mountpoint to be is
    overmounted, it mounts on top what's overmounting it. It's not only wrong
    (we want to go into what's overmounting the automount point and quietly
    discard what we planned to mount there), it introduces the possibility of
    original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix
    vfsmount overput on simultaneous automount) deals with, but it can't do
    anything about the reliability of conflict detection - if something had
    been overmounted the other thread's automount (e.g. that other thread
    having stepped into automount in mount(2)), we don't get that -EBUSY and
    the result is
    referral point under automounted NFS under explicit overmount
    under another copy of automounted NFS

    What we need is finish_automount() *NOT* digging into overmounts - if it
    finds one, it should just quietly discard the thing it was asked to mount.
    And don't bother with actually crossing into the results of finish_automount() -
    the same loop that calls follow_automount() will do that just fine on the
    next iteration.

    IOW, instead of calling lock_mount() have finish_automount() do it manually,
    _without_ the "move into overmount and retry" part. And leave crossing into
    the results to the caller of follow_automount(), which simplifies it a lot.

    Moral: if you end up with a lot of glue working around the calling conventions
    of something, perhaps these calling conventions are simply wrong...

    Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount)
    Signed-off-by: Al Viro

    Al Viro
     

11 Feb, 2020

1 commit


04 Feb, 2020

1 commit


05 Jan, 2020

1 commit

  • Make to_mnt_ns() static to address the following 'sparse' warning:

    fs/namespace.c:1731:22: warning: symbol 'to_mnt_ns' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191209234830.156260-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

12 Dec, 2019

1 commit

  • In prepare_namespace(), do_mount() can be used instead of ksys_mount()
    as the first and third argument are const strings in the kernel, the
    second and fourth argument are passed through anyway, and the fifth
    argument is NULL.

    In do_mount_root(), ksys_mount() is called with the first and third
    argument being already kernelspace strings, which do not need to be
    copied over from userspace to kernelspace (again). The second and
    fourth arguments are passed through to do_mount() anyway. The fifth
    argument, while already residing in kernelspace, needs to be put into
    a page of its own. Then, do_mount() can be used instead of
    ksys_mount().

    Once this is done, there are no in-kernel users to ksys_mount() left,
    which can therefore be removed.

    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

09 Dec, 2019

1 commit


22 Oct, 2019

1 commit

  • Thw open_tree and move_mount syscalls take names from the
    user, so add the __user to these to ensure the following
    warnings from sparse are fixed:

    fs/namespace.c:2392:35: warning: incorrect type in argument 2 (different address spaces)
    fs/namespace.c:2392:35: expected char const [noderef] *name
    fs/namespace.c:2392:35: got char const *filename
    fs/namespace.c:3541:38: warning: incorrect type in argument 2 (different address spaces)
    fs/namespace.c:3541:38: expected char const [noderef] *name
    fs/namespace.c:3541:38: got char const *from_pathname
    fs/namespace.c:3550:36: warning: incorrect type in argument 2 (different address spaces)
    fs/namespace.c:3550:36: expected char const [noderef] *name
    fs/namespace.c:3550:36: got char const *to_pathname

    Signed-off-by: Ben Dooks
    Signed-off-by: Al Viro

    Ben Dooks
     

17 Oct, 2019

1 commit

  • After do_add_mount() returns success, the caller doesn't hold a
    reference to the 'struct mount' anymore. So it's invalid to access it
    in mnt_warn_timestamp_expiry().

    Fix it by calling mnt_warn_timestamp_expiry() before do_add_mount()
    rather than after, and adjusting the warning message accordingly.

    Reported-by: syzbot+da4f525235510683d855@syzkaller.appspotmail.com
    Fixes: f8b92ba67c5d ("mount: Add mount warning for impending timestamp expiry")
    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

27 Sep, 2019

1 commit

  • Merge more updates from Andrew Morton:

    - almost all of the rest of -mm

    - various other subsystems

    Subsystems affected by this patch series:
    memcg, misc, core-kernel, lib, checkpatch, reiserfs, fat, fork,
    cpumask, kexec, uaccess, kconfig, kgdb, bug, ipc, lzo, kasan, madvise,
    cleanups, pagemap

    * emailed patches from Andrew Morton : (77 commits)
    arch/sparc/include/asm/pgtable_64.h: fix build
    mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
    ntfs: remove (un)?likely() from IS_ERR() conditions
    IB/hfi1: remove unlikely() from IS_ERR*() condition
    xfs: remove unlikely() from WARN_ON() condition
    wimax/i2400m: remove unlikely() from WARN*() condition
    fs: remove unlikely() from WARN_ON() condition
    xen/events: remove unlikely() from WARN() condition
    checkpatch: check for nested (un)?likely() calls
    hexagon: drop empty and unused free_initrd_mem
    mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
    mm: introduce MADV_PAGEOUT
    mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
    mm: introduce MADV_COLD
    mm: untag user pointers in mmap/munmap/mremap/brk
    vfio/type1: untag user pointers in vaddr_get_pfn
    tee/shm: untag user pointers in tee_shm_register
    media/v4l2-core: untag user pointers in videobuf_dma_contig_user_get
    drm/radeon: untag user pointers in radeon_gem_userptr_ioctl
    drm/amdgpu: untag user pointers
    ...

    Linus Torvalds
     

26 Sep, 2019

2 commits

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    In copy_mount_options a user address is being subtracted from TASK_SIZE.
    If the address is lower than TASK_SIZE, the size is calculated to not
    allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
    However if the address is tagged, then the size will be calculated
    incorrectly.

    Untag the address before subtracting.

    Link: http://lkml.kernel.org/r/1de225e4a54204bfd7f25dac2635e31aa4aa1d90.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Kees Cook
    Reviewed-by: Catalin Marinas
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Pull fuse updates from Miklos Szeredi:

    - Continue separating the transport (user/kernel communication) and the
    filesystem layers of fuse. Getting rid of most layering violations
    will allow for easier cleanup and optimization later on.

    - Prepare for the addition of the virtio-fs filesystem. The actual
    filesystem will be introduced by a separate pull request.

    - Convert to new mount API.

    - Various fixes, optimizations and cleanups.

    * tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits)
    fuse: Make fuse_args_to_req static
    fuse: fix memleak in cuse_channel_open
    fuse: fix beyond-end-of-page access in fuse_parse_cache()
    fuse: unexport fuse_put_request
    fuse: kmemcg account fs data
    fuse: on 64-bit store time in d_fsdata directly
    fuse: fix missing unlock_page in fuse_writepage()
    fuse: reserve byteswapped init opcodes
    fuse: allow skipping control interface and forced unmount
    fuse: dissociate DESTROY from fuseblk
    fuse: delete dentry if timeout is zero
    fuse: separate fuse device allocation and installation in fuse_conn
    fuse: add fuse_iqueue_ops callbacks
    fuse: extract fuse_fill_super_common()
    fuse: export fuse_dequeue_forget() function
    fuse: export fuse_get_unique()
    fuse: export fuse_send_init_request()
    fuse: export fuse_len_args()
    fuse: export fuse_end_request()
    fuse: fix request limit
    ...

    Linus Torvalds
     

20 Sep, 2019

1 commit

  • Pull y2038 vfs updates from Arnd Bergmann:
    "Add inode timestamp clamping.

    This series from Deepa Dinamani adds a per-superblock minimum/maximum
    timestamp limit for a file system, and clamps timestamps as they are
    written, to avoid random behavior from integer overflow as well as
    having different time stamps on disk vs in memory.

    At mount time, a warning is now printed for any file system that can
    represent current timestamps but not future timestamps more than 30
    years into the future, similar to the arbitrary 30 year limit that was
    added to settimeofday().

    This was picked as a compromise to warn users to migrate to other file
    systems (e.g. ext4 instead of ext3) when they need the file system to
    survive beyond 2038 (or similar limits in other file systems), but not
    get in the way of normal usage"

    * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    ext4: Reduce ext4 timestamp warnings
    isofs: Initialize filesystem timestamp ranges
    pstore: fs superblock limits
    fs: omfs: Initialize filesystem timestamp ranges
    fs: hpfs: Initialize filesystem timestamp ranges
    fs: ceph: Initialize filesystem timestamp ranges
    fs: sysv: Initialize filesystem timestamp ranges
    fs: affs: Initialize filesystem timestamp ranges
    fs: fat: Initialize filesystem timestamp ranges
    fs: cifs: Initialize filesystem timestamp ranges
    fs: nfs: Initialize filesystem timestamp ranges
    ext4: Initialize timestamps limits
    9p: Fill min and max timestamps in sb
    fs: Fill in max and min timestamps in superblock
    utimes: Clamp the timestamps before update
    mount: Add mount warning for impending timestamp expiry
    timestamp_truncate: Replace users of timespec64_trunc
    vfs: Add timestamp_truncate() api
    vfs: Add file timestamp range support

    Linus Torvalds