31 Aug, 2022

1 commit

  • [ Upstream commit bf1ac16edf6770a92bc75cf2373f1f9feea398a4 ]

    Idmapped mounts should not allow a user to map file ownsership into a
    range of ids which is not under the control of that user. However, we
    currently don't check whether the mounter is privileged wrt to the
    target user namespace.

    Currently no FS_USERNS_MOUNT filesystems support idmapped mounts, thus
    this is not a problem as only CAP_SYS_ADMIN in init_user_ns is allowed
    to set up idmapped mounts. But this could change in the future, so add a
    check to refuse to create idmapped mounts when the mounter does not have
    CAP_SYS_ADMIN in the target user namespace.

    Fixes: bd303368b776 ("fs: support mapped mounts of mapped filesystems")
    Signed-off-by: Seth Forshee
    Reviewed-by: Christian Brauner (Microsoft)
    Link: https://lore.kernel.org/r/20220816164752.2595240-1-sforshee@digitalocean.com
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Sasha Levin

    Seth Forshee
     

02 Jul, 2022

2 commits

  • commit bd303368b776eead1c29e6cdda82bde7128b82a7 upstream.

    In previous patches we added new and modified existing helpers to handle
    idmapped mounts of filesystems mounted with an idmapping. In this final
    patch we convert all relevant places in the vfs to actually pass the
    filesystem's idmapping into these helpers.

    With this the vfs is in shape to handle idmapped mounts of filesystems
    mounted with an idmapping. Note that this is just the generic
    infrastructure. Actually adding support for idmapped mounts to a
    filesystem mountable with an idmapping is follow-up work.

    In this patch we extend the definition of an idmapped mount from a mount
    that that has the initial idmapping attached to it to a mount that has
    an idmapping attached to it which is not the same as the idmapping the
    filesystem was mounted with.

    As before we do not allow the initial idmapping to be attached to a
    mount. In addition this patch prevents that the idmapping the filesystem
    was mounted with can be attached to a mount created based on this
    filesystem.

    This has multiple reasons and advantages. First, attaching the initial
    idmapping or the filesystem's idmapping doesn't make much sense as in
    both cases the values of the i_{g,u}id and other places where k{g,u}ids
    are used do not change. Second, a user that really wants to do this for
    whatever reason can just create a separate dedicated identical idmapping
    to attach to the mount. Third, we can continue to use the initial
    idmapping as an indicator that a mount is not idmapped allowing us to
    continue to keep passing the initial idmapping into the mapping helpers
    to tell them that something isn't an idmapped mount even if the
    filesystem is mounted with an idmapping.

    Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
    Cc: Seth Forshee
    Cc: Amir Goldstein
    Cc: Christoph Hellwig
    Cc: Al Viro
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee
    Signed-off-by: Christian Brauner
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • commit bb49e9e730c2906a958eee273a7819f401543d6c upstream.

    Multiple places open-code the same check to determine whether a given
    mount is idmapped. Introduce a simple helper function that can be used
    instead. This allows us to get rid of the fragile open-coding. We will
    later change the check that is used to determine whether a given mount
    is idmapped. Introducing a helper allows us to do this in a single
    place instead of doing it for multiple places.

    Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
    Cc: Seth Forshee
    Cc: Christoph Hellwig
    Cc: Al Viro
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Amir Goldstein
    Reviewed-by: Seth Forshee
    Signed-off-by: Christian Brauner
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

05 Jan, 2022

1 commit

  • commit 012e332286e2bb9f6ac77d195f17e74b2963d663 upstream.

    Make sure that finish_mount_kattr() is called after mount_kattr was
    succesfully built in both the success and failure case to prevent
    leaking any references we took when we built it. We returned early if
    path lookup failed thereby risking to leak an additional reference we
    took when building mount_kattr when an idmapped mount was requested.

    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 9caccd41541a ("fs: introduce MOUNT_ATTR_IDMAP")
    Signed-off-by: Christian Brauner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

04 Sep, 2021

3 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • Container admin can create new namespaces and force kernel to allocate up
    to several pages of memory for the namespaces and its associated
    structures.

    Net and uts namespaces have enabled accounting for such allocations. It
    makes sense to account for rest ones to restrict the host's memory
    consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
    Signed-off-by: Vasily Averin
    Acked-by: Serge Hallyn
    Acked-by: Christian Brauner
    Acked-by: Kirill Tkhai
    Reviewed-by: Shakeel Butt
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Dmitry Safonov
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Jiri Slaby
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Yutian Yang
    Cc: Zefan Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • Patch series "memcg accounting from OpenVZ", v7.

    OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
    Initially we used our own accounting subsystem, then partially committed
    it to upstream, and a few years ago switched to cgroups v1. Now we're
    rebasing again, revising our old patches and trying to push them upstream.

    We try to protect the host system from any misuse of kernel memory
    allocation triggered by untrusted users inside the containers.

    Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
    list, though I would be very grateful for any comments from maintainersi
    of affected subsystems or other people added in cc:

    Compared to the upstream, we additionally account the following kernel objects:
    - network devices and its Tx/Rx queues
    - ipv4/v6 addresses and routing-related objects
    - inet_bind_bucket cache objects
    - VLAN group arrays
    - ipv6/sit: ip_tunnel_prl
    - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
    - nsproxy and namespace objects itself
    - IPC objects: semaphores, message queues and share memory segments
    - mounts
    - pollfd and select bits arrays
    - signals and posix timers
    - file lock
    - fasync_struct used by the file lease code and driver's fasync queues
    - tty objects
    - per-mm LDT

    We have an incorrect/incomplete/obsoleted accounting for few other kernel
    objects: sk_filter, af_packets, netlink and xt_counters for iptables.
    They require rework and probably will be dropped at all.

    Also we're going to add an accounting for nft, however it is not ready
    yet.

    We have not tested performance on upstream, however, our performance team
    compares our current RHEL7-based production kernel and reports that they
    are at least not worse as the according original RHEL7 kernel.

    This patch (of 10):

    The kernel allocates ~400 bytes of 'struct mount' for any new mount.
    Creating a new mount namespace clones most of the parent mounts, and this
    can be repeated many times. Additionally, each mount allocates up to
    PATH_MAX=4096 bytes for mnt->mnt_devname.

    It makes sense to account for these allocations to restrict the host's
    memory consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
    Signed-off-by: Vasily Averin
    Reviewed-by: Shakeel Butt
    Acked-by: Christian Brauner
    Cc: Tejun Heo
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Yutian Yang
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: Borislav Petkov
    Cc: Dmitry Safonov
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Jiri Slaby
    Cc: Kirill Tkhai
    Cc: Oleg Nesterov
    Cc: Serge Hallyn
    Cc: Thomas Gleixner
    Cc: Zefan Li
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

01 Sep, 2021

1 commit

  • …/scm/linux/kernel/git/brauner/linux

    Pull move_mount updates from Christian Brauner:
    "This contains an extension to the move_mount() syscall making it
    possible to add a single private mount into an existing propagation
    tree.

    The use-case comes from the criu folks which have been struggling with
    restoring complex mount trees for a long time. Variations of this work
    have been discussed at Plumbers before, e.g.

    https://www.linuxplumbersconf.org/event/7/contributions/640/

    The extension to move_mount() enables criu to restore any set of mount
    namespaces, mount trees and sharing group trees without introducing
    yet more complexity into mount propagation itself.

    The changes required to criu to make use of this and restore complex
    propagation trees are available at

    https://github.com/Snorch/criu/commits/mount-v2-poc

    A cleaned-up version of this will go up for merging into the main criu
    repo after this lands"

    * tag 'fs.move_mount.move_mount_set_group.v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest
    move_mount: allow to add a mount into an existing group

    Linus Torvalds
     

23 Aug, 2021

1 commit

  • We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
    off in Fedora and RHEL8. Several other distros have followed suit.

    I've heard of one problem in all that time: Someone migrated from an
    older distro that supported "-o mand" to one that didn't, and the host
    had a fstab entry with "mand" in it which broke on reboot. They didn't
    actually _use_ mandatory locking so they just removed the mount option
    and moved on.

    This patch rips out mandatory locking support wholesale from the kernel,
    along with the Kconfig option and the Documentation file. It also
    changes the mount code to ignore the "mand" mount option instead of
    erroring out, and to throw a big, ugly warning.

    Signed-off-by: Jeff Layton

    Jeff Layton
     

22 Aug, 2021

1 commit

  • Pull mandatory file locking deprecation warning from Jeff Layton:
    "As discussed on the list, this patch just adds a new warning for folks
    who still have mandatory locking enabled and actually mount with '-o
    mand'. I'd like to get this in for v5.14 so we can push this out into
    stable kernels and hopefully reach folks who have mounts with -o mand.

    For now, I'm operating under the assumption that we'll fully remove
    this support in v5.15, but we can move that out if any legitimate
    users of this facility speak up between now and then"

    * tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    fs: warn about impending deprecation of mandatory locks

    Linus Torvalds
     

21 Aug, 2021

1 commit


10 Aug, 2021

1 commit


26 Jul, 2021

1 commit

  • Previously a sharing group (shared and master ids pair) can be only
    inherited when mount is created via bindmount. This patch adds an
    ability to add an existing private mount into an existing sharing group.

    With this functionality one can first create the desired mount tree from
    only private mounts (without the need to care about undesired mount
    propagation or mount creation order implied by sharing group
    dependencies), and next then setup any desired mount sharing between
    those mounts in tree as needed.

    This allows CRIU to restore any set of mount namespaces, mount trees and
    sharing group trees for a container.

    We have many issues with restoring mounts in CRIU related to sharing
    groups and propagation:
    - reverse sharing groups vs mount tree order requires complex mounts
    reordering which mostly implies also using some temporary mounts
    (please see https://lkml.org/lkml/2021/3/23/569 for more info)

    - mount() syscall creates tons of mounts due to propagation
    - mount re-parenting due to propagation
    - "Mount Trap" due to propagation
    - "Non Uniform" propagation, meaning that with different tricks with
    mount order and temporary children-"lock" mounts one can create mount
    trees which can't be restored without those tricks
    (see https://www.linuxplumbersconf.org/event/7/contributions/640/)

    With this new functionality we can resolve all the problems with
    propagation at once.

    Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com
    Cc: Eric W. Biederman
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Mattias Nissler
    Cc: Aleksa Sarai
    Cc: Andrei Vagin
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-api@vger.kernel.org
    Cc: lkml
    Co-developed-by: Andrei Vagin
    Acked-by: Christian Brauner
    Signed-off-by: Pavel Tikhomirov
    Signed-off-by: Andrei Vagin
    Signed-off-by: Christian Brauner

    Pavel Tikhomirov
     

01 Jun, 2021

1 commit

  • Commit dab741e0e02b ("Add a "nosymfollow" mount option.") added support
    for the "nosymfollow" mount option allowing to block following symlinks
    when resolving paths. The mount option so far was only available in the
    old mount api. Make it available in the new mount api as well. Bonus is
    that it can be applied to a whole subtree not just a single mount.

    Cc: Christoph Hellwig
    Cc: Mattias Nissler
    Cc: Aleksa Sarai
    Cc: Al Viro
    Cc: Ross Zwisler
    Signed-off-by: Christian Brauner

    Christian Brauner
     

12 May, 2021

1 commit

  • We currently don't have any filesystems that support idmapped mounts
    which are mountable inside a user namespace. That was a deliberate
    decision for now as a userns root can just mount the filesystem
    themselves. So enforce this restriction explicitly until there's a real
    use-case for this. This way we can notice it and will have a chance to
    adapt and audit our translation helpers and fstests appropriately if we
    need to support such filesystems.

    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: stable@vger.kernel.org
    CC: linux-fsdevel@vger.kernel.org
    Suggested-by: Seth Forshee
    Signed-off-by: Christian Brauner

    Christian Brauner
     

01 Apr, 2021

1 commit

  • Fix kernel-doc warnings in fs/namespace.c:

    ./fs/namespace.c:1379: warning: Function parameter or member 'm' not described in 'may_umount_tree'
    ./fs/namespace.c:1379: warning: Excess function parameter 'mnt' description in 'may_umount_tree'
    ./fs/namespace.c:1950: warning: Function parameter or member 'path' not described in 'clone_private_mount'

    Also convert path_is_mountpoint() comments to kernel-doc.

    Signed-off-by: Randy Dunlap
    Allegedly-acked-by: Al Viro
    Link: https://lore.kernel.org/r/20210318025227.4162-1-rdunlap@infradead.org
    Signed-off-by: Jonathan Corbet

    Randy Dunlap
     

28 Feb, 2021

1 commit


24 Jan, 2021

8 commits

  • Introduce a new mount bind mount property to allow idmapping mounts. The
    MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
    together with a file descriptor referring to a user namespace.

    The user namespace referenced by the namespace file descriptor will be
    attached to the bind mount. All interactions with the filesystem going
    through that mount will be mapped according to the mapping specified in
    the user namespace attached to it.

    Using user namespaces to mark mounts means we can reuse all the existing
    infrastructure in the kernel that already exists to handle idmappings
    and can also use this for permission checking to allow unprivileged user
    to create idmapped mounts in the future.

    Idmapping a mount is decoupled from the caller's user and mount
    namespace. This means idmapped mounts can be created in the initial
    user namespace which is an important use-case for systemd-homed,
    portable usb-sticks between systems, sharing data between the initial
    user namespace and unprivileged containers, and other use-cases that
    have been brought up. For example, assume a home directory where all
    files are owned by uid and gid 1000 and the home directory is brought to
    a new laptop where the user has id 12345. The system administrator can
    simply create a mount of this home directory with a mapping of
    1000:12345:1 and other mappings to indicate the ids should be kept.
    (With this it is e.g. also possible to create idmapped mounts on the
    host with an identity mapping 1:1:100000 where the root user is not
    mapped. A user with root access that e.g. has been pivot rooted into
    such a mount on the host will be not be able to execute, read, write, or
    create files as root.)

    Given that mapping a mount is decoupled from the caller's user namespace
    a sufficiently privileged process such as a container manager can set up
    an idmapped mount for the container and the container can simply pivot
    root to it. There's no need for the container to do anything. The mount
    will appear correctly mapped independent of the user namespace the
    container uses. This means we don't need to mark a mount as idmappable.

    In order to create an idmapped mount the caller must currently be
    privileged in the user namespace of the superblock the mount belongs to.
    Once a mount has been idmapped we don't allow it to change its mapping.
    This keeps permission checking and life-cycle management simple. Users
    wanting to change the idmapped can always create a new detached mount
    with a different idmapping.

    Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Mauricio Vásquez Bernal
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • This implements the missing mount_setattr() syscall. While the new mount
    api allows to change the properties of a superblock there is currently
    no way to change the properties of a mount or a mount tree using file
    descriptors which the new mount api is based on. In addition the old
    mount api has the restriction that mount options cannot be applied
    recursively. This hasn't changed since changing mount options on a
    per-mount basis was implemented in [1] and has been a frequent request
    not just for convenience but also for security reasons. The legacy
    mount syscall is unable to accommodate this behavior without introducing
    a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
    MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
    mount. Changing MS_REC to apply to the whole mount tree would mean
    introducing a significant uapi change and would likely cause significant
    regressions.

    The new mount_setattr() syscall allows to recursively clear and set
    mount options in one shot. Multiple calls to change mount options
    requesting the same changes are idempotent:

    int mount_setattr(int dfd, const char *path, unsigned flags,
    struct mount_attr *uattr, size_t usize);

    Flags to modify path resolution behavior are specified in the @flags
    argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
    and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
    restrict path resolution as introduced with openat2() might be supported
    in the future.

    The mount_setattr() syscall can be expected to grow over time and is
    designed with extensibility in mind. It follows the extensible syscall
    pattern we have used with other syscalls such as openat2(), clone3(),
    sched_{set,get}attr(), and others.
    The set of mount options is passed in the uapi struct mount_attr which
    currently has the following layout:

    struct mount_attr {
    __u64 attr_set;
    __u64 attr_clr;
    __u64 propagation;
    __u64 userns_fd;
    };

    The @attr_set and @attr_clr members are used to clear and set mount
    options. This way a user can e.g. request that a set of flags is to be
    raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
    @attr_set while at the same time requesting that another set of flags is
    to be lowered such as removing noexec from a mount tree by specifying
    MOUNT_ATTR_NOEXEC in @attr_clr.

    Note, since the MOUNT_ATTR_ values are an enum starting from 0,
    not a bitmap, users wanting to transition to a different atime setting
    cannot simply specify the atime setting in @attr_set, but must also
    specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
    MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
    can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
    @attr_clr.

    The @propagation field lets callers specify the propagation type of a
    mount tree. Propagation is a single property that has four different
    settings and as such is not really a flag argument but an enum.
    Specifically, it would be unclear what setting and clearing propagation
    settings in combination would amount to. The legacy mount() syscall thus
    forbids the combination of multiple propagation settings too. The goal
    is to keep the semantics of mount propagation somewhat simple as they
    are overly complex as it is.

    The @userns_fd field lets user specify a user namespace whose idmapping
    becomes the idmapping of the mount. This is implemented and explained in
    detail in the next patch.

    [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")

    Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Aleksa Sarai
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-api@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
    which we will use in follow-up patches too.

    Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
    aren't currently any active writers. Split this logic out into simple
    helpers that we can use in follow-up patches.

    Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
    which seems unnecessary since we're not changing the superblock. We're
    only checking whether it is already read-only. Setting other mount
    attributes is protected by lock_mount_hash() afaict and not by s_umount.

    The history of down_write(&sb->s_umount) lock being taken when setting
    mount attributes dates back to the introduction of MNT_READONLY in [2].
    This introduced the concept of having read-only mounts in contrast to
    just having a read-only superblock. When it got introduced it was simply
    plumbed into do_remount() which already took down_write(&sb->s_umount)
    because it was only used to actually change the superblock before [2].
    Afaict, it would've already been possible back then to only use
    down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
    options were protected by the vfsmount lock already. But that would've
    meant special casing the locking for MS_BIND | MS_REMOUNT in
    do_remount() which people might not have considered worth it.
    Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
    do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
    lock was simply copied over.
    Now that we have this be a separate helper only take the
    down_read(&sb->s_umount) lock since we're only interested in checking
    whether the super block is currently read-only and blocking any writers
    from changing it. Essentially, checking that the super block is
    read-only has the advantage that we can avoid having to go into the
    slowpath and through MNT_WRITE_HOLD and can simply set the read-only
    flag on the mount in set_mount_attributes().

    [1]: commit 43f5e655eff7 ("vfs: Separate changing mount flags full remount")
    [2]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")

    Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The lock_mount_hash() and unlock_mount_hash() helpers are never called
    outside a single file. Remove them from the header and make them static
    to reflect this fact. There's no need to have them callable from other
    places right now, as Christoph observed.

    Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Changing mount options always ends up taking lock_mount_hash() but when
    MNT_READONLY is requested and neither the mount nor the superblock are
    MNT_READONLY we end up taking the lock, dropping it, and retaking it to
    change the other mount attributes. Instead, let's acquire the lock once
    when changing the mount attributes. This simplifies the locking in these
    codepath, makes them easier to reason about and avoids having to
    reacquire the lock right after dropping it.

    Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • In order to support per-mount idmappings vfsmounts are marked with user
    namespaces. The idmapping of the user namespace will be used to map the
    ids of vfs objects when they are accessed through that mount. By default
    all vfsmounts are marked with the initial user namespace. The initial
    user namespace is used to indicate that a mount is not idmapped. All
    operations behave as before.

    Based on prior discussions we want to attach the whole user namespace
    and not just a dedicated idmapping struct. This allows us to reuse all
    the helpers that already exist for dealing with idmappings instead of
    introducing a whole new range of helpers. In addition, if we decide in
    the future that we are confident enough to enable unprivileged users to
    setup idmapped mounts the permission checking can take into account
    whether the caller is privileged in the user namespace the mount is
    currently marked with.
    Later patches enforce that once a mount has been idmapped it can't be
    remapped. This keeps permission checking and life-cycle management
    simple. Users wanting to change the idmapped can always create a new
    detached mount with a different idmapping.

    Add a new mnt_userns member to vfsmount and two simple helpers to
    retrieve the mnt_userns from vfsmounts and files.

    The idea to attach user namespaces to vfsmounts has been floated around
    in various forms at Linux Plumbers in ~2018 with the original idea
    tracing back to a discussion in 2017 at a conference in St. Petersburg
    between Christoph, Tycho, and myself.

    Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     

05 Jan, 2021

2 commits

  • Unfortunately, there's userland code that used to rely upon these
    checks being done before anything else to check for UMOUNT_NOFOLLOW
    support. That broke in 41525f56e256 ("fs: refactor ksys_umount").
    Separate those from the rest of checks and move them to ksys_umount();
    unlike everything else in there, this can be sanely done there.

    Reported-by: Sargun Dhillon
    Fixes: 41525f56e256 ("fs: refactor ksys_umount")
    Signed-off-by: Al Viro

    Al Viro
     
  • There's no need for mnt_want_write_file() to increment mnt_writers when
    the file is already open for writing, provided that
    mnt_drop_write_file() is changed to conditionally decrement it.

    We seem to have ended up in the current situation because
    mnt_want_write_file() used to be paired with mnt_drop_write(), due to
    mnt_drop_write_file() not having been added yet. So originally
    mnt_want_write_file() had to always increment mnt_writers.

    But later mnt_drop_write_file() was added, and all callers of
    mnt_want_write_file() were paired with it. This makes the compatibility
    between mnt_want_write_file() and mnt_drop_write() no longer necessary.

    Therefore, make __mnt_want_write_file() and __mnt_drop_write_file() skip
    incrementing mnt_writers on files already open for writing. This
    removes the only caller of mnt_clone_write(), so remove that too.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

26 Dec, 2020

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted patches from previous cycle(s)..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix hostfs_open() use of ->f_path.dentry
    Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode
    fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set
    fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()
    fs/namespace.c: WARN if mnt_count has become negative

    Linus Torvalds
     

15 Dec, 2020

1 commit

  • Pull misc fixes from Christian Brauner:
    "This contains several fixes which felt worth being combined into a
    single branch:

    - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

    - Kirill's work to unify lifecycle management for all namespaces. The
    lifetime counters are used identically for all namespaces types.
    Namespaces may of course have additional unrelated counters and
    these are not altered. This work allows us to unify the type of the
    counters and reduces maintenance cost by moving the counter in one
    place and indicating that basic lifetime management is identical
    for all namespaces.

    - Peilin's fix adding three byte padding to Dmitry's
    PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

    - Two smal patches to convert from the /* fall through */ comment
    annotation to the fallthrough keyword annotation which I had taken
    into my branch and into -next before df561f6688fe ("treewide: Use
    fallthrough pseudo-keyword") made it upstream which fixed this
    tree-wide.

    Since I didn't want to invalidate all testing for other commits I
    didn't rebase and kept them"

    * tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    nsproxy: use put_nsproxy() in switch_task_namespaces()
    sys: Convert to the new fallthrough notation
    signal: Convert to the new fallthrough notation
    time: Use generic ns_common::count
    cgroup: Use generic ns_common::count
    mnt: Use generic ns_common::count
    user: Use generic ns_common::count
    pid: Use generic ns_common::count
    ipc: Use generic ns_common::count
    uts: Use generic ns_common::count
    net: Use generic ns_common::count
    ns: Add a common refcount into ns_common
    ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()

    Linus Torvalds
     

11 Dec, 2020

1 commit

  • Missing calls to mntget() (or equivalently, too many calls to mntput())
    are hard to detect because mntput() delays freeing mounts using
    task_work_add(), then again using call_rcu(). As a result, mnt_count
    can often be decremented to -1 without getting a KASAN use-after-free
    report. Such cases are still bugs though, and they point to real
    use-after-frees being possible.

    For an example of this, see the bug fixed by commit 1b0b9cc8d379
    ("vfs: fsmount: add missing mntget()"), discussed at
    https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u.
    This bug *should* have been trivial to find. But actually, it wasn't
    found until syzkaller happened to use fchdir() to manipulate the
    reference count just right for the bug to be noticeable.

    Address this by making mntput_no_expire() issue a WARN if mnt_count has
    become negative.

    Suggested-by: Miklos Szeredi
    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

25 Oct, 2020

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff all over the place (the largest group here is
    Christoph's stat cleanups)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: remove KSTAT_QUERY_FLAGS
    fs: remove vfs_stat_set_lookup_flags
    fs: move vfs_fstatat out of line
    fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
    fs: remove vfs_statx_fd
    fs: omfs: use kmemdup() rather than kmalloc+memcpy
    [PATCH] reduce boilerplate in fsid handling
    fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
    selftests: mount: add nosymfollow tests
    Add a "nosymfollow" mount option.

    Linus Torvalds
     

18 Oct, 2020

1 commit

  • A previous commit changed the notification mode from true/false to an
    int, allowing notify-no, notify-yes, or signal-notify. This was
    backwards compatible in the sense that any existing true/false user
    would translate to either 0 (on notification sent) or 1, the latter
    which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

    Clean this up properly, and define a proper enum for the notification
    mode. Now we have:

    - TWA_NONE. This is 0, same as before the original change, meaning no
    notification requested.
    - TWA_RESUME. This is 1, same as before the original change, meaning
    that we use TIF_NOTIFY_RESUME.
    - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
    notification.

    Clean up all the callers, switching their 0/1/false/true to using the
    appropriate TWA_* mode for notifications.

    Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 Oct, 2020

1 commit

  • Pull compat mount cleanups from Al Viro:
    "The last remnants of mount(2) compat buried by Christoph.

    Buried into NFS, that is.

    Generally I'm less enthusiastic about "let's use in_compat_syscall()
    deep in call chain" kind of approach than Christoph seems to be, but
    in this case it's warranted - that had been an NFS-specific wart,
    hopefully not to be repeated in any other filesystems (read: any new
    filesystem introducing non-text mount options will get NAKed even if
    it doesn't mess the layout up).

    IOW, not worth trying to grow an infrastructure that would avoid that
    use of in_compat_syscall()..."

    * 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: remove compat_sys_mount
    fs,nfs: lift compat nfs4 mount data handling into the nfs code
    nfs: simplify nfs4_parse_monolithic

    Linus Torvalds
     

23 Sep, 2020

1 commit


04 Sep, 2020

1 commit

  • The copy_mount_options() function takes a user pointer argument but no
    size and it tries to read up to a PAGE_SIZE. However, copy_from_user()
    is not guaranteed to return all the accessible bytes if, for example,
    the access crosses a page boundary and gets a fault on the second page.
    To work around this, the current copy_mount_options() implementation
    performs two copy_from_user() passes, first to the end of the current
    page and the second to what's left in the subsequent page.

    On arm64 with MTE enabled, access to a user page may trigger a fault
    after part of the buffer in a page has been copied (when the user
    pointer tag, bits 56-59, no longer matches the allocation tag stored in
    memory). Allow copy_mount_options() to handle such intra-page faults by
    resorting to byte at a time copy in case of copy_from_user() failure.

    Note that copy_from_user() handles the zeroing of the kernel buffer in
    case of error.

    Signed-off-by: Catalin Marinas
    Cc: Alexander Viro

    Catalin Marinas
     

28 Aug, 2020

1 commit

  • For mounts that have the new "nosymfollow" option, don't follow symlinks
    when resolving paths. The new option is similar in spirit to the
    existing "nodev", "noexec", and "nosuid" options, as well as to the
    LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
    variants have been supporting the "nosymfollow" mount option for a long
    time with equivalent implementations.

    Note that symlinks may still be created on file systems mounted with
    the "nosymfollow" option present. readlink() remains functional, so
    user space code that is aware of symlinks can still choose to follow
    them explicitly.

    Setting the "nosymfollow" mount option helps prevent privileged
    writers from modifying files unintentionally in case there is an
    unexpected link along the accessed path. The "nosymfollow" option is
    thus useful as a defensive measure for systems that need to deal with
    untrusted file systems in privileged contexts.

    More information on the history and motivation for this patch can be
    found here:

    https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal

    Signed-off-by: Mattias Nissler
    Signed-off-by: Ross Zwisler
    Reviewed-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Mattias Nissler
     

19 Aug, 2020

1 commit

  • Switch over mount namespaces to use the newly introduced common lifetime
    counter.

    Currently every namespace type has its own lifetime counter which is stored
    in the specific namespace struct. The lifetime counters are used
    identically for all namespaces types. Namespaces may of course have
    additional unrelated counters and these are not altered.

    This introduces a common lifetime counter into struct ns_common. The
    ns_common struct encompasses information that all namespaces share. That
    should include the lifetime counter since its common for all of them.

    It also allows us to unify the type of the counters across all namespaces.
    Most of them use refcount_t but one uses atomic_t and at least one uses
    kref. Especially the last one doesn't make much sense since it's just a
    wrapper around refcount_t since 2016 and actually complicates cleanup
    operations by having to use container_of() to cast the correct namespace
    struct out of struct ns_common.

    Having the lifetime counter for the namespaces in one place reduces
    maintenance cost. Not just because after switching all namespaces over we
    will have removed more code than we added but also because the logic is
    more easily understandable and we indicate to the user that the basic
    lifetime requirements for all namespaces are currently identical.

    Signed-off-by: Kirill Tkhai
    Reviewed-by: Kees Cook
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/159644980287.604812.761686947449081169.stgit@localhost.localdomain
    Signed-off-by: Christian Brauner

    Kirill Tkhai
     

08 Aug, 2020

3 commits

  • Pull mount leak fix from Al Viro:
    "Regression fix for the syscalls-for-init series - fix a leak of a 'struct path'"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: fix a struct path leak in path_umount

    Linus Torvalds
     
  • Make sure we also put the dentry and vfsmnt in the illegal flags
    and !may_umount cases.

    Fixes: 41525f56e256 ("fs: refactor ksys_umount")
    Reported-by: Vikas Kumar
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Pull init and set_fs() cleanups from Al Viro:
    "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

    * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
    init: add an init_dup helper
    init: add an init_utimes helper
    init: add an init_stat helper
    init: add an init_mknod helper
    init: add an init_mkdir helper
    init: add an init_symlink helper
    init: add an init_link helper
    init: add an init_eaccess helper
    init: add an init_chmod helper
    init: add an init_chown helper
    init: add an init_chroot helper
    init: add an init_chdir helper
    init: add an init_rmdir helper
    init: add an init_unlink helper
    init: add an init_umount helper
    init: add an init_mount helper
    init: mark create_dev as __init
    init: mark console_on_rootfs as __init
    init: initialize ramdisk_execute_command at compile time
    devtmpfs: refactor devtmpfsd()
    ...

    Linus Torvalds