02 Jul, 2022

3 commits

  • commit bd303368b776eead1c29e6cdda82bde7128b82a7 upstream.

    In previous patches we added new and modified existing helpers to handle
    idmapped mounts of filesystems mounted with an idmapping. In this final
    patch we convert all relevant places in the vfs to actually pass the
    filesystem's idmapping into these helpers.

    With this the vfs is in shape to handle idmapped mounts of filesystems
    mounted with an idmapping. Note that this is just the generic
    infrastructure. Actually adding support for idmapped mounts to a
    filesystem mountable with an idmapping is follow-up work.

    In this patch we extend the definition of an idmapped mount from a mount
    that that has the initial idmapping attached to it to a mount that has
    an idmapping attached to it which is not the same as the idmapping the
    filesystem was mounted with.

    As before we do not allow the initial idmapping to be attached to a
    mount. In addition this patch prevents that the idmapping the filesystem
    was mounted with can be attached to a mount created based on this
    filesystem.

    This has multiple reasons and advantages. First, attaching the initial
    idmapping or the filesystem's idmapping doesn't make much sense as in
    both cases the values of the i_{g,u}id and other places where k{g,u}ids
    are used do not change. Second, a user that really wants to do this for
    whatever reason can just create a separate dedicated identical idmapping
    to attach to the mount. Third, we can continue to use the initial
    idmapping as an indicator that a mount is not idmapped allowing us to
    continue to keep passing the initial idmapping into the mapping helpers
    to tell them that something isn't an idmapped mount even if the
    filesystem is mounted with an idmapping.

    Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
    Cc: Seth Forshee
    Cc: Amir Goldstein
    Cc: Christoph Hellwig
    Cc: Al Viro
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee
    Signed-off-by: Christian Brauner
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • commit 4472071331549e911a5abad41aea6e3be855a1a4 upstream.

    In a few places the vfs needs to interact with bare k{g,u}ids directly
    instead of struct inode. These are just a few. In previous patches we
    introduced low-level mapping helpers that are able to support
    filesystems mounted an idmapping. This patch simply converts the places
    to use these new helpers.

    Link: https://lore.kernel.org/r/20211123114227.3124056-7-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-7-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-7-brauner@kernel.org
    Cc: Seth Forshee
    Cc: Amir Goldstein
    Cc: Christoph Hellwig
    Cc: Al Viro
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Seth Forshee
    Signed-off-by: Christian Brauner
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • commit a793d79ea3e041081cd7cbd8ee43d0b5e4914a2b upstream.

    The low-level mapping helpers were so far crammed into fs.h. They are
    out of place there. The fs.h header should just contain the higher-level
    mapping helpers that interact directly with vfs objects such as struct
    super_block or struct inode and not the bare mapping helpers. Similarly,
    only vfs and specific fs code shall interact with low-level mapping
    helpers. And so they won't be made accessible automatically through
    regular {g,u}id helpers.

    Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
    Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
    Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
    Cc: Seth Forshee
    Cc: Christoph Hellwig
    Cc: Al Viro
    CC: linux-fsdevel@vger.kernel.org
    Reviewed-by: Amir Goldstein
    Reviewed-by: Seth Forshee
    Signed-off-by: Christian Brauner
    Signed-off-by: Christian Brauner (Microsoft)
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

19 Nov, 2021

2 commits

  • commit 8468e937df1f31411d1e127fa38db064af051fe5 upstream.

    When truncating pagecache on file THP, the private pages of a process
    should not be unmapped mapping. This incorrect behavior on a dynamic
    shared libraries which will cause related processes to happen core dump.

    A simple test for a DSO (Prerequisite is the DSO mapped in file THP):

    int main(int argc, char *argv[])
    {
    int fd;

    fd = open(argv[1], O_WRONLY);
    if (fd < 0) {
    perror("open");
    }

    close(fd);
    return 0;
    }

    The test only to open a target DSO, and do nothing. But this operation
    will lead one or more process to happen core dump. This patch mainly to
    fix this bug.

    Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
    Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
    Signed-off-by: Rongwei Wang
    Tested-by: Xu Yu
    Cc: Matthew Wilcox (Oracle)
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Hugh Dickins
    Cc: Yang Shi
    Cc: Mike Kravetz
    Cc: Collin Fijalkovich
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rongwei Wang
     
  • commit 55fc0d91746759c71bc165bba62a2db64ac98e35 upstream.

    Patch series "fix two bugs for file THP".

    This patch (of 2):

    Transparent huge page has supported read-only non-shmem files. The
    file- backed THP is collapsed by khugepaged and truncated when written
    (for shared libraries).

    However, there is a race when multiple writers truncate the same page
    cache concurrently.

    In that case, subpage(s) of file THP can be revealed by find_get_entry
    in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
    truncate_inode_page, as follows:

    page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
    head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
    flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
    raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
    head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
    page dumped because: VM_BUG_ON_PAGE(PageTail(page))
    ------------[ cut here ]------------
    kernel BUG at mm/truncate.c:213!
    Internal error: Oops - BUG: 0 [#1] SMP
    Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
    CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
    Hardware name: ECS, BIOS 0.0.0 02/06/2015
    pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
    Call trace:
    truncate_inode_page+0x64/0x70
    truncate_inode_pages_range+0x550/0x7e4
    truncate_pagecache+0x58/0x80
    do_dentry_open+0x1e4/0x3c0
    vfs_open+0x38/0x44
    do_open+0x1f0/0x310
    path_openat+0x114/0x1dc
    do_filp_open+0x84/0x134
    do_sys_openat2+0xbc/0x164
    __arm64_sys_openat+0x74/0xc0
    el0_svc_common.constprop.0+0x88/0x220
    do_el0_svc+0x30/0xa0
    el0_svc+0x20/0x30
    el0_sync_handler+0x1a4/0x1b0
    el0_sync+0x180/0x1c0
    Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)

    This patch mainly to lock filemap when one enter truncate_pagecache(),
    avoiding truncating the same page cache concurrently.

    Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com
    Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
    Signed-off-by: Xu Yu
    Signed-off-by: Rongwei Wang
    Suggested-by: Matthew Wilcox (Oracle)
    Tested-by: Song Liu
    Cc: Collin Fijalkovich
    Cc: Hugh Dickins
    Cc: Mike Kravetz
    Cc: William Kucharski
    Cc: Yang Shi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rongwei Wang
     

23 Aug, 2021

1 commit

  • We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
    off in Fedora and RHEL8. Several other distros have followed suit.

    I've heard of one problem in all that time: Someone migrated from an
    older distro that supported "-o mand" to one that didn't, and the host
    had a fstab entry with "mand" in it which broke on reboot. They didn't
    actually _use_ mandatory locking so they just removed the mount option
    and moved on.

    This patch rips out mandatory locking support wholesale from the kernel,
    along with the Kconfig option and the Documentation file. It also
    changes the mount code to ignore the "mand" mount option instead of
    erroring out, and to throw a big, ugly warning.

    Signed-off-by: Jeff Layton

    Jeff Layton
     

04 Jul, 2021

1 commit

  • Pull vfs name lookup updates from Al Viro:
    "Small namei.c patch series, mostly to simplify the rules for nameidata
    state. It's actually from the previous cycle - but I didn't post it
    for review in time...

    Changes visible outside of fs/namei.c: file_open_root() calling
    conventions change, some freed bits in LOOKUP_... space"

    * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    namei: make sure nd->depth is always valid
    teach set_nameidata() to handle setting the root as well
    take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
    switch file_open_root() to struct path

    Linus Torvalds
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

01 Jul, 2021

1 commit

  • Transparent huge pages are supported for read-only non-shmem files, but
    are only used for vmas with VM_DENYWRITE. This condition ensures that
    file THPs are protected from writes while an application is running
    (ETXTBSY). Any existing file THPs are then dropped from the page cache
    when a file is opened for write in do_dentry_open(). Since sys_mmap
    ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
    produced by execve().

    Systems that make heavy use of shared libraries (e.g. Android) are unable
    to apply VM_DENYWRITE through the dynamic linker, preventing them from
    benefiting from the resultant reduced contention on the TLB.

    This patch reduces the constraint on file THPs allowing use with any
    executable mapping from a file not opened for write (see
    inode_is_open_for_write()). It also introduces additional conditions to
    ensure that files opened for write will never be backed by file THPs.

    Restricting the use of THPs to executable mappings eliminates the risk
    that a read-only file later opened for write would encounter significant
    latencies due to page cache truncation.

    The ld linker flag '-z max-page-size=(hugepage size)' can be used to
    produce executables with the necessary layout. The dynamic linker must
    map these file's segments at a hugepage size aligned vma for the mapping
    to be backed with THPs.

    Comparison of the performance characteristics of 4KB and 2MB-backed
    libraries follows; the Android dex2oat tool was used to AOT compile an
    example application on a single ARM core.

    4KB Pages:
    ==========

    count event_name # count / runtime
    598,995,035,942 cpu-cycles # 1.800861 GHz
    81,195,620,851 raw-stall-frontend # 244.112 M/sec
    347,754,466,597 iTLB-loads # 1.046 G/sec
    2,970,248,900 iTLB-load-misses # 0.854122% miss rate

    Total test time: 332.854998 seconds.

    2MB Pages:
    ==========

    count event_name # count / runtime
    592,872,663,047 cpu-cycles # 1.800358 GHz
    76,485,624,143 raw-stall-frontend # 232.261 M/sec
    350,478,413,710 iTLB-loads # 1.064 G/sec
    803,233,322 iTLB-load-misses # 0.229182% miss rate

    Total test time: 329.826087 seconds

    A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

    /apex/com.android.art/lib64/libart.so
    FilePmdMapped: 4096 kB

    /apex/com.android.art/lib64/libart-compiler.so
    FilePmdMapped: 2048 kB

    Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
    Signed-off-by: Collin Fijalkovich
    Acked-by: Hugh Dickins
    Reviewed-by: William Kucharski
    Acked-by: Song Liu
    Cc: Suren Baghdasaryan
    Cc: Hridya Valsaraju
    Cc: Kalesh Singh
    Cc: Tim Murray
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Collin Fijalkovich
     

28 May, 2021

1 commit

  • The new openat2() syscall verifies that no unknown O-flag values are
    set and returns an error to userspace if they are while the older open
    syscalls like open() and openat() simply ignore unknown flag values:

    #define O_FLAG_CURRENTLY_INVALID (1 << 31)
    struct open_how how = {
    .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
    .resolve = 0,
    };

    /* fails */
    fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

    /* succeeds */
    fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

    However, openat2() silently truncates the upper 32 bits meaning:

    #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
    #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

    struct open_how how_lowe32 = {
    .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
    };

    struct open_how how_upper32 = {
    .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
    };

    /* fails */
    fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

    /* succeeds */
    fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

    Fix this by preventing the immediate truncation in build_open_flags().

    There's a snafu here though stripping FMODE_* directly from flags would
    cause the upper 32 bits to be truncated as well due to integer promotion
    rules since FMODE_* is unsigned int, O_* are signed ints (yuck).

    In addition, struct open_flags currently defines flags to be 32 bit
    which is reasonable. If we simply were to bump it to 64 bit we would
    need to change a lot of code preemptively which doesn't seem worth it.
    So simply add a compile-time check verifying that all currently known
    O_* flags are within the 32 bit range and fail to build if they aren't
    anymore.

    This change shouldn't regress old open syscalls since they silently
    truncate any unknown values anyway. It is a tiny semantic change for
    openat2() but it is very unlikely people pass ing > 32 bit unknown flags
    and the syscall is relatively new too.

    Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
    Cc: Christoph Hellwig
    Cc: Aleksa Sarai
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reported-by: Richard Guy Briggs
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Aleksa Sarai
    Reviewed-by: Richard Guy Briggs
    Signed-off-by: Christian Brauner

    Christian Brauner
     

08 Apr, 2021

1 commit


24 Feb, 2021

1 commit

  • Pull idmapped mounts from Christian Brauner:
    "This introduces idmapped mounts which has been in the making for some
    time. Simply put, different mounts can expose the same file or
    directory with different ownership. This initial implementation comes
    with ports for fat, ext4 and with Christoph's port for xfs with more
    filesystems being actively worked on by independent people and
    maintainers.

    Idmapping mounts handle a wide range of long standing use-cases. Here
    are just a few:

    - Idmapped mounts make it possible to easily share files between
    multiple users or multiple machines especially in complex
    scenarios. For example, idmapped mounts will be used in the
    implementation of portable home directories in
    systemd-homed.service(8) where they allow users to move their home
    directory to an external storage device and use it on multiple
    computers where they are assigned different uids and gids. This
    effectively makes it possible to assign random uids and gids at
    login time.

    - It is possible to share files from the host with unprivileged
    containers without having to change ownership permanently through
    chown(2).

    - It is possible to idmap a container's rootfs and without having to
    mangle every file. For example, Chromebooks use it to share the
    user's Download folder with their unprivileged containers in their
    Linux subsystem.

    - It is possible to share files between containers with
    non-overlapping idmappings.

    - Filesystem that lack a proper concept of ownership such as fat can
    use idmapped mounts to implement discretionary access (DAC)
    permission checking.

    - They allow users to efficiently changing ownership on a per-mount
    basis without having to (recursively) chown(2) all files. In
    contrast to chown (2) changing ownership of large sets of files is
    instantenous with idmapped mounts. This is especially useful when
    ownership of a whole root filesystem of a virtual machine or
    container is changed. With idmapped mounts a single syscall
    mount_setattr syscall will be sufficient to change the ownership of
    all files.

    - Idmapped mounts always take the current ownership into account as
    idmappings specify what a given uid or gid is supposed to be mapped
    to. This contrasts with the chown(2) syscall which cannot by itself
    take the current ownership of the files it changes into account. It
    simply changes the ownership to the specified uid and gid. This is
    especially problematic when recursively chown(2)ing a large set of
    files which is commong with the aforementioned portable home
    directory and container and vm scenario.

    - Idmapped mounts allow to change ownership locally, restricting it
    to specific mounts, and temporarily as the ownership changes only
    apply as long as the mount exists.

    Several userspace projects have either already put up patches and
    pull-requests for this feature or will do so should you decide to pull
    this:

    - systemd: In a wide variety of scenarios but especially right away
    in their implementation of portable home directories.

    https://systemd.io/HOME_DIRECTORY/

    - container runtimes: containerd, runC, LXD:To share data between
    host and unprivileged containers, unprivileged and privileged
    containers, etc. The pull request for idmapped mounts support in
    containerd, the default Kubernetes runtime is already up for quite
    a while now: https://github.com/containerd/containerd/pull/4734

    - The virtio-fs developers and several users have expressed interest
    in using this feature with virtual machines once virtio-fs is
    ported.

    - ChromeOS: Sharing host-directories with unprivileged containers.

    I've tightly synced with all those projects and all of those listed
    here have also expressed their need/desire for this feature on the
    mailing list. For more info on how people use this there's a bunch of
    talks about this too. Here's just two recent ones:

    https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
    https://fosdem.org/2021/schedule/event/containers_idmap/

    This comes with an extensive xfstests suite covering both ext4 and
    xfs:

    https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

    It covers truncation, creation, opening, xattrs, vfscaps, setid
    execution, setgid inheritance and more both with idmapped and
    non-idmapped mounts. It already helped to discover an unrelated xfs
    setgid inheritance bug which has since been fixed in mainline. It will
    be sent for inclusion with the xfstests project should you decide to
    merge this.

    In order to support per-mount idmappings vfsmounts are marked with
    user namespaces. The idmapping of the user namespace will be used to
    map the ids of vfs objects when they are accessed through that mount.
    By default all vfsmounts are marked with the initial user namespace.
    The initial user namespace is used to indicate that a mount is not
    idmapped. All operations behave as before and this is verified in the
    testsuite.

    Based on prior discussions we want to attach the whole user namespace
    and not just a dedicated idmapping struct. This allows us to reuse all
    the helpers that already exist for dealing with idmappings instead of
    introducing a whole new range of helpers. In addition, if we decide in
    the future that we are confident enough to enable unprivileged users
    to setup idmapped mounts the permission checking can take into account
    whether the caller is privileged in the user namespace the mount is
    currently marked with.

    The user namespace the mount will be marked with can be specified by
    passing a file descriptor refering to the user namespace as an
    argument to the new mount_setattr() syscall together with the new
    MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
    of extensibility.

    The following conditions must be met in order to create an idmapped
    mount:

    - The caller must currently have the CAP_SYS_ADMIN capability in the
    user namespace the underlying filesystem has been mounted in.

    - The underlying filesystem must support idmapped mounts.

    - The mount must not already be idmapped. This also implies that the
    idmapping of a mount cannot be altered once it has been idmapped.

    - The mount must be a detached/anonymous mount, i.e. it must have
    been created by calling open_tree() with the OPEN_TREE_CLONE flag
    and it must not already have been visible in the filesystem.

    The last two points guarantee easier semantics for userspace and the
    kernel and make the implementation significantly simpler.

    By default vfsmounts are marked with the initial user namespace and no
    behavioral or performance changes are observed.

    The manpage with a detailed description can be found here:

    https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8

    In order to support idmapped mounts, filesystems need to be changed
    and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
    patches to convert individual filesystem are not very large or
    complicated overall as can be seen from the included fat, ext4, and
    xfs ports. Patches for other filesystems are actively worked on and
    will be sent out separately. The xfstestsuite can be used to verify
    that port has been done correctly.

    The mount_setattr() syscall is motivated independent of the idmapped
    mounts patches and it's been around since July 2019. One of the most
    valuable features of the new mount api is the ability to perform
    mounts based on file descriptors only.

    Together with the lookup restrictions available in the openat2()
    RESOLVE_* flag namespace which we added in v5.6 this is the first time
    we are close to hardened and race-free (e.g. symlinks) mounting and
    path resolution.

    While userspace has started porting to the new mount api to mount
    proper filesystems and create new bind-mounts it is currently not
    possible to change mount options of an already existing bind mount in
    the new mount api since the mount_setattr() syscall is missing.

    With the addition of the mount_setattr() syscall we remove this last
    restriction and userspace can now fully port to the new mount api,
    covering every use-case the old mount api could. We also add the
    crucial ability to recursively change mount options for a whole mount
    tree, both removing and adding mount options at the same time. This
    syscall has been requested multiple times by various people and
    projects.

    There is a simple tool available at

    https://github.com/brauner/mount-idmapped

    that allows to create idmapped mounts so people can play with this
    patch series. I'll add support for the regular mount binary should you
    decide to pull this in the following weeks:

    Here's an example to a simple idmapped mount of another user's home
    directory:

    u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

    u1001@f2-vm:/$ ls -al /home/ubuntu/
    total 28
    drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
    drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
    -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
    -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ ls -al /mnt/
    total 28
    drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
    drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
    -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
    -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ touch /mnt/my-file

    u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

    u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

    u1001@f2-vm:/$ ls -al /mnt/my-file
    -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

    u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
    -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

    u1001@f2-vm:/$ getfacl /mnt/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: mnt/my-file
    # owner: u1001
    # group: u1001
    user::rw-
    user:u1001:rwx
    group::rw-
    mask::rwx
    other::r--

    u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: home/ubuntu/my-file
    # owner: ubuntu
    # group: ubuntu
    user::rw-
    user:ubuntu:rwx
    group::rw-
    mask::rwx
    other::r--"

    * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
    xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
    xfs: support idmapped mounts
    ext4: support idmapped mounts
    fat: handle idmapped mounts
    tests: add mount_setattr() selftests
    fs: introduce MOUNT_ATTR_IDMAP
    fs: add mount_setattr()
    fs: add attr_flags_to_mnt_flags helper
    fs: split out functions to hold writers
    namespace: only take read lock in do_reconfigure_mnt()
    mount: make {lock,unlock}_mount_hash() static
    namespace: take lock_mount_hash() directly when changing flags
    nfs: do not export idmapped mounts
    overlayfs: do not mount on top of idmapped mounts
    ecryptfs: do not mount on top of idmapped mounts
    ima: handle idmapped mounts
    apparmor: handle idmapped mounts
    fs: make helpers idmap mount aware
    exec: handle idmapped mounts
    would_dump: handle idmapped mounts
    ...

    Linus Torvalds
     

24 Jan, 2021

5 commits

  • For core file operations such as changing directories or chrooting,
    determining file access, changing mode or ownership the vfs will verify
    that the caller is privileged over the inode. Extend the various helpers
    to handle idmapped mounts. If the inode is accessed through an idmapped
    mount map it into the mount's user namespace. Afterwards the permissions
    checks are identical to non-idmapped mounts. When changing file
    ownership we need to map the uid and gid from the mount's user
    namespace. If the initial user namespace is passed nothing changes so
    non-idmapped mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: James Morris
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When truncating files the vfs will verify that the caller is privileged
    over the inode. Extend it to handle idmapped mounts. If the inode is
    accessed through an idmapped mount it is mapped according to the mount's
    user namespace. Afterwards the permissions checks are identical to
    non-idmapped mounts. If the initial user namespace is passed nothing
    changes so non-idmapped mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When file attributes are changed most filesystems rely on the
    setattr_prepare(), setattr_copy(), and notify_change() helpers for
    initialization and permission checking. Let them handle idmapped mounts.
    If the inode is accessed through an idmapped mount map it into the
    mount's user namespace. Afterwards the checks are identical to
    non-idmapped mounts. If the initial user namespace is passed nothing
    changes so non-idmapped mounts will see identical behavior as before.

    Helpers that perform checks on the ia_uid and ia_gid fields in struct
    iattr assume that ia_uid and ia_gid are intended values and have already
    been mapped correctly at the userspace-kernelspace boundary as we
    already do today. If the initial user namespace is passed nothing
    changes so non-idmapped mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The two helpers inode_permission() and generic_permission() are used by
    the vfs to perform basic permission checking by verifying that the
    caller is privileged over an inode. In order to handle idmapped mounts
    we extend the two helpers with an additional user namespace argument.
    On idmapped mounts the two helpers will make sure to map the inode
    according to the mount's user namespace and then peform identical
    permission checks to inode_permission() and generic_permission(). If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Add two simple helpers to check permissions on a file and path
    respectively and convert over some callers. It simplifies quite a few
    codepaths and also reduces the churn in later patches quite a bit.
    Christoph also correctly points out that this makes codepaths (e.g.
    ioctls) way easier to follow that would otherwise have to do more
    complex argument passing than necessary.

    Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     

05 Jan, 2021

1 commit

  • Now that we support non-blocking path resolution internally, expose it
    via openat2() in the struct open_how ->resolve flags. This allows
    applications using openat2() to limit path resolution to the extent that
    it is already cached.

    If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
    will return -1/-EAGAIN.

    Cc: Al Viro
    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro

    Jens Axboe
     

16 Dec, 2020

1 commit

  • …biederm/user-namespace

    Pull execve updates from Eric Biederman:
    "This set of changes ultimately fixes the interaction of posix file
    lock and exec. Fundamentally most of the change is just moving where
    unshare_files is called during exec, and tweaking the users of
    files_struct so that the count of files_struct is not unnecessarily
    played with.

    Along the way fcheck and related helpers were renamed to more
    accurately reflect what they do.

    There were also many other small changes that fell out, as this is the
    first time in a long time much of this code has been touched.

    Benchmarks haven't turned up any practical issues but Al Viro has
    observed a possibility for a lot of pounding on task_lock. So I have
    some changes in progress to convert put_files_struct to always rcu
    free files_struct. That wasn't ready for the merge window so that will
    have to wait until next time"

    * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    exec: Move io_uring_task_cancel after the point of no return
    coredump: Document coredump code exclusively used by cell spufs
    file: Remove get_files_struct
    file: Rename __close_fd_get_file close_fd_get_file
    file: Replace ksys_close with close_fd
    file: Rename __close_fd to close_fd and remove the files parameter
    file: Merge __alloc_fd into alloc_fd
    file: In f_dupfd read RLIMIT_NOFILE once.
    file: Merge __fd_install into fd_install
    proc/fd: In fdinfo seq_show don't use get_files_struct
    bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
    proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
    file: Implement task_lookup_next_fd_rcu
    kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
    proc/fd: In tid_fd_mode use task_lookup_fd_rcu
    file: Implement task_lookup_fd_rcu
    file: Rename fcheck lookup_fd_rcu
    file: Replace fcheck_files with files_lookup_fd_rcu
    file: Factor files_lookup_fd_locked out of fcheck_files
    file: Rename __fcheck_files to files_lookup_fd_raw
    ...

    Linus Torvalds
     

11 Dec, 2020

1 commit

  • The function __close_fd was added to support binder[1]. Now that
    binder has been fixed to no longer need __close_fd[2] all calls
    to __close_fd pass current->files.

    Therefore transform the files parameter into a local variable
    initialized to current->files, and rename __close_fd to close_fd to
    reflect this change, and keep it in sync with the similar changes to
    __alloc_fd, and __fd_install.

    This removes the need for callers to care about the extra care that
    needs to be take if anything except current->files is passed, by
    limiting the callers to only operation on current->files.

    [1] 483ce1d4b8c3 ("take descriptor-related part of close() to file.c")
    [2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
    Acked-by: Christian Brauner
    v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

03 Dec, 2020

1 commit

  • This was an oversight in the original implementation, as it makes no
    sense to specify both scoping flags to the same openat2(2) invocation
    (before this patch, the result of such an invocation was equivalent to
    RESOLVE_IN_ROOT being ignored).

    This is a userspace-visible ABI change, but the only user of openat2(2)
    at the moment is LXC which doesn't specify both flags and so no
    userspace programs will break as a result.

    Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall")
    Signed-off-by: Aleksa Sarai
    Acked-by: Christian Brauner
    Cc: # v5.6+
    Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
    Signed-off-by: Christian Brauner

    Aleksa Sarai
     

13 Aug, 2020

1 commit

  • The execve(2)/uselib(2) syscalls have always rejected non-regular files.
    Recently, it was noticed that a deadlock was introduced when trying to
    execute pipes, as the S_ISREG() test was happening too late. This was
    fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
    during execve()"), but it was added after inode_permission() had already
    run, which meant LSMs could see bogus attempts to execute non-regular
    files.

    Move the test into the other inode type checks (which already look for
    other pathological conditions[1]). Since there is no need to use
    FMODE_EXEC while we still have access to "acc_mode", also switch the test
    to MAY_EXEC.

    Also include a comment with the redundant S_ISREG() checks at the end of
    execve(2)/uselib(2) to note that they are present to avoid any mistakes.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs S_ISREG() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    /* old location of FMODE_EXEC vs S_ISREG() test */
    security_file_open(f)
    open()

    [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     

08 Aug, 2020

1 commit

  • Pull init and set_fs() cleanups from Al Viro:
    "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

    * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
    init: add an init_dup helper
    init: add an init_utimes helper
    init: add an init_stat helper
    init: add an init_mknod helper
    init: add an init_mkdir helper
    init: add an init_symlink helper
    init: add an init_link helper
    init: add an init_eaccess helper
    init: add an init_chmod helper
    init: add an init_chown helper
    init: add an init_chroot helper
    init: add an init_chdir helper
    init: add an init_rmdir helper
    init: add an init_unlink helper
    init: add an init_umount helper
    init: add an init_mount helper
    init: mark create_dev as __init
    init: mark console_on_rootfs as __init
    init: initialize ramdisk_execute_command at compile time
    devtmpfs: refactor devtmpfsd()
    ...

    Linus Torvalds
     

31 Jul, 2020

7 commits


16 Jul, 2020

2 commits


17 Jun, 2020

2 commits

  • One of the use-cases of close_range() is to drop file descriptors just before
    execve(). This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part of
    close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

    This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
    maximum number of file descriptors to copy from the old struct files. When the
    user requests that all file descriptors are supposed to be closed via
    close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
    to do any of the heavy fput() work for everything above min.

    The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
    fact currently share our file descriptor table we create a new private copy.
    We then close all fds in the requested range and finally after we're done we
    install the new fd table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • This adds the close_range() syscall. It allows to efficiently close a range
    of file descriptors up to all file descriptors of a calling task.

    I was contacted by FreeBSD as they wanted to have the same close_range()
    syscall as we proposed here. We've coordinated this and in the meantime, Kyle
    was fast enough to merge close_range() into FreeBSD already in April:
    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836
    and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
    once its merged in Linux too. Python is in the process of switching to
    close_range() on FreeBSD and they are waiting on us to merge this to switch on
    Linux as well: https://bugs.python.org/issue38061

    The syscall came up in a recent discussion around the new mount API and
    making new file descriptor types cloexec by default. During this
    discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
    syscall in this manner has been requested by various people over time.

    First, it helps to close all file descriptors of an exec()ing task. This
    can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that file
    descriptors are not cloexec by default. This is aggravated by the fact that
    we can't just switch them over without massively regressing userspace. For
    a whole class of programs having an in-kernel method of closing all file
    descriptors is very helpful (e.g. demons, service managers, programming
    language standard libraries, container managers etc.).
    (Please note, unshare(CLONE_FILES) should only be needed if the calling
    task is multi-threaded and shares the file descriptor table with another
    thread in which case two threads could race with one thread allocating file
    descriptors and the other one closing them via close_range(). For the
    general case close_range() before the execve() is sufficient.)

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on each
    file descriptor. From looking at various large(ish) userspace code bases
    this or similar patterns are very common in:
    - service managers (cf. [4])
    - libcs (cf. [6])
    - container runtimes (cf. [5])
    - programming language runtimes/standard libraries
    - Python (cf. [2])
    - Rust (cf. [7], [8])
    As Dmitry pointed out there's even a long-standing glibc bug about missing
    kernel support for this task (cf. [3]).
    In addition, the syscall will also work for tasks that do not have procfs
    mounted and on kernels that do not have procfs support compiled in. In such
    situations the only way to make sure that all file descriptors are closed
    is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
    OPEN_MAX trickery (cf. comment [8] on Rust).

    The performance is striking. For good measure, comparing the following
    simple close_all_fds() userspace implementation that is essentially just
    glibc's version in [6]:

    static int close_all_fds(void)
    {
    int dir_fd;
    DIR *dir;
    struct dirent *direntp;

    dir = opendir("/proc/self/fd");
    if (!dir)
    return -1;
    dir_fd = dirfd(dir);
    while ((direntp = readdir(dir))) {
    int fd;
    if (strcmp(direntp->d_name, ".") == 0)
    continue;
    if (strcmp(direntp->d_name, "..") == 0)
    continue;
    fd = atoi(direntp->d_name);
    if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
    continue;
    close(fd);
    }
    closedir(dir);
    return 0;
    }

    to close_range() yields:
    1. closing 4 open files:
    - close_all_fds(): ~280 us
    - close_range(): ~24 us

    2. closing 1000 open files:
    - close_all_fds(): ~5000 us
    - close_range(): ~800 us

    close_range() is designed to allow for some flexibility. Specifically, it
    does not simply always close all open file descriptors of a task. Instead,
    callers can specify an upper bound.
    This is e.g. useful for scenarios where specific file descriptors are
    created with well-known numbers that are supposed to be excluded from
    getting closed.
    For extra paranoia close_range() comes with a flags argument. This can e.g.
    be used to implement extension. Once can imagine userspace wanting to stop
    at the first error instead of ignoring errors under certain circumstances.
    There might be other valid ideas in the future. In any case, a flag
    argument doesn't hurt and keeps us on the safe side.

    From an implementation side this is kept rather dumb. It saw some input
    from David and Jann but all nonsense is obviously my own!
    - Errors to close file descriptors are currently ignored. (Could be changed
    by setting a flag in the future if needed.)
    - __close_range() is a rather simplistic wrapper around __close_fd().
    My reasoning behind this is based on the nature of how __close_fd() needs
    to release an fd. But maybe I misunderstood specifics:
    We take the files_lock and rcu-dereference the fdtable of the calling
    task, we find the entry in the fdtable, get the file and need to release
    files_lock before calling filp_close().
    In the meantime the fdtable might have been altered so we can't just
    retake the spinlock and keep the old rcu-reference of the fdtable
    around. Instead we need to grab a fresh reference to the fdtable.
    If my reasoning is correct then there's really no point in fancyfying
    __close_range(): We just need to rcu-dereference the fdtable of the
    calling task once to cap the max_fd value correctly and then go on
    calling __close_fd() in a loop.

    /* References */
    [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
    [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
    [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
    [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
    [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
    [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
    Note that this is an internal implementation that is not exported.
    Currently, libc seems to not provide an exported version of this
    because of missing kernel support to do this.
    Note, in a recent patch series Florian made grantpt() a nop thereby
    removing the code referenced here.
    [7]: https://github.com/rust-lang/rust/issues/12148
    [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
    Rust's solution is slightly different but is equally unperformant.
    Rust calls getdtablesize() which is a glibc library function that
    simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
    goes on to call close() on each fd. That's obviously overkill for most
    tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
    OPEN_MAX.
    Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
    to 1024. Even in this case, there's a very high chance that in the
    common case Rust is calling the close() syscall 1021 times pointlessly
    if the task just has 0, 1, and 2 open.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Kyle Evans
    Cc: Jann Horn
    Cc: David Howells
    Cc: Dmitry V. Levin
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Florian Weimer
    Cc: linux-api@vger.kernel.org

    Christian Brauner
     

03 Jun, 2020

2 commits

  • Merge updates from Andrew Morton:
    "A few little subsystems and a start of a lot of MM patches.

    Subsystems affected by this patch series: squashfs, ocfs2, parisc,
    vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
    swap, memcg, pagemap, memory-failure, vmalloc, kasan"

    * emailed patches from Andrew Morton : (128 commits)
    kasan: move kasan_report() into report.c
    mm/mm_init.c: report kasan-tag information stored in page->flags
    ubsan: entirely disable alignment checks under UBSAN_TRAP
    kasan: fix clang compilation warning due to stack protector
    x86/mm: remove vmalloc faulting
    mm: remove vmalloc_sync_(un)mappings()
    x86/mm/32: implement arch_sync_kernel_mappings()
    x86/mm/64: implement arch_sync_kernel_mappings()
    mm/ioremap: track which page-table levels were modified
    mm/vmalloc: track which page-table levels were modified
    mm: add functions to track page directory modifications
    s390: use __vmalloc_node in stack_alloc
    powerpc: use __vmalloc_node in alloc_vm_stack
    arm64: use __vmalloc_node in arch_alloc_vmap_stack
    mm: remove vmalloc_user_node_flags
    mm: switch the test_vmalloc module to use __vmalloc_node
    mm: remove __vmalloc_node_flags_caller
    mm: remove both instances of __vmalloc_node_flags
    mm: remove the prot argument to __vmalloc_node
    mm: remove the pgprot argument to __vmalloc
    ...

    Linus Torvalds
     
  • Patch series "vfs: have syncfs() return error when there are writeback
    errors", v6.

    Currently, syncfs does not return errors when one of the inodes fails to
    be written back. It will return errors based on the legacy AS_EIO and
    AS_ENOSPC flags when syncing out the block device fails, but that's not
    particularly helpful for filesystems that aren't backed by a blockdev.
    It's also possible for a stray sync to lose those errors.

    The basic idea in this set is to track writeback errors at the
    superblock level, so that we can quickly and easily check whether
    something bad happened without having to fsync each file individually.
    syncfs is then changed to reliably report writeback errors after they
    occur, much in the same fashion as fsync does now.

    This patch (of 2):

    Usually we suggest that applications call fsync when they want to ensure
    that all data written to the file has made it to the backing store, but
    that can be inefficient when there are a lot of open files.

    Calling syncfs on the filesystem can be more efficient in some
    situations, but the error reporting doesn't currently work the way most
    people expect. If a single inode on a filesystem reports a writeback
    error, syncfs won't necessarily return an error. syncfs only returns an
    error if __sync_blockdev fails, and on some filesystems that's a no-op.

    It would be better if syncfs reported an error if there were any
    writeback failures. Then applications could call syncfs to see if there
    are any errors on any open files, and could then call fsync on all of
    the other descriptors to figure out which one failed.

    This patch adds a new errseq_t to struct super_block, and has
    mapping_set_error also record writeback errors there.

    To report those errors, we also need to keep an errseq_t in struct file
    to act as a cursor. This patch adds a dedicated field for that purpose,
    which slots nicely into 4 bytes of padding at the end of struct file on
    x86_64.

    An earlier version of this patch used an O_PATH file descriptor to cue
    the kernel that the open file should track the superblock error and not
    the inode's writeback error.

    I think that API is just too weird though. This is simpler and should
    make syncfs error reporting "just work" even if someone is multiplexing
    fsync and syncfs on the same fds.

    Signed-off-by: Jeff Layton
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Andres Freund
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: David Howells
    Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
    Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

14 May, 2020

2 commits

  • POSIX defines faccessat() as having a fourth "flags" argument, while the
    linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and
    AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

    Add a new faccessat(2) syscall with the added flags argument and implement
    both flags.

    The value of AT_EACCESS is defined in glibc headers to be the same as
    AT_REMOVEDIR. Use this value for the kernel interface as well, together
    with the explanatory comment.

    Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
    be useful and is trivial to implement.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Split out a helper that overrides the credentials in preparation for
    actually doing the access check.

    This prepares for the next patch that optionally disables the creds
    override.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

03 Apr, 2020

1 commit

  • Pull vfs pathwalk sanitizing from Al Viro:
    "Massive pathwalk rewrite and cleanups.

    Several iterations have been posted; hopefully this thing is getting
    readable and understandable now. Pretty much all parts of pathname
    resolutions are affected...

    The branch is identical to what has sat in -next, except for commit
    message in "lift all calls of step_into() out of follow_dotdot/
    follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
    commit message changed there."

    * 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
    lookup_open(): don't bother with fallbacks to lookup+create
    atomic_open(): no need to pass struct open_flags anymore
    open_last_lookups(): move complete_walk() into do_open()
    open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
    open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
    open_last_lookups(): consolidate fsnotify_create() calls
    take post-lookup part of do_last() out of loop
    link_path_walk(): sample parent's i_uid and i_mode for the last component
    __nd_alloc_stack(): make it return bool
    reserve_stack(): switch to __nd_alloc_stack()
    pick_link(): take reserving space on stack into a new helper
    pick_link(): more straightforward handling of allocation failures
    fold path_to_nameidata() into its only remaining caller
    pick_link(): pass it struct path already with normal refcounting rules
    fs/namei.c: kill follow_mount()
    non-RCU analogue of the previous commit
    helper for mount rootwards traversal
    follow_dotdot(): be lazy about changing nd->path
    follow_dotdot_rcu(): be lazy about changing nd->path
    follow_dotdot{,_rcu}(): massage loops
    ...

    Linus Torvalds
     

13 Mar, 2020

1 commit

  • several iterations of ->atomic_open() calling conventions ago, we
    used to need fput() if ->atomic_open() failed at some point after
    successful finish_open(). Now (since 2016) it's not needed -
    struct file carries enough state to make fput() work regardless
    of the point in struct file lifecycle and discarding it on
    failure exits in open() got unified. Unfortunately, I'd missed
    the fact that we had an instance of ->atomic_open() (cifs one)
    that used to need that fput(), as well as the stale comment in
    finish_open() demanding such late failure handling. Trivially
    fixed...

    Fixes: fe9ec8291fca "do_last(): take fput() on error after opening to out:"
    Cc: stable@kernel.org # v4.7+
    Signed-off-by: Al Viro

    Al Viro