29 Dec, 2018

1 commit

  • commit 94f82008ce30e2624537d240d64ce718255e0b80 upstream.

    This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd.

    commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
    enabled mknod() in user namespaces for userns root if CAP_MKNOD is
    available. However, these device nodes are useless since any filesystem
    mounted from a non-initial user namespace will set the SB_I_NODEV flag on
    the filesystem. Now, when a device node s created in a non-initial user
    namespace a call to open() on said device node will fail due to:

    bool may_open_dev(const struct path *path)
    {
    return !(path->mnt->mnt_flags & MNT_NODEV) &&
    !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
    }

    The problem with this is that as of the aforementioned commit mknod()
    creates partially functional device nodes in non-initial user namespaces.
    In particular, it has the consequence that as of the aforementioned commit
    open() will be more privileged with respect to device nodes than mknod().
    Before it was the other way around. Specifically, if mknod() succeeded
    then it was transparent for any userspace application that a fatal error
    must have occured when open() failed.

    All of this breaks multiple userspace workloads and a widespread assumption
    about how to handle mknod(). Basically, all container runtimes and systemd
    live by the slogan "ask for forgiveness not permission" when running user
    namespace workloads. For mknod() the assumption is that if the syscall
    succeeds the device nodes are useable irrespective of whether it succeeds
    in a non-initial user namespace or not. This logic was chosen explicitly
    to allow for the glorious day when mknod() will actually be able to create
    fully functional device nodes in user namespaces.
    A specific problem people are already running into when running 4.18 rc
    kernels are failing systemd services. For any distro that is run in a
    container systemd services started with the PrivateDevices= property set
    will fail to start since the device nodes in question cannot be
    opened (cf. the arguments in [1]).

    Full disclosure, Seth made the very sound argument that it is already
    possible to end up with partially functional device nodes. Any filesystem
    mounted with MS_NODEV set will allow mknod() to succeed but will not allow
    open() to succeed. The difference to the case here is that the MS_NODEV
    case is transparent to userspace since it is an explicitly set mount option
    while the SB_I_NODEV case is an implicit property enforced by the kernel
    and hence opaque to userspace.

    [1]: https://github.com/systemd/systemd/pull/9483

    Signed-off-by: Christian Brauner
    Cc: "Eric W. Biederman"
    Cc: Seth Forshee
    Cc: Serge Hallyn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

24 Aug, 2018

1 commit

  • Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salvatore Mesoraca
     

22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • …ux/kernel/git/viro/vfs

    Pull misc vfs updates from Al Viro:
    "Misc cleanups from various folks all over the place

    I expected more fs/dcache.c cleanups this cycle, so that went into a
    separate branch. Said cleanups have missed the window, so in the
    hindsight it could've gone into work.misc instead. Decided not to
    cherry-pick, thus the 'work.dcache' branch"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: dcache: Use true and false for boolean values
    fold generic_readlink() into its only caller
    fs: shave 8 bytes off of struct inode
    fs: Add more kernel-doc to the produced documentation
    fs: Fix attr.c kernel-doc
    removed extra extern file_fdatawait_range

    * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    kill dentry_update_name_case()

    Linus Torvalds
     

20 Jul, 2018

1 commit


18 Jul, 2018

1 commit


12 Jul, 2018

17 commits


16 Jun, 2018

1 commit

  • Pull AFS updates from Al Viro:
    "Assorted AFS stuff - ended up in vfs.git since most of that consists
    of David's AFS-related followups to Christoph's procfs series"

    * 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    afs: Optimise callback breaking by not repeating volume lookup
    afs: Display manually added cells in dynamic root mount
    afs: Enable IPv6 DNS lookups
    afs: Show all of a server's addresses in /proc/fs/afs/servers
    afs: Handle CONFIG_PROC_FS=n
    proc: Make inline name size calculation automatic
    afs: Implement network namespacing
    afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
    afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
    proc: Add a way to make network proc files writable
    afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
    afs: Rearrange fs/afs/proc.c to move the show routines up
    afs: Rearrange fs/afs/proc.c by moving fops and open functions down
    afs: Move /proc management functions to the end of the file

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • Alter the dynroot mount so that cells created by manipulation of
    /proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
    cell as a module parameter will cause directories for those cells to be
    created in the dynamic root superblock for the network namespace[*].

    To this end:

    (1) Only one dynamic root superblock is now created per network namespace
    and this is shared between all attempts to mount it. This makes it
    easier to find the superblock to modify.

    (2) When a dynamic root superblock is created, the list of cells is walked
    and directories created for each cell already defined.

    (3) When a new cell is added, if a dynamic root superblock exists, a
    directory is created for it.

    (4) When a cell is destroyed, the directory is removed.

    (5) These directories are created by calling lookup_one_len() on the root
    dir which automatically creates them if they don't exist.

    [*] Inasmuch as network namespaces are currently supported here.

    Signed-off-by: David Howells

    David Howells
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

05 Jun, 2018

3 commits

  • Pull userns updates from Eric Biederman:
    "This is the last couple of vfs bits to enable root in a user namespace
    to mount and manipulate a filesystem with backing store (AKA not a
    virtual filesystem like proc, but a filesystem where the unprivileged
    user controls the content). The target filesystem for this work is
    fuse, and Miklos should be sending you the pull request for the fuse
    bits this merge window.

    The two key patches are "evm: Don't update hmacs in user ns mounts"
    and "vfs: Don't allow changing the link count of an inode with an
    invalid uid or gid". Those close small gaps in the vfs that would be a
    problem if an unprivileged fuse filesystem is mounted.

    The rest of the changes are things that are now safe to allow a root
    user in a user namespace to do with a filesystem they have mounted.
    The most interesting development is that remount is now safe"

    * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
    capabilities: Allow privileged user in s_user_ns to set security.* xattrs
    fs: Allow superblock owner to access do_remount_sb()
    fs: Allow superblock owner to replace invalid owners of inodes
    vfs: Allow userns root to call mknod on owned filesystems.
    vfs: Don't allow changing the link count of an inode with an invalid uid or gid
    evm: Don't update hmacs in user ns mounts

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Misc bits and pieces not fitting into anything more specific"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: delete unnecessary assignment in vfs_listxattr
    Documentation: filesystems: update filesystem locking documentation
    vfs: namei: use path_equal() in follow_dotdot()
    fs.h: fix outdated comment about file flags
    __inode_security_revalidate() never gets NULL opt_dentry
    make xattr_getsecurity() static
    vfat: simplify checks in vfat_lookup()
    get rid of dead code in d_find_alias()
    it's SB_BORN, not MS_BORN...
    msdos_rmdir(): kill BS comment
    remove rpc_rmdir()
    fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()

    Linus Torvalds
     
  • Pull rmdir update from Al Viro:
    "More shrink_dcache_parent()-related stuff - killing the main source of
    potentially contended calls of that on large subtrees"

    * 'work.rmdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    rmdir(),rename(): do shrink_dcache_parent() only on success

    Linus Torvalds
     

04 Jun, 2018

1 commit

  • This reverts commit cab64df194667dc5d9d786f0a895f647f5501c0d.

    Having vfs_open() in some cases drop the reference to
    struct file combined with

    error = vfs_open(path, f, cred);
    if (error) {
    put_filp(f);
    return ERR_PTR(error);
    }
    return f;

    is flat-out wrong. It used to be

    error = vfs_open(path, f, cred);
    if (!error) {
    /* from now on we need fput() to dispose of f */
    error = open_check_o_direct(f);
    if (error) {
    fput(f);
    f = ERR_PTR(error);
    }
    } else {
    put_filp(f);
    f = ERR_PTR(error);
    }

    and sure, having that open_check_o_direct() boilerplate gotten rid of is
    nice, but not that way...

    Worse, another call chain (via finish_open()) is FUBAR now wrt
    FILE_OPENED handling - in that case we get error returned, with file
    already hit by fput() *AND* FILE_OPENED not set. Guess what happens in
    path_openat(), when it hits

    if (!(opened & FILE_OPENED)) {
    BUG_ON(!error);
    put_filp(file);
    }

    The root cause of all that crap is that the callers of do_dentry_open()
    have no way to tell which way did it fail; while that could be fixed up
    (by passing something like int *opened to do_dentry_open() and have it
    marked if we'd called ->open()), it's probably much too late in the
    cycle to do so right now.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

28 May, 2018

1 commit

  • Once upon a time ->rmdir() instances used to check if victim inode
    had more than one (in-core) reference and failed with -EBUSY if it
    had. The reason was race avoidance - emptiness check is worthless
    if somebody could just go and create new objects in the victim
    directory afterwards.

    With introduction of dcache the checks had been replaced with
    checking the refcount of dentry. However, since a cached negative
    lookup leaves a negative child dentry, such check had lead to false
    positives - with empty foo/ doing stat foo/bar before rmdir foo
    ended up with -EBUSY unless the negative dentry of foo/bar happened
    to be evicted by the time of rmdir(2). That had been fixed by
    doing shrink_dcache_parent() just before the refcount check.

    At the same time, ext2_rmdir() has grown a private solution that
    eliminated those -EBUSY - it did something (setting ->i_size to 0)
    which made any subsequent ext2_add_entry() fail.

    Unfortunately, even with shrink_dcache_parent() the check had been
    racy - after all, the victim itself could be found by dcache lookup
    just after we'd checked its refcount. That got fixed by a new
    helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
    the sucker if its refcount ended up equal to 1. That got called before
    ->rmdir(), turning the checks in ->rmdir() instances into "if not
    unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but
    had an unpleasant side effect - now shrink_dcache_parent() had been
    done before the emptiness checks, leading to easily triggerable calls
    of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
    nested into each other.

    Several years later the ext2-private trick had been generalized -
    (in-core) inodes of dead directories are flagged and calls of
    lookup, readdir and all directory-modifying methods were prevented
    in so marked directories. Remaining boilerplate in ->rmdir() instances
    became redundant and some instances got rid of it.

    In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
    and then killed off in all of them. That has lead to another problem,
    though - in case of successful rmdir we *want* any (negative) child
    dentries dropped and the victim itself made negative. There's no point
    keeping cached negative lookups in foo when we can get the negative
    lookup of foo itself cached. So shrink_dcache_parent() call had been
    restored; unfortunately, it went into the place where dentry_unhash()
    used to be, i.e. before the ->rmdir() call. Note that we don't unhash
    anymore, so any "is it busy" checks would be racy; fortunately, all of
    them are gone.

    We should've done that call right *after* successful ->rmdir(). That
    reduces contention caused by tree-walking in shrink_dcache_parent()
    and, especially, contention caused by evictions in two nested subtrees
    going on in parallel. The same goes for directory-overwriting rename() -
    the story there had been parallel to that of rmdir().

    Signed-off-by: Al Viro

    Al Viro
     

25 May, 2018

2 commits


18 May, 2018

1 commit


10 Apr, 2018

1 commit

  • Pull vfs namei updates from Al Viro:

    - make lookup_one_len() safe with parent locked only shared(incoming
    afs series wants that)

    - fix of getname_kernel() regression from 2015 (-stable fodder, that
    one).

    * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    getname_kernel() needs to make sure that ->name != ->iname in long case
    make lookup_one_len() safe to use with directory locked shared
    new helper: __lookup_slow()
    merge common parts of lookup_one_len{,_unlocked} into common helper

    Linus Torvalds
     

08 Apr, 2018

1 commit


07 Apr, 2018

4 commits

  • Pull audit updates from Paul Moore:
    "We didn't have anything to send for v4.16, but we're back with a
    little more than usual for v4.17.

    Eleven patches in total, most fall into the small fix category, but
    there are three non-trivial changes worth calling out:

    - the audit entry filter is being removed after deprecating it for
    quite a while (years of no one really using it because it turns out
    to be not very practical)

    - created our own version of "__mutex_owner()" because the locking
    folks were upset we were using theirs

    - improved our handling of kernel command line parameters to make
    them more forgiving

    - we fixed auditing of symlink operations

    Everything passes the audit-testsuite and as of a few minutes ago it
    merges well with your tree"

    * tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: add refused symlink to audit_names
    audit: remove path param from link denied function
    audit: link denied should not directly generate PATH record
    audit: make ANOM_LINK obey audit_enabled and audit_dummy_context
    audit: do not panic on invalid boot parameter
    audit: track the owner of the command mutex ourselves
    audit: return on memory error to avoid null pointer dereference
    audit: bail before bug check if audit disabled
    audit: deprecate the AUDIT_FILTER_ENTRY filter
    audit: session ID should not set arch quick field pointer
    audit: update bugtracker and source URIs

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • lookup_slow() sans locking/unlocking the directory

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro