02 Feb, 2020

1 commit

  • Brown paperbag time: fetching ->i_uid/->i_mode really should've been
    done from nd->inode. I even suggested that, but the reason for that has
    slipped through the cracks and I went for dir->d_inode instead - made
    for more "obvious" patch.

    Analysis:

    - at the entry into do_last() and all the way to step_into(): dir (aka
    nd->path.dentry) is known not to have been freed; so's nd->inode and
    it's equal to dir->d_inode unless we are already doomed to -ECHILD.
    inode of the file to get opened is not known.

    - after step_into(): inode of the file to get opened is known; dir
    might be pointing to freed memory/be negative/etc.

    - at the call of may_create_in_sticky(): guaranteed to be out of RCU
    mode; inode of the file to get opened is known and pinned; dir might
    be garbage.

    The last was the reason for the original patch. Except that at the
    do_last() entry we can be in RCU mode and it is possible that
    nd->path.dentry->d_inode has already changed under us.

    In that case we are going to fail with -ECHILD, but we need to be
    careful; nd->inode is pointing to valid struct inode and it's the same
    as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
    should use that.

    Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)"
    Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
    Wearing-brown-paperbag: Al Viro
    Cc: stable@kernel.org
    Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

30 Jan, 2020

1 commit

  • Pull openat2 support from Al Viro:
    "This is the openat2() series from Aleksa Sarai.

    I'm afraid that the rest of namei stuff will have to wait - it got
    zero review the last time I'd posted #work.namei, and there had been a
    leak in the posted series I'd caught only last weekend. I was going to
    repost it on Monday, but the window opened and the odds of getting any
    review during that... Oh, well.

    Anyway, openat2 part should be ready; that _did_ get sane amount of
    review and public testing, so here it comes"

    From Aleksa's description of the series:
    "For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown
    flags are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road
    to being added to openat(2).

    Furthermore, the need for some sort of control over VFS's path
    resolution (to avoid malicious paths resulting in inadvertent
    breakouts) has been a very long-standing desire of many userspace
    applications.

    This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
    (which was a variant of David Drysdale's O_BENEATH patchset[4] which
    was a spin-off of the Capsicum project[5]) with a few additions and
    changes made based on the previous discussion within [6] as well as
    others I felt were useful.

    In line with the conclusions of the original discussion of
    AT_NO_JUMPS, the flag has been split up into separate flags. However,
    instead of being an openat(2) flag it is provided through a new
    syscall openat2(2) which provides several other improvements to the
    openat(2) interface (see the patch description for more details). The
    following new LOOKUP_* flags are added:

    LOOKUP_NO_XDEV:

    Blocks all mountpoint crossings (upwards, downwards, or through
    absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is
    also blocked (though magic-link jumps on the same vfsmount are
    permitted).

    LOOKUP_NO_MAGICLINKS:

    Blocks resolution through /proc/$pid/fd-style links. This is done
    by blocking the usage of nd_jump_link() during resolution in a
    filesystem. The term "magic-links" is used to match with the only
    reference to these links in Documentation/, but I'm happy to change
    the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

    LOOKUP_BENEATH:

    Disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed.

    Conceptually this flag is to ensure you "stay below" a certain
    point in the filesystem tree -- but this requires some additional
    to protect against various races that would allow escape using
    "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done
    as in LOOKUP_IN_ROOT, but that requires more discussion.

    In addition, two new flags are added that expand on the above ideas:

    LOOKUP_NO_SYMLINKS:

    Does what it says on the tin. No symlink resolution is allowed at
    all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
    can still be used with NOFOLLOW to open an fd for the symlink as
    long as no parent path had a symlink component.

    LOOKUP_IN_ROOT:

    This is an extension of LOOKUP_BENEATH that, rather than blocking
    attempts to move past the root, forces all such movements to be
    scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that
    chroot(2) is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to
    cross magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container.

    There is a long list of CVEs that could have bene mitigated by
    having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
    CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
    few).

    In order to make all of the above more usable, I'm working on
    libpathrs[8] which is a C-friendly library for safe path resolution.
    It features a userspace-emulated backend if the kernel doesn't support
    openat2(2). Hopefully we can get userspace to switch to using it, and
    thus get openat2(2) support for free once it's ready.

    Future work would include implementing things like
    RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
    programs to be sure they don't hit DoSes though stale NFS handles)"

    * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Documentation: path-lookup: include new LOOKUP flags
    selftests: add openat2(2) selftests
    open: introduce openat2(2) syscall
    namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
    namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
    namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
    namei: LOOKUP_NO_XDEV: block mountpoint crossing
    namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
    namei: LOOKUP_NO_SYMLINKS: block symlink resolution
    namei: allow set_root() to produce errors
    namei: allow nd_jump_link() to produce errors
    nsfs: clean-up ns_get_path() signature to return int
    namei: only return -ECHILD from follow_dotdot_rcu()

    Linus Torvalds
     

26 Jan, 2020

1 commit


15 Jan, 2020

2 commits

  • we need to reload ->d_flags after the call of ->d_manage() - the thing
    might've been called with dentry still negative and have the damn thing
    turned positive while we'd waited.

    Fixes: d41efb522e90 "fs/namei.c: pull positivity check into follow_managed()"
    Reported-by: Ian Kent
    Tested-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     
  • ... and get rid of a bunch of bugs in it. Background:
    the reason for path_mountpoint() is that umount() really doesn't
    want attempts to revalidate the root of what it's trying to umount.
    The thing we want to avoid actually happen from complete_walk();
    solution was to do something parallel to normal path_lookupat()
    and it both went overboard and got the boilerplate subtly
    (and not so subtly) wrong.

    A better solution is to do pretty much what the normal path_lookupat()
    does, but instead of complete_walk() do unlazy_walk(). All it takes
    to avoid that ->d_weak_revalidate() call... mountpoint_last() goes
    away, along with everything it got wrong, and so does the magic around
    LOOKUP_NO_REVAL.

    Another source of bugs is that when we traverse mounts at the final
    location (and we need to do that - umount . expects to get whatever's
    overmounting ., if any, out of the lookup) we really ought to take
    care of ->d_manage() - as it is, manual umount of autofs automount
    in progress can lead to unpleasant surprises for the daemon. Easily
    solved by using handle_lookup_down() instead of follow_mount().

    Tested-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     

09 Dec, 2019

9 commits

  • Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
    (in the case of LOOKUP_BENEATH the resolution will still fail if ".."
    resolution would resolve a path outside of the root -- while
    LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
    still disallowed entirely[*].

    As Jann explains[1,2], the need for this patch (and the original no-".."
    restriction) is explained by observing there is a fairly easy-to-exploit
    race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
    LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
    used to "skip over" nd->root and thus escape to the filesystem above
    nd->root.

    thread1 [attacker]:
    for (;;)
    renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
    thread2 [victim]:
    for (;;)
    openat2(dirb, "b/c/../../etc/shadow",
    { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

    With fairly significant regularity, thread2 will resolve to
    "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
    (though somewhat more privileged) attack using MS_MOVE.

    With this patch, such cases will be detected *during* ".." resolution
    and will return -EAGAIN for userspace to decide to either retry or abort
    the lookup. It should be noted that ".." is the weak point of chroot(2)
    -- walking *into* a subdirectory tautologically cannot result in you
    walking *outside* nd->root (except through a bind-mount or magic-link).
    There is also no other way for a directory's parent to change (which is
    the primary worry with ".." resolution here) other than a rename or
    MS_MOVE.

    The primary reason for deferring to userspace with -EAGAIN is that an
    in-kernel retry loop (or doing a path_is_under() check after re-taking
    the relevant seqlocks) can become unreasonably expensive on machines
    with lots of VFS activity (nfsd can cause lots of rename_lock updates).
    Thus it should be up to userspace how many times they wish to retry the
    lookup -- the selftests for this attack indicate that there is a ~35%
    chance of the lookup succeeding on the first try even with an attacker
    thrashing rename_lock.

    A variant of the above attack is included in the selftests for
    openat2(2) later in this patch series. I've run this test on several
    machines for several days and no instances of a breakout were detected.
    While this is not concrete proof that this is safe, when combined with
    the above argument it should lend some trustworthiness to this
    construction.

    [*] It may be acceptable in the future to do a path_is_under() check for
    magic-links after they are resolved. However this seems unlikely to
    be a feature that people *really* need -- it can be added later if
    it turns out a lot of people want it.

    [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
    [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/

    Cc: Christian Brauner
    Suggested-by: Jann Horn
    Suggested-by: Linus Torvalds
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • /* Background. */
    Container runtimes or other administrative management processes will
    often interact with root filesystems while in the host mount namespace,
    because the cost of doing a chroot(2) on every operation is too
    prohibitive (especially in Go, which cannot safely use vfork). However,
    a malicious program can trick the management process into doing
    operations on files outside of the root filesystem through careful
    crafting of symlinks.

    Most programs that need this feature have attempted to make this process
    safe, by doing all of the path resolution in userspace (with symlinks
    being scoped to the root of the malicious root filesystem).
    Unfortunately, this method is prone to foot-guns and usually such
    implementations have subtle security bugs.

    Thus, what userspace needs is a way to resolve a path as though it were
    in a chroot(2) -- with all absolute symlinks being resolved relative to
    the dirfd root (and ".." components being stuck under the dirfd root).
    It is much simpler and more straight-forward to provide this
    functionality in-kernel (because it can be done far more cheaply and
    correctly).

    More classical applications that also have this problem (which have
    their own potentially buggy userspace path sanitisation code) include
    web servers, archive extraction tools, network file servers, and so on.

    /* Userspace API. */
    LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).

    /* Semantics. */
    Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
    LOOKUP_IN_ROOT applies to all components of the path.

    With LOOKUP_IN_ROOT, any path component which attempts to cross the
    starting point of the pathname lookup (the dirfd passed to openat) will
    remain at the starting point. Thus, all absolute paths and symlinks will
    be scoped within the starting point.

    There is a slight change in behaviour regarding pathnames -- if the
    pathname is absolute then the dirfd is still used as the root of
    resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
    foot-guns, at the cost of a minor API inconsistency).

    As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
    LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
    will be lifted in a future patch, but requires more work to ensure that
    permitting ".." is done safely.

    Magic-link jumps are also blocked, because they can beam the path lookup
    across the starting point. It would be possible to detect and block
    only the "bad" crossings with path_is_under() checks, but it's unclear
    whether it makes sense to permit magic-links at all. However, userspace
    is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
    magic-link crossing is entirely disabled.

    /* Testing. */
    LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.

    [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/

    Cc: Christian Brauner
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • /* Background. */
    There are many circumstances when userspace wants to resolve a path and
    ensure that it doesn't go outside of a particular root directory during
    resolution. Obvious examples include archive extraction tools, as well as
    other security-conscious userspace programs. FreeBSD spun out O_BENEATH
    from their Capsicum project[1,2], so it also seems reasonable to
    implement similar functionality for Linux.

    This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
    variation on David Drysdale's O_BENEATH patchset[4], which in turn was
    based on the Capsicum project[5]).

    /* Userspace API. */
    LOOKUP_BENEATH will be exposed to userspace through openat2(2).

    /* Semantics. */
    Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
    LOOKUP_BENEATH applies to all components of the path.

    With LOOKUP_BENEATH, any path component which attempts to "escape" the
    starting point of the filesystem lookup (the dirfd passed to openat)
    will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.

    Due to a security concern brought up by Jann[6], any ".." path
    components are also blocked. This restriction will be lifted in a future
    patch, but requires more work to ensure that permitting ".." is done
    safely.

    Magic-link jumps are also blocked, because they can beam the path lookup
    across the starting point. It would be possible to detect and block
    only the "bad" crossings with path_is_under() checks, but it's unclear
    whether it makes sense to permit magic-links at all. However, userspace
    is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
    magic-link crossing is entirely disabled.

    /* Testing. */
    LOOKUP_BENEATH is tested as part of the openat2(2) selftests.

    [1]: https://reviews.freebsd.org/D2808
    [2]: https://reviews.freebsd.org/D17547
    [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
    [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
    [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
    [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/

    Cc: Christian Brauner
    Suggested-by: David Drysdale
    Suggested-by: Al Viro
    Suggested-by: Andy Lutomirski
    Suggested-by: Linus Torvalds
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • /* Background. */
    The need to contain path operations within a mountpoint has been a
    long-standing usecase that userspace has historically implemented
    manually with liberal usage of stat(). find, rsync, tar and
    many other programs implement these semantics -- but it'd be much
    simpler to have a fool-proof way of refusing to open a path if it
    crosses a mountpoint.

    This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
    variation on David Drysdale's O_BENEATH patchset[2], which in turn was
    based on the Capsicum project[3]).

    /* Userspace API. */
    LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).

    /* Semantics. */
    Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
    LOOKUP_NO_XDEV applies to all components of the path.

    With LOOKUP_NO_XDEV, any path component which crosses a mount-point
    during path resolution (including "..") will yield an -EXDEV. Absolute
    paths, absolute symlinks, and magic-links will only yield an -EXDEV if
    the jump involved changing mount-points.

    /* Testing. */
    LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.

    [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
    [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
    [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

    Cc: Christian Brauner
    Suggested-by: David Drysdale
    Suggested-by: Al Viro
    Suggested-by: Andy Lutomirski
    Suggested-by: Linus Torvalds
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • /* Background. */
    There has always been a special class of symlink-like objects in procfs
    (and a few other pseudo-filesystems) which allow for non-lexical
    resolution of paths using nd_jump_link(). These "magic-links" do not
    follow traditional mount namespace boundaries, and have been used
    consistently in container escape attacks because they can be used to
    trick unsuspecting privileged processes into resolving unexpected paths.

    It is also non-trivial for userspace to unambiguously avoid resolving
    magic-links, because they do not have a reliable indication that they
    are a magic-link (in order to verify them you'd have to manually open
    the path given by readlink(2) and then verify that the two file
    descriptors reference the same underlying file, which is plagued with
    possible race conditions or supplementary attack scenarios).

    It would therefore be very helpful for userspace to be able to avoid
    these symlinks easily, thus hopefully removing a tool from attackers'
    toolboxes.

    This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
    variation on David Drysdale's O_BENEATH patchset[2], which in turn was
    based on the Capsicum project[3]).

    /* Userspace API. */
    LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).

    /* Semantics. */
    Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
    LOOKUP_NO_MAGICLINKS applies to all components of the path.

    With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
    during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
    for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.

    LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.

    /* Testing. */
    LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.

    [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
    [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
    [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

    Cc: Christian Brauner
    Suggested-by: David Drysdale
    Suggested-by: Al Viro
    Suggested-by: Andy Lutomirski
    Suggested-by: Linus Torvalds
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • /* Background. */
    Userspace cannot easily resolve a path without resolving symlinks, and
    would have to manually resolve each path component with O_PATH and
    O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
    up (resulting in possible security bugs). Linus has mentioned that Git
    has a particular need for this kind of flag[1]. It also resolves a
    fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
    blocks the opening of trailing symlinks.

    This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
    variation on David Drysdale's O_BENEATH patchset[3], which in turn was
    based on the Capsicum project[4]).

    /* Userspace API. */
    LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).

    /* Semantics. */
    Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
    LOOKUP_NO_SYMLINKS applies to all components of the path.

    With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
    path resolution will yield -ELOOP. If the trailing component is a
    symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
    will not error out and will instead provide a handle to the trailing
    symlink -- without resolving it.

    /* Testing. */
    LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.

    [1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
    [2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
    [3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
    [4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

    Cc: Christian Brauner
    Suggested-by: Al Viro
    Suggested-by: Linus Torvalds
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
    set_root() is never called, and thus (for hardening purposes) it should
    return an error rather than permit a breakout from the root. In
    addition, move all of the repetitive set_root() calls to nd_jump_root().

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
    ability for nd_jump_link() to return an error which the corresponding
    get_link() caller must propogate back up to the VFS.

    Suggested-by: Al Viro
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     
  • It's over-zealous to return hard errors under RCU-walk here, given that
    a REF-walk will be triggered for all other cases handling ".." under
    RCU.

    The original purpose of this check was to ensure that if a rename occurs
    such that a directory is moved outside of the bind-mount which the
    resolution started in, it would be detected and blocked to avoid being
    able to mess with paths outside of the bind-mount. However, triggering a
    new REF-walk is just as effective a solution.

    Cc: "Eric W. Biederman"
    Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
    Suggested-by: Al Viro
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

07 Dec, 2019

1 commit

  • Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
    "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
    Basically, the problem is that negative pinned dentries require
    careful treatment - unless ->d_lock is locked or parent is held at
    least shared, another thread can make them positive right under us.

    Most of the uses turned out to be safe - the main surprises as far as
    filesystems are concerned were

    - race in dget_parent() fastpath, that might end up with the caller
    observing the returned dentry _negative_, due to insufficient
    barriers. It is positive in memory, but we could end up seeing the
    wrong value of ->d_inode in CPU cache. Fixed.

    - manual checks that result of lookup_one_len_unlocked() is positive
    (and rejection of negatives). Again, insufficient barriers (we
    might end up with inconsistent observed values of ->d_inode and
    ->d_flags). Fixed by switching to a new primitive that does the
    checks itself and returns ERR_PTR(-ENOENT) instead of a negative
    dentry. That way we get rid of boilerplate converting negatives
    into ERR_PTR(-ENOENT) in the callers and have a single place to
    deal with the barrier-related mess - inside fs/namei.c rather than
    in every caller out there.

    The guts of pathname resolution *do* need to be careful - the race
    found by Ritesh is real, as well as several similar races.
    Fortunately, it turns out that we can take care of that with fairly
    local changes in there.

    The tree-wide audit had not been fun, and I hate the idea of repeating
    it. I think the right approach would be to annotate the places where
    we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
    catch regressions. But I'm still not sure what would be the least
    invasive way of doing that and it's clearly the next cycle fodder"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs/namei.c: fix missing barriers when checking positivity
    fix dget_parent() fastpath race
    new helper: lookup_positive_unlocked()
    fs/namei.c: pull positivity check into follow_managed()

    Linus Torvalds
     

16 Nov, 2019

3 commits

  • Pinned negative dentries can, generally, be made positive
    by another thread. Conditions that prevent that are
    * ->d_lock on dentry in question
    * parent directory held at least shared
    * nobody else could have observed the address of dentry
    Most of the places working with those fall into one of those
    categories; however, d_lookup() and friends need to be used
    with some care. Fortunately, there's not a lot of call sites,
    and with few exceptions all of those fall under one of the
    cases above.

    Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
    and mountpoint_last(). Another one is lookup_slow() - there
    dcache lookup is done with parent held shared, but the result
    is used after we'd drop the lock. The same happens in do_last() -
    the lookup (in lookup_one()) is done with parent locked, but
    result is used after unlocking.

    lookup_fast(), do_last() and mountpoint_last() flat-out reject
    negatives.

    Most of lookup_dcache() calls are made with parent locked at least
    shared; the only exception is lookup_one_len_unlocked(). It might
    return pinned negative, needs serious care from callers. Fortunately,
    almost nobody calls it directly anymore; all but two callers have
    converted to lookup_positive_unlocked(), which rejects negatives.

    lookup_slow() is called by the same lookup_one_len_unlocked() (see
    above), mountpoint_last() and walk_component(). In those two negatives
    are rejected.

    In other words, there is a small set of places where we need to
    check carefully if a pinned potentially negative dentry is, in
    fact, positive. After that check we want to be sure that both
    ->d_inode and type bits in ->d_flags are stable and observed.
    The set consists of follow_managed() (where the rejection happens
    for lookup_fast(), walk_component() and do_last()), last_mountpoint()
    and lookup_positive_unlocked().

    Solution:
    1) transition from negative to positive (in __d_set_inode_and_type())
    stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
    2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
    smp_load_acquire() and bugger off if it type bits say "negative".
    That way anyone downstream of those checks has dentry know positive pinned,
    with ->d_inode and type bits of ->d_flags stable and observed.

    I considered splitting off d_lookup_positive(), so that the checks could
    be done right there, under ->d_lock. However, that leads to massive
    duplication of rather subtle code in fs/namei.c and fs/dcache.c. It's
    worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
    No matter what, autofs_d_manage()/autofs_d_automount() must live with
    the possibility of pinned negative dentry passed their way, becoming
    positive under them - that's the intended behaviour when lookup comes
    in the middle of automount in progress, so we can't keep them out of
    the area that has to deal with those, more's the pity...

    Reported-by: Ritesh Harjani
    Signed-off-by: Al Viro

    Al Viro
     
  • Most of the callers of lookup_one_len_unlocked() treat negatives are
    ERR_PTR(-ENOENT). Provide a helper that would do just that. Note
    that a pinned positive dentry remains positive - it's ->d_inode is
    stable, etc.; a pinned _negative_ dentry can become positive at any
    point as long as you are not holding its parent at least shared.
    So using lookup_one_len_unlocked() needs to be careful;
    lookup_positive_unlocked() is safer and that's what the callers
    end up open-coding anyway.

    Signed-off-by: Al Viro

    Al Viro
     
  • There are 4 callers; two proceed to check if result is positive and
    fail with ENOENT if it isn't; one (in handle_lookup_down()) is
    guaranteed to yield positive and one (in lookup_fast()) is _preceded_
    by positivity check.

    However, follow_managed() on a negative dentry is a (fairly cheap)
    no-op on anything other than autofs. And negative autofs dentries
    are never hashed, so lookup_fast() is not going to run into one
    of those. Moreover, successful follow_managed() on a _positive_
    dentry never yields a negative one (and we significantly rely upon
    that in callers of lookup_fast()).

    In other words, we can easily transpose the positivity check and
    the call of follow_managed() in lookup_fast(). And that allows
    to fold the positivity check *into* follow_managed(), simplifying
    life for the code downstream of its calls.

    Signed-off-by: Al Viro

    Al Viro
     

04 Oct, 2019

1 commit

  • This renames the very specific audit_log_link_denied() to
    audit_log_path_denied() and adds the AUDIT_* type as an argument. This
    allows for the creation of the new AUDIT_ANOM_CREAT that can be used to
    report the fifo/regular file creation restrictions that were introduced
    in commit 30aba6656f61 ("namei: allow restricted O_CREAT of FIFOs and
    regular files").

    Signed-off-by: Kees Cook
    Signed-off-by: Paul Moore

    Kees Cook
     

03 Sep, 2019

1 commit

  • The rules for nd->root are messy:
    * if we have LOOKUP_ROOT, it doesn't contribute to refcounts
    * if we have LOOKUP_RCU, it doesn't contribute to refcounts
    * if nd->root.mnt is NULL, it doesn't contribute to refcounts
    * otherwise it does contribute

    terminate_walk() needs to drop the references if they are contributing.
    So everything else should be careful not to confuse it, leading to
    rather convoluted code.

    It's easier to keep track of whether we'd grabbed the reference(s)
    explicitly. Use a new flag for that. Don't bother with zeroing
    nd->root.mnt on unlazy failures and in terminate_walk - it's not
    needed anymore (terminate_walk() won't care and the next path_init()
    will zero nd->root in !LOOKUP_ROOT case anyway).

    Resulting rules for nd->root refcounts are much simpler: they are
    contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.

    Signed-off-by: Al Viro

    Al Viro
     

31 Aug, 2019

1 commit


22 Jul, 2019

3 commits


20 Jun, 2019

1 commit

  • We would like to move fsnotify_nameremove() calls from d_delete()
    into a higher layer where the hook makes more sense and so we can
    consider every d_delete() call site individually.

    Start by creating empty hook fsnotify_{unlink,rmdir}() and place
    them in the proper VFS call sites. After all d_delete() call sites
    will be converted to use the new hook, the new hook will generate the
    delete events and fsnotify_nameremove() hook will be removed.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

08 May, 2019

1 commit

  • Pull fscrypt updates from Ted Ts'o:
    "Clean up fscrypt's dcache revalidation support, and other
    miscellaneous cleanups"

    * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
    fscrypt: cache decrypted symlink target in ->i_link
    vfs: use READ_ONCE() to access ->i_link
    fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext
    fscrypt: only set dentry_operations on ciphertext dentries
    fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory
    fscrypt: fix race allowing rename() and link() of ciphertext dentries
    fscrypt: clean up and improve dentry revalidation
    fscrypt: use READ_ONCE() to access ->i_crypt_info
    fscrypt: remove WARN_ON_ONCE() when decryption fails
    fscrypt: drop inode argument from fscrypt_get_ctx()

    Linus Torvalds
     

27 Apr, 2019

2 commits


18 Apr, 2019

1 commit

  • Use 'READ_ONCE(inode->i_link)' to explicitly support filesystems caching
    the symlink target in ->i_link later if it was unavailable at iget()
    time, or wasn't easily available. I'll be doing this in fscrypt, to
    improve the performance of encrypted symlinks on ext4, f2fs, and ubifs.

    ->i_link will start NULL and may later be set to a non-NULL value by a
    smp_store_release() or cmpxchg_release(). READ_ONCE() is needed on the
    read side. smp_load_acquire() is unnecessary because only a data
    dependency barrier is required. (Thanks to Al for pointing this out.)

    Acked-by: Al Viro
    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o

    Eric Biggers
     

13 Mar, 2019

1 commit

  • Pull vfs mount infrastructure updates from Al Viro:
    "The rest of core infrastructure; no new syscalls in that pile, but the
    old parts are switched to new infrastructure. At that point
    conversions of individual filesystems can happen independently; some
    are done here (afs, cgroup, procfs, etc.), there's also a large series
    outside of that pile dealing with NFS (quite a bit of option-parsing
    stuff is getting used there - it's one of the most convoluted
    filesystems in terms of mount-related logics), but NFS bits are the
    next cycle fodder.

    It got seriously simplified since the last cycle; documentation is
    probably the weakest bit at the moment - I considered dropping the
    commit introducing Documentation/filesystems/mount_api.txt (cutting
    the size increase by quarter ;-), but decided that it would be better
    to fix it up after -rc1 instead.

    That pile allows to do followup work in independent branches, which
    should make life much easier for the next cycle. fs/super.c size
    increase is unpleasant; there's a followup series that allows to
    shrink it considerably, but I decided to leave that until the next
    cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
    afs: Use fs_context to pass parameters over automount
    afs: Add fs_context support
    vfs: Add some logging to the core users of the fs_context log
    vfs: Implement logging through fs_context
    vfs: Provide documentation for new mount API
    vfs: Remove kern_mount_data()
    hugetlbfs: Convert to fs_context
    cpuset: Use fs_context
    kernfs, sysfs, cgroup, intel_rdt: Support fs_context
    cgroup: store a reference to cgroup_ns into cgroup_fs_context
    cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper
    cgroup_do_mount(): massage calling conventions
    cgroup: stash cgroup_root reference into cgroup_fs_context
    cgroup2: switch to option-by-option parsing
    cgroup1: switch to option-by-option parsing
    cgroup: take options parsing into ->parse_monolithic()
    cgroup: fold cgroup1_mount() into cgroup1_get_tree()
    cgroup: start switching to fs_context
    ipc: Convert mqueue fs to fs_context
    proc: Add fs_context support to procfs
    ...

    Linus Torvalds
     

11 Mar, 2019

1 commit

  • …morris/linux-security

    Pull integrity updates from James Morris:
    "Mimi Zohar says:

    'Linux 5.0 introduced the platform keyring to allow verifying the IMA
    kexec kernel image signature using the pre-boot keys. This pull
    request similarly makes keys on the platform keyring accessible for
    verifying the PE kernel image signature.

    Also included in this pull request is a new IMA hook that tags tmp
    files, in policy, indicating the file hash needs to be calculated.
    The remaining patches are cleanup'"

    * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    evm: Use defined constant for UUID representation
    ima: define ima_post_create_tmpfile() hook and add missing call
    evm: remove set but not used variable 'xattr'
    encrypted-keys: fix Opt_err/Opt_error = -1
    kexec, KEYS: Make use of platform keyring for signature verify
    integrity, KEYS: add a reference to platform keyring

    Linus Torvalds
     

08 Mar, 2019

2 commits

  • Merge more updates from Andrew Morton:

    - some of the rest of MM

    - various misc things

    - dynamic-debug updates

    - checkpatch

    - some epoll speedups

    - autofs

    - rapidio

    - lib/, lib/lzo/ updates

    * emailed patches from Andrew Morton : (83 commits)
    samples/mic/mpssd/mpssd.h: remove duplicate header
    kernel/fork.c: remove duplicated include
    include/linux/relay.h: fix percpu annotation in struct rchan
    arch/nios2/mm/fault.c: remove duplicate include
    unicore32: stop printing the virtual memory layout
    MAINTAINERS: fix GTA02 entry and mark as orphan
    mm: create the new vm_fault_t type
    arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
    arch: simplify several early memory allocations
    openrisc: simplify pte_alloc_one_kernel()
    sh: prefer memblock APIs returning virtual address
    microblaze: prefer memblock API returning virtual address
    powerpc: prefer memblock APIs returning virtual address
    lib/lzo: separate lzo-rle from lzo
    lib/lzo: implement run-length encoding
    lib/lzo: fast 8-byte copy on arm64
    lib/lzo: 64-bit CTZ on arm64
    lib/lzo: tidy-up ifdefs
    ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
    ipc: annotate implicit fall through
    ...

    Linus Torvalds
     
  • Instead of doing this compile-time check in some slightly arbitrary user
    of struct filename, put it next to the definition.

    Link: http://lkml.kernel.org/r/20190208203015.29702-3-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Cc: Alexander Viro
    Cc: Kees Cook
    Cc: Luc Van Oostenryck
    Cc: Masahiro Yamada
    Cc: Nick Desaulniers
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

28 Feb, 2019

1 commit

  • Because the new API passes in key,value parameters, match_token() cannot be
    used with it. Instead, provide three new helpers to aid with parsing:

    (1) fs_parse(). This takes a parameter and a simple static description of
    all the parameters and maps the key name to an ID. It returns 1 on a
    match, 0 on no match if unknowns should be ignored and some other
    negative error code on a parse error.

    The parameter description includes a list of key names to IDs, desired
    parameter types and a list of enumeration name -> ID mappings.

    [!] Note that for the moment I've required that the key->ID mapping
    array is expected to be sorted and unterminated. The size of the
    array is noted in the fsconfig_parser struct. This allows me to use
    bsearch(), but I'm not sure any performance gain is worth the hassle
    of requiring people to keep the array sorted.

    The parameter type array is sized according to the number of parameter
    IDs and is indexed directly. The optional enum mapping array is an
    unterminated, unsorted list and the size goes into the fsconfig_parser
    struct.

    The function can do some additional things:

    (a) If it's not ambiguous and no value is given, the prefix "no" on
    a key name is permitted to indicate that the parameter should
    be considered negatory.

    (b) If the desired type is a single simple integer, it will perform
    an appropriate conversion and store the result in a union in
    the parse result.

    (c) If the desired type is an enumeration, {key ID, name} will be
    looked up in the enumeration list and the matching value will
    be stored in the parse result union.

    (d) Optionally generate an error if the key is unrecognised.

    This is called something like:

    enum rdt_param {
    Opt_cdp,
    Opt_cdpl2,
    Opt_mba_mpbs,
    nr__rdt_params
    };

    const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
    [Opt_cdp] = { fs_param_is_bool },
    [Opt_cdpl2] = { fs_param_is_bool },
    [Opt_mba_mpbs] = { fs_param_is_bool },
    };

    const const char *const rdt_param_keys[nr__rdt_params] = {
    [Opt_cdp] = "cdp",
    [Opt_cdpl2] = "cdpl2",
    [Opt_mba_mpbs] = "mba_mbps",
    };

    const struct fs_parameter_description rdt_parser = {
    .name = "rdt",
    .nr_params = nr__rdt_params,
    .keys = rdt_param_keys,
    .specs = rdt_param_specs,
    .no_source = true,
    };

    int rdt_parse_param(struct fs_context *fc,
    struct fs_parameter *param)
    {
    struct fs_parse_result parse;
    struct rdt_fs_context *ctx = rdt_fc2context(fc);
    int ret;

    ret = fs_parse(fc, &rdt_parser, param, &parse);
    if (ret < 0)
    return ret;

    switch (parse.key) {
    case Opt_cdp:
    ctx->enable_cdpl3 = true;
    return 0;
    case Opt_cdpl2:
    ctx->enable_cdpl2 = true;
    return 0;
    case Opt_mba_mpbs:
    ctx->enable_mba_mbps = true;
    return 0;
    }

    return -EINVAL;
    }

    (2) fs_lookup_param(). This takes a { dirfd, path, LOOKUP_EMPTY? } or
    string value and performs an appropriate path lookup to convert it
    into a path object, which it will then return.

    If the desired type was a blockdev, the type of the looked up inode
    will be checked to make sure it is one.

    This can be used like:

    enum foo_param {
    Opt_source,
    nr__foo_params
    };

    const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
    [Opt_source] = { fs_param_is_blockdev },
    };

    const char *char foo_param_keys[nr__foo_params] = {
    [Opt_source] = "source",
    };

    const struct constant_table foo_param_alt_keys[] = {
    { "device", Opt_source },
    };

    const struct fs_parameter_description foo_parser = {
    .name = "foo",
    .nr_params = nr__foo_params,
    .nr_alt_keys = ARRAY_SIZE(foo_param_alt_keys),
    .keys = foo_param_keys,
    .alt_keys = foo_param_alt_keys,
    .specs = foo_param_specs,
    };

    int foo_parse_param(struct fs_context *fc,
    struct fs_parameter *param)
    {
    struct fs_parse_result parse;
    struct foo_fs_context *ctx = foo_fc2context(fc);
    int ret;

    ret = fs_parse(fc, &foo_parser, param, &parse);
    if (ret < 0)
    return ret;

    switch (parse.key) {
    case Opt_source:
    return fs_lookup_param(fc, &foo_parser, param,
    &parse, &ctx->source);
    default:
    return -EINVAL;
    }
    }

    (3) lookup_constant(). This takes a table of named constants and looks up
    the given name within it. The table is expected to be sorted such
    that bsearch() be used upon it.

    Possibly I should require the table be terminated and just use a
    for-loop to scan it instead of using bsearch() to reduce hassle.

    Tables look something like:

    static const struct constant_table bool_names[] = {
    { "0", false },
    { "1", true },
    { "false", false },
    { "no", false },
    { "true", true },
    { "yes", true },
    };

    and a lookup is done with something like:

    b = lookup_constant(bool_names, param->string, -1);

    Additionally, optional validation routines for the parameter description
    are provided that can be enabled at compile time. A later patch will
    invoke these when a filesystem is registered.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

05 Feb, 2019

1 commit

  • If tmpfiles can be made persistent, then newly created tmpfiles need to
    be treated like any other new files in policy.

    This patch indicates which newly created tmpfiles are in policy, causing
    the file hash to be calculated on __fput().

    Reported-by: Ignaz Forster
    [rgoldwyn@suse.com: Call ima_post_create_tmpfile() in vfs_tmpfile() as
    opposed to do_tmpfile(). This will help the case for overlayfs where
    copy_up is denied while overwriting a file.]
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Mimi Zohar

    Mimi Zohar
     

31 Jan, 2019

1 commit

  • Don't fetch fcaps when umount2 is called to avoid a process hang while
    it waits for the missing resource to (possibly never) re-appear.

    Note the comment above user_path_mountpoint_at():
    * A umount is a special case for path walking. We're not actually interested
    * in the inode in this situation, and ESTALE errors can be a problem. We
    * simply want track down the dentry and vfsmount attached at the mountpoint
    * and avoid revalidating the last component.

    This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS.

    Please see the github issue tracker
    https://github.com/linux-audit/audit-kernel/issues/100

    Signed-off-by: Richard Guy Briggs
    [PM: merge fuzz in audit_log_fcaps()]
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

23 Dec, 2018

1 commit

  • This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd.

    commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
    enabled mknod() in user namespaces for userns root if CAP_MKNOD is
    available. However, these device nodes are useless since any filesystem
    mounted from a non-initial user namespace will set the SB_I_NODEV flag on
    the filesystem. Now, when a device node s created in a non-initial user
    namespace a call to open() on said device node will fail due to:

    bool may_open_dev(const struct path *path)
    {
    return !(path->mnt->mnt_flags & MNT_NODEV) &&
    !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
    }

    The problem with this is that as of the aforementioned commit mknod()
    creates partially functional device nodes in non-initial user namespaces.
    In particular, it has the consequence that as of the aforementioned commit
    open() will be more privileged with respect to device nodes than mknod().
    Before it was the other way around. Specifically, if mknod() succeeded
    then it was transparent for any userspace application that a fatal error
    must have occured when open() failed.

    All of this breaks multiple userspace workloads and a widespread assumption
    about how to handle mknod(). Basically, all container runtimes and systemd
    live by the slogan "ask for forgiveness not permission" when running user
    namespace workloads. For mknod() the assumption is that if the syscall
    succeeds the device nodes are useable irrespective of whether it succeeds
    in a non-initial user namespace or not. This logic was chosen explicitly
    to allow for the glorious day when mknod() will actually be able to create
    fully functional device nodes in user namespaces.
    A specific problem people are already running into when running 4.18 rc
    kernels are failing systemd services. For any distro that is run in a
    container systemd services started with the PrivateDevices= property set
    will fail to start since the device nodes in question cannot be
    opened (cf. the arguments in [1]).

    Full disclosure, Seth made the very sound argument that it is already
    possible to end up with partially functional device nodes. Any filesystem
    mounted with MS_NODEV set will allow mknod() to succeed but will not allow
    open() to succeed. The difference to the case here is that the MS_NODEV
    case is transparent to userspace since it is an explicitly set mount option
    while the SB_I_NODEV case is an implicit property enforced by the kernel
    and hence opaque to userspace.

    [1]: https://github.com/systemd/systemd/pull/9483

    Signed-off-by: Christian Brauner
    Cc: "Eric W. Biederman"
    Cc: Seth Forshee
    Cc: Serge Hallyn
    Signed-off-by: Linus Torvalds

    Christian Brauner
     

24 Aug, 2018

1 commit

  • Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salvatore Mesoraca
     

22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • …ux/kernel/git/viro/vfs

    Pull misc vfs updates from Al Viro:
    "Misc cleanups from various folks all over the place

    I expected more fs/dcache.c cleanups this cycle, so that went into a
    separate branch. Said cleanups have missed the window, so in the
    hindsight it could've gone into work.misc instead. Decided not to
    cherry-pick, thus the 'work.dcache' branch"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: dcache: Use true and false for boolean values
    fold generic_readlink() into its only caller
    fs: shave 8 bytes off of struct inode
    fs: Add more kernel-doc to the produced documentation
    fs: Fix attr.c kernel-doc
    removed extra extern file_fdatawait_range

    * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    kill dentry_update_name_case()

    Linus Torvalds