15 Oct, 2016

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This update contains fixes to the "use mounter's permission to access
    underlying layers" area, and miscellaneous other fixes and cleanups.

    No new features this time"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: use vfs_get_link()
    vfs: add vfs_get_link() helper
    ovl: use generic_readlink
    ovl: explain error values when removing acl from workdir
    ovl: Fix info leak in ovl_lookup_temp()
    ovl: during copy up, switch to mounter's creds early
    ovl: lookup: do getxattr with mounter's permission
    ovl: copy_up_xattr(): use strnlen

    Linus Torvalds
     

14 Oct, 2016

1 commit

  • This helper is for filesystems that want to read the symlink and are better
    off with the get_link() interface (returning a char *) rather than the
    readlink() interface (copy into a userspace buffer).

    Also call the LSM hook for readlink (not get_link) since this is for
    symlink reading not following.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

27 Sep, 2016

2 commits


16 Sep, 2016

1 commit

  • On overlayfs relatime_need_update() needs inode times to be correct on
    overlay inode. But i_mtime and i_ctime are updated by filesystem code on
    underlying inode only, so they will be out-of-date on the overlay inode.

    This patch copies the times from the underlying inode if needed. This
    can't be done if called from RCU lookup (link following) but link m/ctime
    are not updated by fs, so this is all right.

    This patch doesn't change functionality for anything but overlayfs.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

07 Aug, 2016

1 commit

  • In most cases, EPERM is returned on immutable inode, and there're only a
    few places returning EACCES. I noticed this when running LTP on
    overlayfs, setxattr03 failed due to unexpected EACCES on immutable
    inode.

    So converting all EACCES to EPERM on immutable inode.

    Acked-by: Dave Chinner
    Signed-off-by: Eryu Guan
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Eryu Guan
     

30 Jul, 2016

2 commits

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     
  • This reverts commit 3c9fe8cdff1b889a059a30d22f130372f2b3885f.

    As Miklos points out in commit c1b2cc1a765a, the "lookup_hash()" helper
    is now unused, and in fact, with the hash salting changes, since the
    hash of a dentry name now depends on the directory dentry it is in, the
    helper function isn't even really likely to be useful.

    So rather than keep it around in case somebody else might end up finding
    a use for it, let's just remove the helper and not trick people into
    thinking it might be a useful thing.

    For example, I had obviously completely missed how the helper didn't
    follow the normal dentry hashing patterns, and how the hash salting
    patch broke overlayfs. Things would quietly build and look sane, but
    not work.

    Suggested-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Jul, 2016

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     

25 Jul, 2016

1 commit

  • commit 6c51e513a3aa ("lookup_dcache(): lift d_alloc() into callers")
    removed the need_lookup argument from lookup_dcache(), but the
    comment was forgotten. Also it no longer allocates a new dentry
    if nothing was found.

    Signed-off-by: Oleg Drokin
    Signed-off-by: Al Viro

    Oleg Drokin
     

24 Jul, 2016

1 commit

  • Seth Forshee reported a mount regression in nfs autmounts
    with "fs: Add user namespace member to struct super_block".

    It turns out that the assumption that current->cred is something
    reasonable during mount while necessary to improve support of
    unprivileged mounts is wrong in the automount path.

    To fix the existing filesystems override current->cred with the
    init_cred before calling d_automount and restore current->cred after
    d_automount completes.

    To support unprivileged mounts would require a more nuanced cred
    selection, so fail on unprivileged mounts for the time being. As none
    of the filesystems that currently set FS_USERNS_MOUNT implement
    d_automount this check is only good for preventing future problems.

    Fixes: 6e4eab577a0c ("fs: Add user namespace member to struct super_block")
    Tested-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 Jul, 2016

2 commits

  • It is expected that filesystems can not represent uids and gids from
    outside of their user namespace. Keep things simple by not even
    trying to create filesystem nodes with non-sense uids and gids.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • When a filesystem outside of init_user_ns is mounted it could have
    uids and gids stored in it that do not map to init_user_ns.

    The plan is to allow those filesystems to set i_uid to INVALID_UID and
    i_gid to INVALID_GID for unmapped uids and gids and then to handle
    that strange case in the vfs to ensure there is consistent robust
    handling of the weirdness.

    Upon a careful review of the vfs and filesystems about the only case
    where there is any possibility of confusion or trouble is when the
    inode is written back to disk. In that case filesystems typically
    read the inode->i_uid and inode->i_gid and write them to disk even
    when just an inode timestamp is being updated.

    Which leads to a rule that is very simple to implement and understand
    inodes whose i_uid or i_gid is not valid may not be written.

    In dealing with access times this means treat those inodes as if the
    inode flag S_NOATIME was set. Reads of the inodes appear safe and
    useful, but any write or modification is disallowed. The only inode
    write that is allowed is a chown that sets the uid and gid on the
    inode to valid values. After such a chown the inode is normal and may
    be treated as such.

    Denying all writes to inodes with uids or gids unknown to the vfs also
    prevents several oddball cases where corruption would have occurred
    because the vfs does not have complete information.

    One problem case that is prevented is attempting to use the gid of a
    directory for new inodes where the directories sgid bit is set but the
    directories gid is not mapped.

    Another problem case avoided is attempting to update the evm hash
    after setxattr, removexattr, and setattr. As the evm hash includeds
    the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
    a correct evm hash from being computed. evm hash verification also
    fails when i_uid or i_gid is unknown but that is essentially harmless
    as it does not cause filesystem corruption.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 Jul, 2016

1 commit

  • Filesystem uids which don't map into a user namespace may result
    in inode->i_uid being INVALID_UID. A symlink and its parent
    could have different owners in the filesystem can both get
    mapped to INVALID_UID, which may result in following a symlink
    when this would not have otherwise been permitted when protected
    symlinks are enabled.

    Signed-off-by: Seth Forshee
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Seth Forshee
     

30 Jun, 2016

1 commit

  • The two methods essentially do the same: find the real dentry/inode
    belonging to an overlay dentry. The difference is in the usage:

    vfs_open() uses ->d_select_inode() and expects the function to perform
    copy-up if necessary based on the open flags argument.

    file_dentry() uses ->d_real() passing in the overlay dentry as well as the
    underlying inode.

    vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real()
    with a zero inode would have worked just as well here.

    This patch merges the functionality of ->d_select_inode() into ->d_real()
    by adding an 'open_flags' argument to the latter.

    [Al Viro] Make the signature of d_real() match that of ->d_real() again.
    And constify the inode argument, while we are at it.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

24 Jun, 2016

1 commit

  • Introduce a function may_open_dev that tests MNT_NODEV and a new
    superblock flab SB_I_NODEV. Use this new function in all of the
    places where MNT_NODEV was previously tested.

    Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
    filesystems should never support device nodes, and a simple superblock
    flags makes that very hard to get wrong. With SB_I_NODEV set if any
    device nodes somehow manage to show up on on a filesystem those
    device nodes will be unopenable.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Jun, 2016

2 commits

  • Pull vfs fixes from Al Viro:
    "Fixes for crap of assorted ages: EOPENSTALE one is 4.2+, autofs one is
    4.6, d_walk - 3.2+.

    The atomic_open() and coredump ones are regressions from this window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coredump: fix dumping through pipes
    fix a regression in atomic_open()
    fix d_walk()/non-delayed __d_free() race
    autofs braino fix for do_last()
    fix EOPENSTALE bug in do_last()

    Linus Torvalds
     
  • open("/foo/no_such_file", O_RDONLY | O_CREAT) on should fail with
    EACCES when /foo is not writable; failing with ENOENT is obviously
    wrong. That got broken by a braino introduced when moving the
    creat_error logics from atomic_open() to lookup_open(). Easy to
    fix, fortunately.

    Spotted-by: "Yan, Zheng"
    Tested-by: "Yan, Zheng"
    Signed-off-by: Al Viro

    Al Viro
     

06 Jun, 2016

1 commit

  • The /dev/ptmx device node is changed to lookup the directory entry "pts"
    in the same directory as the /dev/ptmx device node was opened in. If
    there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
    uses that filesystem. Otherwise the open of /dev/ptmx fails.

    The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
    userspace can now safely depend on each mount of devpts creating a new
    instance of the filesystem.

    Each mount of devpts is now a separate and equal filesystem.

    Reserved ttys are now available to all instances of devpts where the
    mounter is in the initial mount namespace.

    A new vfs helper path_pts is introduced that finds a directory entry
    named "pts" in the directory of the passed in path, and changes the
    passed in path to point to it. The helper path_pts uses a function
    path_parent_directory that was factored out of follow_dotdot.

    In the implementation of devpts:
    - devpts_mnt is killed as it is no longer meaningful if all mounts of
    devpts are equal.
    - pts_sb_from_inode is replaced by just inode->i_sb as all cached
    inodes in the tty layer are now from the devpts filesystem.
    - devpts_add_ref is rolled into the new function devpts_ptmx. And the
    unnecessary inode hold is removed.
    - devpts_del_ref is renamed devpts_release and reduced to just a
    deacrivate_super.
    - The newinstance mount option continues to be accepted but is now
    ignored.

    In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
    they are never used.

    Documentation/filesystems/devices.txt is updated to describe the current
    situation.

    This has been verified to work properly on openwrt-15.05, centos5,
    centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
    ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
    slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
    caveat that on centos6 and on slackware-14.1 that there wind up being
    two instances of the devpts filesystem mounted on /dev/pts, the lower
    copy does not end up getting used.

    Signed-off-by: "Eric W. Biederman"
    Cc: Greg KH
    Cc: Peter Hurley
    Cc: Peter Anvin
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: Serge Hallyn
    Cc: Willy Tarreau
    Cc: Aurelien Jarno
    Cc: One Thousand Gnomes
    Cc: Jann Horn
    Cc: Jiri Slaby
    Cc: Florian Weimer
    Cc: Konstantin Khlebnikov
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

05 Jun, 2016

1 commit

  • It's an analogue of commit 7500c38a (fix the braino in "namei:
    massage lookup_slow() to be usable by lookup_one_len_unlocked()").
    The same problem (->lookup()-returned unhashed negative dentry
    just might be an autofs one with ->d_manage() that would wait
    until the daemon makes it positive) applies in do_last() - we
    need to do follow_managed() first.

    Fortunately, remaining callers of follow_managed() are OK - only
    autofs has that weirdness (negative dentry that does not mean
    an instant -ENOENT)) and autofs never has its negative dentries
    hashed, so we can't pick one from a dcache lookup.

    ->d_manage() is a bloody mess ;-/

    Cc: stable@vger.kernel.org # v4.6
    Spotted-by: Ian Kent
    Signed-off-by: Al Viro

    Al Viro
     

04 Jun, 2016

1 commit

  • EOPENSTALE occuring at the last component of a trailing symlink ends up
    with do_last() retrying its lookup. After the symlink body has been
    discarded. The thing is, all this retry_lookup logics in there is not
    needed at all - the upper layers will do the right thing if we simply
    return that -EOPENSTALE as we would with any other error. Trying to
    microoptimize in do_last() is a lot of headache for no good reason.

    Cc: stable@vger.kernel.org # v4.2+
    Tested-by: Oleg Drokin
    Reviewed-and-Tested-by: Jeff Layton
    Signed-off-by: Al Viro

    Al Viro
     

29 May, 2016

6 commits

  • The self-test was updated to cover zero-length strings; the function
    needs to be updated, too.

    Reported-by: Geert Uytterhoeven
    Signed-off-by: George Spelvin
    Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function")
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • The original name was simply hash_string(), but that conflicted with a
    function with that name in drivers/base/power/trace.c, and I decided
    that calling it "hashlen_" was better anyway.

    But you have to do it in two places.

    [ This caused build errors for architectures that don't define
    CONFIG_DCACHE_WORD_ACCESS - Linus ]

    Signed-off-by: George Spelvin
    Reported-by: Guenter Roeck
    Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function")
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • Pull string hash improvements from George Spelvin:
    "This series does several related things:

    - Makes the dcache hash (fs/namei.c) useful for general kernel use.

    (Thanks to Bruce for noticing the zero-length corner case)

    - Converts the string hashes in to use the
    above.

    - Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two
    32-bit multiplies will do well enough.

    - Rids the world of the bad hash multipliers in hash_32.

    This finishes the job started in commit 689de1d6ca95 ("Minimal
    fix-up of bad hashing behavior of hash_64()")

    The vast majority of Linux architectures have hardware support for
    32x32-bit multiply and so derive no benefit from "simplified"
    multipliers.

    The few processors that do not (68000, h8/300 and some models of
    Microblaze) have arch-specific implementations added. Those
    patches are last in the series.

    - Overhauls the dcache hash mixing.

    The patch in commit 0fed3ac866ea ("namei: Improve hash mixing if
    CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion.
    Replaced with a much more careful design that's simultaneously
    faster and better. (My own invention, as there was noting suitable
    in the literature I could find. Comments welcome!)

    - Modify the hash_name() loop to skip the initial HASH_MIX(). This
    would let us salt the hash if we ever wanted to.

    - Sort out partial_name_hash().

    The hash function is declared as using a long state, even though
    it's truncated to 32 bits at the end and the extra internal state
    contributes nothing to the result. And some callers do odd things:

    - fs/hfs/string.c only allocates 32 bits of state
    - fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes

    - Modify bytemask_from_count to handle inputs of 1..sizeof(long)
    rather than 0..sizeof(long)-1. This would simplify users other
    than full_name_hash"

    Special thanks to Bruce Fields for testing and finding bugs in v1. (I
    learned some humbling lessons about "obviously correct" code.)

    On the arch-specific front, the m68k assembly has been tested in a
    standalone test harness, I've been in contact with the Microblaze
    maintainers who mostly don't care, as the hardware multiplier is never
    omitted in real-world applications, and I haven't heard anything from
    the H8/300 world"

    * 'hash' of git://ftp.sciencehorizons.net/linux:
    h8300: Add
    microblaze: Add
    m68k: Add
    : Add support for architecture-specific functions
    fs/namei.c: Improve dcache hash function
    Eliminate bad hash multipliers from hash_32() and hash_64()
    Change hash_64() return value to 32 bits
    : Define hash_str() in terms of hashlen_string()
    fs/namei.c: Add hashlen_string() function
    Pull out string hash to

    Linus Torvalds
     
  • This is just the infrastructure; there are no users yet.

    This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares
    the existence of .

    That file may define its own versions of various functions, and define
    HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones.

    Included is a self-test (in lib/test_hash.c) that verifies the basics.
    It is NOT in general required that the arch-specific functions compute
    the same thing as the generic, but if a HAVE_* symbol is defined with
    the value 1, then equality is tested.

    Signed-off-by: George Spelvin
    Cc: Geert Uytterhoeven
    Cc: Greg Ungerer
    Cc: Andreas Schwab
    Cc: Philippe De Muyter
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: Alistair Francis
    Cc: Michal Simek
    Cc: Yoshinori Sato
    Cc: uclinux-h8-devel@lists.sourceforge.jp

    George Spelvin
     
  • Patch 0fed3ac866 improved the hash mixing, but the function is slower
    than necessary; there's a 7-instruction dependency chain (10 on x86)
    each loop iteration.

    Word-at-a-time access is a very tight loop (which is good, because
    link_path_walk() is one of the hottest code paths in the entire kernel),
    and the hash mixing function must not have a longer latency to avoid
    slowing it down.

    There do not appear to be any published fast hash functions that:
    1) Operate on the input a word at a time, and
    2) Don't need to know the length of the input beforehand, and
    3) Have a single iterated mixing function, not needing conditional
    branches or unrolling to distinguish different loop iterations.

    One of the algorithms which comes closest is Yann Collet's xxHash, but
    that's two dependent multiplies per word, which is too much.

    The key insights in this design are:

    1) Barring expensive ops like multiplies, to diffuse one input bit
    across 64 bits of hash state takes at least log2(64) = 6 sequentially
    dependent instructions. That is more cycles than we'd like.
    2) An operation like "hash ^= hash << 13" requires a second temporary
    register anyway, and on a 2-operand machine like x86, it's three
    instructions.
    3) A better use of a second register is to hold a two-word hash state.
    With careful design, no temporaries are needed at all, so it doesn't
    increase register pressure. And this gets rid of register copying
    on 2-operand machines, so the code is smaller and faster.
    4) Using two words of state weakens the requirement for one-round mixing;
    we now have two rounds of mixing before cancellation is possible.
    5) A two-word hash state also allows operations on both halves to be
    done in parallel, so on a superscalar processor we get more mixing
    in fewer cycles.

    I ended up using a mixing function inspired by the ChaCha and Speck
    round functions. It is 6 simple instructions and 3 cycles per iteration
    (assuming multiply by 9 can be done by an "lea" instruction):

    x ^= *input++;
    y ^= x; x = ROL(x, K1);
    x += y; y = ROL(y, K2);
    y *= 9;

    Not only is this reversible, two consecutive rounds are reversible:
    if you are given the initial and final states, but not the intermediate
    state, it is possible to compute both input words. This means that at
    least 3 words of input are required to create a collision.

    (It also has the property, used by hash_name() to avoid a branch, that
    it hashes all-zero to all-zero.)

    The rotate constants K1 and K2 were found by experiment. The search took
    a sample of random initial states (I used 1023) and considered the effect
    of flipping each of the 64 input bits on each of the 128 output bits two
    rounds later. Each of the 8192 pairs can be considered a biased coin, and
    adding up the Shannon entropy of all of them produces a score.

    The best-scoring shifts also did well in other tests (flipping bits in y,
    trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
    so the choice was made with the additional constraint that the sum of the
    shifts is odd and not too close to the word size.

    The final state is then folded into a 32-bit hash value by a less carefully
    optimized multiply-based scheme. This also has to be fast, as pathname
    components tend to be short (the most common case is one iteration!), but
    there's some room for latency, as there is a fair bit of intervening logic
    before the hash value is used for anything.

    (Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
    a better benchmark; the numbers seem to show a slight dip in performance
    between 4.6.0 and this patch, but they're too noisy to quote.)

    Special thanks to Bruce fields for diligent testing which uncovered a
    nasty fencepost error in an earlier version of this patch.

    [checkpatch.pl formatting complaints noted and respectfully disagreed with.]

    Signed-off-by: George Spelvin
    Tested-by: J. Bruce Fields

    George Spelvin
     
  • We'd like to make more use of the highly-optimized dcache hash functions
    throughout the kernel, rather than have every subsystem create its own,
    and a function that hashes basic null-terminated strings is required
    for that.

    (The name is to emphasize that it returns both hash and length.)

    It's actually useful in the dcache itself, specifically d_alloc_name().
    Other uses in the next patch.

    full_name_hash() is also tweaked to make it more generally useful:
    1) Take a "char *" rather than "unsigned char *" argument, to
    be consistent with hash_name().
    2) Handle zero-length inputs. If we want more callers, we don't want
    to make them worry about corner cases.

    Signed-off-by: George Spelvin

    George Spelvin
     

27 May, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "A pretty average collection of fixes, cleanups and improvements in
    this request.

    Summary:
    - fixes for mount line parsing, sparse warnings, read-only compat
    feature remount behaviour
    - allow fast path symlink lookups for inline symlinks.
    - attribute listing cleanups
    - writeback goes direct to bios rather than indirecting through
    bufferheads
    - transaction allocation cleanup
    - optimised kmem_realloc
    - added configurable error handling for metadata write errors,
    changed default error handling behaviour from "retry forever" to
    "retry until unmount then fail"
    - fixed several inode cluster writeback lookup vs reclaim race
    conditions
    - fixed inode cluster writeback checking wrong inode after lookup
    - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
    - cleaned up inode reclaim tagging"

    * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
    xfs: fix warning in xfs_finish_page_writeback for non-debug builds
    xfs: move reclaim tagging functions
    xfs: simplify inode reclaim tagging interfaces
    xfs: rename variables in xfs_iflush_cluster for clarity
    xfs: xfs_iflush_cluster has range issues
    xfs: mark reclaimed inodes invalid earlier
    xfs: xfs_inode_free() isn't RCU safe
    xfs: optimise xfs_iext_destroy
    xfs: skip stale inodes in xfs_iflush_cluster
    xfs: fix inode validity check in xfs_iflush_cluster
    xfs: xfs_iflush_cluster fails to abort on error
    xfs: remove xfs_fs_evict_inode()
    xfs: add "fail at unmount" error handling configuration
    xfs: add configuration handlers for specific errors
    xfs: add configuration of error failure speed
    xfs: introduce table-based init for error behaviors
    xfs: add configurable error support to metadata buffers
    xfs: introduce metadata IO error class
    xfs: configurable error behavior via sysfs
    xfs: buffer ->bi_end_io function requires irq-safe lock
    ...

    Linus Torvalds
     

20 May, 2016

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
    of modules and firmware to be loaded from a specific device (this
    is from ChromeOS, where the device as a whole is verified
    cryptographically via dm-verity).

    This is disabled by default but can be configured to be enabled by
    default (don't do this if you don't know what you're doing).

    - Keys: allow authentication data to be stored in an asymmetric key.
    Lots of general fixes and updates.

    - SELinux: add restrictions for loading of kernel modules via
    finit_module(). Distinguish non-init user namespace capability
    checks. Apply execstack check on thread stacks"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
    LSM: LoadPin: provide enablement CONFIG
    Yama: use atomic allocations when reporting
    seccomp: Fix comment typo
    ima: add support for creating files using the mknodat syscall
    ima: fix ima_inode_post_setattr
    vfs: forbid write access when reading a file into memory
    fs: fix over-zealous use of "const"
    selinux: apply execstack check on thread stacks
    selinux: distinguish non-init user namespace capability checks
    LSM: LoadPin for kernel file loading restrictions
    fs: define a string representation of the kernel_read_file_id enumeration
    Yama: consolidate error reporting
    string_helpers: add kstrdup_quotable_file
    string_helpers: add kstrdup_quotable_cmdline
    string_helpers: add kstrdup_quotable
    selinux: check ss_initialized before revalidating an inode label
    selinux: delay inode label lookup as long as possible
    selinux: don't revalidate an inode's label when explicitly setting it
    selinux: Change bool variable name to index.
    KEYS: Add KEYCTL_DH_COMPUTE command
    ...

    Linus Torvalds
     

18 May, 2016

2 commits

  • Pull 'struct path' constification update from Al Viro:
    "'struct path' is passed by reference to a bunch of Linux security
    methods; in theory, there's nothing to stop them from modifying the
    damn thing and LSM community being what it is, sooner or later some
    enterprising soul is going to decide that it's a good idea.

    Let's remove the temptation and constify all of those..."

    * 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    constify ima_d_path()
    constify security_sb_pivotroot()
    constify security_path_chroot()
    constify security_path_{link,rename}
    apparmor: remove useless checks for NULL ->mnt
    constify security_path_{mkdir,mknod,symlink}
    constify security_path_{unlink,rmdir}
    apparmor: constify common_perm_...()
    apparmor: constify aa_path_link()
    apparmor: new helper - common_path_perm()
    constify chmod_common/security_path_chmod
    constify security_sb_mount()
    constify chown_common/security_path_chown
    tomoyo: constify assorted struct path *
    apparmor_path_truncate(): path->mnt is never NULL
    constify vfs_truncate()
    constify security_path_truncate()
    [apparmor] constify struct path * in a bunch of helpers

    Linus Torvalds
     
  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     

17 May, 2016

2 commits


11 May, 2016

3 commits

  • Al Viro
     
  • Overlayfs needs lookup without inode_permission() and already has the name
    hash (in form of dentry->d_name on overlayfs dentry). It also doesn't
    support filesystems with d_op->d_hash() so basically it only needs
    the actual hashed lookup from lookup_one_len_unlocked()

    So add a new helper that does unlocked lookup of a hashed name.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
    should do nothing and return success.

    This condition is checked in vfs_rename(). However it won't detect hard
    links on overlayfs where these are given separate inodes on the overlayfs
    layer.

    Overlayfs itself detects this condition and returns success without doing
    anything, but then vfs_rename() will proceed as if this was a successful
    rename (detach_mounts(), d_move()).

    The correct thing to do is to detect this condition before even calling
    into overlayfs. This patch does this by calling vfs_select_inode() to get
    the underlying inodes.

    Signed-off-by: Miklos Szeredi
    Cc: # v4.2+

    Miklos Szeredi
     

03 May, 2016

2 commits