27 Oct, 2016

1 commit


15 Oct, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - tracepoints for basic cgroup management operations added

    - kernfs and cgroup path formatting functions updated to behave in the
    style of strlcpy()

    - non-critical bug fixes

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
    cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
    cpuset: fix error handling regression in proc_cpuset_show()
    cgroup: add tracepoints for basic operations
    cgroup: make cgroup_path() and friends behave in the style of strlcpy()
    kernfs: remove kernfs_path_len()
    kernfs: make kernfs_path*() behave in the style of strlcpy()
    kernfs: add dummy implementation of kernfs_path_from_node()

    Linus Torvalds
     

11 Oct, 2016

3 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     

08 Oct, 2016

2 commits


07 Oct, 2016

1 commit


28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

27 Sep, 2016

2 commits

  • Generated patch:

    sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
    sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This is trivial to do:

    - add flags argument to foo_rename()
    - check if flags is zero
    - assign foo_rename() to .rename2 instead of .rename

    This doesn't mean it's impossible to support RENAME_NOREPLACE for these
    filesystems, but it is not trivial, like for local filesystems.
    RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
    for a file to be created on one host while it is overwritten by rename on
    another host).

    Filesystems converted:

    9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

    After this, we can get rid of the duplicate interfaces for rename.

    Signed-off-by: Miklos Szeredi
    Acked-by: Greg Kroah-Hartman
    Acked-by: David Howells [AFS]
    Acked-by: Mike Marshall
    Cc: Eric Van Hensbergen
    Cc: Ilya Dryomov
    Cc: Jan Harkes
    Cc: Tyler Hicks
    Cc: Oleg Drokin
    Cc: Trond Myklebust
    Cc: Mark Fasheh

    Miklos Szeredi
     

22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

31 Aug, 2016

1 commit

  • kernfs_notify_workfn() sends out file modified events for the
    scheduled kernfs_nodes. Because the modifications aren't from
    userland, it doesn't have the matching file struct at hand and can't
    use fsnotify_modify(). Instead, it looked up the inode and then used
    d_find_any_alias() to find the dentry and used fsnotify_parent() and
    fsnotify() directly to generate notifications.

    The assumption was that the relevant dentries would have been pinned
    if there are listeners, which isn't true as inotify doesn't pin
    dentries at all and watching the parent doesn't pin the child dentries
    even for dnotify. This led to, for example, inotify watchers not
    getting notifications if the system is under memory pressure and the
    matching dentries got reclaimed. It can also be triggered through
    /proc/sys/vm/drop_caches or a remount attempt which involves shrinking
    dcache.

    fsnotify_parent() only uses the dentry to access the parent inode,
    which kernfs can do easily. Update kernfs_notify_workfn() so that it
    uses fsnotify() directly for both the parent and target inodes without
    going through d_find_any_alias(). While at it, supply the target file
    name to fsnotify() from kernfs_node->name.

    Signed-off-by: Tejun Heo
    Reported-by: Evgeny Vereshchagin
    Fixes: d911d9874801 ("kernfs: make kernfs_notify() trigger inotify events too")
    Cc: John McCutchan
    Cc: Robert Love
    Cc: Eric Paris
    Cc: stable@vger.kernel.org # v3.16+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

10 Aug, 2016

2 commits

  • It doesn't have any in-kernel user and the same result can be obtained
    from kernfs_path(@kn, NULL, 0). Remove it.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Cc: Serge Hallyn

    Tejun Heo
     
  • kernfs_path*() functions always return the length of the full path but
    the path content is undefined if the length is larger than the
    provided buffer. This makes its behavior different from strlcpy() and
    requires error handling in all its users even when they don't care
    about truncation. In addition, the implementation can actully be
    simplified by making it behave properly in strlcpy() style.

    * Update kernfs_path_from_node_locked() to always fill up the buffer
    with path. If the buffer is not large enough, the output is
    truncated and terminated.

    * kernfs_path() no longer needs error handling. Make it a simple
    inline wrapper around kernfs_path_from_node().

    * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
    Updated accordingly.

    * cgroup_path()'s use of kernfs_path() updated to retain the old
    behavior.

    Signed-off-by: Tejun Heo
    Acked-by: Greg Kroah-Hartman
    Acked-by: Serge Hallyn

    Tejun Heo
     

30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

24 Jun, 2016

3 commits

  • Introduce a function may_open_dev that tests MNT_NODEV and a new
    superblock flab SB_I_NODEV. Use this new function in all of the
    places where MNT_NODEV was previously tested.

    Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
    filesystems should never support device nodes, and a simple superblock
    flags makes that very hard to get wrong. With SB_I_NODEV set if any
    device nodes somehow manage to show up on on a filesystem those
    device nodes will be unopenable.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The cgroup filesystem is in the same boat as sysfs. No one ever
    permits executables of any kind on the cgroup filesystem, and there is
    no reasonable future case to support executables in the future.

    Therefore move the setting of SB_I_NOEXEC which makes the code proof
    against future mistakes of accidentally creating executables from
    sysfs to kernfs itself. Making the code simpler and covering the
    sysfs, cgroup, and cgroup2 filesystems.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Start marking filesystems with a user namespace owner, s_user_ns. In
    this change this is only used for permission checks of who may mount a
    filesystem. Ultimately s_user_ns will be used for translating ids and
    checking capabilities for filesystems mounted from user namespaces.

    The default policy for setting s_user_ns is implemented in sget(),
    which arranges for s_user_ns to be set to current_user_ns() and to
    ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
    user_ns.

    The guts of sget are split out into another function sget_userns().
    The function sget_userns calls alloc_super with the specified user
    namespace or it verifies the existing superblock that was found
    has the expected user namespace, and fails with EBUSY when it is not.
    This failing prevents users with the wrong privileges mounting a
    filesystem.

    The reason for the split of sget_userns from sget is that in some
    cases such as mount_ns and kernfs_mount_ns a different policy for
    permission checking of mounts and setting s_user_ns is necessary, and
    the existence of sget_userns() allows those policies to be
    implemented.

    The helper mount_ns is expected to be used for filesystems such as
    proc and mqueuefs which present per namespace information. The
    function mount_ns is modified to call sget_userns instead of sget to
    ensure the user namespace owner of the namespace whose information is
    presented by the filesystem is used on the superblock.

    For sysfs and cgroup the appropriate permission checks are already in
    place, and kernfs_mount_ns is modified to call sget_userns so that
    the init_user_ns is the only user namespace used.

    For the cgroup filesystem cgroup namespace mounts are bind mounts of a
    subset of the full cgroup filesystem and as such s_user_ns must be the
    same for all of them as there is only a single superblock.

    Mounts of sysfs that vary based on the network namespace could in principle
    change s_user_ns but it keeps the analysis and implementation of kernfs
    simpler if that is not supported, and at present there appear to be no
    benefits from supporting a different s_user_ns on any sysfs mount.

    Getting the details of setting s_user_ns correct has been
    a long process. Thanks to Pavel Tikhorirorv who spotted a leak
    in sget_userns. Thanks to Seth Forshee who has kept the work alive.

    Thanks-to: Seth Forshee
    Thanks-to: Pavel Tikhomirov
    Acked-by: Seth Forshee
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 May, 2016

1 commit

  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     

21 May, 2016

1 commit

  • Pull driver core updates from Greg KH:
    "Here's the "big" driver core update for 4.7-rc1.

    Mostly just debugfs changes, the long-known and messy races with
    removing debugfs files should be fixed thanks to the great work of
    Nicolai Stange. We also have some isa updates in here (the x86
    maintainers told me to take it through this tree), a new warning when
    we run out of dynamic char major numbers, and a few other assorted
    changes, details in the shortlog.

    All have been in linux-next for some time with no reported issues"

    * tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
    Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case"
    gpio: ws16c48: Utilize the ISA bus driver
    gpio: 104-idio-16: Utilize the ISA bus driver
    gpio: 104-idi-48: Utilize the ISA bus driver
    gpio: 104-dio-48e: Utilize the ISA bus driver
    watchdog: ebc-c384_wdt: Utilize the ISA bus driver
    iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros
    iio: stx104: Add X86 dependency to STX104 Kconfig option
    Documentation: Add ISA bus driver documentation
    isa: Implement the max_num_isa_dev macro
    isa: Implement the module_isa_driver macro
    pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS
    isa: Decouple X86_32 dependency from the ISA Kconfig option
    driver-core: use 'dev' argument in dev_dbg_ratelimited stub
    base: dd: don't remove driver_data in -EPROBE_DEFER case
    kernfs: Move faulting copy_user operations outside of the mutex
    devcoredump: add scatterlist support
    debugfs: unproxify files created through debugfs_create_u32_array()
    debugfs: unproxify files created through debugfs_create_blob()
    debugfs: unproxify files created through debugfs_create_bool()
    ...

    Linus Torvalds
     

18 May, 2016

1 commit

  • Pull parallel filesystem directory handling update from Al Viro.

    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.

    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.

    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).

    A more detailed explanation of the process from Al Viro:
    "The xattr work starts with some acl fixes, then switches ->getxattr to
    passing inode and dentry separately. This is the point where the
    things start to get tricky - that got merged into the very beginning
    of the -rc3-based #work.lookups, to allow untangling the
    security_d_instantiate() mess. The xattr work itself proceeds to
    switch a lot of filesystems to generic_...xattr(); no complications
    there.

    After that initial xattr work, the series then does the following:

    - untangle security_d_instantiate()

    - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
    that thing; one such place (in overlayfs) actually yields a trivial
    conflict with overlayfs fixes later in the cycle - overlayfs ended
    up switching to a variant of lookup_one_len_unlocked() sans the
    permission checks. I would've dropped that commit (it gets
    overridden on merge from #ovl-fixes in #for-next; proper resolution
    is to use the variant in mainline fs/overlayfs/super.c), but I
    didn't want to rebase the damn thing - it was fairly late in the
    cycle...

    - some filesystems had managed to depend on lookup/lookup exclusion
    for *fs-internal* data structures in a way that would break if we
    relaxed the VFS exclusion. Fixing hadn't been hard, fortunately.

    - core of that series - parallel lookup machinery, replacing
    ->i_mutex with rwsem, making lookup_slow() take it only shared. At
    that point lookups happen in parallel; lookups on the same name
    wait for the in-progress one to be done with that dentry.

    Surprisingly little code, at that - almost all of it is in
    fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
    making it use the new primitive and actually switching to locking
    shared.

    - parallel readdir stuff - first of all, we provide the exclusion on
    per-struct file basis, same as we do for read() vs lseek() for
    regular files. That takes care of most of the needed exclusion in
    readdir/readdir; however, these guys are trickier than lookups, so
    I went for switching them one-by-one. To do that, a new method
    '->iterate_shared()' is added and filesystems are switched to it
    as they are either confirmed to be OK with shared lock on directory
    or fixed to be OK with that. I hope to kill the original method
    come next cycle (almost all in-tree filesystems are switched
    already), but it's still not quite finished.

    - several filesystems get switched to parallel readdir. The
    interesting part here is dealing with dcache preseeding by readdir;
    that needs minor adjustment to be safe with directory locked only
    shared.

    Most of the filesystems doing that got switched to in those
    commits. Important exception: NFS. Turns out that NFS folks, with
    their, er, insistence on VFS getting the fuck out of the way of the
    Smart Filesystem Code That Knows How And What To Lock(tm) have
    grown the locking of their own. They had their own homegrown
    rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
    is the reader there). Of course, with VFS getting the fuck out of
    the way, as requested, the actual smarts of the smart filesystem
    code etc. had become exposed...

    - do_last/lookup_open/atomic_open cleanups. As the result, open()
    without O_CREAT locks the directory only shared. Including the
    ->atomic_open() case. Backmerge from #for-linus in the middle of
    that - atomic_open() fix got brought in.

    - then comes NFS switch to saner (VFS-based ;-) locking, killing the
    homegrown "lookup and readdir are writers" kinda-sorta rwsem. All
    exclusion for sillyunlink/lookup is done by the parallel lookups
    mechanism. Exclusion between sillyunlink and rmdir is a real rwsem
    now - rmdir being the writer.

    Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
    now.

    - the rest of the series consists of switching a lot of filesystems
    to parallel readdir; in a lot of cases ->llseek() gets simplified
    as well. One backmerge in there (again, #for-linus - rockridge
    fix)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
    ext4: switch to ->iterate_shared()
    hfs: switch to ->iterate_shared()
    hfsplus: switch to ->iterate_shared()
    hostfs: switch to ->iterate_shared()
    hpfs: switch to ->iterate_shared()
    hpfs: handle allocation failures in hpfs_add_pos()
    gfs2: switch to ->iterate_shared()
    f2fs: switch to ->iterate_shared()
    afs: switch to ->iterate_shared()
    befs: switch to ->iterate_shared()
    befs: constify stuff a bit
    isofs: switch to ->iterate_shared()
    get_acorn_filename(): deobfuscate a bit
    btrfs: switch to ->iterate_shared()
    logfs: no need to lock directory in lseek
    switch ecryptfs to ->iterate_shared
    9p: switch to ->iterate_shared()
    fat: switch to ->iterate_shared()
    romfs, squashfs: switch to ->iterate_shared()
    more trivial ->iterate_shared conversions
    ...

    Linus Torvalds
     

12 May, 2016

1 commit


10 May, 2016

1 commit

  • Patch summary:

    When showing a cgroupfs entry in mountinfo, show the path of the mount
    root dentry relative to the reader's cgroup namespace root.

    Short explanation (courtesy of mkerrisk):

    If we create a new cgroup namespace, then we want both /proc/self/cgroup
    and /proc/self/mountinfo to show cgroup paths that are correctly
    virtualized with respect to the cgroup mount point. Previous to this
    patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
    does not.

    Long version:

    When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
    namespace, and then mounts a new instance of the freezer cgroup, the new
    mount will be rooted at /a/b. The root dentry field of the mountinfo
    entry will show '/a/b'.

    cat > /tmp/do1 << EOF
    mount -t cgroup -o freezer freezer /mnt
    grep freezer /proc/self/mountinfo
    EOF

    unshare -Gm bash /tmp/do1
    > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
    > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

    The task's freezer cgroup entry in /proc/self/cgroup will simply show
    '/':

    grep freezer /proc/self/cgroup
    9:freezer:/

    If instead the same task simply bind mounts the /a/b cgroup directory,
    the resulting mountinfo entry will again show /a/b for the dentry root.
    However in this case the task will find its own cgroup at /mnt/a/b,
    not at /mnt:

    mount --bind /sys/fs/cgroup/freezer/a/b /mnt
    130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

    In other words, there is no way for the task to know, based on what is
    in mountinfo, which cgroup directory is its own.

    Example (by mkerrisk):

    First, a little script to save some typing and verbiage:

    echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
    cat /proc/self/mountinfo | grep freezer |
    awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

    Create cgroup, place this shell into the cgroup, and look at the state
    of the /proc files:

    2653
    2653 # Our shell
    14254 # cat(1)
    /proc/self/cgroup: 10:freezer:/a/b
    mountinfo: / /sys/fs/cgroup/freezer

    Create a shell in new cgroup and mount namespaces. The act of creating
    a new cgroup namespace causes the process's current cgroups directories
    to become its cgroup root directories. (Here, I'm using my own version
    of the "unshare" utility, which takes the same options as the util-linux
    version):

    Look at the state of the /proc files:

    /proc/self/cgroup: 10:freezer:/
    mountinfo: / /sys/fs/cgroup/freezer

    The third entry in /proc/self/cgroup (the pathname of the cgroup inside
    the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
    is rooted at /a/b in the outer namespace.

    However, the info in /proc/self/mountinfo is not for this cgroup
    namespace, since we are seeing a duplicate of the mount from the
    old mount namespace, and the info there does not correspond to the
    new cgroup namespace. However, trying to create a new mount still
    doesn't show us the right information in mountinfo:

    # propagating to other mountns
    /proc/self/cgroup: 7:freezer:/
    mountinfo: /a/b /mnt/freezer

    The act of creating a new cgroup namespace caused the process's
    current freezer directory, "/a/b", to become its cgroup freezer root
    directory. In other words, the pathname directory of the directory
    within the newly mounted cgroup filesystem should be "/",
    but mountinfo wrongly shows us "/a/b". The consequence of this is
    that the process in the cgroup namespace cannot correctly construct
    the pathname of its cgroup root directory from the information in
    /proc/PID/mountinfo.

    With this patch, the dentry root field in mountinfo is shown relative
    to the reader's cgroup namespace. So the same steps as above:

    /proc/self/cgroup: 10:freezer:/a/b
    mountinfo: / /sys/fs/cgroup/freezer
    /proc/self/cgroup: 10:freezer:/
    mountinfo: /../.. /sys/fs/cgroup/freezer
    /proc/self/cgroup: 10:freezer:/
    mountinfo: / /mnt/freezer

    cgroup.clone_children freezer.parent_freezing freezer.state tasks
    cgroup.procs freezer.self_freezing notify_on_release
    3164
    2653 # First shell that placed in this cgroup
    3164 # Shell started by 'unshare'
    14197 # cat(1)

    Signed-off-by: Serge Hallyn
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Signed-off-by: Tejun Heo

    Serge E. Hallyn
     

09 May, 2016

1 commit


03 May, 2016

3 commits


01 May, 2016

1 commit

  • A fault in a user provided buffer may lead anywhere, and lockdep warns
    that we have a potential deadlock between the mm->mmap_sem and the
    kernfs file mutex:

    [ 82.811702] ======================================================
    [ 82.811705] [ INFO: possible circular locking dependency detected ]
    [ 82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
    [ 82.811711] -------------------------------------------------------
    [ 82.811714] kms_setmode/5859 is trying to acquire lock:
    [ 82.811717] (&dev->struct_mutex){+.+.+.}, at: [] drm_gem_mmap+0x1a1/0x270
    [ 82.811731]
    but task is already holding lock:
    [ 82.811734] (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
    [ 82.811745]
    which lock already depends on the new lock.

    [ 82.811749]
    the existing dependency chain (in reverse order) is:
    [ 82.811752]
    -> #3 (&mm->mmap_sem){++++++}:
    [ 82.811761] [] lock_acquire+0xc3/0x1d0
    [ 82.811766] [] __might_fault+0x75/0xa0
    [ 82.811771] [] kernfs_fop_write+0x8a/0x180
    [ 82.811787] [] __vfs_write+0x23/0xe0
    [ 82.811792] [] vfs_write+0xa4/0x190
    [ 82.811797] [] SyS_write+0x44/0xb0
    [ 82.811801] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.811807]
    -> #2 (s_active#6){++++.+}:
    [ 82.811814] [] lock_acquire+0xc3/0x1d0
    [ 82.811819] [] __kernfs_remove+0x210/0x2f0
    [ 82.811823] [] kernfs_remove_by_name_ns+0x40/0xa0
    [ 82.811828] [] sysfs_remove_file_ns+0x10/0x20
    [ 82.811832] [] device_del+0x124/0x250
    [ 82.811837] [] device_unregister+0x19/0x60
    [ 82.811841] [] cpu_cache_sysfs_exit+0x51/0xb0
    [ 82.811846] [] cacheinfo_cpu_callback+0x38/0x70
    [ 82.811851] [] notifier_call_chain+0x39/0xa0
    [ 82.811856] [] __raw_notifier_call_chain+0x9/0x10
    [ 82.811860] [] cpu_notify+0x1e/0x40
    [ 82.811865] [] cpu_notify_nofail+0x9/0x20
    [ 82.811869] [] _cpu_down+0x233/0x340
    [ 82.811874] [] disable_nonboot_cpus+0xc9/0x350
    [ 82.811878] [] suspend_devices_and_enter+0x5a1/0xb50
    [ 82.811883] [] pm_suspend+0x543/0x8d0
    [ 82.811888] [] state_store+0x77/0xe0
    [ 82.811892] [] kobj_attr_store+0xf/0x20
    [ 82.811897] [] sysfs_kf_write+0x40/0x50
    [ 82.811902] [] kernfs_fop_write+0x13c/0x180
    [ 82.811906] [] __vfs_write+0x23/0xe0
    [ 82.811910] [] vfs_write+0xa4/0x190
    [ 82.811914] [] SyS_write+0x44/0xb0
    [ 82.811918] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.811923]
    -> #1 (cpu_hotplug.lock){+.+.+.}:
    [ 82.811929] [] lock_acquire+0xc3/0x1d0
    [ 82.811933] [] mutex_lock_nested+0x62/0x3b0
    [ 82.811940] [] get_online_cpus+0x61/0x80
    [ 82.811944] [] stop_machine+0x1b/0xe0
    [ 82.811949] [] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
    [ 82.812009] [] ggtt_bind_vma+0x46/0x70 [i915]
    [ 82.812045] [] i915_vma_bind+0x140/0x290 [i915]
    [ 82.812081] [] i915_gem_object_do_pin+0x899/0xb00 [i915]
    [ 82.812117] [] i915_gem_object_pin+0x35/0x40 [i915]
    [ 82.812154] [] intel_init_pipe_control+0xbe/0x210 [i915]
    [ 82.812192] [] intel_logical_rings_init+0xe2/0xde0 [i915]
    [ 82.812232] [] i915_gem_init+0xf3/0x130 [i915]
    [ 82.812278] [] i915_driver_load+0xf2d/0x1770 [i915]
    [ 82.812318] [] drm_dev_register+0xa4/0xb0
    [ 82.812323] [] drm_get_pci_dev+0xce/0x1e0
    [ 82.812328] [] i915_pci_probe+0x2f/0x50 [i915]
    [ 82.812360] [] pci_device_probe+0x87/0xf0
    [ 82.812366] [] driver_probe_device+0x229/0x450
    [ 82.812371] [] __driver_attach+0x83/0x90
    [ 82.812375] [] bus_for_each_dev+0x61/0xa0
    [ 82.812380] [] driver_attach+0x19/0x20
    [ 82.812384] [] bus_add_driver+0x1ef/0x290
    [ 82.812388] [] driver_register+0x5b/0xe0
    [ 82.812393] [] __pci_register_driver+0x5b/0x60
    [ 82.812398] [] drm_pci_init+0xd6/0x100
    [ 82.812402] [] 0xffffffffa027c094
    [ 82.812406] [] do_one_initcall+0xae/0x1d0
    [ 82.812412] [] do_init_module+0x5b/0x1cb
    [ 82.812417] [] load_module+0x1c20/0x2480
    [ 82.812422] [] SyS_finit_module+0x7e/0xa0
    [ 82.812428] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.812433]
    -> #0 (&dev->struct_mutex){+.+.+.}:
    [ 82.812439] [] __lock_acquire+0x1fc9/0x20f0
    [ 82.812443] [] lock_acquire+0xc3/0x1d0
    [ 82.812456] [] drm_gem_mmap+0x1c7/0x270
    [ 82.812460] [] mmap_region+0x334/0x580
    [ 82.812466] [] do_mmap+0x364/0x410
    [ 82.812470] [] vm_mmap_pgoff+0x6d/0xa0
    [ 82.812474] [] SyS_mmap_pgoff+0x184/0x220
    [ 82.812479] [] SyS_mmap+0x1d/0x20
    [ 82.812484] [] entry_SYSCALL_64_fastpath+0x16/0x73
    [ 82.812489]
    other info that might help us debug this:

    [ 82.812493] Chain exists of:
    &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

    [ 82.812502] Possible unsafe locking scenario:

    [ 82.812506] CPU0 CPU1
    [ 82.812508] ---- ----
    [ 82.812510] lock(&mm->mmap_sem);
    [ 82.812514] lock(s_active#6);
    [ 82.812519] lock(&mm->mmap_sem);
    [ 82.812522] lock(&dev->struct_mutex);
    [ 82.812526]
    *** DEADLOCK ***

    [ 82.812531] 1 lock held by kms_setmode/5859:
    [ 82.812533] #0: (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
    [ 82.812541]
    stack backtrace:
    [ 82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
    [ 82.812550] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
    [ 82.812553] 0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
    [ 82.812560] ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
    [ 82.812566] ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
    [ 82.812573] Call Trace:
    [ 82.812578] [] dump_stack+0x67/0x92
    [ 82.812582] [] print_circular_bug+0x1fc/0x310
    [ 82.812586] [] __lock_acquire+0x1fc9/0x20f0
    [ 82.812590] [] lock_acquire+0xc3/0x1d0
    [ 82.812594] [] ? drm_gem_mmap+0x1a1/0x270
    [ 82.812599] [] drm_gem_mmap+0x1c7/0x270
    [ 82.812603] [] ? drm_gem_mmap+0x1a1/0x270
    [ 82.812608] [] mmap_region+0x334/0x580
    [ 82.812612] [] do_mmap+0x364/0x410
    [ 82.812616] [] vm_mmap_pgoff+0x6d/0xa0
    [ 82.812629] [] SyS_mmap_pgoff+0x184/0x220
    [ 82.812633] [] SyS_mmap+0x1d/0x20
    [ 82.812637] [] entry_SYSCALL_64_fastpath+0x16/0x73

    Highly unlikely though this scenario is, we can avoid the issue entirely
    by moving the copy operation from out under the kernfs_get_active()
    tracking by assigning the preallocated buffer its own mutex. The
    temporary buffer allocation doesn't require mutex locking as it is
    entirely local.

    The locked section was extended by the addition of the preallocated buf
    to speed up md user operations in

    commit 2b75869bba676c248d8d25ae6d2bd9221dfffdb6
    Author: NeilBrown
    Date: Mon Oct 13 16:41:28 2014 +1100

    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

    Reported-by: Ville Syrjälä
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
    Signed-off-by: Chris Wilson
    Reviewed-by: Joonas Lahtinen
    Cc: Ville Syrjälä
    Cc: Joonas Lahtinen
    Cc: NeilBrown
    Acked-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Chris Wilson
     

19 Apr, 2016

1 commit


11 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

30 Mar, 2016

1 commit

  • This is in preparation for the series that transitions
    filesystem timestamps to use 64 bit time and hence make
    them y2038 safe.

    CURRENT_TIME macro will be deleted before merging the
    aforementioned series.

    Use current_fs_time() instead of CURRENT_TIME for inode
    timestamps.

    struct kernfs_node is associated with a sysfs file/ directory.
    Truncate the values to appropriate time granularity when
    writing to inode timestamps of the files.

    ktime_get_real_ts() is used to obtain times for
    struct kernfs_iattrs. Since these times are later assigned to
    inode times using timespec_truncate() for all filesystem based
    operations, we can save the supers list traversal time here by
    using ktime_get_real_ts() directly.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Greg Kroah-Hartman

    Deepa Dinamani
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

17 Feb, 2016

2 commits


08 Feb, 2016

1 commit

  • kernfs_walk_ns() uses a static path_buf[PATH_MAX] to separate out path
    components. Keeping around the 4k buffer just for kernfs_walk_ns() is
    wasteful. This patch makes it piggyback on kernfs_pr_cont_buf[]
    instead. This requires kernfs_walk_ns() to hold kernfs_rename_lock.

    Signed-off-by: Tejun Heo
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit

  • Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
    alloc_kmem_pages call) are accounted to memory cgroup automatically.
    Callers have to explicitly opt out if they don't want/need accounting
    for some reason. Such a design decision leads to several problems:

    - kmalloc users are highly sensitive to failures, many of them
    implicitly rely on the fact that kmalloc never fails, while memcg
    makes failures quite plausible.

    - A lot of objects are shared among different containers by design.
    Accounting such objects to one of containers is just unfair.
    Moreover, it might lead to pinning a dead memcg along with its kmem
    caches, which aren't tiny, which might result in noticeable increase
    in memory consumption for no apparent reason in the long run.

    - There are tons of short-lived objects. Accounting them to memcg will
    only result in slight noise and won't change the overall picture, but
    we still have to pay accounting overhead.

    For more info, see

    - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
    - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

    Therefore this patchset switches to the white list policy. Now kmalloc
    users have to explicitly opt in by passing __GFP_ACCOUNT flag.

    Currently, the list of accounted objects is quite limited and only
    includes those allocations that (1) are known to be easily triggered
    from userspace and (2) can fail gracefully (for the full list see patch
    no. 6) and it still misses many object types. However, accounting only
    those objects should be a satisfactory approximation of the behavior we
    used to have for most sane workloads.

    This patch (of 6):

    Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
    to memcg").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch reverts
    bits introducing the black-list policy. The white-list policy will be
    introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov