27 Aug, 2018

1 commit

  • Pull IDA updates from Matthew Wilcox:
    "A better IDA API:

    id = ida_alloc(ida, GFP_xxx);
    ida_free(ida, id);

    rather than the cumbersome ida_simple_get(), ida_simple_remove().

    The new IDA API is similar to ida_simple_get() but better named. The
    internal restructuring of the IDA code removes the bitmap
    preallocation nonsense.

    I hope the net -200 lines of code is convincing"

    * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
    ida: Change ida_get_new_above to return the id
    ida: Remove old API
    test_ida: check_ida_destroy and check_ida_alloc
    test_ida: Convert check_ida_conv to new API
    test_ida: Move ida_check_max
    test_ida: Move ida_check_leaf
    idr-test: Convert ida_check_nomem to new API
    ida: Start new test_ida module
    target/iscsi: Allocate session IDs from an IDA
    iscsi target: fix session creation failure handling
    drm/vmwgfx: Convert to new IDA API
    dmaengine: Convert to new IDA API
    ppc: Convert vas ID allocation to new IDA API
    media: Convert entity ID allocation to new IDA API
    ppc: Convert mmu context allocation to new IDA API
    Convert net_namespace to new IDA API
    cb710: Convert to new IDA API
    rsxx: Convert to new IDA API
    osd: Convert to new IDA API
    sd: Convert to new IDA API
    ...

    Linus Torvalds
     

22 Aug, 2018

1 commit


18 Aug, 2018

3 commits

  • We need to distinguish the situations when shrinker has very small
    amount of objects (see vfs_pressure_ratio() called from
    super_cache_count()), and when it has no objects at all. Currently, in
    the both of these cases, shrinker::count_objects() returns 0.

    The patch introduces new SHRINK_EMPTY return value, which will be used
    for "no objects at all" case. It's is a refactoring mostly, as
    SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
    patch, and all the magic will happen in further.

    Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Add list_lru::shrinker_id field and populate it by registered shrinker
    id.

    This will be used to set correct bit in memcg shrinkers map by lru code
    in next patches, after there appeared the first related to memcg element
    in list_lru.

    Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Do two list_lru_init_memcg() calls after prealloc_super().
    destroy_unused_super() in fail path is OK with this. Next patch needs
    such the order.

    Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

05 Jun, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "Misc bits and pieces not fitting into anything more specific"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: delete unnecessary assignment in vfs_listxattr
    Documentation: filesystems: update filesystem locking documentation
    vfs: namei: use path_equal() in follow_dotdot()
    fs.h: fix outdated comment about file flags
    __inode_security_revalidate() never gets NULL opt_dentry
    make xattr_getsecurity() static
    vfat: simplify checks in vfat_lookup()
    get rid of dead code in d_find_alias()
    it's SB_BORN, not MS_BORN...
    msdos_rmdir(): kill BS comment
    remove rpc_rmdir()
    fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()

    Linus Torvalds
     

12 May, 2018

1 commit

  • We recently had an oops reported on a 4.14 kernel in
    xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
    and so the m_perag_tree lookup walked into lala land. It produces
    an oops down this path during the failed mount:

    radix_tree_gang_lookup_tag+0xc4/0x130
    xfs_perag_get_tag+0x37/0xf0
    xfs_reclaim_inodes_count+0x32/0x40
    xfs_fs_nr_cached_objects+0x11/0x20
    super_cache_count+0x35/0xc0
    shrink_slab.part.66+0xb1/0x370
    shrink_node+0x7e/0x1a0
    try_to_free_pages+0x199/0x470
    __alloc_pages_slowpath+0x3a1/0xd20
    __alloc_pages_nodemask+0x1c3/0x200
    cache_grow_begin+0x20b/0x2e0
    fallback_alloc+0x160/0x200
    kmem_cache_alloc+0x111/0x4e0

    The problem is that the superblock shrinker is running before the
    filesystem structures it depends on have been fully set up. i.e.
    the shrinker is registered in sget(), before ->fill_super() has been
    called, and the shrinker can call into the filesystem before
    fill_super() does it's setup work. Essentially we are exposed to
    both use-after-free and use-before-initialisation bugs here.

    To fix this, add a check for the SB_BORN flag in super_cache_count.
    In general, this flag is not set until ->fs_mount() completes
    successfully, so we know that it is set after the filesystem
    setup has completed. This matches the trylock_super() behaviour
    which will not let super_cache_scan() run if SB_BORN is not set, and
    hence will not allow the superblock shrinker from entering the
    filesystem while it is being set up or after it has failed setup
    and is being torn down.

    Cc: stable@kernel.org
    Signed-Off-By: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

11 May, 2018

1 commit


16 Apr, 2018

1 commit

  • syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97
    ("sget(): handle failures of register_shrinker()"). That commit expected
    that calling kill_sb() from deactivate_locked_super() without successful
    fill_super() is safe, but the reality was different; some callers assign
    attributes which are needed for kill_sb() after sget() succeeds.

    For example, [1] is a report where sb->s_mode (which seems to be either
    FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
    assigned unless sget() succeeds. But it does not worth complicate sget()
    so that register_shrinker() failure path can safely call
    kill_block_super() via kill_sb(). Making alloc_super() fail if memory
    allocation for register_shrinker() failed is much simpler. Let's avoid
    calling deactivate_locked_super() from sget_userns() by preallocating
    memory for the shrinker and making register_shrinker() in sget_userns()
    never fail.

    [1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Cc: Al Viro
    Cc: Michal Hocko
    Signed-off-by: Al Viro

    Tetsuo Handa
     

13 Apr, 2018

1 commit


19 Mar, 2018

1 commit

  • There are 2 distinct freezing mechanisms - one operates on block
    devices and another one directly on super blocks. Both end up with the
    same result, but thaw of only one of these does not thaw the other.

    In particular fsfreeze --freeze uses the ioctl variant going to the
    super block. Since prior to this patch emergency thaw was not doing
    a relevant thaw, filesystems frozen with this method remained
    unaffected.

    The patch is a hack which adds blind unfreezing.

    In order to keep the super block write-locked the whole time the code
    is shuffled around and the newly introduced __iterate_supers is
    employed.

    Signed-off-by: Mateusz Guzik
    Signed-off-by: Al Viro

    Mateusz Guzik
     

01 Feb, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of misc stuff, without any unifying topic, from various
    people.

    Neil's d_anon patch, several bugfixes, introduction of kvmalloc
    analogue of kmemdup_user(), extending bitfield.h to deal with
    fixed-endians, assorted cleanups all over the place..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
    alpha: osf_sys.c: use timespec64 where appropriate
    alpha: osf_sys.c: fix put_tv32 regression
    jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
    dcache: delete unused d_hash_mask
    dcache: subtract d_hash_shift from 32 in advance
    fs/buffer.c: fold init_buffer() into init_page_buffers()
    fs: fold __inode_permission() into inode_permission()
    fs: add RWF_APPEND
    sctp: use vmemdup_user() rather than badly open-coding memdup_user()
    snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
    replace_user_tlv(): switch to vmemdup_user()
    new primitive: vmemdup_user()
    memdup_user(): switch to GFP_USER
    eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
    eventfd: fold eventfd_ctx_read() into eventfd_read()
    eventfd: convert to use anon_inode_getfd()
    nfs4file: get rid of pointless include of btrfs.h
    uvc_v4l2: clean copyin/copyout up
    vme_user: don't use __copy_..._user()
    usx2y: don't bother with memdup_user() for 16-byte structure
    ...

    Linus Torvalds
     

26 Dec, 2017

1 commit

  • The original purpose of the per-superblock d_anon list was to
    keep disconnected dentries in the cache between consecutive
    requests to the NFS server. Dentries can be disconnected if
    a client holds a file open and repeatedly performs IO on it,
    and if the server drops the dentry, whether due to memory
    pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".

    This purpose was thwarted by commit 75a6f82a0d10 ("freeing unlinked
    file indefinitely delayed") which caused disconnected dentries
    to be freed as soon as their refcount reached zero.

    This means that, when a dentry being used by nfsd gets disconnected, a
    new one needs to be allocated for every request (unless requests
    overlap). As the dentry has no name, no parent, and no children,
    there is little of value to cache. As small memory allocations are
    typically fast (from per-cpu free lists) this likely has little cost.

    This means that the original purpose of s_anon is no longer relevant:
    there is no longer any need to keep disconnected dentries on a list so
    they appear to be hashed.

    However, s_anon now has a new use. When you mount an NFS filesystem,
    the dentry stored in s_root is just a placebo. The "real" root dentry
    is allocated using d_obtain_root() and so it kept on the s_anon list.
    I don't know the reason for this, but suspect it related to NFSv4
    where a mount of "server:/some/path" require NFS to look up the root
    filehandle on the server, then walk down "/some" and "/path" to get
    the filehandle to mount.

    Whatever the reason, NFS depends on the s_anon list and on
    shrink_dcache_for_umount() pruning all dentries on this list. So we
    cannot simply remove s_anon.

    We could just leave the code unchanged, but apart from that being
    potentially confusing, the (unfair) bit-spin-lock which protects
    s_anon can become a bottle neck when lots of disconnected dentries are
    being created.

    So this patch renames s_anon to s_roots, and stops storing
    disconnected dentries on the list. Only dentries obtained with
    d_obtain_root() are now stored on this list. There are many fewer of
    these (only NFS and NILFS2 use the call, and only during filesystem
    mount) so contention on the bit-lock will not be a problem.

    Possibly an alternate solution should be found for NFS and NILFS2, but
    that would require understanding their needs first.

    Signed-off-by: NeilBrown
    Signed-off-by: Al Viro

    NeilBrown
     

19 Dec, 2017

1 commit


05 Dec, 2017

1 commit


18 Nov, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, really no common topic here"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: grab the lock instead of blocking in __fd_install during resizing
    vfs: stop clearing close on exec when closing a fd
    include/linux/fs.h: fix comment about struct address_space
    fs: make fiemap work from compat_ioctl
    coda: fix 'kernel memory exposure attempt' in fsync
    pstore: remove unneeded unlikely()
    vfs: remove unneeded unlikely()
    stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
    make vfs_ustat() static
    do_handle_open() should be static
    elf_fdpic: fix unused variable warning
    fold destroy_super() into __put_super()
    new helper: destroy_unused_super()
    fix address space warnings in ipc/
    acct.h: get rid of detritus

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

12 Oct, 2017

2 commits

  • There's only one caller of destroy_super() left now. Fold it there,
    and replace those list_lru_destroy() calls with checks that they
    had already been done (as they should have, when we were dropping
    the last active reference).

    Signed-off-by: Al Viro

    Al Viro
     
  • Used for disposal of super_block instances that had never been reachable
    via any shared data structures. No need for RCU delay in there, everything
    can be called directly.

    Signed-off-by: Al Viro

    Al Viro
     

05 Oct, 2017

1 commit


15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

18 Aug, 2017

1 commit


17 Jul, 2017

2 commits

  • Differentiate the MS_* flags passed to mount(2) from the internal flags set
    in the super_block's s_flags. s_flags are now called SB_*, with the names
    and the values for the moment mirroring the MS_* flags that they're
    equivalent to.

    In this patch, just the headers are altered and some kernel code where
    blind automated conversion isn't necessarily correct.

    Note that this shows up some interesting issues:

    (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
    MNT_NODEV) without passing this on to the filesystem, but some
    filesystems set such flags anyway.

    (2) The ->remount_fs() methods of some filesystems adjust the *flags
    argument by setting MS_* flags in it, such as MS_NOATIME - but these
    flags are then scrubbed by do_remount_sb() (only the occupants of
    MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
    MS_I_VERSION and MS_LAZYTIME)

    I'm not sure what's the best way to solve all these cases.

    Suggested-by: Al Viro
    Signed-off-by: David Howells

    David Howells
     
  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

11 Jul, 2017

1 commit

  • Kill off s_options, save/replace_mount_options() and generic_show_options()
    as all filesystems now implement ->show_options() for themselves. This
    should make it easier to implement a context-based mount where the mount
    options can be passed individually over a file descriptor.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

06 Jul, 2017

1 commit


21 Apr, 2017

4 commits

  • Drop 'parent' argument of bdi_register() and bdi_register_va(). It is
    always NULL.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Now that all bdi structures filesystems use are properly refcounted, we
    can remove the SB_I_DYNBDI flag.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • So far we just relied on block device to hold a bdi reference for us
    while the filesystem is mounted. While that works perfectly fine, it is
    a bit awkward that we have a pointer to a refcounted structure in the
    superblock without proper reference. So make s_bdi hold a proper
    reference to block device's BDI. No filesystem using mount_bdev()
    actually changes s_bdi so this is safe and will make bdev filesystems
    work the same way as filesystems needing to set up their private bdi.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Provide helper functions for setting up dynamically allocated
    backing_dev_info structures for filesystems and cleaning them up on
    superblock destruction.

    CC: linux-mtd@lists.infradead.org
    CC: linux-nfs@vger.kernel.org
    CC: Petr Vandrovec
    CC: linux-nilfs@vger.kernel.org
    CC: cluster-devel@redhat.com
    CC: osd-dev@open-osd.org
    CC: codalist@coda.cs.cmu.edu
    CC: linux-afs@lists.infradead.org
    CC: ecryptfs@vger.kernel.org
    CC: linux-cifs@vger.kernel.org
    CC: ceph-devel@vger.kernel.org
    CC: linux-btrfs@vger.kernel.org
    CC: v9fs-developer@lists.sourceforge.net
    CC: lustre-devel@lists.lustre.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

02 Feb, 2017

1 commit

  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

01 Feb, 2017

1 commit

  • To support unprivileged users mounting filesystems two permission
    checks have to be performed: a test to see if the user allowed to
    create a mount in the mount namespace, and a test to see if
    the user is allowed to access the specified filesystem.

    The automount case is special in that mounting the original filesystem
    grants permission to mount the sub-filesystems, to any user who
    happens to stumble across the their mountpoint and satisfies the
    ordinary filesystem permission checks.

    Attempting to handle the automount case by using override_creds
    almost works. It preserves the idea that permission to mount
    the original filesystem is permission to mount the sub-filesystem.
    Unfortunately using override_creds messes up the filesystems
    ordinary permission checks.

    Solve this by being explicit that a mount is a submount by introducing
    vfs_submount, and using it where appropriate.

    vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
    sget and friends know that a mount is a submount so they can take appropriate
    action.

    sget and sget_userns are modified to not perform any permission checks
    on submounts.

    follow_automount is modified to stop using override_creds as that
    has proven problemantic.

    do_mount is modified to always remove the new MS_SUBMOUNT flag so
    that we know userspace will never by able to specify it.

    autofs4 is modified to stop using current_real_cred that was put in
    there to handle the previous version of submount permission checking.

    cifs is modified to pass the mountpoint all of the way down to vfs_submount.

    debugfs is modified to pass the mountpoint all of the way down to
    trace_automount by adding a new parameter. To make this change easier
    a new typedef debugfs_automount_t is introduced to capture the type of
    the debugfs automount function.

    Cc: stable@vger.kernel.org
    Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid")
    Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
    Reviewed-by: Trond Myklebust
    Reviewed-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

30 Nov, 2016

1 commit

  • The only places that were grabbing dqonoff_mutex are functions turning
    quotas on and off and these are properly serialized using s_umount
    semaphore. Remove dqonoff_mutex.

    Signed-off-by: Jan Kara

    Jan Kara
     

23 Nov, 2016

1 commit


15 Oct, 2016

2 commits

  • sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
    false-positives. Now that xfs was fixed by Dave's commit dbad7c993053
    ("xfs: stop holding ILOCK over filldir callbacks") we can remove it and
    change freeze_super() and thaw_super() to run with s_writers.rw_sem locks
    held; we add two trivial helpers for that, lockdep_sb_freeze_release()
    and lockdep_sb_freeze_acquire().

    xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any
    warning from lockdep.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Change thaw_super() to check frozen != SB_FREEZE_COMPLETE rather than
    frozen == SB_UNFROZEN, otherwise it can race with freeze_super() which
    drops sb->s_umount after SB_FREEZE_WRITE to preserve the lock ordering.

    In this case thaw_super() will wrongly call s_op->unfreeze_fs() before
    it was actually frozen, and call sb_freeze_unlock() which leads to the
    unbalanced percpu_up_write(). Unfortunately lockdep can't detect this,
    so this triggers misc BUG_ON()'s in kernel/rcu/sync.c.

    Reported-and-tested-by: Nikolay Borisov
    Signed-off-by: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Oleg Nesterov
     

30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

27 Jul, 2016

1 commit

  • wait_sb_inodes() currently does a walk of all inodes in the filesystem
    to find dirty one to wait on during sync. This is highly inefficient
    and wastes a lot of CPU when there are lots of clean cached inodes that
    we don't need to wait on.

    To avoid this "all inode" walk, we need to track inodes that are
    currently under writeback that we need to wait for. We do this by
    adding inodes to a writeback list on the sb when the mapping is first
    tagged as having pages under writeback. wait_sb_inodes() can then walk
    this list of "inodes under IO" and wait specifically just for the inodes
    that the current sync(2) needs to wait for.

    Define a couple helpers to add/remove an inode from the writeback list
    and call them when the overall mapping is tagged for or cleared from
    writeback. Update wait_sb_inodes() to walk only the inodes under
    writeback due to the sync.

    With this change, filesystem sync times are significantly reduced for
    fs' with largely populated inode caches and otherwise no other work to
    do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
    with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
    than 0.1s when the filesystem is fully clean.

    Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Tested-by: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

24 Jun, 2016

1 commit