17 Jul, 2018

1 commit

  • commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7 upstream.

    sgid directories have special semantics, making newly created files in
    the directory belong to the group of the directory, and newly created
    subdirectories will also become sgid. This is historically used for
    group-shared directories.

    But group directories writable by non-group members should not imply
    that such non-group members can magically join the group, so make sure
    to clear the sgid bit on non-directories for non-members (but remember
    that sgid without group execute means "mandatory locking", just to
    confuse things even more).

    Reported-by: Jann Horn
    Cc: Andy Lutomirski
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

08 Jul, 2018

1 commit

  • [ Upstream commit 829bc787c1a0403e4d886296dd4d90c5f9c1744a ]

    In inode_init_always(), we clear the inode mapping flags, which clears
    any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not
    also clear wb_err, which means that old mapping errors can leak through
    to new inodes.

    This is crucial for the XFS inode allocation path because we recycle old
    in-core inodes and we do not want error state from an old file to leak
    into the new file. This bug was discovered by running generic/036 and
    generic/047 in a loop and noticing that the EIOs generated by the
    collision of direct and buffered writes in generic/036 would survive the
    remount between 036 and 047, and get reported to the fsyncs (on
    different files!) in generic/047.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jeff Layton
    Reviewed-by: Brian Foster
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

14 Sep, 2017

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This fixes d_ino correctness in readdir, which brings overlayfs on par
    with normal filesystems regarding inode number semantics, as long as
    all layers are on the same filesystem.

    There are also some bug fixes, one in particular (random ioctl's
    shouldn't be able to modify lower layers) that touches some vfs code,
    but of course no-op for non-overlay fs"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix false positive ESTALE on lookup
    ovl: don't allow writing ioctl on lower layer
    ovl: fix relatime for directories
    vfs: add flags to d_real()
    ovl: cleanup d_real for negative
    ovl: constant d_ino for non-merge dirs
    ovl: constant d_ino across copy up
    ovl: fix readdir error value
    ovl: check snprintf return

    Linus Torvalds
     

09 Sep, 2017

1 commit

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Sep, 2017

1 commit

  • Need to treat non-regular overlayfs files the same as regular files when
    checking for an atime update.

    Add a d_real() flag to make it return the upper dentry for all file types.

    Reported-by: "zhangyi (F)"
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

02 Sep, 2017

1 commit

  • When we introduced the bmap redo log items, we set MS_ACTIVE on the
    mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
    from being truncated prematurely during log recovery. This also had the
    effect of putting linked inodes on the lru instead of evicting them.

    Unfortunately, we neglected to find all those unreferenced lru inodes
    and evict them after finishing log recovery, which means that we leak
    them if anything goes wrong in the rest of xfs_mountfs, because the lru
    is only cleaned out on unmount.

    Therefore, evict unreferenced inodes in the lru list immediately
    after clearing MS_ACTIVE.

    Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
    Signed-off-by: Darrick J. Wong
    Cc: viro@ZenIV.linux.org.uk
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

09 Jul, 2017

1 commit

  • Pull misc filesystem updates from Al Viro:
    "Assorted normal VFS / filesystems stuff..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    dentry name snapshots
    Make statfs properly return read-only state after emergency remount
    fs/dcache: init in_lookup_hashtable
    minix: Deinline get_block, save 2691 bytes
    fs: Reorder inode_owner_or_capable() to avoid needless
    fs: warn in case userspace lied about modprobe return

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

04 Jul, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

30 Jun, 2017

1 commit

  • Checking for capabilities should be the last operation when performing
    access control tests so that PF_SUPERPRIV is set only when it was required
    for success (implying that the capability was needed for the operation).

    Reported-by: Solar Designer
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: Andy Lutomirski
    Signed-off-by: Al Viro

    Kees Cook
     

28 Jun, 2017

1 commit

  • Define a set of write life time hints:

    RWH_WRITE_LIFE_NOT_SET No hint information set
    RWH_WRITE_LIFE_NONE No hints about write life time
    RWH_WRITE_LIFE_SHORT Data written has a short life time
    RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
    RWH_WRITE_LIFE_LONG Data written has a long life time
    RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time

    The intent is for these values to be relative to each other, no
    absolute meaning should be attached to these flag names.

    Add an fcntl interface for querying these flags, and also for
    setting them as well:

    F_GET_RW_HINT Returns the read/write hint set on the
    underlying inode.

    F_SET_RW_HINT Set one of the above write hints on the
    underlying inode.

    F_GET_FILE_RW_HINT Returns the read/write hint set on the
    file descriptor.

    F_SET_FILE_RW_HINT Set one of the above write hints on the
    file descriptor.

    The user passes in a 64-bit pointer to get/set these values, and
    the interface returns 0/-1 on success/error.

    Sample program testing/implementing basic setting/getting of write
    hints is below.

    Add support for storing the write life time hint in the inode flags
    and in struct file as well, and pass them to the kiocb flags. If
    both a file and its corresponding inode has a write hint, then we
    use the one in the file, if available. The file hint can be used
    for sync/direct IO, for buffered writeback only the inode hint
    is available.

    This is in preparation for utilizing these hints in the block layer,
    to guide on-media data placement.

    /*
    * writehint.c: get or set an inode write hint
    */
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef F_GET_RW_HINT
    #define F_LINUX_SPECIFIC_BASE 1024
    #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
    #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
    #endif

    static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
    "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
    "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

    int main(int argc, char *argv[])
    {
    uint64_t hint;
    int fd, ret;

    if (argc < 2) {
    fprintf(stderr, "%s: file \n", argv[0]);
    return 1;
    }

    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
    perror("open");
    return 2;
    }

    if (argc > 2) {
    hint = atoi(argv[2]);
    ret = fcntl(fd, F_SET_RW_HINT, &hint);
    if (ret < 0) {
    perror("fcntl: F_SET_RW_HINT");
    return 4;
    }
    }

    ret = fcntl(fd, F_GET_RW_HINT, &hint);
    if (ret < 0) {
    perror("fcntl: F_GET_RW_HINT");
    return 3;
    }

    printf("%s: hint %s\n", argv[1], str[hint]);
    close(fd);
    return 0;
    }

    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Jun, 2017

1 commit


10 May, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted bits and pieces from various people. No common topic in this
    pile, sorry"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs/affs: add rename exchange
    fs/affs: add rename2 to prepare multiple methods
    Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
    fs: don't set *REFERENCED on single use objects
    fs: compat: Remove warning from COMPATIBLE_IOCTL
    remove pointless extern of atime_need_update_rcu()
    fs: completely ignore unknown open flags
    fs: add a VALID_OPEN_FLAGS
    fs: remove _submit_bh()
    fs: constify tree_descr arrays passed to simple_fill_super()
    fs: drop duplicate header percpu-rwsem.h
    fs/affs: bugfix: Write files greater than page size on OFS
    fs/affs: bugfix: enable writes on OFS disks
    fs/affs: remove node generation check
    fs/affs: import amigaffs.h
    fs/affs: bugfix: make symbolic links work again

    Linus Torvalds
     

09 May, 2017

1 commit

  • Fix typos and add the following to the scripts/spelling.txt:

    intialisation||initialisation
    intialised||initialised
    intialise||initialise

    This commit does not intend to change the British spelling itself.

    Link: http://lkml.kernel.org/r/1481573103-11329-18-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

03 May, 2017

1 commit

  • By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
    inode we create. This is problematic as this means that it takes two
    trips through the LRU for any of these objects to be reclaimed,
    regardless of their actual lifetime. With enough pressure from these
    caches we can easily evict our working set from page cache with single
    use objects. So instead only set *REFERENCED if we've already been
    added to the LRU list. This means that we've been touched since the
    first time we were accessed, and so more likely to need to hang out in
    cache.

    To illustrate this issue I wrote the following scripts

    https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure

    on my test box. It is a single socket 4 core CPU with 16gib of RAM and
    I tested on an Intel 2tib NVME drive. The cache-pressure.sh script
    creates a new file system and creates 2 6.5gib files in order to take up
    13gib of the 16gib of ram with pagecache. Then it runs a test program
    that reads these 2 files in a loop, and keeps track of how often it has
    to read bytes for each loop. On an ideal system with no pressure we
    should have to read 0 bytes indefinitely. The second thing this script
    does is start a fs_mark job that creates a ton of 0 length files,
    putting pressure on the system with slab only allocations. On exit the
    script prints out how many bytes were read by the read-file program.
    The results are as follows

    Without patch:
    /mnt/btrfs-test/reads/file1: total read during loops 27262988288
    /mnt/btrfs-test/reads/file2: total read during loops 27262976000

    With patch:
    /mnt/btrfs-test/reads/file2: total read during loops 18640457728
    /mnt/btrfs-test/reads/file1: total read during loops 9565376512

    This patch results in a 50% reduction of the amount of pages evicted
    from our working set.

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

10 Apr, 2017

2 commits

  • Currently we free fsnotify_mark_connector structure only when inode /
    vfsmount is getting freed. This can however impose noticeable memory
    overhead when marks get attached to inodes only temporarily. So free the
    connector structure once the last mark is detached from the object.
    Since notification infrastructure can be working with the connector
    under the protection of fsnotify_mark_srcu, we have to be careful and
    free the fsnotify_mark_connector only after SRCU period passes.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently notification marks are attached to object (inode or vfsmnt) by
    a hlist_head in the object. The list is also protected by a spinlock in
    the object. So while there is any mark attached to the list of marks,
    the object must be pinned in memory (and thus e.g. last iput() deleting
    inode cannot happen). Also for list iteration in fsnotify() to work, we
    must hold fsnotify_mark_srcu lock so that mark itself and
    mark->obj_list.next cannot get freed. Thus we are required to wait for
    response to fanotify events from userspace process with
    fsnotify_mark_srcu lock held. That causes issues when userspace process
    is buggy and does not reply to some event - basically the whole
    notification subsystem gets eventually stuck.

    So to be able to drop fsnotify_mark_srcu lock while waiting for
    response, we have to pin the mark in memory and make sure it stays in
    the object list (as removing the mark waiting for response could lead to
    lost notification events for groups later in the list). However we don't
    want inode reclaim to block on such mark as that would lead to system
    just locking up elsewhere.

    This commit is the first in the series that paves way towards solving
    these conflicting lifetime needs. Instead of anchoring the list of marks
    directly in the object, we anchor it in a dedicated structure
    (fsnotify_mark_connector) and just point to that structure from the
    object. The following commits will also add spinlock protecting the list
    and object pointer to the structure.

    Reviewed-by: Miklos Szeredi
    Reviewed-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Jan Kara
     

11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     

08 Oct, 2016

3 commits


28 Sep, 2016

2 commits

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     
  • current_fs_time() is used for inode timestamps.

    Change the signature of the function to take inode pointer
    instead of superblock as per Linus's suggestion.

    Also, move the api under vfs as per the discussion on the
    thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's
    suggestion on the thread, changing the function name.

    current_fs_time() will be deleted after all the references
    to it are replaced by current_time().

    There was a bug reported by kbuild test bot with the change
    as some of the calls to current_time() were made before the
    super_block was initialized. Catch these accidental assignments
    as timespec_trunc() does for wrong granularities. This allows
    for the function to work right even in these circumstances.
    But, adds a warning to make the user aware of the bug.

    A coccinelle script was used to identify all the current
    .alloc_inode super_block callbacks that updated inode timestamps.
    proc filesystem was the only one that was modifying inode times
    as part of this callback. The series includes a patch to fix that.

    Note that timespec_trunc() will also be moved to fs/inode.c
    in a separate patch when this will need to be revamped for
    bounds checking purposes.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Deepa Dinamani
     

16 Sep, 2016

1 commit

  • On overlayfs relatime_need_update() needs inode times to be correct on
    overlay inode. But i_mtime and i_ctime are updated by filesystem code on
    underlying inode only, so they will be out-of-date on the overlay inode.

    This patch copies the times from the underlying inode if needed. This
    can't be done if called from RCU lookup (link following) but link m/ctime
    are not updated by fs, so this is all right.

    This patch doesn't change functionality for anything but overlayfs.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

07 Aug, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    In the "trivial API change" department - ->d_compare() losing 'parent'
    argument"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    cachefiles: Fix race between inactivating and culling a cache object
    9p: use clone_fid()
    9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
    vfs: make dentry_needs_remove_privs() internal
    vfs: remove file_needs_remove_privs()
    vfs: fix deadlock in file_remove_privs() on overlayfs
    get rid of 'parent' argument of ->d_compare()
    cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
    affs ->d_compare(): don't bother with ->d_inode
    fold _d_rehash() and __d_rehash() together
    fold dentry_rcuwalk_invalidate() into its only remaining caller

    Linus Torvalds
     

04 Aug, 2016

1 commit


03 Aug, 2016

3 commits

  • Only used by the vfs.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • file_remove_privs() is called with inode lock on file_inode(), which
    proceeds to calling notify_change() on file->f_path.dentry. Which triggers
    the WARN_ON_ONCE(!inode_is_locked(inode)) in addition to deadlocking later
    when ovl_setattr tries to lock the underlying inode again.

    Fix this mess by not mixing the layers, but doing everything on underlying
    dentry/inode.

    Signed-off-by: Miklos Szeredi
    Fixes: 07a2daab49c5 ("ovl: Copy up underlying inode's ->i_mode to overlay inode")
    Cc:

    Miklos Szeredi
     
  • Radix trees may be used not only for storing page cache pages, so
    unconditionally accounting radix tree nodes to the current memory cgroup
    is bad: if a radix tree node is used for storing data shared among
    different cgroups we risk pinning dead memory cgroups forever.

    So let's only account radix tree nodes if it was explicitly requested by
    passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to
    account page cache entries, so mark mapping->page_tree so.

    Fixes: 58e698af4c63 ("radix-tree: account radix_tree_node to memory cgroup")
    Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

27 Jul, 2016

1 commit

  • wait_sb_inodes() currently does a walk of all inodes in the filesystem
    to find dirty one to wait on during sync. This is highly inefficient
    and wastes a lot of CPU when there are lots of clean cached inodes that
    we don't need to wait on.

    To avoid this "all inode" walk, we need to track inodes that are
    currently under writeback that we need to wait for. We do this by
    adding inodes to a writeback list on the sb when the mapping is first
    tagged as having pages under writeback. wait_sb_inodes() can then walk
    this list of "inodes under IO" and wait specifically just for the inodes
    that the current sync(2) needs to wait for.

    Define a couple helpers to add/remove an inode from the writeback list
    and call them when the overall mapping is tagged for or cleared from
    writeback. Update wait_sb_inodes() to walk only the inodes under
    writeback due to the sync.

    With this change, filesystem sync times are significantly reduced for
    fs' with largely populated inode caches and otherwise no other work to
    do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
    with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
    than 0.1s when the filesystem is fully clean.

    Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Tested-by: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

06 Jul, 2016

1 commit

  • When a filesystem outside of init_user_ns is mounted it could have
    uids and gids stored in it that do not map to init_user_ns.

    The plan is to allow those filesystems to set i_uid to INVALID_UID and
    i_gid to INVALID_GID for unmapped uids and gids and then to handle
    that strange case in the vfs to ensure there is consistent robust
    handling of the weirdness.

    Upon a careful review of the vfs and filesystems about the only case
    where there is any possibility of confusion or trouble is when the
    inode is written back to disk. In that case filesystems typically
    read the inode->i_uid and inode->i_gid and write them to disk even
    when just an inode timestamp is being updated.

    Which leads to a rule that is very simple to implement and understand
    inodes whose i_uid or i_gid is not valid may not be written.

    In dealing with access times this means treat those inodes as if the
    inode flag S_NOATIME was set. Reads of the inodes appear safe and
    useful, but any write or modification is disallowed. The only inode
    write that is allowed is a chown that sets the uid and gid on the
    inode to valid values. After such a chown the inode is normal and may
    be treated as such.

    Denying all writes to inodes with uids or gids unknown to the vfs also
    prevents several oddball cases where corruption would have occurred
    because the vfs does not have complete information.

    One problem case that is prevented is attempting to use the gid of a
    directory for new inodes where the directories sgid bit is set but the
    directories gid is not mapped.

    Another problem case avoided is attempting to update the evm hash
    after setxattr, removexattr, and setattr. As the evm hash includeds
    the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
    a correct evm hash from being computed. evm hash verification also
    fails when i_uid or i_gid is unknown but that is essentially harmless
    as it does not cause filesystem corruption.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2016

1 commit

  • If one thread does iget_locked(), proceeds to try and set
    the new inode up and fails, inode will be unhashed and dropped.
    However, another thread doing ilookup/iget_locked in the middle
    of that would end up finding a half-set-up inode, grabbing
    a reference, waiting for it to come unlocked and getting the
    resulting bad inode. It's a race (if that ilookup had been
    called just after the failure of setup attempt it wouldn't
    have found the sucker at all), particularly unpleasant in
    cases when failure is transient/caller-dependent/etc.

    While it can be dealt with in the callers, there's no reason
    not to handle it in fs/inode.c primitives, especially since
    the cost is trivial.

    Signed-off-by: Al Viro

    Al Viro
     

03 May, 2016

2 commits

  • ta-da!

    The main issue is the lack of down_write_killable(), so the places
    like readdir.c switched to plain inode_lock(); once killable
    variants of rwsem primitives appear, that'll be dealt with.

    lockdep side also might need more work

    Signed-off-by: Al Viro

    Al Viro
     
  • We'll need to verify that there's neither a hashed nor in-lookup
    dentry with desired parent/name before adding to in-lookup set.

    One possible solution would be to hold the parent's ->d_lock through
    both checks, but while the in-lookup set is relatively small at any
    time, dcache is not. And holding the parent's ->d_lock through
    something like __d_lookup_rcu() would suck too badly.

    So we leave the parent's ->d_lock alone, which means that we watch
    out for the following scenario:
    * we verify that there's no hashed match
    * existing in-lookup match gets hashed by another process
    * we verify that there's no in-lookup matches and decide
    that everything's fine.

    Solution: per-directory kinda-sorta seqlock, bumped around the times
    we hash something that used to be in-lookup or move (and hash)
    something in place of in-lookup. Then the above would turn into
    * read the counter
    * do dcache lookup
    * if no matches found, check for in-lookup matches
    * if there had been none of those either, check if the
    counter has changed; repeat if it has.

    The "kinda-sorta" part is due to the fact that we don't have much spare
    space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe),
    so the counter part is not a problem, but spinlock is a different story.

    We could use the parent's ->d_lock, and it would be less painful in
    terms of contention, for __d_add() it would be rather inconvenient to
    grab; we could do that (using lock_parent()), but...

    Fortunately, we can get serialization on the counter itself, and it
    might be a good idea in general; we can use cmpxchg() in a loop to
    get from even to odd and smp_store_release() from odd to even.

    This commit adds the counter and updating logics; the readers will be
    added in the next commit.

    Signed-off-by: Al Viro

    Al Viro
     

31 Mar, 2016

1 commit

  • When get_acl() is called for an inode whose ACL is not cached yet, the
    get_acl inode operation is called to fetch the ACL from the filesystem.
    The inode operation is responsible for updating the cached acl with
    set_cached_acl(). This is done without locking at the VFS level, so
    another task can call set_cached_acl() or forget_cached_acl() before the
    get_acl inode operation gets to calling set_cached_acl(), and then
    get_acl's call to set_cached_acl() results in caching an outdate ACL.

    Prevent this from happening by setting the cached ACL pointer to a
    task-specific sentinel value before calling the get_acl inode operation.
    Move the responsibility for updating the cached ACL from the get_acl
    inode operations to get_acl(). There, only set the cached ACL if the
    sentinel value hasn't changed.

    The sentinel values are chosen to have odd values. Likewise, the value
    of ACL_NOT_CACHED is odd. In contrast, ACL object pointers always have
    an even value (ACLs are aligned in memory). This allows to distinguish
    uncached ACLs values from ACL objects.

    In addition, switch from guarding inode->i_acl and inode->i_default_acl
    upates by the inode->i_lock spinlock to using xchg() and cmpxchg().

    Filesystems that do not want ACLs returned from their get_acl inode
    operations to be cached must call forget_cached_acl() to prevent the VFS
    from doing so.

    (Patch written by Al Viro and Andreas Gruenbacher.)

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     

17 Feb, 2016

1 commit

  • inode struct members that track cgroup writeback information
    should be reinitialized when inode gets allocated from
    kmem_cache. Otherwise, their values remain and get used by the
    new inode.

    Signed-off-by: Tahsin Erdogan
    Acked-by: Tejun Heo
    Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

1 commit

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler