07 Feb, 2019

1 commit

  • commit 1dbd449c9943e3145148cc893c2461b72ba6fef0 upstream.

    The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
    lists and the shrink lists where the DCACHE_LRU_LIST bit is set.

    The shrink_dcache_sb() function moves dentries from the LRU list to a
    shrink list and subtracts the dentry count from nr_dentry_unused. This
    is incorrect as the nr_dentry_unused count will also be decremented in
    shrink_dentry_list() via d_shrink_del().

    To fix this double decrement, the decrement in the shrink_dcache_sb()
    function is taken out.

    Fixes: 4e717f5c1083 ("list_lru: remove special case function list_lru_dispose_all."
    Cc: stable@kernel.org
    Signed-off-by: Waiman Long
    Reviewed-by: Dave Chinner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

18 Oct, 2018

1 commit

  • commit f1782c9bc547754f4bd3043fe8cfda53db85f13f upstream.

    I received a report about suspicious growth of unreclaimable slabs on
    some machines. I've found that it happens on machines with low memory
    pressure, and these unreclaimable slabs are external names attached to
    dentries.

    External names are allocated using generic kmalloc() function, so they
    are accounted as unreclaimable. But they are held by dentries, which
    are reclaimable, and they will be reclaimed under the memory pressure.

    In particular, this breaks MemAvailable calculation, as it doesn't take
    unreclaimable slabs into account. This leads to a silly situation, when
    a machine is almost idle, has no memory pressure and therefore has a big
    dentry cache. And the resulting MemAvailable is too low to start a new
    workload.

    To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
    used to track the amount of memory, consumed by external names. The
    counter is increased in the dentry allocation path, if an external name
    structure is allocated; and it's decreased in the dentry freeing path.

    To reproduce the problem I've used the following Python script:

    import os

    for iter in range (0, 10000000):
    try:
    name = ("/some_long_name_%d" % iter) + "_" * 220
    os.stat(name)
    except Exception:
    pass

    Without this patch:
    $ cat /proc/meminfo | grep MemAvailable
    MemAvailable: 7811688 kB
    $ python indirect.py
    $ cat /proc/meminfo | grep MemAvailable
    MemAvailable: 2753052 kB

    With the patch:
    $ cat /proc/meminfo | grep MemAvailable
    MemAvailable: 7809516 kB
    $ python indirect.py
    $ cat /proc/meminfo | grep MemAvailable
    MemAvailable: 7749144 kB

    [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
    Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
    [guro@fb.com: fix indirectly reclaimable memory accounting]
    Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

15 Sep, 2018

1 commit

  • [ Upstream commit 6cd00a01f0c1ae6a852b09c59b8dd55cc6c35d1d ]

    Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
    are initialized at __d_alloc(), we can't copy the whole size
    unconditionally.

    WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
    636f6e66696766732e746d70000000000010000000000000020000000188ffff
    i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
    ^
    RIP: 0010:take_dentry_name_snapshot+0x28/0x50
    RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
    RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
    RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
    RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
    R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
    R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
    FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
    take_dentry_name_snapshot+0x28/0x50
    vfs_rename+0x128/0x870
    SyS_rename+0x3b2/0x3d0
    entry_SYSCALL_64_fastpath+0x1a/0xa4
    0xffffffffffffffff

    Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Cc: Vegard Nossum
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

16 Aug, 2018

2 commits

  • commit 4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab upstream.

    RCU pathwalk relies upon the assumption that anything that changes
    ->d_inode of a dentry will invalidate its ->d_seq. That's almost
    true - the one exception is that the final dput() of already unhashed
    dentry does *not* touch ->d_seq at all. Unhashing does, though,
    so for anything we'd found by RCU dcache lookup we are fine.
    Unfortunately, we can *start* with an unhashed dentry or jump into
    it.

    We could try and be careful in the (few) places where that could
    happen. Or we could just make the final dput() invalidate the damn
    thing, unhashed or not. The latter is much simpler and easier to
    backport, so let's do it that way.

    Reported-by: "Dae R. Jeong"
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream.

    Since mountpoint crossing can happen without leaving lazy mode,
    root dentries do need the same protection against having their
    memory freed without RCU delay as everything else in the tree.

    It's partially hidden by RCU delay between detaching from the
    mount tree and dropping the vfsmount reference, but the starting
    point of pathwalk can be on an already detached mount, in which
    case umount-caused RCU delay has already passed by the time the
    lazy pathwalk grabs rcu_read_lock(). If the starting point
    happens to be at the root of that vfsmount *and* that vfsmount
    covers the entire filesystem, we get trouble.

    Fixes: 48a066e72d97 ("RCU'd vsfmounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

30 May, 2018

3 commits

  • [ Upstream commit 8cc07c808c9d595e81cbe5aad419b7769eb2e5c9 ]

    i_dir_seq is subject to concurrent modification by a cmpxchg or
    store-release operation, so ensure that the relaxed access in
    d_alloc_parallel uses READ_ONCE.

    Reported-by: Peter Zijlstra
    Signed-off-by: Will Deacon
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • [ Upstream commit 015555fd4d2930bc0c86952c46ad88b3392f66e4 ]

    If d_alloc_parallel runs concurrently with __d_add, it is possible for
    d_alloc_parallel to continuously retry whilst i_dir_seq has been
    incremented to an odd value by __d_add:

    CPU0:
    __d_add
    n = start_dir_add(dir);
    cmpxchg(&dir->i_dir_seq, n, n + 1) == n

    CPU1:
    d_alloc_parallel
    retry:
    seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
    hlist_bl_lock(b);
    bit_spin_lock(0, (unsigned long *)b); // Always succeeds

    CPU0:
    __d_lookup_done(dentry)
    hlist_bl_lock
    bit_spin_lock(0, (unsigned long *)b); // Never succeeds

    CPU1:
    if (unlikely(parent->d_inode->i_dir_seq != seq)) {
    hlist_bl_unlock(b);
    goto retry;
    }

    Since the simple bit_spin_lock used to implement hlist_bl_lock does not
    provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
    and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
    exit its retry loop because the sequence number always has the bottom
    bit set.

    This patch resolves the livelock by not taking hlist_bl_lock in
    d_alloc_parallel if the sequence counter is odd, since any subsequent
    masked comparison with i_dir_seq will fail anyway.

    Cc: Peter Zijlstra
    Cc: Al Viro
    Reported-by: Naresh Madhusudana
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Will Deacon
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.

    For anything NFS-exported we do _not_ want to unlock new inode
    before it has grown an alias; original set of fixes got the
    ordering right, but missed the nasty complication in case of
    lockdep being enabled - unlock_new_inode() does
    lockdep_annotate_inode_mutex_key(inode)
    which can only be done before anyone gets a chance to touch
    ->i_mutex. Unfortunately, flipping the order and doing
    unlock_new_inode() before d_instantiate() opens a window when
    mkdir can race with open-by-fhandle on a guessed fhandle, leading
    to multiple aliases for a directory inode and all the breakage
    that follows from that.

    Correct solution: a new primitive (d_instantiate_new())
    combining these two in the right order - lockdep annotate, then
    d_instantiate(), then the rest of unlock_new_inode(). All
    combinations of d_instantiate() with unlock_new_inode() should
    be converted to that.

    Cc: stable@kernel.org # 2.6.29 and later
    Tested-by: Mike Marshall
    Reviewed-by: Andreas Dilger
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

12 Apr, 2018

1 commit

  • [ Upstream commit 61647823aa920e395afcce4b57c32afb51456cab ]

    d_move() will call __d_drop() and then __d_rehash()
    on the dentry being moved. This creates a small window
    when the dentry appears to be unhashed. Many tests
    of d_unhashed() are made under ->d_lock and so are safe
    from racing with this window, but some aren't.
    In particular, getcwd() calls d_unlinked() (which calls
    d_unhashed()) without d_lock protection, so it can race.

    This races has been seen in practice with lustre, which uses d_move() as
    part of name lookup. See:
    https://jira.hpdd.intel.com/browse/LU-9735
    It could race with a regular rename(), and result in ENOENT instead
    of either the 'before' or 'after' name.

    The race can be demonstrated with a simple program which
    has two threads, one renaming a directory back and forth
    while another calls getcwd() within that directory: it should never
    fail, but does. See:
    https://patchwork.kernel.org/patch/9455345/

    We could fix this race by taking d_lock and rechecking when
    d_unhashed() reports true. Alternately when can remove the window,
    which is the approach this patch takes.

    ___d_drop() is introduce which does *not* clear d_hash.pprev
    so the dentry still appears to be hashed. __d_drop() calls
    ___d_drop(), then clears d_hash.pprev.
    __d_move() now uses ___d_drop() and only clears d_hash.pprev
    when not rehashing.

    Signed-off-by: NeilBrown
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

21 Mar, 2018

1 commit

  • commit 3b821409632ab778d46e807516b457dfa72736ed upstream.

    In case when dentry passed to lock_parent() is protected from freeing only
    by the fact that it's on a shrink list and trylock of parent fails, we
    could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
    between unlocking dentry and locking presumed parent. We need to recheck
    that dentry is alive once we lock both it and parent *and* postpone
    rcu_read_unlock() until after that point. Otherwise we could return
    a pointer to struct dentry that already is rcu-scheduled for freeing, with
    ->d_lock held on it; caller's subsequent attempt to unlock it can end
    up with memory corruption.

    Cc: stable@vger.kernel.org # 3.12+, counting backports
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

25 Dec, 2017

1 commit

  • commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.

    [ Note, this is a Git cherry-pick of the following commit:

    506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

    ... for easier x86 PTI code testing and back-porting. ]

    READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
    duration if there are more number of items in the lru list. As per the
    current code, it can hold the spin lock for upto maximum UINT_MAX
    entries at a time. So if there are more number of items in the lru
    list, then "BUG: spinlock lockup suspected" is observed in the below
    path:

    spin_bug+0x90
    do_raw_spin_lock+0xfc
    _raw_spin_lock+0x28
    list_lru_add+0x28
    dput+0x1c8
    path_put+0x20
    terminate_walk+0x3c
    path_lookupat+0x100
    filename_lookup+0x6c
    user_path_at_empty+0x54
    SyS_faccessat+0xd0
    el0_svc_naked+0x24

    This nlru->lock is acquired by another CPU in this path -

    d_lru_shrink_move+0x34
    dentry_lru_isolate_shrink+0x48
    __list_lru_walk_one.isra.10+0x94
    list_lru_walk_node+0x40
    shrink_dcache_sb+0x60
    do_remount_sb+0xbc
    do_emergency_remount+0xb0
    process_one_work+0x228
    worker_thread+0x2e0
    kthread+0xf4
    ret_from_fork+0x10

    Fix this lockup by reducing the number of entries to be shrinked from
    the lru list to 1024 at once. Also, add cond_resched() before
    processing the lru list again.

    Link: http://marc.info/?t=149722864900001&r=1&w=2
    Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
    Signed-off-by: Sahitya Tummala
    Suggested-by: Jan Kara
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Alexander Polakov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sahitya Tummala
     

09 Jul, 2017

1 commit

  • Pull misc filesystem updates from Al Viro:
    "Assorted normal VFS / filesystems stuff..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    dentry name snapshots
    Make statfs properly return read-only state after emergency remount
    fs/dcache: init in_lookup_hashtable
    minix: Deinline get_block, save 2691 bytes
    fs: Reorder inode_owner_or_capable() to avoid needless
    fs: warn in case userspace lied about modprobe return

    Linus Torvalds
     

08 Jul, 2017

1 commit

  • take_dentry_name_snapshot() takes a safe snapshot of dentry name;
    if the name is a short one, it gets copied into caller-supplied
    structure, otherwise an extra reference to external name is grabbed
    (those are never modified). In either case the pointer to stable
    string is stored into the same structure.

    dentry must be held by the caller of take_dentry_name_snapshot(),
    but may be freely dropped afterwards - the snapshot will stay
    until destroyed by release_dentry_name_snapshot().

    Intended use:
    struct name_snapshot s;

    take_dentry_name_snapshot(&s, dentry);
    ...
    access s.name
    ...
    release_dentry_name_snapshot(&s);

    Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
    to pass down with event.

    Signed-off-by: Al Viro

    Al Viro
     

07 Jul, 2017

1 commit

  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

06 Jul, 2017

1 commit


30 Jun, 2017

1 commit

  • in_lookup_hashtable was introduced in commit 94bdd655caba ("parallel
    lookups machinery, part 3") and never initialized but since it is in
    the data it is all zeros. But we need this for -RT.

    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Al Viro

    Sebastian Andrzej Siewior
     

15 Jun, 2017

1 commit

  • It's not hard to trigger a bunch of d_invalidate() on the same
    dentry in parallel. They end up fighting each other - any
    dentry picked for removal by one will be skipped by the rest
    and we'll go for the next iteration through the entire
    subtree, even if everything is being skipped. Morevoer, we
    immediately go back to scanning the subtree. The only thing
    we really need is to dissolve all mounts in the subtree and
    as soon as we've nothing left to do, we can just unhash the
    dentry and bugger off.

    Signed-off-by: Al Viro

    Al Viro
     

03 May, 2017

1 commit

  • By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
    inode we create. This is problematic as this means that it takes two
    trips through the LRU for any of these objects to be reclaimed,
    regardless of their actual lifetime. With enough pressure from these
    caches we can easily evict our working set from page cache with single
    use objects. So instead only set *REFERENCED if we've already been
    added to the LRU list. This means that we've been touched since the
    first time we were accessed, and so more likely to need to hang out in
    cache.

    To illustrate this issue I wrote the following scripts

    https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure

    on my test box. It is a single socket 4 core CPU with 16gib of RAM and
    I tested on an Intel 2tib NVME drive. The cache-pressure.sh script
    creates a new file system and creates 2 6.5gib files in order to take up
    13gib of the 16gib of ram with pagecache. Then it runs a test program
    that reads these 2 files in a loop, and keeps track of how often it has
    to read bytes for each loop. On an ideal system with no pressure we
    should have to read 0 bytes indefinitely. The second thing this script
    does is start a fs_mark job that creates a ton of 0 length files,
    putting pressure on the system with slab only allocations. On exit the
    script prints out how many bytes were read by the read-file program.
    The results are as follows

    Without patch:
    /mnt/btrfs-test/reads/file1: total read during loops 27262988288
    /mnt/btrfs-test/reads/file2: total read during loops 27262976000

    With patch:
    /mnt/btrfs-test/reads/file2: total read during loops 18640457728
    /mnt/btrfs-test/reads/file1: total read during loops 9565376512

    This patch results in a 50% reduction of the amount of pages evicted
    from our working set.

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

10 Jan, 2017

1 commit

  • Protecting the mountpoint hashtable with namespace_sem was sufficient
    until a call to umount_mnt was added to mntput_no_expire. At which
    point it became possible for multiple calls of put_mountpoint on
    the same hash chain to happen on the same time.

    Kristen Johansen reported:
    > This can cause a panic when simultaneous callers of put_mountpoint
    > attempt to free the same mountpoint. This occurs because some callers
    > hold the mount_hash_lock, while others hold the namespace lock. Some
    > even hold both.
    >
    > In this submitter's case, the panic manifested itself as a GP fault in
    > put_mountpoint() when it called hlist_del() and attempted to dereference
    > a m_hash.pprev that had been poisioned by another thread.

    Al Viro observed that the simple fix is to switch from using the namespace_sem
    to the mount_lock to protect the mountpoint hash table.

    I have taken Al's suggested patch moved put_mountpoint in pivot_root
    (instead of taking mount_lock an additional time), and have replaced
    new_mountpoint with get_mountpoint a function that does the hash table
    lookup and addition under the mount_lock. The introduction of get_mounptoint
    ensures that only the mount_lock is needed to manipulate the mountpoint
    hashtable.

    d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
    already set. This allows get_mountpoint to use the setting of
    DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
    happens exactly once.

    Cc: stable@vger.kernel.org
    Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
    Reported-by: Krister Johansen
    Suggested-by: Al Viro
    Acked-by: Al Viro
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Dec, 2016

1 commit


04 Dec, 2016

2 commits

  • Now that path_has_submounts() has been added have_submounts() is no
    longer used so remove it.

    Link: http://lkml.kernel.org/r/20161011053428.27645.12310.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: Omar Sandoval
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Ian Kent
     
  • d_mountpoint() can only be used reliably to establish if a dentry is
    not mounted in any namespace. It isn't aware of the possibility there
    may be multiple mounts using the given dentry, possibly in a different
    namespace.

    Add function, path_has_submounts(), that checks is a struct path contains
    mounts (or is a mountpoint itself) to handle this case.

    Link: http://lkml.kernel.org/r/20161011053403.27645.55242.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: Omar Sandoval
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Ian Kent
     

07 Aug, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    In the "trivial API change" department - ->d_compare() losing 'parent'
    argument"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    cachefiles: Fix race between inactivating and culling a cache object
    9p: use clone_fid()
    9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
    vfs: make dentry_needs_remove_privs() internal
    vfs: remove file_needs_remove_privs()
    vfs: fix deadlock in file_remove_privs() on overlayfs
    get rid of 'parent' argument of ->d_compare()
    cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
    affs ->d_compare(): don't bother with ->d_inode
    fold _d_rehash() and __d_rehash() together
    fold dentry_rcuwalk_invalidate() into its only remaining caller

    Linus Torvalds
     

06 Aug, 2016

1 commit

  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     

01 Aug, 2016

1 commit


30 Jul, 2016

2 commits


29 Jul, 2016

2 commits

  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     
  • This changes the vfs dentry hashing to mix in the parent pointer at the
    _beginning_ of the hash, rather than at the end.

    That actually improves both the hash and the code generation, because we
    can move more of the computation to the "static" part of the dcache
    setup, and do less at lookup runtime.

    It turns out that a lot of other hash users also really wanted to mix in
    a base pointer as a 'salt' for the hash, and so the slightly extended
    interface ends up working well for other cases too.

    Users that want a string hash that is purely about the string pass in a
    'salt' pointer of NULL.

    * merge branch 'salted-string-hash':
    fs/dcache.c: Save one 32-bit multiply in dcache lookup
    vfs: make the string hashes salt the hash

    Linus Torvalds
     

25 Jul, 2016

3 commits

  • We triggered soft-lockup under stress test which
    open/access/write/close one file concurrently on more than
    five different CPUs:

    WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
    ...
    [] dput+0x100/0x298
    [] terminate_walk+0x4c/0x60
    [] path_lookupat+0x5cc/0x7a8
    [] filename_lookup+0x38/0xf0
    [] user_path_at_empty+0x78/0xd0
    [] user_path_at+0x1c/0x28
    [] SyS_faccessat+0xb4/0x230

    ->d_lock trylock may failed many times because of concurrently
    operations, and dput() may execute a long time.

    Fix this by replacing cpu_relax() with cond_resched().
    dput() used to be sleepable, so make it sleepable again
    should be safe.

    Cc:
    Signed-off-by: Wei Fang
    Signed-off-by: Al Viro

    Wei Fang
     
  • Allow filesystem to initialize dentry at allocation time.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Al Viro
     

21 Jul, 2016

1 commit


01 Jul, 2016

4 commits