21 Aug, 2015

1 commit


18 Aug, 2015

2 commits

  • When competing sync(2) calls walk the same filesystem, they need to
    walk the list of inodes on the superblock to find all the inodes
    that we need to wait for IO completion on. However, when multiple
    wait_sb_inodes() calls do this at the same time, they contend on the
    the inode_sb_list_lock and the contention causes system wide
    slowdowns. In effect, concurrent sync(2) calls can take longer and
    burn more CPU than if they were serialised.

    Stop the worst of the contention by adding a per-sb mutex to wrap
    around wait_sb_inodes() so that we only execute one sync(2) IO
    completion walk per superblock superblock at a time and hence avoid
    contention being triggered by concurrent sync(2) calls.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     
  • The process of reducing contention on per-superblock inode lists
    starts with moving the locking to match the per-superblock inode
    list. This takes the global lock out of the picture and reduces the
    contention problems to within a single filesystem. This doesn't get
    rid of contention as the locks still have global CPU scope, but it
    does isolate operations on different superblocks form each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     

15 Aug, 2015

4 commits

  • We can remove everything from struct sb_writers except frozen
    and add the array of percpu_rw_semaphore's instead.

    This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
    it for get_super_thawed(). We will probably remove it later.

    This change tries to address the following problems:

    - Firstly, __sb_start_write() looks simply buggy. It does
    __sb_end_write() if it sees ->frozen, but if it migrates
    to another CPU before percpu_counter_dec(), sb_wait_write()
    can wrongly succeed if there is another task which holds
    the same "semaphore": sb_wait_write() can miss the result
    of the previous percpu_counter_inc() but see the result
    of this percpu_counter_dec().

    - As Dave Hansen reports, it is suboptimal. The trivial
    microbenchmark that writes to a tmpfs file in a loop runs
    12% faster if we change this code to rely on RCU and kill
    the memory barriers.

    - This code doesn't look simple. It would be better to rely
    on the generic locking code.

    According to Dave, this change adds the same performance
    improvement.

    Note: with this change both freeze_super() and thaw_super() will do
    synchronize_sched_expedited() 3 times. This is just ugly. But:

    - This will be "fixed" by the rcu_sync changes we are going
    to merge. After that freeze_super()->percpu_down_write()
    will use synchronize_sched(), and thaw_super() won't use
    synchronize() at all.

    This doesn't need any changes in fs/super.c.

    - Once we merge rcu_sync changes, we can also change super.c
    so that all wb_write->rw_sem's will share the single ->rss
    in struct sb_writes, then freeze_super() will need only one
    synchronize_sched().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Jan Kara

    Oleg Nesterov
     
  • Of course, this patch is ugly as hell. It will be (partially)
    reverted later. We add it to ensure that other WIP changes in
    percpu_rw_semaphore won't break fs/super.c.

    We do not even need this change right now, percpu_free_rwsem()
    is fine in atomic context. But we are going to change this, it
    will be might_sleep() after we merge the rcu_sync() patches.

    And even after that we do not really need destroy_super_work(),
    we will kill it in any case. Instead, destroy_super_rcu() should
    just check that rss->cb_state == CB_IDLE and do call_rcu() again
    in the (very unlikely) case this is not true.

    So this is just the temporary kludge which helps us to avoid the
    conflicts with the changes which will be (hopefully) routed via
    rcu tree.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Jan Kara

    Oleg Nesterov
     
  • Not only we need to avoid the warning from lockdep_sys_exit(), the
    caller of freeze_super() can never release this lock. Another thread
    can do this, so there is another reason for rwsem_release().

    Plus the comment should explain why we have to fool lockdep.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Jan Kara

    Oleg Nesterov
     
  • 1. wait_event(frozen < level) without rwsem_acquire_read() is just
    wrong from lockdep perspective. If we are going to deadlock
    because the caller is buggy, lockdep can't detect this problem.

    2. __sb_start_write() can race with thaw_super() + freeze_super(),
    and after "goto retry" the 2nd acquire_freeze_lock() is wrong.

    3. The "tell lockdep we are doing trylock" hack doesn't look nice.

    I think this is correct, but this logic should be more explicit.
    Yes, the recursive read_lock() is fine if we hold the lock on a
    higher level. But we do not need to fool lockdep. If we can not
    deadlock in this case then try-lock must not fail and we can use
    use wait == F throughout this code.

    Note: as Dave Chinner explains, the "trylock" hack and the fat comment
    can be probably removed. But this needs a separate change and it will
    be trivial: just kill __sb_start_write() and rename do_sb_start_write()
    back to __sb_start_write().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Jan Kara

    Oleg Nesterov
     

01 Jul, 2015

1 commit


15 Apr, 2015

1 commit

  • The limit equals 32 and is imposed by the number of entries in the
    fs_poolid_map and shared_fs_poolid_map. Nowadays it is insufficient,
    because with containers on board a Linux host can have hundreds of
    active fs mounts.

    These maps were introduced by commit 49a9ab815acb8 ("mm: cleancache:
    lazy initialization to allow tmem backends to build/run as modules") in
    order to allow compiling cleancache drivers as modules. Real pool ids
    are stored in these maps while super_block->cleancache_poolid points to
    an entry in the map, so that on cleancache registration we can walk over
    all (if there are s_umount held for writing, while iterate_supers takes this
    semaphore for reading, so if we call iterate_supers after setting
    cleancache_ops, all super blocks that had been created before
    cleancache_register_ops was called will be assigned pool ids by the
    action function of iterate_supers while all newer super blocks will
    receive it in cleancache_init_fs.

    This patch therefore removes the maps and hence the artificial limit on
    the number of cleancache enabled filesystems.

    Signed-off-by: Vladimir Davydov
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Stefan Hengelein
    Cc: Florian Schmaus
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Feb, 2015

1 commit

  • I've noticed significant locking contention in memory reclaimer around
    sb_lock inside grab_super_passive(). Grab_super_passive() is called from
    two places: in icache/dcache shrinkers (function super_cache_scan) and
    from writeback (function __writeback_inodes_wb). Both are required for
    progress in memory allocator.

    Grab_super_passive() acquires sb_lock to increment sb->s_count and check
    sb->s_instances. It seems sb->s_umount locked for read is enough here:
    super-block deactivation always runs under sb->s_umount locked for write.
    Protecting super-block itself isn't a problem: in super_cache_scan() sb
    is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
    are still active. Inside writeback super-block comes from inode from bdi
    writeback list under wb->list_lock.

    This patch removes locking sb_lock and checks s_instances under s_umount:
    generic_shutdown_super() unlinks it under sb->s_umount locked for write.
    New variant is called trylock_super() and since it only locks semaphore,
    callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
    they're done.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Al Viro

    Konstantin Khlebnikov
     

18 Feb, 2015

1 commit

  • Pull misc VFS updates from Al Viro:
    "This cycle a lot of stuff sits on topical branches, so I'll be sending
    more or less one pull request per branch.

    This is the first pile; more to follow in a few. In this one are
    several misc commits from early in the cycle (before I went for
    separate branches), plus the rework of mntput/dput ordering on umount,
    switching to use of fs_pin instead of convoluted games in
    namespace_unlock()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch the IO-triggering parts of umount to fs_pin
    new fs_pin killing logics
    allow attaching fs_pin to a group not associated with some superblock
    get rid of the second argument of acct_kill()
    take count and rcu_head out of fs_pin
    dcache: let the dentry count go down to zero without taking d_lock
    pull bumping refcount into ->kill()
    kill pin_put()
    mode_t whack-a-mole: chelsio
    file->f_path.dentry is pinned down for as long as the file is open...
    get rid of lustre_dump_dentry()
    gut proc_register() a bit
    kill d_validate()
    ncpfs: get rid of d_validate() nonsense
    selinuxfs: don't open-code d_genocide()

    Linus Torvalds
     

13 Feb, 2015

6 commits

  • Merge third set of updates from Andrew Morton:

    - the rest of MM

    [ This includes getting rid of the numa hinting bits, in favor of
    just generic protnone logic. Yay. - Linus ]

    - core kernel

    - procfs

    - some of lib/ (lots of lib/ material this time)

    * emailed patches from Andrew Morton : (104 commits)
    lib/lcm.c: replace include
    lib/percpu_ida.c: remove redundant includes
    lib/strncpy_from_user.c: replace module.h include
    lib/stmp_device.c: replace module.h include
    lib/sort.c: move include inside #if 0
    lib/show_mem.c: remove redundant include
    lib/radix-tree.c: change to simpler include
    lib/plist.c: remove redundant include
    lib/nlattr.c: remove redundant include
    lib/kobject_uevent.c: remove redundant include
    lib/llist.c: remove redundant include
    lib/md5.c: simplify include
    lib/list_sort.c: rearrange includes
    lib/genalloc.c: remove redundant include
    lib/idr.c: remove redundant include
    lib/halfmd4.c: simplify includes
    lib/dynamic_queue_limits.c: simplify includes
    lib/sort.c: use simpler includes
    lib/interval_tree.c: simplify includes
    hexdump: make it return number of bytes placed in buffer
    ...

    Linus Torvalds
     
  • In super_cache_scan() we divide the number of objects of particular type
    by the total number of objects in order to distribute pressure among As a
    result, in some corner cases we can get nr_to_scan=0 even if there are
    some objects to reclaim, e.g. dentries=1, inodes=1, fs_objects=1,
    nr_to_scan=1/3=0.

    This is unacceptable for per memcg kmem accounting, because this means
    that some objects may never get reclaimed after memcg death, preventing it
    from being freed.

    This patch therefore assures that super_cache_scan() will scan at least
    one object of each type if any.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Alexander Viro
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Now, to make any list_lru-based shrinker memcg aware we should only
    initialize its list_lru as memcg aware. Let's do it for the general FS
    shrinker (super_block::s_shrink).

    There are other FS-specific shrinkers that use list_lru for storing
    objects, such as XFS and GFS2 dquot cache shrinkers, but since they
    reclaim objects that are shared among different cgroups, there is no point
    making them memcg aware. It's a big question whether we should account
    them to memcg at all.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • To make list_lru memcg aware, we need all list_lrus to be kept on a list
    protected by a mutex, so that we could sleep while walking over the
    list.

    Therefore after this change list_lru_destroy may sleep. Fortunately,
    there is only one user that calls it from an atomic context - it's
    put_super - and we can easily fix it by calling list_lru_destroy before
    put_super in destroy_locked_super - anyway we don't longer need lrus by
    that time.

    Another point that should be noted is that list_lru_destroy is allowed
    to be called on an uninitialized zeroed-out object, in which case it is
    a no-op. Before this patch this was guaranteed by kfree, but now we
    need an explicit check there.

    Signed-off-by: Vladimir Davydov
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We are going to make FS shrinkers memcg-aware. To achieve that, we will
    have to pass the memcg to scan to the nr_cached_objects and
    free_cached_objects VFS methods, which currently take only the NUMA node
    to scan. Since the shrink_control structure already holds the node, and
    the memcg to scan will be added to it when we introduce memcg-aware
    vmscan, let us consolidate the methods' arguments in this structure to
    keep things clean.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Kmem accounting of memcg is unusable now, because it lacks slab shrinker
    support. That means when we hit the limit we will get ENOMEM w/o any
    chance to recover. What we should do then is to call shrink_slab, which
    would reclaim old inode/dentry caches from this cgroup. This is what
    this patch set is intended to do.

    Basically, it does two things. First, it introduces the notion of
    per-memcg slab shrinker. A shrinker that wants to reclaim objects per
    cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
    passed the memory cgroup to scan from in shrink_control->memcg. For
    such shrinkers shrink_slab iterates over the whole cgroup subtree under
    the target cgroup and calls the shrinker for each kmem-active memory
    cgroup.

    Secondly, this patch set makes the list_lru structure per-memcg. It's
    done transparently to list_lru users - everything they have to do is to
    tell list_lru_init that they want memcg-aware list_lru. Then the
    list_lru will automatically distribute objects among per-memcg lists
    basing on which cgroup the object is accounted to. This way to make FS
    shrinkers (icache, dcache) memcg-aware we only need to make them use
    memcg-aware list_lru, and this is what this patch set does.

    As before, this patch set only enables per-memcg kmem reclaim when the
    pressure goes from memory.limit, not from memory.kmem.limit. Handling
    memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
    it is still unclear whether we will have this knob in the unified
    hierarchy.

    This patch (of 9):

    NUMA aware slab shrinkers use the list_lru structure to distribute
    objects coming from different NUMA nodes to different lists. Whenever
    such a shrinker needs to count or scan objects from a particular node,
    it issues commands like this:

    count = list_lru_count_node(lru, sc->nid);
    freed = list_lru_walk_node(lru, sc->nid, isolate_func,
    isolate_arg, &sc->nr_to_scan);

    where sc is an instance of the shrink_control structure passed to it
    from vmscan.

    To simplify this, let's add special list_lru functions to be used by
    shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
    consolidate the nid and nr_to_scan arguments in the shrink_control
    structure.

    This will also allow us to avoid patching shrinkers that use list_lru
    when we make shrink_slab() per-memcg - all we will have to do is extend
    the shrink_control structure to include the target memcg and make
    list_lru_shrink_{count,walk} handle this appropriately.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

03 Feb, 2015

1 commit


26 Jan, 2015

1 commit


21 Jan, 2015

1 commit

  • Now that default_backing_dev_info is not used for writeback purposes we can
    git rid of it easily:

    - instead of using it's name for tracing unregistered bdi we just use
    "unknown"
    - btrfs and ceph can just assign the default read ahead window themselves
    like several other filesystems already do.
    - we can assign noop_backing_dev_info as the default one in alloc_super.
    All filesystems already either assigned their own or
    noop_backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

13 Oct, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The big thing in this pile is Eric's unmount-on-rmdir series; we
    finally have everything we need for that. The final piece of prereqs
    is delayed mntput() - now filesystem shutdown always happens on
    shallow stack.

    Other than that, we have several new primitives for iov_iter (Matt
    Wilcox, culled from his XIP-related series) pushing the conversion to
    ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
    cleanups and fixes (including the external name refcounting, which
    gives consistent behaviour of d_move() wrt procfs symlinks for long
    and short names alike) and assorted cleanups and fixes all over the
    place.

    This is just the first pile; there's a lot of stuff from various
    people that ought to go in this window. Starting with
    unionmount/overlayfs mess... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
    fs/file_table.c: Update alloc_file() comment
    vfs: Deduplicate code shared by xattr system calls operating on paths
    reiserfs: remove pointless forward declaration of struct nameidata
    don't need that forward declaration of struct nameidata in dcache.h anymore
    take dname_external() into fs/dcache.c
    let path_init() failures treated the same way as subsequent link_path_walk()
    fix misuses of f_count() in ppp and netlink
    ncpfs: use list_for_each_entry() for d_subdirs walk
    vfs: move getname() from callers to do_mount()
    gfs2_atomic_open(): skip lookups on hashed dentry
    [infiniband] remove pointless assignments
    gadgetfs: saner API for gadgetfs_create_file()
    f_fs: saner API for ffs_sb_create_file()
    jfs: don't hash direct inode
    [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
    ecryptfs: ->f_op is never NULL
    android: ->f_op is never NULL
    nouveau: __iomem misannotations
    missing annotation in fs/file.c
    fs: namespace: suppress 'may be used uninitialized' warnings
    ...

    Linus Torvalds
     

09 Oct, 2014

1 commit

  • total_objects could be 0 and is used as a denom.

    While total_objects is a "long", total_objects == 0 unlikely happens for
    3.12 and later kernels because 32-bit architectures would not be able to
    hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
    between 3.1 and 3.11 because total_objects in prune_super() was an "int"
    and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Cc: stable # 3.1+
    Signed-off-by: Al Viro

    Tetsuo Handa
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

14 Aug, 2014

1 commit

  • Pull quota, reiserfs, UDF updates from Jan Kara:
    "Scalability improvements for quota, a few reiserfs fixes, and couple
    of misc cleanups (udf, ext2)"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    reiserfs: Fix use after free in journal teardown
    reiserfs: fix corruption introduced by balance_leaf refactor
    udf: avoid redundant memcpy when writing data in ICB
    fs/udf: re-use hex_asc_upper_{hi,lo} macros
    fs/quota: kernel-doc warning fixes
    udf: use linux/uaccess.h
    fs/ext2/super.c: Drop memory allocation cast
    quota: remove dqptr_sem
    quota: simplify remove_inode_dquot_ref()
    quota: avoid unnecessary dqget()/dqput() calls
    quota: protect Q_GETFMT by dqonoff_mutex

    Linus Torvalds
     

08 Aug, 2014

3 commits


16 Jul, 2014

1 commit

  • Remove dqptr_sem to make quota code scalable: Remove the dqptr_sem,
    accessing inode->i_dquot now protected by dquot_srcu, and changing
    inode->i_dquot is now serialized by dq_data_lock.

    Signed-off-by: Lai Siyao
    Signed-off-by: Niu Yawei
    Signed-off-by: Jan Kara

    Niu Yawei
     

05 Jun, 2014

2 commits

  • We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g. when we are doing large amount of file read and
    page cache is under pressure. The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim. But counting
    creates a log jam acquiring the sb_lock.

    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.

    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)

    ffsb running in a configuration that is meant to simulate a mail server showed

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
    Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
    Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
    Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
    Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • This series is aimed at regressions noticed during reclaim activity. The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me. I'm posting them again to see
    if there was a reason they were dropped or if they just got lost. Dave?
    Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine? Hugh, does this
    work for you on the memcg test cases?

    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.

    postmark
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)

    ffsb (mail server simulator)
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
    Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
    Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
    Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
    Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)

    dd of a large file
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
    WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
    WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)

    stutter (times mmap latency during large amounts of IO)

    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
    Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
    Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
    Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
    Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
    Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
    Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
    Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
    Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
    Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
    Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
    Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
    Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
    Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)

    This patch (of 3):

    We will like to unregister the sb shrinker before ->kill_sb(). This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb. We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.

    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block. That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:

    shrinker umount

    down_read(shrinker_rwsem)
    down_write(sb->s_umount)
    shrinker_unregister
    down_write(shrinker_rwsem)

    grab_super_passive(sb)
    down_read_trylock(sb->s_umount)


    ....

    up_read(shrinker_rwsem)


    up_write(shrinker_rwsem)
    ->kill_sb()
    ....

    So it is safe to deregister the shrinker before ->kill_sb().

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

17 Apr, 2014

1 commit

  • Commit 9e30cc9595303b27b48 removed an internal mount. This
    has the side-effect that rootfs now has FSID 0. Many
    userspace utilities assume that st_dev in struct stat
    is never 0, so this change breaks a number of tools in
    early userspace.

    Since we don't know how many userspace programs are affected,
    make sure that FSID is at least 1.

    References: http://article.gmane.org/gmane.linux.kernel/1666905
    References: http://permalink.gmane.org/gmane.linux.utilities.util-linux-ng/8557
    Cc: 3.14
    Signed-off-by: Thomas Bächler
    Acked-by: Tejun Heo
    Acked-by: H. Peter Anvin
    Tested-by: Alexandre Demers
    Signed-off-by: Greg Kroah-Hartman

    Thomas Bächler
     

13 Mar, 2014

1 commit

  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o
     

01 Feb, 2014

1 commit

  • Move sync_filesystem() after sb_prepare_remount_readonly(). If writers
    sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
    it can cause inodes to be dirtied and writeback to occur well after
    sys_mount() has completely successfully.

    This was spotted by corrupted ubifs filesystems on reboot, but appears
    that it can cause issues with any filesystem using writeback.

    Cc: Artem Bityutskiy
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    CC: Richard Weinberger
    Co-authored-by: Richard Weinberger
    Signed-off-by: Andrew Ruder
    Signed-off-by: Al Viro

    Andrew Ruder
     

22 Jan, 2014

1 commit

  • On fail path alloc_super() calls destroy_super(), which issues a warning
    if the sb's s_mounts list is not empty, in particular if it has not been
    initialized. That said s_mounts must be initialized in alloc_super()
    before any possible failure, but currently it is initialized close to
    the end of the function leading to a useless warning dumped to log if
    either percpu_counter_init() or list_lru_init() fails. Let's fix this.

    Signed-off-by: Vladimir Davydov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Nov, 2013

1 commit


25 Oct, 2013

2 commits


02 Oct, 2013

1 commit

  • Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong;
    the right place is destroy_super(). As it is, we leak them if sget()
    decides that new superblock it has allocated (and never shown to
    anybody) isn't needed and should be freed.

    Signed-off-by: Al Viro

    Al Viro
     

11 Sep, 2013

2 commits

  • This patch adds the missing call to list_lru_destroy (spotted by Li Zhong)
    and moves the deletion to after the shrinker is unregistered, as correctly
    spotted by Dave

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • We currently use a compile-time constant to size the node array for the
    list_lru structure. Due to this, we don't need to allocate any memory at
    initialization time. But as a consequence, the structures that contain
    embedded list_lru lists can become way too big (the superblock for
    instance contains two of them).

    This patch aims at ameliorating this situation by dynamically allocating
    the node arrays with the firmware provided nr_node_ids.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa