13 Oct, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The big thing in this pile is Eric's unmount-on-rmdir series; we
    finally have everything we need for that. The final piece of prereqs
    is delayed mntput() - now filesystem shutdown always happens on
    shallow stack.

    Other than that, we have several new primitives for iov_iter (Matt
    Wilcox, culled from his XIP-related series) pushing the conversion to
    ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
    cleanups and fixes (including the external name refcounting, which
    gives consistent behaviour of d_move() wrt procfs symlinks for long
    and short names alike) and assorted cleanups and fixes all over the
    place.

    This is just the first pile; there's a lot of stuff from various
    people that ought to go in this window. Starting with
    unionmount/overlayfs mess... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
    fs/file_table.c: Update alloc_file() comment
    vfs: Deduplicate code shared by xattr system calls operating on paths
    reiserfs: remove pointless forward declaration of struct nameidata
    don't need that forward declaration of struct nameidata in dcache.h anymore
    take dname_external() into fs/dcache.c
    let path_init() failures treated the same way as subsequent link_path_walk()
    fix misuses of f_count() in ppp and netlink
    ncpfs: use list_for_each_entry() for d_subdirs walk
    vfs: move getname() from callers to do_mount()
    gfs2_atomic_open(): skip lookups on hashed dentry
    [infiniband] remove pointless assignments
    gadgetfs: saner API for gadgetfs_create_file()
    f_fs: saner API for ffs_sb_create_file()
    jfs: don't hash direct inode
    [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
    ecryptfs: ->f_op is never NULL
    android: ->f_op is never NULL
    nouveau: __iomem misannotations
    missing annotation in fs/file.c
    fs: namespace: suppress 'may be used uninitialized' warnings
    ...

    Linus Torvalds
     

09 Oct, 2014

1 commit

  • total_objects could be 0 and is used as a denom.

    While total_objects is a "long", total_objects == 0 unlikely happens for
    3.12 and later kernels because 32-bit architectures would not be able to
    hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
    between 3.1 and 3.11 because total_objects in prune_super() was an "int"
    and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Cc: stable # 3.1+
    Signed-off-by: Al Viro

    Tetsuo Handa
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

14 Aug, 2014

1 commit

  • Pull quota, reiserfs, UDF updates from Jan Kara:
    "Scalability improvements for quota, a few reiserfs fixes, and couple
    of misc cleanups (udf, ext2)"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    reiserfs: Fix use after free in journal teardown
    reiserfs: fix corruption introduced by balance_leaf refactor
    udf: avoid redundant memcpy when writing data in ICB
    fs/udf: re-use hex_asc_upper_{hi,lo} macros
    fs/quota: kernel-doc warning fixes
    udf: use linux/uaccess.h
    fs/ext2/super.c: Drop memory allocation cast
    quota: remove dqptr_sem
    quota: simplify remove_inode_dquot_ref()
    quota: avoid unnecessary dqget()/dqput() calls
    quota: protect Q_GETFMT by dqonoff_mutex

    Linus Torvalds
     

08 Aug, 2014

3 commits


16 Jul, 2014

1 commit

  • Remove dqptr_sem to make quota code scalable: Remove the dqptr_sem,
    accessing inode->i_dquot now protected by dquot_srcu, and changing
    inode->i_dquot is now serialized by dq_data_lock.

    Signed-off-by: Lai Siyao
    Signed-off-by: Niu Yawei
    Signed-off-by: Jan Kara

    Niu Yawei
     

05 Jun, 2014

2 commits

  • We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g. when we are doing large amount of file read and
    page cache is under pressure. The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim. But counting
    creates a log jam acquiring the sb_lock.

    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.

    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)

    ffsb running in a configuration that is meant to simulate a mail server showed

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
    Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
    Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
    Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
    Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • This series is aimed at regressions noticed during reclaim activity. The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me. I'm posting them again to see
    if there was a reason they were dropped or if they just got lost. Dave?
    Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine? Hugh, does this
    work for you on the memcg test cases?

    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.

    postmark
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)

    ffsb (mail server simulator)
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
    Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
    Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
    Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
    Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)

    dd of a large file
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
    WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
    WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)

    stutter (times mmap latency during large amounts of IO)

    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
    Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
    Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
    Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
    Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
    Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
    Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
    Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
    Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
    Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
    Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
    Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
    Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
    Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)

    This patch (of 3):

    We will like to unregister the sb shrinker before ->kill_sb(). This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb. We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.

    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block. That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:

    shrinker umount

    down_read(shrinker_rwsem)
    down_write(sb->s_umount)
    shrinker_unregister
    down_write(shrinker_rwsem)

    grab_super_passive(sb)
    down_read_trylock(sb->s_umount)


    ....

    up_read(shrinker_rwsem)


    up_write(shrinker_rwsem)
    ->kill_sb()
    ....

    So it is safe to deregister the shrinker before ->kill_sb().

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

17 Apr, 2014

1 commit

  • Commit 9e30cc9595303b27b48 removed an internal mount. This
    has the side-effect that rootfs now has FSID 0. Many
    userspace utilities assume that st_dev in struct stat
    is never 0, so this change breaks a number of tools in
    early userspace.

    Since we don't know how many userspace programs are affected,
    make sure that FSID is at least 1.

    References: http://article.gmane.org/gmane.linux.kernel/1666905
    References: http://permalink.gmane.org/gmane.linux.utilities.util-linux-ng/8557
    Cc: 3.14
    Signed-off-by: Thomas Bächler
    Acked-by: Tejun Heo
    Acked-by: H. Peter Anvin
    Tested-by: Alexandre Demers
    Signed-off-by: Greg Kroah-Hartman

    Thomas Bächler
     

13 Mar, 2014

1 commit

  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o
     

01 Feb, 2014

1 commit

  • Move sync_filesystem() after sb_prepare_remount_readonly(). If writers
    sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
    it can cause inodes to be dirtied and writeback to occur well after
    sys_mount() has completely successfully.

    This was spotted by corrupted ubifs filesystems on reboot, but appears
    that it can cause issues with any filesystem using writeback.

    Cc: Artem Bityutskiy
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    CC: Richard Weinberger
    Co-authored-by: Richard Weinberger
    Signed-off-by: Andrew Ruder
    Signed-off-by: Al Viro

    Andrew Ruder
     

22 Jan, 2014

1 commit

  • On fail path alloc_super() calls destroy_super(), which issues a warning
    if the sb's s_mounts list is not empty, in particular if it has not been
    initialized. That said s_mounts must be initialized in alloc_super()
    before any possible failure, but currently it is initialized close to
    the end of the function leading to a useless warning dumped to log if
    either percpu_counter_init() or list_lru_init() fails. Let's fix this.

    Signed-off-by: Vladimir Davydov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Nov, 2013

1 commit


25 Oct, 2013

2 commits


02 Oct, 2013

1 commit

  • Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong;
    the right place is destroy_super(). As it is, we leak them if sget()
    decides that new superblock it has allocated (and never shown to
    anybody) isn't needed and should be freed.

    Signed-off-by: Al Viro

    Al Viro
     

11 Sep, 2013

8 commits

  • This patch adds the missing call to list_lru_destroy (spotted by Li Zhong)
    and moves the deletion to after the shrinker is unregistered, as correctly
    spotted by Dave

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • We currently use a compile-time constant to size the node array for the
    list_lru structure. Due to this, we don't need to allocate any memory at
    initialization time. But as a consequence, the structures that contain
    embedded list_lru lists can become way too big (the superblock for
    instance contains two of them).

    This patch aims at ameliorating this situation by dynamically allocating
    the node arrays with the firmware provided nr_node_ids.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • Now that the shrinker is passing a node in the scan control structure, we
    can pass this to the the generic LRU list code to isolate reclaim to the
    lists on matching nodes.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • [glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • [glommer@openvz.org: adapted for new LRU return codes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert superblock shrinker to use the new count/scan API, and propagate
    the API changes through to the filesystem callouts. The filesystem
    callouts already use a count/scan API, so it's just changing counters to
    longs to match the VM API.

    This requires the dentry and inode shrinker callouts to be converted to
    the count/scan API. This is mainly a mechanical change.

    [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • With the dentry LRUs being per-sb structures, there is no real need for
    a global dentry_lru_lock. The locking can be made more fine-grained by
    moving to a per-sb LRU lock, isolating the LRU operations of different
    filesytsems completely from each other. The need for this is independent
    of any performance consideration that may arise: in the interest of
    abstracting the lru operations away, it is mandatory that each lru works
    around its own lock instead of a global lock for all of them.

    [glommer@openvz.org: updated changelog ]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Reviewed-by: Christoph Hellwig
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • The sysctl knob sysctl_vfs_cache_pressure is used to determine which
    percentage of the shrinkable objects in our cache we should actively try
    to shrink.

    It works great in situations in which we have many objects (at least more
    than 100), because the aproximation errors will be negligible. But if
    this is not the case, specially when total_objects < 100, we may end up
    concluding that we have no objects at all (total / 100 = 0, if total <
    100).

    This is certainly not the biggest killer in the world, but may matter in
    very low kernel memory situations.

    Signed-off-by: Glauber Costa
    Reviewed-by: Carlos Maiolino
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     

08 Sep, 2013

1 commit


04 Sep, 2013

1 commit

  • Add support to the core direct-io code to defer AIO completions to user
    context using a workqueue. This replaces opencoded and less efficient
    code in XFS and ext4 (we save a memory allocation for each direct IO)
    and will be needed to properly support O_(D)SYNC for AIO.

    The communication between the filesystem and the direct I/O code requires
    a new buffer head flag, which is a bit ugly but not avoidable until the
    direct I/O code stops abusing the buffer_head structure for communicating
    with the filesystems.

    Currently this creates a per-superblock unbound workqueue for these
    completions, which is taken from an earlier patch by Jan Kara. I'm
    not really convinced about this use and would prefer a "normal" global
    workqueue with a high concurrency limit, but this needs further discussion.

    JK: Fixed ext4 part, dynamic allocation of the workqueue.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     

20 Jul, 2013

1 commit

  • Eric Sandeen has found a nasty livelock in sget() - take a mount(2) about
    to fail. The superblock is on ->fs_supers, ->s_umount is held exclusive,
    ->s_active is 1. Along comes two more processes, trying to mount the same
    thing; sget() in each is picking that superblock, bumping ->s_count and
    trying to grab ->s_umount. ->s_active is 3 now. Original mount(2)
    finally gets to deactivate_locked_super() on failure; ->s_active is 2,
    superblock is still ->fs_supers because shutdown will *not* happen until
    ->s_active hits 0. ->s_umount is dropped and now we have two processes
    chasing each other:
    s_active = 2, A acquired ->s_umount, B blocked
    A sees that the damn thing is stillborn, does deactivate_locked_super()
    s_active = 1, A drops ->s_umount, B gets it
    A restarts the search and finds the same superblock. And bumps it ->s_active.
    s_active = 2, B holds ->s_umount, A blocked on trying to get it
    ... and we are in the earlier situation with A and B switched places.

    The root cause, of course, is that ->s_active should not grow until we'd
    got MS_BORN. Then failing ->mount() will have deactivate_locked_super()
    shut the damn thing down. Fortunately, it's easy to do - the key point
    is that grab_super() is called only for superblocks currently on ->fs_supers,
    so it can bump ->s_count and grab ->s_umount first, then check MS_BORN and
    bump ->s_active; we must never increment ->s_count for superblocks past
    ->kill_sb(), but grab_super() is never called for those.

    The bug is pretty old; we would've caught it by now, if not for accidental
    exclusion between sget() for block filesystems; the things like cgroup or
    e.g. mtd-based filesystems don't have anything of that sort, so they get
    bitten. The right way to deal with that is obviously to fix sget()...

    Signed-off-by: Al Viro

    Al Viro
     

28 Feb, 2013

2 commits

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • MAX_IDR_MASK is another weirdness in the idr interface. As idr covers
    whole positive integer range, it's defined as 0x7fffffff or INT_MAX.

    Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
    They basically mask off the sign bit and operate on the rest, so if
    the caller, by accident, passes in a negative number, the sign bit
    will be masked off and the remaining part will be used as if that was
    the input, which is worse than crashing.

    The constant is visible in idr.h and there are several users in the
    kernel.

    * drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()

    Basically used to test if adap->nr is a negative number which isn't
    -1 and returns -EINVAL if so. idr_alloc() already has negative
    @start checking (w/ WARN_ON_ONCE), so this can go away.

    * drivers/infiniband/core/cm.c:cm_alloc_id()
    drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()

    Used to wrap cyclic @start. Can be replaced with max(next, 0).
    Note that this type of cyclic allocation using idr is buggy. These
    are prone to spurious -ENOSPC failure after the first wraparound.

    * fs/super.c:get_anon_bdev()

    The ID allocated from ida is masked off before being tested whether
    it's inside valid range. ida allocated ID can never be a negative
    number and the masking is unnecessary.

    Update idr_*() functions to fail with -EINVAL when negative @id is
    specified and update other MAX_IDR_MASK users as described above.

    This leaves MAX_IDR_MASK without any user, remove it and relocate
    other MAX_IDR_* constants to lib/idr.c.

    Signed-off-by: Tejun Heo
    Cc: Jean Delvare
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hal Rosenstock
    Cc: "Marciniszyn, Mike"
    Cc: Jack Morgenstein
    Cc: Or Gerlitz
    Cc: Al Viro
    Acked-by: Wolfram Sang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Oct, 2012

1 commit


06 Oct, 2012

1 commit

  • To avoid name conflicts:

    drivers/video/riva/fbdev.c:281:9: sparse: preprocessor token MAX_LEVEL redefined

    While at it, also make the other names more consistent and add
    parentheses.

    [akpm@linux-foundation.org: repair fallout]
    [sfr@canb.auug.org.au: IB/mlx4: fix for MAX_ID_MASK to MAX_IDR_MASK name change]
    Signed-off-by: Fengguang Wu
    Cc: Bernd Petrovitsch
    Cc: walter harms
    Cc: Glauber Costa
    Signed-off-by: Stephen Rothwell
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

03 Oct, 2012

1 commit

  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     

04 Aug, 2012

1 commit

  • Finally we can kill the 'sync_supers' kernel thread along with the
    '->write_super()' superblock operation because all the users are gone.
    Now every file-system is supposed to self-manage own superblock and
    its dirty state.

    The nice thing about killing this thread is that it improves power management.
    Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
    every 5 seconds no matter what - even if there were no dirty superblocks and
    even if there were no file-systems using this service (e.g., btrfs and
    journalled ext4 do not need it). So it was wasting power most of the time. And
    because the thread was in the core of the kernel, all systems had to have it.
    So I am quite happy to make it go away.

    Interestingly, this thread is a left-over from the pdflush kernel thread which
    was a self-forking kernel thread responsible for all the write-back in old
    Linux kernels. It was turned into per-block device BDI threads, and
    'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

1 commit

  • 09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a
    shrinker callback which was returning -1 when nr_to_scan is zero, which
    caused excessive slab scanning. But 635697c6 ("vmscan: fix initial
    shrinker size handling") fixed the problem, again so we can freely return
    -1 although nr_to_scan is zero. So let's revert 09f363c7 because the
    comment added in 09f363c7 made an unnecessary rule.

    Signed-off-by: Minchan Kim
    Cc: Al Viro
    Cc: Mikulas Patocka
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 Jul, 2012

3 commits

  • Now that all users are converted, we can remove functions, variables, and
    constants defined by the old freezing mechanism.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • vfs_check_frozen() tests are racy since the filesystem can be frozen just after
    the test is performed. Thus in write paths we can end up marking some pages or
    inodes dirty even though the file system is already frozen. This creates
    problems with flusher thread hanging on frozen filesystem.

    Another problem is that exclusion between ->page_mkwrite() and filesystem
    freezing has been handled by setting page dirty and then verifying s_frozen.
    This guaranteed that either the freezing code sees the faulted page, writes it,
    and writeprotects it again or we see s_frozen set and bail out of page fault.
    This works to protect from page being marked writeable while filesystem
    freezing is running but has an unpleasant artefact of leaving dirty (although
    unmodified and writeprotected) pages on frozen filesystem resulting in similar
    problems with flusher thread as the first problem.

    This patch aims at providing exclusion between write paths and filesystem
    freezing. We implement a writer-freeze read-write semaphore in the superblock.
    Actually, there are three such semaphores because of lock ranking reasons - one
    for page fault handlers (->page_mkwrite), one for all other writers, and one of
    internal filesystem purposes (used e.g. to track running transactions). Write
    paths which should block freezing (e.g. directory operations, ->aio_write(),
    ->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
    takes the writer side.

    Only that we don't really want to bounce cachelines of the semaphores between
    CPUs for each write happening. So we implement the reader side of the semaphore
    as a per-cpu counter and the writer side is implemented using s_writers.frozen
    superblock field.

    [AV: microoptimize sb_start_write(); we want it fast in normal case]

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Pull writeback updates from Wu Fengguang:
    "Use time based periods to age the writeback proportions, which can
    adapt equally well to fast/slow devices."

    Fix up trivial conflict in comment in fs/sync.c

    * tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Fix some comment errors
    block: Convert BDI proportion calculations to flexible proportions
    lib: Fix possible deadlock in flexible proportion code
    lib: Proportions with flexible period

    Linus Torvalds