18 Aug, 2010

1 commit

  • fs: scale files_lock

    Improve scalability of files_lock by adding per-cpu, per-sb files lists,
    protected with an lglock. The lglock provides fast access to the per-cpu lists
    to add and remove files. It also provides a snapshot of all the per-cpu lists
    (although this is very slow).

    One difficulty with this approach is that a file can be removed from the list
    by another CPU. We must track which per-cpu list the file is on with a new
    variale in the file struct (packed into a hole on 64-bit archs). Scalability
    could suffer if files are frequently removed from different cpu's list.

    However loads with frequent removal of files imply short interval between
    adding and removing the files, and the scheduler attempts to avoid moving
    processes too far away. Also, even in the case of cross-CPU removal, the
    hardware has much more opportunity to parallelise cacheline transfers with N
    cachelines than with 1.

    A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
    degenerates to contending on a single lock, which is no worse than before. When
    more than one CPU are allocating files, even if they are always freed by
    different CPUs, there will be more parallelism than the single-lock case.

    Testing results:

    On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
    to remove the file, the number of times it is removed by the same CPU that
    added it, and the number of times it is removed by the same node that added it.

    Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
    kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
    dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

    So a file is removed from the same CPU it was added by over 90% of the time.
    It remains within the same node 95% of the time.

    Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

    throughput
    2.6.34-rc2 24.5
    +patch 24.9

    us sys idle IO wait (in %)
    2.6.34-rc2 51.25 28.25 17.25 3.25
    +patch 53.75 18.5 19 8.75

    So significantly less CPU time spent in kernel code, higher idle time and
    slightly higher throughput.

    Single threaded performance difference was within the noise of microbenchmarks.
    That is not to say penalty does not exist, the code is larger and more memory
    accesses required so it will be slightly slower.

    Cc: linux-kernel@vger.kernel.org
    Cc: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

10 Aug, 2010

3 commits

  • just delay __put_super() a bit

    Signed-off-by: Al Viro

    Al Viro
     
  • If sget() finds a matching superblock being set up, it'll
    grab an active reference to it and grab s_umount. That's
    fine - we'll wait for completion of foofs_get_sb() that way.
    However, if said foofs_get_sb() fails we'll end up holding
    the halfway-created superblock. deactivate_locked_super()
    called by foofs_get_sb() will just unlock the sucker since
    we are holding another active reference to it.

    What we need is a way to tell if superblock has been successfully
    set up. Unfortunately, neither ->s_root nor the check for
    MS_ACTIVE quite fit. Cheap and easy way, suitable for backport:
    new flag set by the (only) caller of ->get_sb(). If that flag
    isn't present by the time sget() grabbed s_umount on preexisting
    superblock it has found, it's seeing a stillborn and should
    just bury it with deactivate_locked_super() (and repeat the search).

    Longer term we want to set that flag in ->get_sb() instances (and
    check for it to distinguish between "sget() found us a live sb"
    and "sget() has allocated an sb, we need to set it up" in there,
    instead of checking ->s_root as we do now).

    Signed-off-by: Al Viro
    Cc: stable@kernel.org

    Al Viro
     
  • Fix an obscure AB-BA deadlock in get_sb_bdev().

    When a superblock is mounted more than once get_sb_bdev() calls
    close_bdev_exclusive() to drop the extra bdev reference while holding
    s_umount. However, sb->s_umount nests inside bd_mutex during
    __invalidate_device() and close_bdev_exclusive() acquires bd_mutex during
    blkdev_put(); thus creating an AB-BA deadlock.

    This condition doesn't trigger frequently. For this condition to be
    visible to lockdep, the filesystem must occupy the whole device (as
    __invalidate_device() only grabs bd_mutex for the whole device), the FS
    must be mounted more than once and partition rescan should be issued while
    the FS is still mounted.

    Fix it by dropping s_umount over close_bdev_exclusive().

    Signed-off-by: Tejun Heo
    Reported-by: Ciprian Docan
    Cc: Al Viro
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Tejun Heo
     

30 Jun, 2010

1 commit

  • list_for_each_entry_safe is not suitable to protect against concurrent
    modification of the list. 6754af6 introduced a race in sb walking.

    list_for_each_entry can use the trick of pinning the current entry in
    the list before we drop and retake the lock because it subsequently
    follows cur->next. However list_for_each_entry_safe saves n=cur->next
    for following before entering the loop body, so when the lock is
    dropped, n may be deleted.

    Signed-off-by: Nick Piggin
    Cc: Christoph Hellwig
    Cc: John Stultz
    Cc: Frank Mayhar
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    npiggin@suse.de
     

31 May, 2010

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
    quota: Convert quota statistics to generic percpu_counter
    ext3 uses rb_node = NULL; to zero rb_root.
    quota: Fixup dquot_transfer
    reiserfs: Fix resuming of quotas on remount read-write
    pohmelfs: Remove dead quota code
    ufs: Remove dead quota code
    udf: Remove dead quota code
    quota: rename default quotactl methods to dquot_
    quota: explicitly set ->dq_op and ->s_qcop
    quota: drop remount argument to ->quota_on and ->quota_off
    quota: move unmount handling into the filesystem
    quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
    quota: move remount handling into the filesystem
    ocfs2: Fix use after free on remount read-only

    Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c

    Linus Torvalds
     

28 May, 2010

1 commit


24 May, 2010

3 commits

  • Only set the quota operation vectors if the filesystem actually supports
    quota instead of doing it for all filesystems in alloc_super().

    [Jan Kara: Export dquot_operations and vfs_quotactl_ops]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently the VFS calls into the quotactl interface for unmounting
    filesystems. This means filesystems with their own quota handling
    can't easily distinguish between user-space originating quotaoff
    and an unount. Instead move the responsibily of the unmount handling
    into the filesystem to be consistent with all other dquot handling.

    Note that we do call dquot_disable a lot later now, e.g. after
    a sync_filesystem. But this is fine as the quota code does all its
    writes via blockdev's mapping and that is synced even later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently do_remount_sb calls into the dquot code to tell it about going
    from rw to ro and ro to rw. Move this code into the filesystem to
    not depend on the dquot code in the VFS - note ocfs2 already ignores
    these calls and handles remount by itself. This gets rid of overloading
    the quotactl calls and allows to unify the VFS and XFS codepaths in
    that area later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

22 May, 2010

17 commits

  • > =============================================
    > [ INFO: possible recursive locking detected ]
    > 2.6.31-2-generic #14~rbd3
    > ---------------------------------------------
    > firefox-3.5/4162 is trying to acquire lock:
    > (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
    >
    > but task is already holding lock:
    > (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
    >
    > other info that might help us debug this:
    > 3 locks held by firefox-3.5/4162:
    > #0: (&s->s_vfs_rename_mutex){+.+.+.}, at: [] lock_rename+0x41/0xf0
    > #1: (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [] lock_rename+0x6a/0xf0
    > #2: (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [] lock_rename+0x7f/0xf0
    >
    > stack backtrace:
    > Pid: 4162, comm: firefox-3.5 Tainted: G C 2.6.31-2-generic #14~rbd3
    > Call Trace:
    > [] print_deadlock_bug+0xf4/0x100
    > [] validate_chain+0x4c6/0x750
    > [] __lock_acquire+0x237/0x430
    > [] lock_acquire+0xa5/0x150
    > [] ? lock_rename+0x41/0xf0
    > [] __mutex_lock_common+0x4d/0x3d0
    > [] ? lock_rename+0x41/0xf0
    > [] ? lock_rename+0x41/0xf0
    > [] ? ecryptfs_rename+0x99/0x170
    > [] mutex_lock_nested+0x46/0x60
    > [] lock_rename+0x41/0xf0
    > [] ecryptfs_rename+0xca/0x170
    > [] vfs_rename_dir+0x13e/0x160
    > [] vfs_rename+0xee/0x290
    > [] ? __lookup_hash+0x102/0x160
    > [] sys_renameat+0x252/0x280
    > [] ? cp_new_stat+0xe4/0x100
    > [] ? sysret_check+0x2e/0x69
    > [] ? trace_hardirqs_on_caller+0x14d/0x190
    > [] sys_rename+0x1b/0x20
    > [] system_call_fastpath+0x16/0x1b

    The trace above is totally reproducible by doing a cross-directory
    rename on an ecryptfs directory.

    The issue seems to be that sys_renameat() does lock_rename() then calls
    into the filesystem; if the filesystem is ecryptfs, then
    ecryptfs_rename() again does lock_rename() on the lower filesystem, and
    lockdep can't tell that the two s_vfs_rename_mutexes are different. It
    seems an annotation like the following is sufficient to fix this (it
    does get rid of the lockdep trace in my simple tests); however I would
    like to make sure I'm not misunderstanding the locking, hence the CC
    list...

    Signed-off-by: Roland Dreier
    Cc: Tyler Hicks
    Cc: Dustin Kirkland
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Roland Dreier
     
  • Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
    letting it do all the work. But freezing is more of an fs thing, and doesn't
    really have much to do with the bdev at all, all the work gets done with the
    super. In btrfs we do not populate s_bdev, since we can have multiple bdev's
    for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
    This means that freezing a btrfs filesystem fails, which causes us to corrupt
    with things like tux-on-ice which use the fsfreeze mechanism. So instead of
    populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
    freezing stuff into freeze_super and thaw_super. These just take the
    super_block that we're freezing and does the appropriate work. It's basically
    just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to
    use the new super helpers. I've tested this with ext4 and btrfs and verified
    everything continues to work the same as before.

    The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
    the fs is already frozen. I thought this was a better solution than adding a
    freeze counter to the super_block, but if everybody hates this idea I'm open to
    suggestions. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and switch the simple "loop over superblocks and do something"
    loops to it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • If superblock had been still alive, we would've returned it...

    Signed-off-by: Al Viro

    Al Viro
     
  • This one needs restarts...

    Signed-off-by: Al Viro

    Al Viro
     
  • need list_for_each_entry_safe() here. Original didn't even
    have restart logics, so if you race with umount() it blew up.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • At the same time we can kill s_need_restart and local mutex in there.
    __put_super() made public for a while; will be gone later.

    Signed-off-by: Al Viro

    Al Viro
     
  • We used to remove from s_list and s_instances at the same
    time. So let's *not* do the former and skip superblocks
    that have empty s_instances in the loops over s_list.

    The next step, of course, will be to get rid of rescan logics
    in those loops.

    Signed-off-by: Al Viro

    Al Viro
     
  • Make sure that s_umount is acquired *before* we drop the final
    active reference; we still have the fast path (atomic_dec_unless)
    and we have gotten rid of the window between the moment when
    s_active hits zero and s_umount is acquired. Which simplifies
    the living hell out of grab_super() and inotify pin_to_kill()
    stuff.

    Signed-off-by: Al Viro

    Al Viro
     
  • use atomic_inc_not_zero(&sb->s_active) instead of playing games with
    checking ->s_count > S_BIAS

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

30 Apr, 2010

1 commit


25 Apr, 2010

1 commit

  • noop_backing_dev_info is used only as a flag to mark filesystems that
    don't have any backing store, like tmpfs, procfs, spufs, etc.

    Signed-off-by: Joern Engel

    Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
    to the noop_backing_dev_info is not legal and will not result in
    them being flushed, but we already catch this condition in
    __mark_inode_dirty() when checking for a registered bdi.

    Signed-off-by: Jens Axboe

    Jörn Engel
     

04 Mar, 2010

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Invalidate sb->s_bdev on remount,ro.

    Fixes a problem reported by Jorge Boncompte who is seeing corruption
    trying to snapshot a minix filesystem image. Some filesystems modify
    their metadata via a path other than the bdev buffer cache (eg. they may
    use a private linear mapping for their metadata, or implement directories
    in pagecache, etc). Also, file data modifications usually go to the bdev
    via their own mappings.

    These updates are not coherent with buffercache IO (eg. via /dev/bdev)
    and never have been. However there could be a reasonable expectation that
    after a mount -oremount,ro operation then the buffercache should
    subsequently be coherent with previous filesystem modifications.

    So invalidate the bdev mappings on a remount,ro operation to provide a
    coherency point.

    The problem was exposed when we switched the old rd to brd because old rd
    didn't really function like a normal block device and updates to rd via
    mappings other than the buffercache would still end up going into its
    buffercache. But the same problem has always affected other "normal"
    block devices, including loop.

    [akpm@linux-foundation.org: repair comment layout]
    Reported-by: "Jorge Boncompte [DTI2]"
    Tested-by: "Jorge Boncompte [DTI2]"
    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Nick Piggin
     

24 Dec, 2009

1 commit

  • Filesystem code usually destroys the option buffer while
    parsing it. This leads to errors when the same buffer is
    passed twice. In case we fill a new superblock do not call
    remount.

    This is needed to quite a warning that the debugfs code
    causes every boot.

    Cc: Miklos Szeredi
    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

24 Sep, 2009

3 commits

  • Currently we held s_umount while a filesystem is frozen, despite that we
    might return to userspace and unlock it from a different process. Instead
    grab an active reference to keep the file system busy and add an explicit
    check for frozen filesystems in remount and reject the remount instead
    of blocking on s_umount.

    Add a new get_active_super helper to super.c for use by freeze_bdev that
    grabs an active reference to a superblock from a given block device.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that we have the freeze count there is not much reason for bd_mount_sem
    anymore. The actual freeze/thaw operations are serialized using the
    bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
    get_sb_bdev which tries to prevent mounting a filesystem while the block
    device is frozen. Instead of add a check for bd_fsfreeze_count and
    return -EBUSY if a filesystem is frozen. While that is a change in user
    visible behaviour a failing mount is much better for this case rather
    than having the mount process stuck uninterruptible for a long time.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • sb->s_maxbytes is supposed to indicate the maximum size of a file that can
    exist on the filesystem. It's declared as an unsigned long long.

    Even if a filesystem has no inherent limit that prevents it from using
    every bit in that unsigned long long, it's still problematic to set it to
    anything larger than MAX_LFS_FILESIZE. There are places in the kernel
    that cast s_maxbytes to a signed value. If it's set too large then this
    cast makes it a negative number and generally breaks the comparison.

    Change s_maxbytes to be loff_t instead. That should help eliminate the
    temptation to set it too large by making it a signed value.

    Also, add a warning for couple of releases to help catch filesystems that
    set s_maxbytes too large. Eventually we can either convert this to a
    BUG() or just remove it and in the hope that no one will get it wrong now
    that it's a signed value.

    Signed-off-by: Jeff Layton
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Robert Love
    Cc: Mandeep Singh Baines
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Jeff Layton
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • We do this automatically in get_sb_bdev() from the set_bdev_super()
    callback. Filesystems that have their own private backing_dev_info
    must assign that in ->fill_super().

    Note that ->s_bdi assignment is required for proper writeback!

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Sep, 2009

2 commits

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2009

1 commit