24 Sep, 2009

3 commits

  • Currently we held s_umount while a filesystem is frozen, despite that we
    might return to userspace and unlock it from a different process. Instead
    grab an active reference to keep the file system busy and add an explicit
    check for frozen filesystems in remount and reject the remount instead
    of blocking on s_umount.

    Add a new get_active_super helper to super.c for use by freeze_bdev that
    grabs an active reference to a superblock from a given block device.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that we have the freeze count there is not much reason for bd_mount_sem
    anymore. The actual freeze/thaw operations are serialized using the
    bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
    get_sb_bdev which tries to prevent mounting a filesystem while the block
    device is frozen. Instead of add a check for bd_fsfreeze_count and
    return -EBUSY if a filesystem is frozen. While that is a change in user
    visible behaviour a failing mount is much better for this case rather
    than having the mount process stuck uninterruptible for a long time.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • sb->s_maxbytes is supposed to indicate the maximum size of a file that can
    exist on the filesystem. It's declared as an unsigned long long.

    Even if a filesystem has no inherent limit that prevents it from using
    every bit in that unsigned long long, it's still problematic to set it to
    anything larger than MAX_LFS_FILESIZE. There are places in the kernel
    that cast s_maxbytes to a signed value. If it's set too large then this
    cast makes it a negative number and generally breaks the comparison.

    Change s_maxbytes to be loff_t instead. That should help eliminate the
    temptation to set it too large by making it a signed value.

    Also, add a warning for couple of releases to help catch filesystems that
    set s_maxbytes too large. Eventually we can either convert this to a
    BUG() or just remove it and in the hope that no one will get it wrong now
    that it's a signed value.

    Signed-off-by: Jeff Layton
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Robert Love
    Cc: Mandeep Singh Baines
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Jeff Layton
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • We do this automatically in get_sb_bdev() from the set_bdev_super()
    callback. Filesystems that have their own private backing_dev_info
    must assign that in ->fill_super().

    Note that ->s_bdi assignment is required for proper writeback!

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Sep, 2009

2 commits

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2009

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Standard trick - add a new variable (start) such that
    for each n < start n is known to be busy. Allocation can
    skip checking everything in [0..start) and if it returns
    n, we can set start to n + 1. Freeing below start sets
    start to what we'd just freed.

    Of course, it still sucks if we do something like
    free 0
    allocate
    allocate
    in a loop - still O(n^2) time. However, on saner loads it
    improves the things a lot and the entire thing is not worth
    the trouble of switching to something with better worst-case
    behaviour.

    Signed-off-by: Al Viro

    Al Viro
     

17 Jun, 2009

1 commit

  • commit 337eb00a2c3a421999c39c94ce7e33545ee8baa7
    Push BKL down into ->remount_fs()
    and
    commit 4aa98cf768b6f2ea4b204620d949a665959214f6
    Push BKL down into do_remount_sb()

    were uncorrectly merged.
    The former removes one pair of lock/unlock_kernel(), but the latter adds
    several unlock_kernel(). Finally a few unlock_kernel() calls left.

    Signed-off-by: J. R. Okajima
    Signed-off-by: Al Viro

    J. R. Okajima
     

12 Jun, 2009

17 commits

  • [xfs, btrfs, capifs, shmem don't need BKL, exempt]

    Signed-off-by: Alessio Igor Bogani
    Signed-off-by: Al Viro

    Alessio Igor Bogani
     
  • Push down lock_super into ->write_super instances and remove it from the
    caller.

    Following filesystem don't need ->s_lock in ->write_super and are skipped:

    * bfs, nilfs2 - no other uses of s_lock and have internal locks in
    ->write_super
    * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
    * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
    ->write_super
    * xfs - no other uses of s_lock and uses internal lock (buffer lock on
    superblock buffer) to serialize ->write_super. Also xfs_fs_write_super
    is superflous and will go away in the next merge window

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • [folded fix from Jiri Slaby]

    Signed-off-by: Al Viro

    Al Viro
     
  • Note that since we can't run into contention between remount_fs and write_super
    (due to exclusion on s_umount), we have to care only about filesystems that
    touch lock_super() on their own. Out of those ext3, ext4, hpfs, sysv and ufs
    do need it; fat doesn't since its ->remount_fs() only accesses assign-once
    data (basically, it's "we have no atime on directories and only have atime on
    files for vfat; force nodiratime and possibly noatime into *flags").

    [folded a build fix from hch]

    Signed-off-by: Al Viro

    Al Viro
     
  • Move BKL into ->put_super from the only caller. A couple of
    filesystems had trivial enough ->put_super (only kfree and NULLing of
    s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
    hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
    of them probably don't need it, but I'd rather sort that out individually.
    Preferably after all the other BKL pushdowns in that area.

    [AV: original used to move lock_super() down as well; these changes are
    removed since we don't do lock_super() at all in generic_shutdown_super()
    now]
    [AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • We can't run into contention on it. All other callers of lock_super()
    either hold s_umount (and we have it exclusive) or hold an active
    reference to superblock in question, which prevents the call of
    generic_shutdown_super() while the reference is held. So we can
    replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
    (and corresponding change for unlock_super(), of course).

    Since ext4 expects s_lock held for its put_super, take lock_super()
    into it. The rest of filesystems do not care at all.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Merge the write_super helper into sync_super and move the check for
    ->write_super earlier so that we can avoid grabbing a reference to
    a superblock that doesn't have it.

    While we're at it also add a little comment documenting sync_supers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • We just did a full fs writeout using sync_filesystem before, and if
    that's not enough for the filesystem it can perform it's own writeout
    in ->put_super, which many filesystems already do.

    Move a call to foofs_write_super into every foofs_put_super for now to
    guarantee identical behaviour until it's cleaned up by the individual
    filesystem maintainers.

    Exceptions:

    - affs already has identical copy & pasted code at the beginning of
    affs_put_super so no need to do it twice.
    - xfs does the right thing without it and I have changes pending for
    the xfs tree touching this are so I don't really need conflicts
    here..

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Rename the function so that it better describe what it really does. Also
    remove the unnecessary include of buffer_head.h.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Move sync_filesystems(), __fsync_super(), fsync_super() from
    super.c to sync.c where it fits better.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    Nice bonus is that we get a complete livelock avoidance and write_supers()
    is now only used for periodic writeback of superblocks.

    sync_blockdevs() introduced a couple of patches ago is gone now.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • __fsync_super() does the same thing as fsync_super(). So change the only
    caller to use fsync_super() and make __fsync_super() static. This removes
    unnecessarily duplicated call to sync_blockdev() and prepares ground
    for the changes to __fsync_super() in the following patches.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then
    ->sync_fs() isn't called. This does not really make much sence since s_dirt is
    generally used by a filesystem to mean that ->write_super() needs to be called.
    But ->sync_fs() does different things. I even suspect that some filesystems
    (btrfs?) sets s_dirt just to fool this logic.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • So far, do_sync() called:
    sync_inodes(0);
    sync_supers();
    sync_filesystems(0);
    sync_filesystems(1);
    sync_inodes(1);

    This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
    submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
    transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
    not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
    and others are hit by this) when racing e.g. with background writeback. A
    similar problem hits also other filesystems (e.g. ext2) because of
    write_supers() being called before the sync_inodes(1).

    Change the ordering of calls in do_sync() - this requires a new function
    sync_blockdevs() to preserve the property that block devices are always synced
    after write_super() / sync_fs() call.

    The same issue is fixed in __fsync_super() function used on umount /
    remount read-only.

    [AV: build fixes]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Remove the unused s_async_list in the superblock, a leftover of the
    broken async inode deletion code that leaked into mainline. Having this
    in the middle of the sync/unmount path is not helpful for the following
    cleanups.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • This function walks the s_files lock, and operates primarily on the
    files in a superblock, so it better belongs here (eg. see also
    fs_may_remount_ro).

    [AV: ... and it shouldn't be static after that move]

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

09 May, 2009

2 commits

  • Signed-off-by: H Hartley Sweeten
    Cc: Subrata Modak
    Signed-off-by: Al Viro

    H Hartley Sweeten
     
  • Does equivalent of up_write(&s->s_umount); deactivate_super(s);
    However, it does not does not unlock it until it's all over.
    As the result, it's safe to use to dispose of new superblock on ->get_sb()
    failure exits - nobody will see the sucker until it's all over.
    Equivalent using up_write/deactivate_super is safe for that purpose
    if superblock is either safe to use or has NULL ->s_root when we unlock.
    Normally filesystems take the required precautions, but
    a) we do have bugs in that area in some of them.
    b) up_write/deactivate_super sequence is extremely common,
    so the helper makes sense anyway.

    Signed-off-by: Al Viro

    Al Viro
     

07 Apr, 2009

1 commit


03 Apr, 2009

1 commit


28 Mar, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits)
    fs: avoid I_NEW inodes
    Merge code for single and multiple-instance mounts
    Remove get_init_pts_sb()
    Move common mknod_ptmx() calls into caller
    Parse mount options just once and copy them to super block
    Unroll essentials of do_remount_sb() into devpts
    vfs: simple_set_mnt() should return void
    fs: move bdev code out of buffer.c
    constify dentry_operations: rest
    constify dentry_operations: configfs
    constify dentry_operations: sysfs
    constify dentry_operations: JFS
    constify dentry_operations: OCFS2
    constify dentry_operations: GFS2
    constify dentry_operations: FAT
    constify dentry_operations: FUSE
    constify dentry_operations: procfs
    constify dentry_operations: ecryptfs
    constify dentry_operations: CIFS
    constify dentry_operations: AFS
    ...

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
    ext2: Zero our b_size in ext2_quota_read()
    trivial: fix typos/grammar errors in fs/Kconfig
    quota: Coding style fixes
    quota: Remove superfluous inlines
    quota: Remove uppercase aliases for quota functions.
    nfsd: Use lowercase names of quota functions
    jfs: Use lowercase names of quota functions
    udf: Use lowercase names of quota functions
    ufs: Use lowercase names of quota functions
    reiserfs: Use lowercase names of quota functions
    ext4: Use lowercase names of quota functions
    ext3: Use lowercase names of quota functions
    ext2: Use lowercase names of quota functions
    ramfs: Remove quota call
    vfs: Use lowercase names of quota functions
    quota: Remove dqbuf_t and other cleanups
    quota: Remove NODQUOT macro
    quota: Make global quota locks cacheline aligned
    quota: Move quota files into separate directory
    ext4: quota reservation for delayed allocation
    ...

    Linus Torvalds
     
  • simple_set_mnt() is defined as returning 'int' but always returns 0.
    Callers assume simple_set_mnt() never fails and don't properly cleanup if
    it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev()
    should:

    up_write(sb->s_unmount);
    deactivate_super(sb);

    if simple_set_mnt() fails.

    Since simple_set_mnt() never fails, would be cleaner if it did not
    return anything.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Sukadev Bhattiprolu
     

26 Mar, 2009

2 commits


13 Mar, 2009

1 commit

  • In sget(), destroy_super(s) is called with s->s_umount held, which makes
    lockdep unhappy.

    Signed-off-by: Li Zefan
    Cc: Al Viro
    Acked-by: Peter Zijlstra
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

19 Feb, 2009

1 commit

  • Li Zefan said:

    Thread 1:
    for ((; ;))
    {
    mount -t cpuset xxx /mnt > /dev/null 2>&1
    cat /mnt/cpus > /dev/null 2>&1
    umount /mnt > /dev/null 2>&1
    }

    Thread 2:
    for ((; ;))
    {
    mount -t cpuset xxx /mnt > /dev/null 2>&1
    umount /mnt > /dev/null 2>&1
    }

    (Note: It is irrelevant which cgroup subsys is used.)

    After a while a lockdep warning showed up:

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.28 #479
    ---------------------------------------------
    mount/13554 is trying to acquire lock:
    (&type->s_umount_key#19){--..}, at: [] sget+0x5e/0x321

    but task is already holding lock:
    (&type->s_umount_key#19){--..}, at: [] sget+0x1e2/0x321

    other info that might help us debug this:
    1 lock held by mount/13554:
    #0: (&type->s_umount_key#19){--..}, at: [] sget+0x1e2/0x321

    stack backtrace:
    Pid: 13554, comm: mount Not tainted 2.6.28-mc #479
    Call Trace:
    [] validate_chain+0x4c6/0xbbd
    [] __lock_acquire+0x676/0x700
    [] lock_acquire+0x5d/0x7a
    [] ? sget+0x5e/0x321
    [] down_write+0x34/0x50
    [] ? sget+0x5e/0x321
    [] sget+0x5e/0x321
    [] ? cgroup_set_super+0x0/0x3e
    [] ? cgroup_test_super+0x0/0x2f
    [] cgroup_get_sb+0x98/0x2e7
    [] cpuset_get_sb+0x4a/0x5f
    [] vfs_kern_mount+0x40/0x7b
    [] do_kern_mount+0x37/0xbf
    [] do_mount+0x5c3/0x61a
    [] ? copy_mount_options+0x2c/0x111
    [] sys_mount+0x69/0xa0
    [] sysenter_do_call+0x12/0x31

    The cause is after alloc_super() and then retry, an old entry in list
    fs_supers is found, so grab_super(old) is called, but both functions hold
    s_umount lock:

    struct super_block *sget(...)
    {
    ...
    retry:
    spin_lock(&sb_lock);
    if (test) {
    list_for_each_entry(old, &type->fs_supers, s_instances) {
    if (!test(old, data))
    continue;
    if (!grab_super(old)) s_umount);
    goto retry;
    if (s)
    destroy_super(s);
    return old;
    }
    }
    if (!s) {
    spin_unlock(&sb_lock);
    s = alloc_super(type); s_umount)
    if (!s)
    return ERR_PTR(-ENOMEM);
    goto retry;
    }
    ...
    }

    It seems like a false positive, and seems like VFS but not cgroup needs to
    be fixed.

    Peter said:

    We can simply put the new s_umount instance in a but lockdep doesn't
    particularly cares about subclass order.

    If there's any issue with the callers of sget() assuming the s_umount lock
    being of sublcass 0, then there is another annotation we can use to fix
    that, but lets not bother with that if this is sufficient.

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673

    Signed-off-by: Peter Zijlstra
    Tested-by: Li Zefan
    Reported-by: Li Zefan
    Cc: Al Viro
    Cc: Paul Menage
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

09 Feb, 2009

1 commit


14 Jan, 2009

1 commit