26 Sep, 2009

12 commits

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pointless to iterate other devices looking for a super, when
    we have a bdi mapping.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Debug traces show that in per-bdi writeback, the inode under writeback
    almost always get redirtied by a busy dirtier. We used to call
    redirty_tail() in this case, which could delay inode for up to 30s.

    This is unacceptable because it now happens so frequently for plain cp/dd,
    that the accumulated delays could make writeback of big files very slow.

    So let's distinguish between data redirty and metadata only redirty.
    The first one is caused by a busy dirtier, while the latter one could
    happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

    The inode being busy dirtied will now be requeued for next io, while
    the inode being redirtied by fs will continue to be delayed to avoid
    repeated IO.

    CC: Jan Kara
    CC: Theodore Ts'o
    CC: Dave Chinner
    CC: Chris Mason
    CC: Christoph Hellwig
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Currently we pin the inode->i_sb for every single inode. This
    increases cache traffic on sb->s_umount sem. Lets instead
    cache the inode sb pin state and keep the super_block pinned
    for as long as keep writing out inodes from the same
    super_block.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If we only moved inodes from a single super_block to the temporary
    list, there's no point in doing a resort for multiple super_blocks.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
    several partitions, writeback might keep spindle moving between partitions.
    To reduce the move, better write big chunk of one partition and then move to
    another. Inodes from one fs usually are in one partion, so idealy move indoes
    from one fs together should reduce spindle move. This patch tries to address
    this. Before per-bdi writeback is added, the behavior is write indoes
    from one fs first and then another, so the patch restores previous behavior.
    The loop in the patch is a bit ugly, should we add a dirty list for each
    superblock in bdi_writeback?

    Test in a two partition disk with attached fio script shows about 3% ~ 6%
    improvement.

    Signed-off-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • And throw some comments in there, too.

    Reviewed-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Make the if-else straight in writeback_single_inode().
    No behavior change.

    Cc: Jan Kara
    Cc: Michael Rubin
    Cc: Peter Zijlstra
    Signed-off-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Fix the kupdate case, which disregards wbc.more_io and stop writeback
    prematurely even when there are more inodes to be synced.

    wbc.more_io should always be respected.

    Also remove the pages_skipped check. It will set when some page(s) of some
    inode(s) cannot be written for now. Such inodes will be delayed for a while.
    This variable has nothing to do with whether there are other writeable inodes.

    CC: Jan Kara
    CC: Dave Chinner
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Treat bdi_start_writeback(0) as a special request to do background write,
    and stop such work when we are below the background dirty threshold.

    Also simplify the (nr_pages
    CC: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • If all inodes are under writeback (e.g. in case when there's only one inode
    with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
    to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
    waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
    write anything.

    Tested-by: Wu Fengguang
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

16 Sep, 2009

12 commits


14 Sep, 2009

1 commit


11 Sep, 2009

5 commits

  • Also a debugging aid. We want to catch dirty inodes being added to
    backing devices that don't do writeback.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It is now unused, so kill it off.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This adds two new exported functions:

    - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
    this super_block, for WB_SYNC_NONE writeout.
    - sync_inodes_sb(), which writes out all dirty inodes on this super_block
    and also waits for the IO to complete.

    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2009

1 commit

  • There is no reason to for the split between __writeback_single_inode and
    __sync_single_inode, the former just does a couple of checks before
    tail-calling the latter. So merge the two, and while we're at it split
    out the I_SYNC waiting case for data integrity writers, as it's
    logically separate function. Finally rename __writeback_single_inode to
    writeback_single_inode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

17 Jun, 2009

1 commit

  • 1) I_FREEING tests should be coupled with I_CLEAR

    The two I_FREEING tests are racy because clear_inode() can set i_state to
    I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

    2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
    races with generic_forget_inode()

    generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
    generic_sync_sb_inodes() shall not try to step in and create possible races:

    generic_forget_inode
    inode->i_state |= I_WILL_FREE;
    spin_unlock(&inode_lock);
    generic_sync_sb_inodes()
    spin_lock(&inode_lock);
    __iget(inode);
    __writeback_single_inode
    // see non zero i_count
    may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE);
    spin_unlock(&inode_lock);
    may call generic_forget_inode again ==> iput(inode);

    The above race and warning didn't turn up because writeback_inodes() holds
    the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
    early. But we are not sure the UBIFS calls and future callers will
    guarantee that. So skip I_WILL_FREE inodes for the sake of safety.

    Cc: Eric Sandeen
    Acked-by: Jeff Layton
    Cc: Masayoshi MIZUMA
    Signed-off-by: Wu Fengguang
    Cc: Artem Bityutskiy
    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

12 Jun, 2009

3 commits

  • I think the block_dump output in __mark_inode_dirty is missing dentry locking.
    Surely the i_dentry list can change any time, so we may not even *get* a
    dentry there. If we do get one by chance, then it would appear to be able to
    go away or get renamed at any time...

    Signed-off-by: Al Viro

    Nick Piggin
     
  • Some filesystems can call in to sync an inode that is still in the
    I_NEW state (eg. ext family, when mounted with -osync). This is OK
    because the filesystem has sole access to the new inode, so it can
    modify i_state without races (because no other thread should be
    modifying it, by definition of I_NEW). Ie. a false positive, so
    remove the warnings.

    The races are described here 7ef0d7377cb287e08f3ae94cebc919448e1f5dff,
    which is also where the warnings were introduced.

    Reported-by: Stephen Hemminger
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    Nice bonus is that we get a complete livelock avoidance and write_supers()
    is now only used for periodic writeback of superblocks.

    sync_blockdevs() introduced a couple of patches ago is gone now.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

04 Apr, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
    trivial: Update my email address
    trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
    trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
    trivial: Fix misspelling of "Celsius".
    trivial: remove unused variable 'path' in alloc_file()
    trivial: fix a pdlfush -> pdflush typo in comment
    trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
    trivial: wusb: Storage class should be before const qualifier
    trivial: drivers/char/bsr.c: Storage class should be before const qualifier
    trivial: h8300: Storage class should be before const qualifier
    trivial: fix where cgroup documentation is not correctly referred to
    trivial: Give the right path in Documentation example
    trivial: MTD: remove EOL from MODULE_DESCRIPTION
    trivial: Fix typo in bio_split()'s documentation
    trivial: PWM: fix of #endif comment
    trivial: fix typos/grammar errors in Kconfig texts
    trivial: Fix misspelling of firmware
    trivial: cgroups: documentation typo and spelling corrections
    trivial: Update contact info for Jochen Hein
    trivial: fix typo "resgister" -> "register"
    ...

    Linus Torvalds
     

03 Apr, 2009

2 commits

  • The dirtied_when value on an inode is supposed to represent the first time
    that an inode has one of its pages dirtied. This value is in units of
    jiffies. It's used in several places in the writeback code to determine
    when to write out an inode.

    The problem is that these checks assume that dirtied_when is updated
    periodically. If an inode is continuously being used for I/O it can be
    persistently marked as dirty and will continue to age. Once the time
    compared to is greater than or equal to half the maximum of the jiffies
    type, the logic of the time_*() macros inverts and the opposite of what is
    needed is returned. On 32-bit architectures that's just under 25 days
    (assuming HZ == 1000).

    As the least-recently dirtied inode, it'll end up being the first one that
    pdflush will try to write out. sync_sb_inodes does this check:

    /* Was this inode dirtied after sync_sb_inodes was called? */
    if (time_after(inode->dirtied_when, start))
    break;

    ...but now dirtied_when appears to be in the future. sync_sb_inodes bails
    out without attempting to write any dirty inodes. When this occurs,
    pdflush will stop writing out inodes for this superblock. Nothing can
    unwedge it until jiffies moves out of the problematic window.

    This patch fixes this problem by changing the checks against dirtied_when
    to also check whether it appears to be in the future. If it does, then we
    consider the value to be far in the past.

    This should shrink the problematic window of time to such a small period
    (30s) as not to matter.

    Signed-off-by: Jeff Layton
    Signed-off-by: Wu Fengguang
    Acked-by: Ian Kent
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
    _outside_ of inode_lock. So any I_FREEING testing is incomplete without a
    coupled testing of I_CLEAR.

    So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
    add_dquot_ref().

    Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
    reminds fixing the other two cases.

    Masayoshi MIZUMA has a nice panic flow:

    =====================================================================
    [process A] | [process B]
    | |
    | prune_icache() | drop_pagecache()
    | spin_lock(&inode_lock) | drop_pagecache_sb()
    | inode->i_state |= I_FREEING; | |
    | spin_unlock(&inode_lock) | V
    | | | spin_lock(&inode_lock)
    | V | |
    | dispose_list() | |
    | list_del() | |
    | clear_inode() | |
    | inode->i_state = I_CLEAR | |
    | | | V
    | | | if (inode->i_state & (I_FREEING|I_WILL_FREE))
    | | | continue;
    Reviewed-by: Jan Kara
    Signed-off-by: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

30 Mar, 2009

1 commit


13 Mar, 2009

1 commit

  • There was a report of a data corruption
    http://lkml.org/lkml/2008/11/14/121. There is a script included to
    reproduce the problem.

    During testing, I encountered a number of strange things with ext3, so I
    tried ext2 to attempt to reduce complexity of the problem. I found that
    fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
    cleared, even though instrumentation showed that unlock_new_inode had
    already been called for that inode. This points to memory scribble, or
    synchronisation problme.

    i_state of I_NEW inodes is not protected by inode_lock because other
    processes are not supposed to touch them until I_LOCK (and I_NEW) is
    cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
    i_state revealed that generic_sync_sb_inodes is picking up new inodes from
    the inode lists and passing them to __writeback_single_inode without
    waiting for I_NEW. Subsequently modifying i_state causes corruption. In
    my case it would look like this:

    CPU0 CPU1
    unlock_new_inode() __sync_single_inode()
    reg i_state
    reg -> reg & ~(I_LOCK|I_NEW) reg i_state
    reg -> inode->i_state reg -> reg | I_SYNC
    reg -> inode->i_state

    Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.

    Fix for this is rather than wait for I_NEW inodes, just skip over them:
    inodes concurrently being created are not subject to data integrity
    operations, and should not significantly contribute to dirty memory
    either.

    After this change, I'm unable to reproduce any of the added warnings or
    hangs after ~1hour of running. Previously, the new warnings would start
    immediately and hang would happen in under 5 minutes.

    I'm also testing on ext3 now, and so far no problems there either. I
    don't know whether this fixes the problem reported above, but it fixes a
    real problem for me.

    Cc: "Jorge Boncompte [DTI2]"
    Reported-by: Adrian Hunter
    Cc: Jan Kara
    Cc:
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin