12 Aug, 2010

1 commit

  • Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
    that the latter can be avoided when under global dirty background
    threshold (which is the normal state for most systems).

    Signed-off-by: Wu Fengguang
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

08 Aug, 2010

17 commits

  • Fix a bug where a lock is _bh nested within another _bh lock,
    but forgets to use the _bh variant for unlock.

    Further more, it's not necessary to test _bh locks, the inner lock
    can just use spin_lock(). So fix up the bug by making that change.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch makes sure we first initialize everything and set the BDI_registered
    flag, and only after this we add the bdi to 'bdi_list'. Current code adds the
    bdi to the list too early, and as a result I the

    WARN(!test_bit(BDI_registered, &bdi->state)

    in bdi forker is triggered. Also, it is in general good practice to make things
    visible only when they are fully initialized.

    Also, this patch does few micro clean-ups:
    1. Removes the 'exit' label which does not do anything, just returns. This
    allows to get rid of few braces and 'ret' variable and make the code smaller.
    2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded
    '-ENOMEM'. Theoretically, some day 'kthread_run()' can return something
    else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to
    NULL.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Add 2 new trace points to the periodic write-back wake up case, just like we do
    in the 'bdi_queue_work()' function. Namely, introduce:

    1. trace_writeback_wake_thread(bdi)
    2. trace_writeback_wake_forker_thread(bdi)

    The first event is triggered every time we wake up a bdi thread to start
    periodic background write-out. The second event is triggered only when the bdi
    thread does not exist and should be created by the forker thread.

    This patch was suggested by Dave Chinner and Christoph Hellwig.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • The 'setup_timer()' function also calls 'init_timer()', so the extra
    'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically
    'init_timer()' plus callback function and data pointers initialization.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
    should take care of the periodic background write-out. However, the write-out
    will actually start only 'dirty_writeback_interval' centisecs later, so we can
    delay the wake-up.

    This change was requested by Nick Piggin who pointed out that if we delay the
    wake-up, we weed out 2 unnecessary contex switches, which matters because
    '__mark_inode_dirty()' is a hot-path function.

    This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
    sets up a timer to wake-up the bdi thread and returns. So the wake-up is
    delayed.

    We also delete the timer in bdi threads just before writing-back. And
    synchronously delete it when unregistering bdi. At the unregister point the bdi
    does not have any users, so no one can arm it again.

    Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
    context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
    this change as well.

    This patch also moves the 'bdi_wb_init()' function down in the file to avoid
    forward-declaration of 'bdi_wakeup_thread_delayed()'.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
    bad for battery-driven devices.

    There are two types of activities bdi threads do:
    1. process bdi works from the 'bdi->work_list'
    2. periodic write-back

    So there are 2 sources of wake-up events for bdi threads:

    1. 'bdi_queue_work()' - submits bdi works
    2. '__mark_inode_dirty()' - adds dirty I/O to bdi's

    The former already has bdi wake-up code. The latter does not, and this patch
    adds it.

    '__mark_inode_dirty()' is hot-path function, but this patch adds another
    'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
    the bdi has no dirty inodes. So adding this spinlock should be fine and should
    not affect performance.

    This patch makes sure bdi threads and the forker thread do not wake-up if there
    is nothing to do. The forker thread will nevertheless wake up at least every
    5 min. to check whether it has to kill a bdi thread. This can also be optimized,
    but is not worth it.

    This patch also tidies up the warning about unregistered bid, and turns it from
    an ugly crocodile to a simple 'WARN()' statement.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently, bdi threads can decide to exit if there were no useful activities
    for 5 minutes. However, this causes nasty races: we can easily oops in the
    'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.

    And even if we do not oops, but the bdi tread exits immediately after we wake
    it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
    in the bdi work processing.

    This patch makes the forker thread to be the central place which not only
    creates bdi threads, but also kills them if they were inactive long enough.
    This better design-wise.

    Another reason why this change was done is to prepare for the further changes
    which will prevent the bdi threads from waking up every 5 sec and wasting
    power. Indeed, when the task does not wake up periodically anymore, it won't be
    able to exit either.

    This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
    forker thread as well. So now the forker thread sets the BDI_pending bit, then
    forks the task or kills it, then clears the bit and wakes up the waiting
    process.

    The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
    function was changed as well - now it first removes the bdi from the
    'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
    guaranteed that the forker thread won't race with it, because the bdi is not
    visible. Note, the forker thread sets the 'BDI_pending' bit under the
    'bdi->wb_lock' which is essential for proper serialization.

    And additionally, when we change 'bdi->wb.task', we now take the
    'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
    would when raced with, say, 'bdi_queue_work()'.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • This patch re-structures the bdi forker a little:
    1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason
    for this is that the forker thread can start _before_ the 'BDI_registered'
    flag is set (see 'bdi_register()'), so the WARN() statement will fire for
    the default bdi. I observed this warning at boot-up.

    2. Introduce an enum 'action' and use "switch" statement in the outer loop.
    This is a preparation to the further patch which will teach the forker
    thread killing bdi threads, so we'll have another case in the "switch"
    statement. This change was suggested by Christoph Hellwig.

    This patch is just a small step towards the coming change where the forker
    thread will kill the bdi threads. It should simplify reviewing the following
    changes, which would otherwise be larger.

    This patch also amends comments a little.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • The forker thread removes bdis from 'bdi_list' before forking the bdi thread.
    But this is wrong for at least 2 reasons.

    Reason #1: if we temporary remove a bdi from the list, we may miss works which
    would otherwise be given to us.

    Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are
    always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when
    it races with the forker thread, it can shut down the bdi thread
    at the same time as the forker creates it.

    This patch makes sure the forker thread never removes bdis from 'bdi_list'
    (which was suggested by Christoph Hellwig).

    In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to
    hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending'
    flag.

    NOTE! The error path is interesting. Currently, when we fail to create a bdi
    thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the
    bdi from the list, we cannot move it to the tail either, because then we can
    mess up the RCU readers which walk the list. And also, we'll have the race
    described above in "Reason #2".

    But I not think that adding to the tail is any important so I just do not do
    that.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • This patch simplifies bdi code a little by removing the 'pending_list' which is
    redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
    working like this:

    1. In a loop, fetch all bdi's which have works but have no writeback thread and
    move them to the 'pending_list'.
    2. If the list is empty, sleep for 5 sec.
    3. Otherwise, take one bdi from the list, fork the writeback thread for this
    bdi, and repeat the loop.

    IOW, it first moves everything to the 'pending_list', then process only one
    element, and so on. This patch simplifies the algorithm, which is now as
    follows.

    1. Find the first bdi which has a work and remove it from the global list of
    bdi's (bdi_list).
    2. If there was not such bdi, sleep 5 sec.
    3. Fork the writeback thread for this bdi and repeat the loop.

    IOW, now we find the first bdi to process, process it, and so on. This is
    simpler and involves less lists.

    The bonus now is that we can get rid of a couple of functions, as well as
    remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.

    This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
    'list_add_tail()', but this piece of code is going to be removed in the next
    patch anyway.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently, if someone submits jobs for the default bdi, we can lose wake-up
    events. E.g., this can happen if 'bdi_queue_work()' is called when
    'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but
    before 'set_current_state(TASK_INTERRUPTIBLE)'.

    This situation is unlikely, and the result is not very severe - we'll just
    delay the execution of the work, but this is still not very nice.

    This patch fixes the issue by checking whether the default bdi has works before
    the forker thread goes sleep.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently the forker thread can lose wake-ups which may lead to unnecessary
    delays in processing bdi works. E.g., consider the following scenario.

    1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to
    do, and is about to finish the loop.
    2. A bdi thread decides to exit because it was inactive for long time.
    3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up
    the forker thread.
    4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)'
    and goes sleep. We lose a wake-up.

    Losing the wake-up is not fatal, but this means that the bdi work processing
    will be delayed by up to 5 sec. This race is theoretical, I never hit it, but
    it is worth fixing.

    The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking
    'bdi_list', not after.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • This patch fixes a very unlikely race condition on the bdi forker thread error
    path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code
    for a short period of time. If at the same time someone submits a work to this
    bdi, we can end up with an oops 'bdi_queue_work()' while executing
    'wake_up_process(wb->task)'.

    This patch fixes the issue by introducing a temporary variable 'task' and
    storing the possible error code there, so that 'wb->task' would never take
    erroneous values.

    Note, this race is very unlikely and I never hit it, so it is theoretical, but
    nevertheless worth fixing.

    This patch also merges 2 comments which were previously separate.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • The write-back code mixes words "thread" and "task" for the same things. This
    is not a big deal, but still an inconsistency.

    hch: a convention I tend to use and I've seen in various places
    is to always use _task for the storage of the task_struct pointer,
    and thread everywhere else. This especially helps with having
    foo_thread for the actual thread and foo_task for a global
    variable keeping the task_struct pointer

    This patch renames:
    * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
    * 'bdi_forker_task()' -> 'bdi_forker_thread()'

    because bdi threads are 'bdi_writeback_thread()', so these names are more
    consistent.

    This patch also amends commentaries and makes them refer the forker and bdi
    threads as "thread", not "task".

    Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
    'static void' instead of 'void static' and make checkpatch.pl happy.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Trace queue/sched/exec parts of the writeback loop. This provides
    insight into when and why flusher threads are scheduled to run. e.g
    a sync invocation leaves traces like:

    sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
    flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0

    This also lays the foundation for adding more writeback tracing to
    provide deeper insight into the whole writeback path.

    The original tracing code is from Jens Axboe, though this version is
    a rewrite as a result of the code being traced changing
    significantly.

    Signed-off-by: Dave Chinner
    Signed-off-by: Jens Axboe

    Dave Chinner
     
  • Move all code for the writeback thread into fs/fs-writeback.c instead of
    splitting it over two functions in two files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The wb_list member of struct backing_device_info always has exactly one
    element. Just use the direct bdi->wb pointer instead and simplify some
    code.

    Also remove bdi_task_init which is now trivial to prepare for the next
    patch.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Jul, 2010

2 commits

  • First remove items from work_list as soon as we start working on them. This
    means we don't have to track any pending or visited state and can get
    rid of all the RCU magic freeing the work items - we can simply free
    them once the operation has finished. Second use a real completion for
    tracking synchronous requests - if the caller sets the completion pointer
    we complete it, otherwise use it as a boolean indicator that we can free
    the work item directly. Third unify struct wb_writeback_args and struct
    bdi_work into a single data structure, wb_writeback_work. Previous we
    set all parameters into a struct wb_writeback_args, copied it into
    struct bdi_work, copied it again on the stack to use it there. Instead
    of just allocate one structure dynamically or on the stack and use it
    all the way through the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This was just an odd wrapper around writeback_inodes_wb. Removing this
    also allows to get rid of the bdi member of struct writeback_control
    which was rather out of place there.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

22 May, 2010

1 commit

  • Commit 69b62d01 fixed up most of the places where we would enter
    busy schedule() spins when disabling the periodic background
    writeback. This fixes up the sb timer so that it doesn't get
    hammered on with the delay disabled, and ensures that it gets
    rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
    gets modified.

    bdi_forker_task() also needs to check for !dirty_writeback_centisecs
    and use schedule() appropriately, fix that up too.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2010

1 commit

  • noop_backing_dev_info is used only as a flag to mark filesystems that
    don't have any backing store, like tmpfs, procfs, spufs, etc.

    Signed-off-by: Joern Engel

    Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
    to the noop_backing_dev_info is not legal and will not result in
    them being flushed, but we already catch this condition in
    __mark_inode_dirty() when checking for a registered bdi.

    Signed-off-by: Jens Axboe

    Jörn Engel
     

22 Apr, 2010

1 commit


02 Apr, 2010

1 commit

  • I hit this when we had a bug in IDR for a few days. Basically sysfs would
    fail to create new inodes since it uses an IDR and therefore class_create would
    fail.

    While we are unlikely to see this fail we may as well handle it instead of
    oopsing.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Jens Axboe

    Anton Blanchard
     

03 Dec, 2009

1 commit

  • To touch task->flags directly is racy. thaw_process() still has race
    (changing non_current->flags, but this is another issue) though, I think
    it's much better off.

    So, use thaw_process() instead.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Jens Axboe

    OGAWA Hirofumi
     

12 Nov, 2009

1 commit

  • Unfreezes the bdi flusher task when the said task needs to exit.

    Steps to reproduce this.
    1) Mount a file system from MMC/SD card.
    2) Unmount the file system. This creates a flusher task.
    3) Attempt suspend to RAM. System is unresponsive.

    This is because the bdi flusher thread is already in the refrigerator and will
    remain so until it is thawed. The MMC driver suspend routine call stack will
    ultimately issue a 'kthread_stop' on the bdi flusher thread and will block
    until the flusher thread is exited. Since the bdi flusher thread is in the
    refrigerator it never cleans up until thawed.

    Signed-off-by: Romit Dasgupta
    Signed-off-by: Jens Axboe

    Romit Dasgupta
     

04 Nov, 2009

1 commit


29 Oct, 2009

1 commit

  • When the bdi is being removed, we have to ensure that no super_blocks
    currently have that cached in sb->s_bdi. Normally this is ensured by
    the sb having a longer life span than the bdi, but if the device is
    suddenly yanked, we have to kill this reference. sb->s_bdi is pointed
    to freed memory at that point.

    This fixes a problem with sync(1) hanging when a USB stick is pulled
    without cleanly umounting it first.

    Reported-by: Pavel Machek
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Oct, 2009

1 commit


16 Sep, 2009

2 commits


11 Sep, 2009

5 commits

  • Also a debugging aid. We want to catch dirty inodes being added to
    backing devices that don't do writeback.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This enables us to track who does what and print info. Its main use
    is catching dirty inodes on the default_backing_dev_info, so we can
    fix that up.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Add some debug entries to be able to inspect the internal state of
    the writeback details.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jul, 2009

1 commit


06 Apr, 2009

1 commit


26 Mar, 2009

1 commit


07 Jan, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    rcu: fix rcutorture bug
    rcu: eliminate synchronize_rcu_xxx macro
    rcu: make treercu safe for suspend and resume
    rcu: fix rcutree grace-period-latency bug on small systems
    futex: catch certain assymetric (get|put)_futex_key calls
    futex: make futex_(get|put)_key() calls symmetric
    locking, percpu counters: introduce separate lock classes
    swiotlb: clean up EXPORT_SYMBOL usage
    swiotlb: remove unnecessary declaration
    swiotlb: replace architecture-specific swiotlb.h with linux/swiotlb.h
    swiotlb: add support for systems with highmem
    swiotlb: store phys address in io_tlb_orig_addr array
    swiotlb: add hwdev to swiotlb_phys_to_bus() / swiotlb_sg_to_bus()

    Linus Torvalds