08 Nov, 2013

1 commit


12 Sep, 2013

1 commit

  • The feature prevents mistrusted filesystems (ie: FUSE mounts created by
    unprivileged users) to grow a large number of dirty pages before
    throttling. For such filesystems balance_dirty_pages always check bdi
    counters against bdi limits. I.e. even if global "nr_dirty" is under
    "freerun", it's not allowed to skip bdi checks. The only use case for now
    is fuse: it sets bdi max_ratio to 1% by default and system administrators
    are supposed to expect that this limit won't be exceeded.

    The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
    filesystem may set the flag when it initializes its BDI.

    The problematic scenario comes from the fact that nobody pays attention to
    the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
    writeback). The implementation of fuse writeback releases original page
    (by calling end_page_writeback) almost immediately. A fuse request queued
    for real processing bears a copy of original page. Hence, if userspace
    fuse daemon doesn't finalize write requests in timely manner, an
    aggressive mmap writer can pollute virtually all memory by those temporary
    fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
    nobody cares.

    To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
    problem" as a shortcut for "a possibility of uncontrolled grow of amount
    of RAM consumed by temporary pages allocated by kernel fuse to process
    writeback".

    The problem was very easy to reproduce. There is a trivial example
    filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
    added "sleep(1);" to the write methods, then recompiled and mounted it.
    Then created a huge file on the mount point and run a simple program which
    mmap-ed the file to a memory region, then wrote a data to the region. An
    hour later I observed almost all RAM consumed by fuse writeback. Since
    then some unrelated changes in kernel fuse made it more difficult to
    reproduce, but it is still possible now.

    Putting this theoretical happens-in-the-lab thing aside, there is another
    thing that really hurts real world (FUSE) users. This is write-through
    page cache policy FUSE currently uses. I.e. handling write(2), kernel
    fuse populates page cache and flushes user data to the server
    synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
    ("writeback cache policy") solve the problem, but they also make resolving
    NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
    a huge file to a fuse mount would result in memory starvation. Miklos,
    the maintainer of FUSE, believes strictlimit feature the way to go.

    And eventually putting FUSE topics aside, there is one more use-case for
    strictlimit feature. Using a slow USB stick (mass storage) in a machine
    with huge amount of RAM installed is a well-known pain. Let's make simple
    computations. Assuming 64GB of RAM installed, existing implementation of
    balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
    dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
    /media/my-usb-storage/" may return in a few seconds, but subsequent
    "umount /media/my-usb-storage/" will take more than two hours if effective
    throughput of the storage is, to say, 1MB/sec.

    After inclusion of strictlimit feature, it will be trivial to add a knob
    (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
    Manually or via udev rule. May be I'm wrong, but it seems to be quite a
    natural desire to limit the amount of dirty memory for some devices we are
    not fully trust (in the sense of sustainable throughput).

    [akpm@linux-foundation.org: fix warning in page-writeback.c]
    Signed-off-by: Maxim Patlasov
    Cc: Jan Kara
    Cc: Miklos Szeredi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     

02 Apr, 2013

2 commits

  • Writeback implements its own worker pool - each bdi can be associated
    with a worker thread which is created and destroyed dynamically. The
    worker thread for the default bdi is always present and serves as the
    "forker" thread which forks off worker threads for other bdis.

    there's no reason for writeback to implement its own worker pool when
    using unbound workqueue instead is much simpler and more efficient.
    This patch replaces custom worker pool implementation in writeback
    with an unbound workqueue.

    The conversion isn't too complicated but the followings are worth
    mentioning.

    * bdi_writeback->last_active, task and wakeup_timer are removed.
    delayed_work ->dwork is added instead. Explicit timer handling is
    no longer necessary. Everything works by either queueing / modding
    / flushing / canceling the delayed_work item.

    * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
    bdi_writeback->dwork. On each execution, it processes
    bdi->work_list and reschedules itself if there are more things to
    do.

    The function also handles low-mem condition, which used to be
    handled by the forker thread. If the function is running off a
    rescuer thread, it only writes out limited number of pages so that
    the rescuer can serve other bdis too. This preserves the flusher
    creation failure behavior of the forker thread.

    * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
    bdi_writeback_workfn() about on-going bdi unregistration so that it
    always drains work_list even if it's running off the rescuer. Note
    that the original code was broken in this regard. Under memory
    pressure, a bdi could finish unregistration with non-empty
    work_list.

    * The default bdi is no longer special. It now is treated the same as
    any other bdi and bdi_cap_flush_forker() is removed.

    * BDI_pending is no longer used. Removed.

    * Some tracepoints become non-applicable. The following TPs are
    removed - writeback_nothread, writeback_wake_thread,
    writeback_wake_forker_thread, writeback_thread_start,
    writeback_thread_stop.

    Everything, including devices coming and going away and rescuer
    operation under simulated memory pressure, seems to work fine in my
    test setup.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Fengguang Wu
    Cc: Jeff Moyer

    Tejun Heo
     
  • There's no user left. Remove it.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Fengguang Wu

    Tejun Heo
     

22 Feb, 2013

1 commit

  • This patchset ("stable page writes, part 2") makes some key
    modifications to the original 'stable page writes' patchset. First, it
    provides creators (devices and filesystems) of a backing_dev_info a flag
    that declares whether or not it is necessary to ensure that page
    contents cannot change during writeout. It is no longer assumed that
    this is true of all devices (which was never true anyway). Second, the
    flag is used to relaxed the wait_on_page_writeback calls so that wait
    only occurs if the device needs it. Third, it fixes up the remaining
    disk-backed filesystems to use this improved conditional-wait logic to
    provide stable page writes on those filesystems.

    It is hoped that (for people not using checksumming devices, anyway)
    this patchset will give back unnecessary performance decreases since the
    original stable page write patchset went into 3.0. Sorry about not
    fixing it sooner.

    Complaints were registered by several people about the long write
    latencies introduced by the original stable page write patchset.
    Generally speaking, the kernel ought to allocate as little extra memory
    as possible to facilitate writeout, but for people who simply cannot
    wait, a second page stability strategy is (re)introduced: snapshotting
    page contents. The waiting behavior is still the default strategy; to
    enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
    set. This flag is used to bandaid^Henable stable page writeback on
    ext3[1], and is not used anywhere else.

    Given that there are already a few storage devices and network FSes that
    have rolled their own page stability wait/page snapshot code, it would
    be nice to move towards consolidating all of these. It seems possible
    that iscsi and raid5 may wish to use the new stable page write support
    to enable zero-copy writeout.

    Thank you to Jan Kara for helping fix a couple more filesystems.

    Per Andrew Morton's request, here are the result of using dbench to measure
    latencies on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, for ext2 the maximum write latency decreases from ~60ms
    on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
    increase, though I suspect that being able to dirty pages faster gives
    the flusher more work to do.

    On ext4, the average write latency decreases as well as all the maximum
    latencies:

    3.8.0-rc3:
    WriteX 85624 0.152 33.078
    ReadX 272090 0.010 61.210
    Flush 12129 36.219 168.260

    Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms

    3.8.0-rc3 + patches:
    WriteX 86082 0.141 30.928
    ReadX 273358 0.010 36.124
    Flush 12214 34.800 165.689

    Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms

    XFS seems to exhibit similar latency improvements as ext2:

    3.8.0-rc3:
    WriteX 125739 0.028 104.343
    ReadX 399070 0.005 4.115
    Flush 17851 25.004 131.390

    Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms

    3.8.0-rc3 + patches:
    WriteX 123529 0.028 6.299
    ReadX 392434 0.005 4.287
    Flush 17549 25.120 188.687

    Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms

    ...and btrfs, just to round things out, also shows some latency
    decreases:

    3.8.0-rc3:
    WriteX 67122 0.083 82.355
    ReadX 212719 0.005 2.828
    Flush 9547 47.561 147.418

    Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms

    3.8.0-rc3 + patches:
    WriteX 64898 0.101 71.631
    ReadX 206673 0.005 7.123
    Flush 9190 47.963 219.034

    Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own wait code, or they don't block at all. The blocking
    behavior is back to what it was before 3.0 if you don't have a disk
    requiring stable page writes.

    This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
    xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
    results as -rc3.

    [1] The alternative fixes to ext3 include fixing the locking order and
    page bit handling like we did for ext4 (but then why not just use
    ext4?), or setting PG_writeback so early that ext3 becomes extremely
    slow. I tried that, but the number of write()s I could initiate dropped
    by nearly an order of magnitude. That was a bit much even for the
    author of the stable page series! :)

    This patch:

    Creates a per-backing-device flag that tracks whether or not pages must
    be held immutable during writeout. Eventually it will be used to waive
    wait_for_page_writeback() if nothing requires stable pages.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

13 Dec, 2012

1 commit


04 Aug, 2012

1 commit

  • Finally we can kill the 'sync_supers' kernel thread along with the
    '->write_super()' superblock operation because all the users are gone.
    Now every file-system is supposed to self-manage own superblock and
    its dirty state.

    The nice thing about killing this thread is that it improves power management.
    Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
    every 5 seconds no matter what - even if there were no dirty superblocks and
    even if there were no file-systems using this service (e.g., btrfs and
    journalled ext4 do not need it). So it was wasting power most of the time. And
    because the thread was in the core of the kernel, all systems had to have it.
    So I am quite happy to make it go away.

    Interestingly, this thread is a left-over from the pdflush kernel thread which
    was a self-forking kernel thread responsible for all the write-back in old
    Linux kernels. It was turned into per-block device BDI threads, and
    'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     

01 Aug, 2012

1 commit

  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

09 Jun, 2012

1 commit


31 Oct, 2011

1 commit

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     

03 Oct, 2011

3 commits

  • There are some imperfections in balanced_dirty_ratelimit.

    1) large fluctuations

    The dirty_rate used for computing balanced_dirty_ratelimit is merely
    averaged in the past 200ms (very small comparing to the 3s estimation
    period for write_bw), which makes rather dispersed distribution of
    balanced_dirty_ratelimit.

    It's pretty hard to average out the singular points by increasing the
    estimation period. Considering that the averaging technique will
    introduce very undesirable time lags, I give it up totally. (btw, the 3s
    write_bw averaging time lag is much more acceptable because its impact
    is one-way and therefore won't lead to oscillations.)

    The more practical way is filtering -- most singular
    balanced_dirty_ratelimit points can be filtered out by remembering some
    prev_balanced_rate and prev_prev_balanced_rate. However the more
    reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

    2) due to truncates and fs redirties, the (write_bw dirty_rate)
    match could become unbalanced, which may lead to large systematical
    errors in balanced_dirty_ratelimit. The truncates, due to its possibly
    bumpy nature, can hardly be compensated smoothly. So let's face it. When
    some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
    high, dirty pages will go higher than the setpoint. task_ratelimit will
    in turn become lower than dirty_ratelimit. So if we consider both
    balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
    only when they are on the same side of dirty_ratelimit, the systematical
    errors in balanced_dirty_ratelimit won't be able to bring
    dirty_ratelimit far away.

    The balanced_dirty_ratelimit estimation may also be inaccurate near
    @limit or @freerun, however is less an issue.

    3) since we ultimately want to

    - keep the fluctuations of task ratelimit as small as possible
    - keep the dirty pages around the setpoint as long time as possible

    the update policy used for (2) also serves the above goals nicely:
    if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
    and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
    there is no point to bring up dirty_ratelimit in a hurry only to hurt
    both the above two goals.

    So, we make use of task_ratelimit to limit the update of dirty_ratelimit
    in two ways:

    1) avoid changing dirty rate when it's against the position control target
    (the adjusted rate will slow down the progress of dirty pages going
    back to setpoint).

    2) limit the step size. task_ratelimit is changing values step by step,
    leaving a consistent trace comparing to the randomly jumping
    balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
    errors in stable state and typically larger errors when there are big
    errors in rate. So it's a pretty good limiting factor for the step
    size of dirty_ratelimit.

    Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
    task_ratelimit is merely used as a limiting factor.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
    when there are N dd tasks.

    On write() syscall, use bdi->dirty_ratelimit
    ============================================

    balance_dirty_pages(pages_dirtied)
    {
    task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
    pause = pages_dirtied / task_ratelimit;
    sleep(pause);
    }

    On every 200ms, update bdi->dirty_ratelimit
    ===========================================

    bdi_update_dirty_ratelimit()
    {
    task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
    balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
    bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

    Estimation of balanced bdi->dirty_ratelimit
    ===========================================

    balanced task_ratelimit
    -----------------------

    balance_dirty_pages() needs to throttle tasks dirtying pages such that
    the total amount of dirty pages stays below the specified dirty limit in
    order to avoid memory deadlocks. Furthermore we desire fairness in that
    tasks get throttled proportionally to the amount of pages they dirty.

    IOW we want to throttle tasks such that we match the dirty rate to the
    writeout bandwidth, this yields a stable amount of dirty pages:

    dirty_rate == write_bw (1)

    The fairness requirement gives us:

    task_ratelimit = balanced_dirty_ratelimit
    == write_bw / N (2)

    where N is the number of dd tasks. We don't know N beforehand, but
    still can estimate balanced_dirty_ratelimit within 200ms.

    Start by throttling each dd task at rate

    task_ratelimit = task_ratelimit_0 (3)
    (any non-zero initial value is OK)

    After 200ms, we measured

    dirty_rate = # of pages dirtied by all dd's / 200ms
    write_bw = # of pages written to the disk / 200ms

    For the aggressive dd dirtiers, the equality holds

    dirty_rate == N * task_rate
    == N * task_ratelimit_0 (4)
    Or
    task_ratelimit_0 == dirty_rate / N (5)

    Now we conclude that the balanced task ratelimit can be estimated by

    write_bw
    balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6)
    dirty_rate

    Because with (4) and (5) we can get the desired equality (1):

    write_bw
    balanced_dirty_ratelimit == (dirty_rate / N) * ----------
    dirty_rate
    == write_bw / N

    Then using the balanced task ratelimit we can compute task pause times like:

    task_pause = task->nr_dirtied / task_ratelimit

    task_ratelimit with position control
    ------------------------------------

    However, while the above gives us means of matching the dirty rate to
    the writeout bandwidth, it at best provides us with a stable dirty page
    count (assuming a static system). In order to control the dirty page
    count such that it is high enough to provide performance, but does not
    exceed the specified limit we need another control.

    The dirty position control works by extending (2) to

    task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7)

    where pos_ratio is a negative feedback function that subjects to

    1) f(setpoint) = 1.0
    2) df/dx < 0

    That is, if the dirty pages are ABOVE the setpoint, we throttle each
    task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
    pages are created less fast than they are cleaned, thus DROP to the
    setpoints (and the reverse).

    Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
    remains CONSTANT for the past 200ms, we get

    task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8)

    Putting (8) into (6), we get the formula used in
    bdi_update_dirty_ratelimit():

    write_bw
    balanced_dirty_ratelimit *= pos_ratio * ---------- (9)
    dirty_rate

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce the BDI_DIRTIED counter. It will be used for estimating the
    bdi's dirty bandwidth.

    CC: Jan Kara
    CC: Michael Rubin
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

10 Jul, 2011

2 commits

  • The estimation value will start from 100MB/s and adapt to the real
    bandwidth in seconds.

    It tries to update the bandwidth only when disk is fully utilized.
    Any inactive period of more than one second will be skipped.

    The estimated bandwidth will be reflecting how fast the device can
    writeout when _fully utilized_, and won't drop to 0 when it goes idle.
    The value will remain constant at disk idle time. At busy write time, if
    not considering fluctuations, it will also remain high unless be knocked
    down by possible concurrent reads that compete for the disk time and
    bandwidth with async writes.

    The estimation is not done purely in the flusher because there is no
    guarantee for write_cache_pages() to return timely to update bandwidth.

    The bdi->avg_write_bandwidth smoothing is very effective for filtering
    out sudden spikes, however may be a little biased in long term.

    The overheads are low because the bdi bandwidth update only occurs at
    200ms intervals.

    The 200ms update interval is suitable, because it's not possible to get
    the real bandwidth for the instance at all, due to large fluctuations.

    The NFS commits can be as large as seconds worth of data. One XFS
    completion may be as large as half second worth of data if we are going
    to increase the write chunk to half second worth of data. In ext4,
    fluctuations with time period of around 5 seconds is observed. And there
    is another pattern of irregular periods of up to 20 seconds on SSD tests.

    That's why we are not only doing the estimation at 200ms intervals, but
    also averaging them over a period of 3 seconds and then go further to do
    another level of smoothing in avg_write_bandwidth.

    CC: Li Shaohua
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce the BDI_WRITTEN counter. It will be used for estimating the
    bdi's write bandwidth.

    Peter Zijlstra :
    Move BDI_WRITTEN accounting into __bdi_writeout_inc().
    This will cover and fix fuse, which only calls bdi_writeout_inc().

    CC: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Oct, 2010

2 commits

  • Declare 'bdi_pending_list' and 'tag_pages_for_writeback()' to remove
    following sparse warnings:

    mm/backing-dev.c:46:1: warning: symbol 'bdi_pending_list' was not declared. Should it be static?
    mm/page-writeback.c:825:6: warning: symbol 'tag_pages_for_writeback' was not declared. Should it be static?

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • …r if significant congestion is not being encountered in the current zone

    If congestion_wait() is called with no BDI congested, the caller will
    sleep for the full timeout and this may be an unnecessary sleep. This
    patch adds a wait_iff_congested() that checks congestion and only sleeps
    if a BDI is congested else, it calls cond_resched() to ensure the caller
    is not hogging the CPU longer than its quota but otherwise will not sleep.

    This is aimed at reducing some of the major desktop stalls reported during
    IO. For example, while kswapd is operating, it calls congestion_wait()
    but it could just have been reclaiming clean page cache pages with no
    congestion. Without this patch, it would sleep for a full timeout but
    after this patch, it'll just call schedule() if it has been on the CPU too
    long. Similar logic applies to direct reclaimers that are not making
    enough progress.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

12 Aug, 2010

1 commit

  • Commit 83ba7b071f3 ("writeback: simplify the write back thread queue")
    broke writeback_in_progress() as in that commit we started to remove work
    items from the list at the moment we start working on them and not at the
    moment they are finished. Thus if the flusher thread was doing some work
    but there was no other work queued, writeback_in_progress() returned
    false. This could in particular cause unnecessary queueing of background
    writeback from balance_dirty_pages() or writeout work from
    writeback_sb_if_idle().

    This patch fixes the problem by introducing a bit in the bdi state which
    indicates that the flusher thread is processing some work and uses this
    bit for writeback_in_progress() test.

    NOTE: Both callsites of writeback_in_progress() (namely,
    writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
    need a different information than what writeback_in_progress() provides.
    They would need to know whether *the kind of writeback they are going to
    submit* is already queued. But this information isn't that simple to
    provide so let's fix writeback_in_progress() for the time being.

    Signed-off-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Aug, 2010

6 commits

  • Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
    should take care of the periodic background write-out. However, the write-out
    will actually start only 'dirty_writeback_interval' centisecs later, so we can
    delay the wake-up.

    This change was requested by Nick Piggin who pointed out that if we delay the
    wake-up, we weed out 2 unnecessary contex switches, which matters because
    '__mark_inode_dirty()' is a hot-path function.

    This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
    sets up a timer to wake-up the bdi thread and returns. So the wake-up is
    delayed.

    We also delete the timer in bdi threads just before writing-back. And
    synchronously delete it when unregistering bdi. At the unregister point the bdi
    does not have any users, so no one can arm it again.

    Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
    context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
    this change as well.

    This patch also moves the 'bdi_wb_init()' function down in the file to avoid
    forward-declaration of 'bdi_wakeup_thread_delayed()'.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently bdi threads use local variable 'last_active' which stores last time
    when the bdi thread did some useful work. Move this local variable to 'struct
    bdi_writeback'. This is just a preparation for the further patches which will
    make the forker thread decide when bdi threads should be killed.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • This patch simplifies bdi code a little by removing the 'pending_list' which is
    redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
    working like this:

    1. In a loop, fetch all bdi's which have works but have no writeback thread and
    move them to the 'pending_list'.
    2. If the list is empty, sleep for 5 sec.
    3. Otherwise, take one bdi from the list, fork the writeback thread for this
    bdi, and repeat the loop.

    IOW, it first moves everything to the 'pending_list', then process only one
    element, and so on. This patch simplifies the algorithm, which is now as
    follows.

    1. Find the first bdi which has a work and remove it from the global list of
    bdi's (bdi_list).
    2. If there was not such bdi, sleep 5 sec.
    3. Fork the writeback thread for this bdi and repeat the loop.

    IOW, now we find the first bdi to process, process it, and so on. This is
    simpler and involves less lists.

    The bonus now is that we can get rid of a couple of functions, as well as
    remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.

    This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
    'list_add_tail()', but this piece of code is going to be removed in the next
    patch anyway.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • The write-back code mixes words "thread" and "task" for the same things. This
    is not a big deal, but still an inconsistency.

    hch: a convention I tend to use and I've seen in various places
    is to always use _task for the storage of the task_struct pointer,
    and thread everywhere else. This especially helps with having
    foo_thread for the actual thread and foo_task for a global
    variable keeping the task_struct pointer

    This patch renames:
    * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
    * 'bdi_forker_task()' -> 'bdi_forker_thread()'

    because bdi threads are 'bdi_writeback_thread()', so these names are more
    consistent.

    This patch also amends commentaries and makes them refer the forker and bdi
    threads as "thread", not "task".

    Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
    'static void' instead of 'void static' and make checkpatch.pl happy.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Move all code for the writeback thread into fs/fs-writeback.c instead of
    splitting it over two functions in two files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The wb_list member of struct backing_device_info always has exactly one
    element. Just use the direct bdi->wb pointer instead and simplify some
    code.

    Also remove bdi_task_init which is now trivial to prepare for the next
    patch.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Jul, 2010

1 commit

  • First remove items from work_list as soon as we start working on them. This
    means we don't have to track any pending or visited state and can get
    rid of all the RCU magic freeing the work items - we can simply free
    them once the operation has finished. Second use a real completion for
    tracking synchronous requests - if the caller sets the completion pointer
    we complete it, otherwise use it as a boolean indicator that we can free
    the work item directly. Third unify struct wb_writeback_args and struct
    bdi_work into a single data structure, wb_writeback_work. Previous we
    set all parameters into a struct wb_writeback_args, copied it into
    struct bdi_work, copied it again on the stack to use it there. Instead
    of just allocate one structure dynamically or on the stack and use it
    all the way through the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jun, 2010

1 commit


01 Jun, 2010

1 commit


22 May, 2010

1 commit

  • Commit 69b62d01 fixed up most of the places where we would enter
    busy schedule() spins when disabling the periodic background
    writeback. This fixes up the sb timer so that it doesn't get
    hammered on with the delay disabled, and ensures that it gets
    rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
    gets modified.

    bdi_forker_task() also needs to check for !dirty_writeback_centisecs
    and use schedule() appropriately, fix that up too.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 May, 2010

1 commit

  • When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
    writeback to kick off writeback of pending dirty inodes, then follow
    that up with a WB_SYNC_ALL to wait for it. Since umount already holds
    the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
    writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
    since WB_SYNC_ALL writeback is a data integrity operation and thus
    a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
    it's a lot slower.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Apr, 2010

1 commit


25 Apr, 2010

1 commit

  • noop_backing_dev_info is used only as a flag to mark filesystems that
    don't have any backing store, like tmpfs, procfs, spufs, etc.

    Signed-off-by: Joern Engel

    Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
    to the noop_backing_dev_info is not legal and will not result in
    them being flushed, but we already catch this condition in
    __mark_inode_dirty() when checking for a registered bdi.

    Signed-off-by: Jens Axboe

    Jörn Engel
     

22 Apr, 2010

1 commit


06 Apr, 2010

1 commit

  • One of the features of laptop-mode is that it forces a writeout of dirty
    pages if something else triggers a physical read or write from a device.
    The current implementation flushes pages on all devices, rather than only
    the one that triggered the flush. This patch alters the behaviour so that
    only the recently accessed block device is flushed, preventing other
    disks being spun up for no terribly good reason.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Jens Axboe

    Matthew Garrett
     

29 Oct, 2009

1 commit


26 Sep, 2009

1 commit

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Sep, 2009

2 commits

  • bdi_start_writeback() is currently split into two paths, one for
    WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
    for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
    only WB_SYNC_NONE.

    Push down the writeback_control allocation and only accept the
    parameters that make sense for each function. This cleans up
    the API considerably.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now that bdi_writeback_all() no longer handles integrity writeback,
    it doesn't have to block anymore. This means that we can switch
    bdi_list reader side protection to RCU.

    Signed-off-by: Jens Axboe

    Jens Axboe