08 Dec, 2011

3 commits

  • Some trace shows lots of bdi_dirty=0 lines where it's actually some
    small value if w/o the accounting errors in the per-cpu bdi stats.

    In this case the max pause time should really be set to the smallest
    (non-zero) value to avoid IO queue underrun and improve throughput.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • On a system with 1 local mount and 1 NFS mount, if the NFS server
    becomes not responding when dd to the NFS mount, the NFS dirty pages may
    exceed the global dirty limit and _every_ task involving writing will be
    blocked. The whole system appears unresponsive.

    The workaround is to permit through the bdi's that only has a small
    number of dirty pages. The number chosen (bdi_stat_error pages) is not
    enough to enable the local disk to run in optimal throughput, however is
    enough to make the system responsive on a broken NFS mount. The user can
    then kill the dirtiers on the NFS mount and increase the global dirty
    limit to bring up the local disk's throughput.

    It risks allowing dirty pages to grow much larger than the global dirty
    limit when there are 1000+ mounts, however that's very unlikely to happen,
    especially in low memory profiles.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • We do "floating proportions" to let active devices to grow its target
    share of dirty pages and stalled/inactive devices to decrease its target
    share over time.

    It works well except in the case of "an inactive disk suddenly goes
    busy", where the initial target share may be too small. To mitigate
    this, bdi_position_ratio() has the below line to raise a small
    bdi_thresh when it's safe to do so, so that the disk be feed with enough
    dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

    bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

    balance_dirty_pages() normally does negative feedback control which
    adjusts ratelimit to balance the bdi dirty pages around the target.
    In some extreme cases when that is not enough, it will have to block
    the tasks completely until the bdi dirty pages drop below bdi_thresh.

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

17 Nov, 2011

2 commits

  • They are not used any more.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
    on every 1 4KB-page, which means it cannot throttle a task under
    4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
    10MB/s USB stick, its bdi dirty pages could grow out of control.

    Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
    means a limit of 4KB/s.

    They can eventually be safeguarded by the global limit check
    (nr_dirty < dirty_thresh). However if someone is also writing to an
    HDD at the same time, it'll get poor HDD write performance.

    We at least want to maintain good write performance for other devices
    when one device is attacked by some "massive parallel" workload, or
    suffers from slow write bandwidth, or somehow get stalled due to some
    error condition (eg. NFS server not responding).

    For a stalled device, we need to completely block its dirtiers, too,
    before its bdi dirty pages grow all the way up to the global limit and
    leave no space for the other functional devices.

    So change the loop exit condition to

    /*
    * Always enforce global dirty limit; also enforce bdi dirty limit
    * if the normal max_pause sleeps cannot keep things under control.
    */
    if (nr_dirty < dirty_thresh &&
    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
    break;

    which can be further simplified to

    if (task_ratelimit)
    break;

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

16 Nov, 2011

1 commit

  • There is no reason why task in balance_dirty_pages() shouldn't be killable
    and it helps in recovering from some error conditions (like when filesystem
    goes in error state and cannot accept writeback anymore but we still want to
    kill processes using it to be able to unmount it).

    There will be follow up patches to further abort the generic_perform_write()
    and other filesystem write loops, to avoid large write + SIGKILL combination
    exceeding the dirty limit and possibly strange OOM.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Reviewed-by: Neil Brown
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

07 Nov, 2011

3 commits

  • In balance_dirty_pages() task_ratelimit may be not initialized
    (initialization skiped by goto pause), and then used when calling
    tracing hook.

    Fix it by moving the task_ratelimit assignment before goto pause.

    Reported-by: Witold Baryluk
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

01 Nov, 2011

1 commit


31 Oct, 2011

4 commits


11 Oct, 2011

1 commit

  • Fix powerpc compile warnings

    mm/page-writeback.c: In function 'bdi_position_ratio':
    mm/page-writeback.c:622:3: warning: comparison of distinct pointer types lacks a cast [enabled by default]
    page-writeback.c:635:4: warning: comparison of distinct pointer types lacks a cast [enabled by default]

    Also fix gcc "uninitialized var" warnings.

    Reported-by: Stephen Rothwell
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

03 Oct, 2011

10 commits

  • Keep a minimal pool of dirty pages for each bdi, so that the disk IO
    queues won't underrun. Also gently increase a small bdi_thresh to avoid
    it stuck in 0 for some light dirtied bdi.

    It's particularly useful for JBOD and small memory system.

    It may result in (pos_ratio > 1) at the setpoint and push the dirty
    pages high. This is more or less intended because the bdi is in the
    danger of IO queue underflow.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The dirty pause time shall ultimately be controlled by adjusting
    nr_dirtied_pause, since there is relationship

    pause = pages_dirtied / task_ratelimit

    Assuming

    pages_dirtied ~= nr_dirtied_pause
    task_ratelimit ~= dirty_ratelimit

    We get

    nr_dirtied_pause ~= dirty_ratelimit * desired_pause

    Here dirty_ratelimit is preferred over task_ratelimit because it's
    more stable.

    It's also important to limit possible large transitional errors:

    - bw is changing quickly
    - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
    - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
    separate fix, but still expect non-trivial errors)

    So we end up using the above formula inside clamp_val().

    The best test case for this code is to run 100 "dd bs=4M" tasks on
    btrfs and check its pause time distribution.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Apply two policies to scale down the max pause time for

    1) small number of concurrent dirtiers
    2) small memory system (comparing to storage bandwidth)

    MAX_PAUSE=200ms may only be suitable for high end servers with lots of
    concurrent dirtiers, where the large pause time can reduce much overheads.

    Otherwise, smaller pause time is desirable whenever possible, so as to
    get good responsiveness and smooth user experiences. It's actually
    required for good disk utilization in the case when all the dirty pages
    can be synced to disk within MAX_PAUSE=200ms.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • As proposed by Chris, Dave and Jan, don't start foreground writeback IO
    inside balance_dirty_pages(). Instead, simply let it idle sleep for some
    time to throttle the dirtying task. In the mean while, kick off the
    per-bdi flusher thread to do background writeback IO.

    RATIONALS
    =========

    - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

    If every thread doing writes and being throttled start foreground
    writeback, it leads to N IO submitters from at least N different
    inodes at the same time, end up with N different sets of IO being
    issued with potentially zero locality to each other, resulting in
    much lower elevator sort/merge efficiency and hence we seek the disk
    all over the place to service the different sets of IO.
    OTOH, if there is only one submission thread, it doesn't jump between
    inodes in the same way when congestion clears - it keeps writing to
    the same inode, resulting in large related chunks of sequential IOs
    being issued to the disk. This is more efficient than the above
    foreground writeback because the elevator works better and the disk
    seeks less.

    - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

    With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
    from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

    * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

    * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

    * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

    * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

    - IO size too small for fast arrays and too large for slow USB sticks

    The write_chunk used by current balance_dirty_pages() cannot be
    directly set to some large value (eg. 128MB) for better IO efficiency.
    Because it could lead to more than 1 second user perceivable stalls.
    Even the current 4MB write size may be too large for slow USB sticks.
    The fact that balance_dirty_pages() starts IO on itself couples the
    IO size to wait time, which makes it hard to do suitable IO size while
    keeping the wait time under control.

    Now it's possible to increase writeback chunk size proportional to the
    disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
    the larger writeback size dramatically reduces the seek count to 1/10
    (far beyond my expectation) and improves the write throughput by 24%.

    - long block time in balance_dirty_pages() hurts desktop responsiveness

    Many of us may have the experience: it often takes a couple of seconds
    or even long time to stop a heavy writing dd/cp/tar command with
    Ctrl-C or "kill -9".

    - IO pipeline broken by bumpy write() progress

    There are a broad class of "loop {read(buf); write(buf);}" applications
    whose read() pipeline will be under-utilized or even come to a stop if
    the write()s have long latencies _or_ don't progress in a constant rate.
    The current threshold based throttling inherently transfers the large
    low level IO completion fluctuations to bumpy application write()s,
    and further deteriorates with increasing number of dirtiers and/or bdi's.

    For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
    the rsync progresses very bumpy in legacy kernel, and throughput is
    improved by 67% by this patchset. (plus the larger write chunk size,
    it will be 93% speedup).

    The new rate based throttling can support 1000+ dd's with excellent
    smoothness, low latency and low overheads.

    For the above reasons, it's much better to do IO-less and low latency
    pauses in balance_dirty_pages().

    Jan Kara, Dave Chinner and me explored the scheme to let
    balance_dirty_pages() wait for enough writeback IO completions to
    safeguard the dirty limit. However it's found to have two problems:

    - in large NUMA systems, the per-cpu counters may have big accounting
    errors, leading to big throttle wait time and jitters.

    - NFS may kill large amount of unstable pages with one single COMMIT.
    Because NFS server serves COMMIT with expensive fsync() IOs, it is
    desirable to delay and reduce the number of COMMITs. So it's not
    likely to optimize away such kind of bursty IO completions, and the
    resulted large (and tiny) stall times in IO completion based throttling.

    So here is a pause time oriented approach, which tries to control the
    pause time in each balance_dirty_pages() invocations, by controlling
    the number of pages dirtied before calling balance_dirty_pages(), for
    smooth and efficient dirty throttling:

    - avoid useless (eg. zero pause time) balance_dirty_pages() calls
    - avoid too small pause time (less than 4ms, which burns CPU power)
    - avoid too large pause time (more than 200ms, which hurts responsiveness)
    - avoid big fluctuations of pause times

    It can control pause times at will. The default policy (in a followup
    patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
    in 1000-dd case.

    BEHAVIOR CHANGE
    ===============

    (1) dirty threshold

    Users will notice that the applications will get throttled once crossing
    the global (background + dirty)/2=15% threshold, and then balanced around
    17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
    memory in 1-dd case.

    Since the task will be soft throttled earlier than before, it may be
    perceived by end users as performance "slow down" if his application
    happens to dirty more than 15% dirtyable memory.

    (2) smoothness/responsiveness

    Users will notice a more responsive system during heavy writeback.
    "killall dd" will take effect instantly.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Add two fields to task_struct.

    1) account dirtied pages in the individual tasks, for accuracy
    2) per-task balance_dirty_pages() call intervals, for flexibility

    The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
    scale near-sqrt to the safety gap between dirty pages and threshold.

    The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
    pages at exactly the same time, each task will be assigned a large
    initial nr_dirtied_pause, so that the dirty threshold will be exceeded
    long before each task reached its nr_dirtied_pause and hence call
    balance_dirty_pages().

    The solution is to watch for the number of pages dirtied on each CPU in
    between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
    (3% dirty threshold), force call balance_dirty_pages() for a chance to
    set bdi->dirty_exceeded. In normal situations, this safeguarding
    condition is not expected to trigger at all.

    On the sqrt in dirty_poll_interval():

    It will serve as an initial guess when dirty pages are still in the
    freerun area.

    When dirty pages are floating inside the dirty control scope [freerun,
    limit], a followup patch will use some refined dirty poll interval to
    get the desired pause time.

    thresh-dirty (MB) sqrt
    1 16
    2 22
    4 32
    8 45
    16 64
    32 90
    64 128
    128 181
    256 256
    512 362
    1024 512

    The above table means, given 1MB (or 1GB) gap and the dd tasks polling
    balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
    be exceeded as long as there are less than 16 (or 512) concurrent dd's.

    So sqrt naturally leads to less overheads and more safe concurrent tasks
    for large memory servers, which have large (thresh-freerun) gaps.

    peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

    CC: Peter Zijlstra
    Reviewed-by: Andrea Righi
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • There are some imperfections in balanced_dirty_ratelimit.

    1) large fluctuations

    The dirty_rate used for computing balanced_dirty_ratelimit is merely
    averaged in the past 200ms (very small comparing to the 3s estimation
    period for write_bw), which makes rather dispersed distribution of
    balanced_dirty_ratelimit.

    It's pretty hard to average out the singular points by increasing the
    estimation period. Considering that the averaging technique will
    introduce very undesirable time lags, I give it up totally. (btw, the 3s
    write_bw averaging time lag is much more acceptable because its impact
    is one-way and therefore won't lead to oscillations.)

    The more practical way is filtering -- most singular
    balanced_dirty_ratelimit points can be filtered out by remembering some
    prev_balanced_rate and prev_prev_balanced_rate. However the more
    reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

    2) due to truncates and fs redirties, the (write_bw dirty_rate)
    match could become unbalanced, which may lead to large systematical
    errors in balanced_dirty_ratelimit. The truncates, due to its possibly
    bumpy nature, can hardly be compensated smoothly. So let's face it. When
    some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
    high, dirty pages will go higher than the setpoint. task_ratelimit will
    in turn become lower than dirty_ratelimit. So if we consider both
    balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
    only when they are on the same side of dirty_ratelimit, the systematical
    errors in balanced_dirty_ratelimit won't be able to bring
    dirty_ratelimit far away.

    The balanced_dirty_ratelimit estimation may also be inaccurate near
    @limit or @freerun, however is less an issue.

    3) since we ultimately want to

    - keep the fluctuations of task ratelimit as small as possible
    - keep the dirty pages around the setpoint as long time as possible

    the update policy used for (2) also serves the above goals nicely:
    if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
    and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
    there is no point to bring up dirty_ratelimit in a hurry only to hurt
    both the above two goals.

    So, we make use of task_ratelimit to limit the update of dirty_ratelimit
    in two ways:

    1) avoid changing dirty rate when it's against the position control target
    (the adjusted rate will slow down the progress of dirty pages going
    back to setpoint).

    2) limit the step size. task_ratelimit is changing values step by step,
    leaving a consistent trace comparing to the randomly jumping
    balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
    errors in stable state and typically larger errors when there are big
    errors in rate. So it's a pretty good limiting factor for the step
    size of dirty_ratelimit.

    Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
    task_ratelimit is merely used as a limiting factor.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
    when there are N dd tasks.

    On write() syscall, use bdi->dirty_ratelimit
    ============================================

    balance_dirty_pages(pages_dirtied)
    {
    task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
    pause = pages_dirtied / task_ratelimit;
    sleep(pause);
    }

    On every 200ms, update bdi->dirty_ratelimit
    ===========================================

    bdi_update_dirty_ratelimit()
    {
    task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
    balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
    bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

    Estimation of balanced bdi->dirty_ratelimit
    ===========================================

    balanced task_ratelimit
    -----------------------

    balance_dirty_pages() needs to throttle tasks dirtying pages such that
    the total amount of dirty pages stays below the specified dirty limit in
    order to avoid memory deadlocks. Furthermore we desire fairness in that
    tasks get throttled proportionally to the amount of pages they dirty.

    IOW we want to throttle tasks such that we match the dirty rate to the
    writeout bandwidth, this yields a stable amount of dirty pages:

    dirty_rate == write_bw (1)

    The fairness requirement gives us:

    task_ratelimit = balanced_dirty_ratelimit
    == write_bw / N (2)

    where N is the number of dd tasks. We don't know N beforehand, but
    still can estimate balanced_dirty_ratelimit within 200ms.

    Start by throttling each dd task at rate

    task_ratelimit = task_ratelimit_0 (3)
    (any non-zero initial value is OK)

    After 200ms, we measured

    dirty_rate = # of pages dirtied by all dd's / 200ms
    write_bw = # of pages written to the disk / 200ms

    For the aggressive dd dirtiers, the equality holds

    dirty_rate == N * task_rate
    == N * task_ratelimit_0 (4)
    Or
    task_ratelimit_0 == dirty_rate / N (5)

    Now we conclude that the balanced task ratelimit can be estimated by

    write_bw
    balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6)
    dirty_rate

    Because with (4) and (5) we can get the desired equality (1):

    write_bw
    balanced_dirty_ratelimit == (dirty_rate / N) * ----------
    dirty_rate
    == write_bw / N

    Then using the balanced task ratelimit we can compute task pause times like:

    task_pause = task->nr_dirtied / task_ratelimit

    task_ratelimit with position control
    ------------------------------------

    However, while the above gives us means of matching the dirty rate to
    the writeout bandwidth, it at best provides us with a stable dirty page
    count (assuming a static system). In order to control the dirty page
    count such that it is high enough to provide performance, but does not
    exceed the specified limit we need another control.

    The dirty position control works by extending (2) to

    task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7)

    where pos_ratio is a negative feedback function that subjects to

    1) f(setpoint) = 1.0
    2) df/dx < 0

    That is, if the dirty pages are ABOVE the setpoint, we throttle each
    task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
    pages are created less fast than they are cleaned, thus DROP to the
    setpoints (and the reverse).

    Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
    remains CONSTANT for the past 200ms, we get

    task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8)

    Putting (8) into (6), we get the formula used in
    bdi_update_dirty_ratelimit():

    write_bw
    balanced_dirty_ratelimit *= pos_ratio * ---------- (9)
    dirty_rate

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • No behavior change.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
    that the resulted task rate limit can drive the dirty pages back to the
    global/bdi setpoints.

    Old scheme is,
    |
    free run area | throttle area
    ----------------------------------------+---------------------------->
    thresh^ dirty pages

    New scheme is,

    ^ task rate limit
    |
    | *
    | *
    | *
    |[free run] * [smooth throttled]
    | *
    | *
    | *
    ..bdi->dirty_ratelimit..........*
    | . *
    | . *
    | . *
    | . *
    | . *
    +-------------------------------.-----------------------*------------>
    setpoint^ limit^ dirty pages

    The slope of the bdi control line should be

    1) large enough to pull the dirty pages to setpoint reasonably fast

    2) small enough to avoid big fluctuations in the resulted pos_ratio and
    hence task ratelimit

    Since the fluctuation range of the bdi dirty pages is typically observed
    to be within 1-second worth of data, the bdi control line's slope is
    selected to be a linear function of bdi write bandwidth, so that it can
    adapt to slow/fast storage devices well.

    Assume the bdi control line

    pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

    where k is the negative slope.

    If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
    are fluctuating in range

    [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

    we get slope

    k = - 1 / (8 * write_bw)

    Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

    x_intercept = bdi_setpoint + 8 * write_bw

    The global/bdi slopes are nicely complementing each other when the
    system has only one major bdi (indicated by bdi_thresh ~= thresh):

    1) slope of global control line => scaling to the control scope size
    2) slope of main bdi control line => scaling to the writeout bandwidth

    so that

    - in memory tight systems, (1) becomes strong enough to squeeze dirty
    pages inside the control scope

    - in large memory systems where the "gravity" of (1) for pulling the
    dirty pages to setpoint is too weak, (2) can back (1) up and drive
    dirty pages to bdi_setpoint ~= setpoint reasonably fast.

    Unfortunately in JBOD setups, the fluctuation range of bdi threshold
    is related to memory size due to the interferences between disks. In
    this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

    Given equations

    span = x_intercept - bdi_setpoint
    k = df/dx = - 1 / span

    and the extremum values

    span = bdi_thresh
    dx = bdi_thresh

    we get

    df = - dx / span = - 1.0

    That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
    task ratelimit will fluctuate by -100%.

    peter: use 3rd order polynomial for the global control line

    CC: Peter Zijlstra
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce the BDI_DIRTIED counter. It will be used for estimating the
    bdi's dirty bandwidth.

    CC: Jan Kara
    CC: Michael Rubin
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

19 Aug, 2011

1 commit

  • Revert the pass-good area introduced in ffd1f609ab10 ("writeback:
    introduce max-pause and pass-good dirty limits") and make the max-pause
    area smaller and safe.

    This fixes ~30% performance regression in the ext3 data=writeback
    fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
    12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

    Using deadline scheduler also has a regression, but not that big as CFQ,
    so this suggests we have some write starvation.

    The test logs show that

    - the disks are sometimes under utilized

    - global dirty pages sometimes rush high to the pass-good area for
    several hundred seconds, while in the mean time some bdi dirty pages
    drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the
    global dirty pages dropped under global dirty threshold and bdi_dirty
    rush very high (for example, 2 times higher than bdi_thresh). During
    which time balance_dirty_pages() is not called at all.

    So the problems are

    1) The random writes progress so slow that they break the assumption of
    the max-pause logic that "8 pages per 200ms is typically more than
    enough to curb heavy dirtiers".

    2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
    for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
    and then (bdi_thresh >> bdi_dirty) for others.

    3) The higher max-pause/pass-good thresholds somehow leads to the bad
    swing of dirty pages.

    The fix is to allow the task to slightly dirty over task_bdi_thresh, but
    no way to exceed bdi_dirty and/or global dirty_thresh.

    Tests show that it fixed the JBOD regression completely (both behavior
    and performance), while still being able to cut down large pause times
    in balance_dirty_pages() for single-disk cases.

    Reported-by: Li Shaohua
    Tested-by: Li Shaohua
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

2 commits


24 Jul, 2011

1 commit

  • We set bdi->dirty_exceeded (and thus ratelimiting code starts to
    call balance_dirty_pages() every 8 pages) when a per-bdi limit is
    exceeded or global limit is exceeded. But per-bdi limit also depends
    on the task. Thus different tasks reach the limit on that bdi at
    different levels of dirty pages. The result is that with current code
    bdi->dirty_exceeded ping-ponged between 1 and 0 depending on which task
    just got into balance_dirty_pages().

    We fix the issue by clearing bdi->dirty_exceeded only when per-bdi amount
    of dirty pages drops below the threshold (7/8 * bdi_dirty_limit) where task
    limits already do not have any influence.

    Impact: The end result is, the dirty pages are kept more tightly under
    control, with the average number slightly lowered than before. This
    reduces the risk to throttle light dirtiers and hence more responsive.
    However it may add overheads by enforcing balance_dirty_pages() calls
    on every 8 pages when there are 2+ heavy dirtiers.

    CC: Andrew Morton
    CC: Christoph Hellwig
    CC: Dave Chinner
    CC: Peter Zijlstra
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

10 Jul, 2011

7 commits

  • Add trace event balance_dirty_state for showing the global dirty page
    counts and thresholds at each global_dirty_limits() invocation. This
    will cover the callers throttle_vm_writeout(), over_bground_thresh()
    and each balance_dirty_pages() loop.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The max-pause limit helps to keep the sleep time inside
    balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
    per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
    normally is enough to stop dirtiers from continue pushing the dirty
    pages high, unless there are a sufficient large number of slow dirtiers
    (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
    write bandwidth of a slow disk and hence accumulating more and more dirty
    pages).

    The pass-good limit helps to let go of the good bdi's in the presence of
    a blocked bdi (ie. NFS server not responding) or slow USB disk which for
    some reason build up a large number of initial dirty pages that refuse
    to go away anytime soon.

    For example, given two bdi's A and B and the initial state

    bdi_thresh_A = dirty_thresh / 2
    bdi_thresh_B = dirty_thresh / 2
    bdi_dirty_A = dirty_thresh / 2
    bdi_dirty_B = dirty_thresh / 2

    Then A get blocked, after a dozen seconds

    bdi_thresh_A = 0
    bdi_thresh_B = dirty_thresh
    bdi_dirty_A = dirty_thresh / 2
    bdi_dirty_B = dirty_thresh / 2

    The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
    will be effectively throttled by condition (nr_dirty < dirty_thresh).
    This has two problems:
    (1) we lose the protections for light dirtiers
    (2) balance_dirty_pages() effectively becomes IO-less because the
    (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
    for IO, but balance_dirty_pages() loses an important way to break
    out of the loop which leads to more spread out throttle delays.

    DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
    DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
    example while this patch uses the more conservative value 8 so as not to
    surprise people with too many dirty pages than expected.

    The max-pause limit won't noticeably impact the speed dirty pages are
    knocked down when there is a sudden drop of global/bdi dirty thresholds.
    Because the heavy dirties will be throttled below 160KB/s which is slow
    enough. It does help to avoid long dirty throttle delays and especially
    will make light dirtiers more responsive.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The start of a heavy weight application (ie. KVM) may instantly knock
    down determine_dirtyable_memory() if the swap is not enabled or full.
    global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
    dirty thresholds that are _much_ lower than the global/bdi dirty pages.

    balance_dirty_pages() will then heavily throttle all dirtiers including
    the light ones, until the dirty pages drop below the new dirty thresholds.
    During this _deep_ dirty-exceeded state, the system may appear rather
    unresponsive to the users.

    About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
    threshold to heavy dirtiers than light ones, and the dirty pages will
    be throttled around the heavy dirtiers' dirty threshold and reasonably
    below the light dirtiers' dirty threshold. In this state, only the heavy
    dirtiers will be throttled and the dirty pages are carefully controlled
    to not exceed the light dirtiers' dirty threshold. However if the
    threshold itself suddenly drops below the number of dirty pages, the
    light dirtiers will get heavily throttled.

    So introduce global_dirty_limit for tracking the global dirty threshold
    with policies

    - follow downwards slowly
    - follow up in one shot

    global_dirty_limit can effectively mask out the impact of sudden drop of
    dirtyable memory. It will be used in the next patch for two new type of
    dirty limits. Note that the new dirty limits are not going to avoid
    throttling the light dirtiers, but could limit their sleep time to 200ms.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce

    nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

    in order to simplify many tests in the following patches.

    balance_dirty_pages() will eventually care only about the dirty sums
    besides nr_writeback.

    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The estimation value will start from 100MB/s and adapt to the real
    bandwidth in seconds.

    It tries to update the bandwidth only when disk is fully utilized.
    Any inactive period of more than one second will be skipped.

    The estimated bandwidth will be reflecting how fast the device can
    writeout when _fully utilized_, and won't drop to 0 when it goes idle.
    The value will remain constant at disk idle time. At busy write time, if
    not considering fluctuations, it will also remain high unless be knocked
    down by possible concurrent reads that compete for the disk time and
    bandwidth with async writes.

    The estimation is not done purely in the flusher because there is no
    guarantee for write_cache_pages() to return timely to update bandwidth.

    The bdi->avg_write_bandwidth smoothing is very effective for filtering
    out sudden spikes, however may be a little biased in long term.

    The overheads are low because the bdi bandwidth update only occurs at
    200ms intervals.

    The 200ms update interval is suitable, because it's not possible to get
    the real bandwidth for the instance at all, due to large fluctuations.

    The NFS commits can be as large as seconds worth of data. One XFS
    completion may be as large as half second worth of data if we are going
    to increase the write chunk to half second worth of data. In ext4,
    fluctuations with time period of around 5 seconds is observed. And there
    is another pattern of irregular periods of up to 20 seconds on SSD tests.

    That's why we are not only doing the estimation at 200ms intervals, but
    also averaging them over a period of 3 seconds and then go further to do
    another level of smoothing in avg_write_bandwidth.

    CC: Li Shaohua
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce the BDI_WRITTEN counter. It will be used for estimating the
    bdi's write bandwidth.

    Peter Zijlstra :
    Move BDI_WRITTEN accounting into __bdi_writeout_inc().
    This will cover and fix fuse, which only calls bdi_writeout_inc().

    CC: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     
  • Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
    and initialize the struct writeback_control there.

    struct writeback_control is basically designed to control writeback of a
    single file, but we keep abuse it for writing multiple files in
    writeback_sb_inodes() and its callers.

    It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
    work->nr_pages starts to make sense, and instead of saving and restoring
    pages_skipped in writeback_sb_inodes it can always start with a clean
    zero value.

    It also makes a neat IO pattern change: large dirty files are now
    written in the full 4MB writeback chunk size, rather than whatever
    remained quota in wbc->nr_to_write.

    Acked-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

20 Jun, 2011

1 commit


08 Jun, 2011

2 commits

  • This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

    Notes about the tmpfs/ramfs behavior changes:

    As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
    balance_dirty_pages() as long as we are over the (dirty+background)/2
    global throttle threshold. This is because both the dirty pages and
    threshold will be 0 for tmpfs/ramfs. Hence this test will always
    evaluate to TRUE:

    dirty_exceeded =
    (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
    || (nr_reclaimable + nr_writeback >= dirty_thresh);

    For 2.6.37, someone complained that the current logic does not allow the
    users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to

    dirty_exceeded =
    (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
    || (nr_reclaimable + nr_writeback > dirty_thresh);

    So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
    throttled unless the global dirty threshold is exceeded (which is very
    unlikely to happen; once happen, will block many tasks).

    I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
    for a busy writing server, tmpfs write()s may get livelocked! The
    "inadvertent" throttling can hardly bring help to any workload because
    of its "either no throttling, or get throttled to death" property.

    So based on 2.6.37, this patch won't bring more noticeable changes.

    CC: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Reviewed-by: Minchan Kim
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Clarify the bdi_dirty_limit() comment.

    Acked-by: Peter Zijlstra
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang