06 May, 2012

1 commit

  • This prevents global_dirty_limit from remaining 0 (the initial value)
    for long time, since it's only updated in update_dirty_limit() when
    above the dirty freerun area.

    It will avoid unexpected consequences when some random code use it as a
    convenient approximation of the global dirty threshold.

    Signed-off-by: Fengguang Wu

    Fengguang Wu
     

14 Apr, 2012

1 commit


29 Mar, 2012

1 commit

  • Pull ext4 updates for 3.4 from Ted Ts'o:
    "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

    The changes to export dirty_writeback_interval are from Artem's s_dirt
    cleanup patch series. The same is true of the change to remove the
    s_dirt helper functions which never got used by anyone in-tree. I've
    run these changes by Al Viro, and am carrying them so that Artem can
    more easily fix up the rest of the file systems during the next merge
    window. (Originally we had hopped to remove the use of s_dirt from
    ext4 during this merge window, but his patches had some bugs, so I
    ultimately ended dropping them from the ext4 tree.)"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
    vfs: remove unused superblock helpers
    mm: export dirty_writeback_interval
    ext4: remove useless s_dirt assignment
    ext4: write superblock only once on unmount
    ext4: do not mark superblock as dirty unnecessarily
    ext4: correct ext4_punch_hole return codes
    ext4: remove restrictive checks for EOFBLOCKS_FL
    ext4: always set then trimmed blocks count into len
    ext4: fix trimmed block count accunting
    ext4: fix start and len arguments handling in ext4_trim_fs()
    ext4: update s_free_{inodes,blocks}_count during online resize
    ext4: change some printk() calls to use ext4_msg() instead
    ext4: avoid output message interleaving in ext4_error_()
    ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
    ext4: add no_printk argument validation, fix fallout
    ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
    ext4: give more helpful error message in ext4_ext_rm_leaf()
    ext4: remove unused code from ext4_ext_map_blocks()
    ext4: rewrite punch hole to use ext4_ext_remove_space()
    jbd2: cleanup journal tail after transaction commit
    ...

    Linus Torvalds
     

22 Mar, 2012

2 commits

  • Export 'dirty_writeback_interval' to make it visible to
    file-systems. We are going to push superblock management down to
    file-systems and get rid of the 'sync_supers' kernel thread completly.

    Signed-off-by: Artem Bityutskiy
    Cc: Al Viro
    Signed-off-by: "Theodore Ts'o"

    Artem Bityutskiy
     
  • When starting a memory hog task, a desktop box w/o swap is found to go
    unresponsive for a long time. It's solely caused by lots of congestion
    waits in throttle_vm_writeout():

    gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
    gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
    gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
    gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
    Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
    Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000

    The root cause is, the dirty threshold is knocked down a lot by the memory
    hog task. Fixed by using global_dirty_limit which decreases gradually on
    such events and can guarantee we stay above (the also decreasing) nr_dirty
    in the progress of following down to the new dirty threshold.

    Signed-off-by: Fengguang Wu
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Greg Thelen
    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

11 Jan, 2012

5 commits

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The next patch will introduce per-zone dirty limiting functions in
    addition to the traditional global dirty limiting.

    Rename determine_dirtyable_memory() to global_dirtyable_memory() before
    adding the zone-specific version, and fix up its documentation.

    Also, move the functions to determine the dirtyable memory and the
    function to calculate the dirty limit based on that together so that their
    relationship is more apparent and that they can be commented on as a
    group.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The tracing ring-buffer used this function briefly, but not anymore.
    Make it local to the writeback code again.

    Also, move the function so that no forward declaration needs to be
    reintroduced.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

18 Dec, 2011

8 commits

  • Add an upper limit to balanced_rate according to the below inequality.
    This filters out some rare but huge singular points, which at least
    enables more readable gnuplot figures.

    When there are N dd dirtiers,

    balanced_dirty_ratelimit = write_bw / N

    So it holds that

    balanced_dirty_ratelimit
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • This helps to reduce dirty throttling polls and hence CPU overheads.

    bdi->dirty_exceeded typically only helps when suddenly starting 100+
    dd's on a disk, in which case the dd's may need to poll
    balance_dirty_pages() earlier than tsk->nr_dirtied_pause.

    CC: Jan Kara
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
    Shaohua manages to root cause it to be the much smaller dirty pause times
    and hence much more frequent invocations to the IO-less balance_dirty_pages().
    Since fio_mmap_randwrite_64k effectively contains both reads and writes,
    the more frequent pauses triggered more idling in the cfq IO scheduler.

    The solution is to increase pause time all the way up to the max 200ms
    in this case, which is found to restore most performance. This will help
    reduce CPU overheads in other cases, too.

    Note that I don't expect many performance critical workloads to run this
    access pattern: the mmap read-on-write is rather inefficient and could
    be avoided by doing normal writes syscalls.

    CC: Jan Kara
    CC: Peter Zijlstra
    Reported-by: Li Shaohua
    Tested-by: Li Shaohua
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Control the pause time and the call intervals to balance_dirty_pages()
    with three parameters:

    1) max_pause, limited by bdi_dirty and MAX_PAUSE

    2) the target pause time, grows with the number of dd tasks
    and is normally limited by max_pause/2

    3) the minimal pause, set to half the target pause
    and is used to skip short sleeps and accumulate them into bigger ones

    The typical behaviors after patch:

    - if ever task_ratelimit is far below dirty_ratelimit, the pause time
    will remain constant at max_pause and nr_dirtied_pause will be
    fluctuating with task_ratelimit

    - in the normal cases, nr_dirtied_pause will remain stable (keep in the
    same pace with dirty_ratelimit) and the pause time will be fluctuating
    with task_ratelimit

    In summary, someone has to fluctuate with task_ratelimit, because

    task_ratelimit = nr_dirtied_pause / pause

    We normally prefer a stable nr_dirtied_pause, until reaching max_pause.

    The notable behavior changes are:

    - in stable workloads, there will no longer be sudden big trajectory
    switching of nr_dirtied_pause as concerned by Peter. It will be as
    smooth as dirty_ratelimit and changing proportionally with it (as
    always, assuming bdi bandwidth does not fluctuate across 2^N lines,
    otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)

    - in the rare cases when something keeps task_ratelimit far below
    dirty_ratelimit, the smoothness can no longer be retained and
    nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
    (not that destructive but still not good) bug that
    dirty_ratelimit gets brought down undesirably

    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Compensate the task's think time when computing the final pause time,
    so that ->dirty_ratelimit can be executed accurately.

    think time := time spend outside of balance_dirty_pages()

    In the rare case that the task slept longer than the 200ms period time
    (result in negative pause time), the sleep time will be compensated in
    the following periods, too, if it's less than 1 second.

    Accumulated errors are carefully avoided as long as the max pause area
    is not hitted.

    Pseudo code:

    period = pages_dirtied / task_ratelimit;
    think = jiffies - dirty_paused_when;
    pause = period - think;

    1) normal case: period > think

    pause = period - think
    dirty_paused_when = jiffies + pause
    nr_dirtied = 0

    period time
    |===============================>|
    think time pause time
    |===============>|==============>|
    ------|----------------|---------------|------------------------
    dirty_paused_when jiffies

    2) no pause case: period |
    think time
    |===================================================>|
    ------|--------------------------------+-------------------|----
    dirty_paused_when jiffies

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • De-account the accumulative dirty counters on page redirty.

    Page redirties (very common in ext4) will introduce mismatch between
    counters (a) and (b)

    a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
    b) NR_WRITTEN, BDI_WRITTEN

    This will introduce systematic errors in balanced_rate and result in
    dirty page position errors (ie. the dirty pages are no longer balanced
    around the global/bdi setpoints).

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • When dd in 512bytes, generic_perform_write() calls
    balance_dirty_pages_ratelimited() 8 times for the same page, but
    obviously the page is only dirtied once.

    Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It's a years long problem that a large number of short-lived dirtiers
    (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
    (eg. dd) as well as pushing the dirty pages to the global hard limit.

    The solution is to charge the pages dirtied by the exited gcc to the
    other random dirtying tasks. It sounds not perfect, however should
    behave good enough in practice, seeing as that throttled tasks aren't
    actually running so those that are running are more likely to pick it up
    and get throttled, therefore promoting an equal spread.

    Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Randy Dunlap
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

08 Dec, 2011

3 commits

  • Some trace shows lots of bdi_dirty=0 lines where it's actually some
    small value if w/o the accounting errors in the per-cpu bdi stats.

    In this case the max pause time should really be set to the smallest
    (non-zero) value to avoid IO queue underrun and improve throughput.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • On a system with 1 local mount and 1 NFS mount, if the NFS server
    becomes not responding when dd to the NFS mount, the NFS dirty pages may
    exceed the global dirty limit and _every_ task involving writing will be
    blocked. The whole system appears unresponsive.

    The workaround is to permit through the bdi's that only has a small
    number of dirty pages. The number chosen (bdi_stat_error pages) is not
    enough to enable the local disk to run in optimal throughput, however is
    enough to make the system responsive on a broken NFS mount. The user can
    then kill the dirtiers on the NFS mount and increase the global dirty
    limit to bring up the local disk's throughput.

    It risks allowing dirty pages to grow much larger than the global dirty
    limit when there are 1000+ mounts, however that's very unlikely to happen,
    especially in low memory profiles.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • We do "floating proportions" to let active devices to grow its target
    share of dirty pages and stalled/inactive devices to decrease its target
    share over time.

    It works well except in the case of "an inactive disk suddenly goes
    busy", where the initial target share may be too small. To mitigate
    this, bdi_position_ratio() has the below line to raise a small
    bdi_thresh when it's safe to do so, so that the disk be feed with enough
    dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

    bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

    balance_dirty_pages() normally does negative feedback control which
    adjusts ratelimit to balance the bdi dirty pages around the target.
    In some extreme cases when that is not enough, it will have to block
    the tasks completely until the bdi dirty pages drop below bdi_thresh.

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

17 Nov, 2011

2 commits

  • They are not used any more.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
    on every 1 4KB-page, which means it cannot throttle a task under
    4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
    10MB/s USB stick, its bdi dirty pages could grow out of control.

    Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
    means a limit of 4KB/s.

    They can eventually be safeguarded by the global limit check
    (nr_dirty < dirty_thresh). However if someone is also writing to an
    HDD at the same time, it'll get poor HDD write performance.

    We at least want to maintain good write performance for other devices
    when one device is attacked by some "massive parallel" workload, or
    suffers from slow write bandwidth, or somehow get stalled due to some
    error condition (eg. NFS server not responding).

    For a stalled device, we need to completely block its dirtiers, too,
    before its bdi dirty pages grow all the way up to the global limit and
    leave no space for the other functional devices.

    So change the loop exit condition to

    /*
    * Always enforce global dirty limit; also enforce bdi dirty limit
    * if the normal max_pause sleeps cannot keep things under control.
    */
    if (nr_dirty < dirty_thresh &&
    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
    break;

    which can be further simplified to

    if (task_ratelimit)
    break;

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

16 Nov, 2011

1 commit

  • There is no reason why task in balance_dirty_pages() shouldn't be killable
    and it helps in recovering from some error conditions (like when filesystem
    goes in error state and cannot accept writeback anymore but we still want to
    kill processes using it to be able to unmount it).

    There will be follow up patches to further abort the generic_perform_write()
    and other filesystem write loops, to avoid large write + SIGKILL combination
    exceeding the dirty limit and possibly strange OOM.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Reviewed-by: Neil Brown
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

07 Nov, 2011

3 commits

  • In balance_dirty_pages() task_ratelimit may be not initialized
    (initialization skiped by goto pause), and then used when calling
    tracing hook.

    Fix it by moving the task_ratelimit assignment before goto pause.

    Reported-by: Witold Baryluk
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

01 Nov, 2011

1 commit


31 Oct, 2011

4 commits


11 Oct, 2011

1 commit

  • Fix powerpc compile warnings

    mm/page-writeback.c: In function 'bdi_position_ratio':
    mm/page-writeback.c:622:3: warning: comparison of distinct pointer types lacks a cast [enabled by default]
    page-writeback.c:635:4: warning: comparison of distinct pointer types lacks a cast [enabled by default]

    Also fix gcc "uninitialized var" warnings.

    Reported-by: Stephen Rothwell
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

03 Oct, 2011

6 commits

  • Keep a minimal pool of dirty pages for each bdi, so that the disk IO
    queues won't underrun. Also gently increase a small bdi_thresh to avoid
    it stuck in 0 for some light dirtied bdi.

    It's particularly useful for JBOD and small memory system.

    It may result in (pos_ratio > 1) at the setpoint and push the dirty
    pages high. This is more or less intended because the bdi is in the
    danger of IO queue underflow.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The dirty pause time shall ultimately be controlled by adjusting
    nr_dirtied_pause, since there is relationship

    pause = pages_dirtied / task_ratelimit

    Assuming

    pages_dirtied ~= nr_dirtied_pause
    task_ratelimit ~= dirty_ratelimit

    We get

    nr_dirtied_pause ~= dirty_ratelimit * desired_pause

    Here dirty_ratelimit is preferred over task_ratelimit because it's
    more stable.

    It's also important to limit possible large transitional errors:

    - bw is changing quickly
    - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
    - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
    separate fix, but still expect non-trivial errors)

    So we end up using the above formula inside clamp_val().

    The best test case for this code is to run 100 "dd bs=4M" tasks on
    btrfs and check its pause time distribution.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Apply two policies to scale down the max pause time for

    1) small number of concurrent dirtiers
    2) small memory system (comparing to storage bandwidth)

    MAX_PAUSE=200ms may only be suitable for high end servers with lots of
    concurrent dirtiers, where the large pause time can reduce much overheads.

    Otherwise, smaller pause time is desirable whenever possible, so as to
    get good responsiveness and smooth user experiences. It's actually
    required for good disk utilization in the case when all the dirty pages
    can be synced to disk within MAX_PAUSE=200ms.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • As proposed by Chris, Dave and Jan, don't start foreground writeback IO
    inside balance_dirty_pages(). Instead, simply let it idle sleep for some
    time to throttle the dirtying task. In the mean while, kick off the
    per-bdi flusher thread to do background writeback IO.

    RATIONALS
    =========

    - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

    If every thread doing writes and being throttled start foreground
    writeback, it leads to N IO submitters from at least N different
    inodes at the same time, end up with N different sets of IO being
    issued with potentially zero locality to each other, resulting in
    much lower elevator sort/merge efficiency and hence we seek the disk
    all over the place to service the different sets of IO.
    OTOH, if there is only one submission thread, it doesn't jump between
    inodes in the same way when congestion clears - it keeps writing to
    the same inode, resulting in large related chunks of sequential IOs
    being issued to the disk. This is more efficient than the above
    foreground writeback because the elevator works better and the disk
    seeks less.

    - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

    With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
    from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

    * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

    * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

    * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

    * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

    - IO size too small for fast arrays and too large for slow USB sticks

    The write_chunk used by current balance_dirty_pages() cannot be
    directly set to some large value (eg. 128MB) for better IO efficiency.
    Because it could lead to more than 1 second user perceivable stalls.
    Even the current 4MB write size may be too large for slow USB sticks.
    The fact that balance_dirty_pages() starts IO on itself couples the
    IO size to wait time, which makes it hard to do suitable IO size while
    keeping the wait time under control.

    Now it's possible to increase writeback chunk size proportional to the
    disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
    the larger writeback size dramatically reduces the seek count to 1/10
    (far beyond my expectation) and improves the write throughput by 24%.

    - long block time in balance_dirty_pages() hurts desktop responsiveness

    Many of us may have the experience: it often takes a couple of seconds
    or even long time to stop a heavy writing dd/cp/tar command with
    Ctrl-C or "kill -9".

    - IO pipeline broken by bumpy write() progress

    There are a broad class of "loop {read(buf); write(buf);}" applications
    whose read() pipeline will be under-utilized or even come to a stop if
    the write()s have long latencies _or_ don't progress in a constant rate.
    The current threshold based throttling inherently transfers the large
    low level IO completion fluctuations to bumpy application write()s,
    and further deteriorates with increasing number of dirtiers and/or bdi's.

    For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
    the rsync progresses very bumpy in legacy kernel, and throughput is
    improved by 67% by this patchset. (plus the larger write chunk size,
    it will be 93% speedup).

    The new rate based throttling can support 1000+ dd's with excellent
    smoothness, low latency and low overheads.

    For the above reasons, it's much better to do IO-less and low latency
    pauses in balance_dirty_pages().

    Jan Kara, Dave Chinner and me explored the scheme to let
    balance_dirty_pages() wait for enough writeback IO completions to
    safeguard the dirty limit. However it's found to have two problems:

    - in large NUMA systems, the per-cpu counters may have big accounting
    errors, leading to big throttle wait time and jitters.

    - NFS may kill large amount of unstable pages with one single COMMIT.
    Because NFS server serves COMMIT with expensive fsync() IOs, it is
    desirable to delay and reduce the number of COMMITs. So it's not
    likely to optimize away such kind of bursty IO completions, and the
    resulted large (and tiny) stall times in IO completion based throttling.

    So here is a pause time oriented approach, which tries to control the
    pause time in each balance_dirty_pages() invocations, by controlling
    the number of pages dirtied before calling balance_dirty_pages(), for
    smooth and efficient dirty throttling:

    - avoid useless (eg. zero pause time) balance_dirty_pages() calls
    - avoid too small pause time (less than 4ms, which burns CPU power)
    - avoid too large pause time (more than 200ms, which hurts responsiveness)
    - avoid big fluctuations of pause times

    It can control pause times at will. The default policy (in a followup
    patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
    in 1000-dd case.

    BEHAVIOR CHANGE
    ===============

    (1) dirty threshold

    Users will notice that the applications will get throttled once crossing
    the global (background + dirty)/2=15% threshold, and then balanced around
    17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
    memory in 1-dd case.

    Since the task will be soft throttled earlier than before, it may be
    perceived by end users as performance "slow down" if his application
    happens to dirty more than 15% dirtyable memory.

    (2) smoothness/responsiveness

    Users will notice a more responsive system during heavy writeback.
    "killall dd" will take effect instantly.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Add two fields to task_struct.

    1) account dirtied pages in the individual tasks, for accuracy
    2) per-task balance_dirty_pages() call intervals, for flexibility

    The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
    scale near-sqrt to the safety gap between dirty pages and threshold.

    The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
    pages at exactly the same time, each task will be assigned a large
    initial nr_dirtied_pause, so that the dirty threshold will be exceeded
    long before each task reached its nr_dirtied_pause and hence call
    balance_dirty_pages().

    The solution is to watch for the number of pages dirtied on each CPU in
    between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
    (3% dirty threshold), force call balance_dirty_pages() for a chance to
    set bdi->dirty_exceeded. In normal situations, this safeguarding
    condition is not expected to trigger at all.

    On the sqrt in dirty_poll_interval():

    It will serve as an initial guess when dirty pages are still in the
    freerun area.

    When dirty pages are floating inside the dirty control scope [freerun,
    limit], a followup patch will use some refined dirty poll interval to
    get the desired pause time.

    thresh-dirty (MB) sqrt
    1 16
    2 22
    4 32
    8 45
    16 64
    32 90
    64 128
    128 181
    256 256
    512 362
    1024 512

    The above table means, given 1MB (or 1GB) gap and the dd tasks polling
    balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
    be exceeded as long as there are less than 16 (or 512) concurrent dd's.

    So sqrt naturally leads to less overheads and more safe concurrent tasks
    for large memory servers, which have large (thresh-freerun) gaps.

    peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case

    CC: Peter Zijlstra
    Reviewed-by: Andrea Righi
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • There are some imperfections in balanced_dirty_ratelimit.

    1) large fluctuations

    The dirty_rate used for computing balanced_dirty_ratelimit is merely
    averaged in the past 200ms (very small comparing to the 3s estimation
    period for write_bw), which makes rather dispersed distribution of
    balanced_dirty_ratelimit.

    It's pretty hard to average out the singular points by increasing the
    estimation period. Considering that the averaging technique will
    introduce very undesirable time lags, I give it up totally. (btw, the 3s
    write_bw averaging time lag is much more acceptable because its impact
    is one-way and therefore won't lead to oscillations.)

    The more practical way is filtering -- most singular
    balanced_dirty_ratelimit points can be filtered out by remembering some
    prev_balanced_rate and prev_prev_balanced_rate. However the more
    reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

    2) due to truncates and fs redirties, the (write_bw dirty_rate)
    match could become unbalanced, which may lead to large systematical
    errors in balanced_dirty_ratelimit. The truncates, due to its possibly
    bumpy nature, can hardly be compensated smoothly. So let's face it. When
    some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
    high, dirty pages will go higher than the setpoint. task_ratelimit will
    in turn become lower than dirty_ratelimit. So if we consider both
    balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
    only when they are on the same side of dirty_ratelimit, the systematical
    errors in balanced_dirty_ratelimit won't be able to bring
    dirty_ratelimit far away.

    The balanced_dirty_ratelimit estimation may also be inaccurate near
    @limit or @freerun, however is less an issue.

    3) since we ultimately want to

    - keep the fluctuations of task ratelimit as small as possible
    - keep the dirty pages around the setpoint as long time as possible

    the update policy used for (2) also serves the above goals nicely:
    if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
    and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
    there is no point to bring up dirty_ratelimit in a hurry only to hurt
    both the above two goals.

    So, we make use of task_ratelimit to limit the update of dirty_ratelimit
    in two ways:

    1) avoid changing dirty rate when it's against the position control target
    (the adjusted rate will slow down the progress of dirty pages going
    back to setpoint).

    2) limit the step size. task_ratelimit is changing values step by step,
    leaving a consistent trace comparing to the randomly jumping
    balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
    errors in stable state and typically larger errors when there are big
    errors in rate. So it's a pretty good limiting factor for the step
    size of dirty_ratelimit.

    Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
    task_ratelimit is merely used as a limiting factor.

    Signed-off-by: Wu Fengguang

    Wu Fengguang