08 Apr, 2014

1 commit


22 Feb, 2014

1 commit

  • This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
    Chinner has reported the commit may cause some
    inodes to be left out from sync(2). This is because we can call
    redirty_tail() for some inode (which sets i_dirtied_when to current time)
    after sync(2) has started or similarly requeue_inode() can set
    i_dirtied_when to current time if writeback had to skip some pages. The
    real problem is in the functions clobbering i_dirtied_when but fixing
    that isn't trivial so revert is a safer choice for now.

    CC: stable@vger.kernel.org # >= 3.13
    Signed-off-by: Jan Kara

    Jan Kara
     

13 Nov, 2013

1 commit

  • When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
    (e.g. causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large.

    The following script reproduces the problem:

    function run_writers
    {
    for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
    dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
    done
    }

    for dir in "$@"; do
    run_writers $dir
    done

    sleep 40
    time sync

    Fix the problem by disregarding inodes dirtied after sync(2) was called
    in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
    a time stamp when sync has started which is used for setting up work for
    flusher threads.

    To give some numbers, when above script is run on two ext4 filesystems
    on simple SATA drive, the average sync time from 10 runs is 267.549
    seconds with standard deviation 104.799426. With the patched kernel,
    the average sync time from 10 runs is 2.995 seconds with standard
    deviation 0.096.

    Signed-off-by: Jan Kara
    Reviewed-by: Fengguang Wu
    Reviewed-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

12 Sep, 2013

1 commit

  • It's not used globally and could be static.

    Signed-off-by: Wanpeng Li
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Fengguang Wu
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Jiri Kosina
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

10 Jul, 2013

3 commits


03 Jul, 2013

1 commit

  • When sync does it's WB_SYNC_ALL writeback, it issues data Io and
    then immediately waits for IO completion. This is done in the
    context of the flusher thread, and hence completely ties up the
    flusher thread for the backing device until all the dirty inodes
    have been synced. On filesystems that are dirtying inodes constantly
    and quickly, this means the flusher thread can be tied up for
    minutes per sync call and hence badly affect system level write IO
    performance as the page cache cannot be cleaned quickly.

    We already have a wait loop for IO completion for sync(2), so cut
    this out of the flusher thread and delegate it to wait_sb_inodes().
    Hence we can do rapid IO submission, and then wait for it all to
    complete.

    Effect of sync on fsmark before the patch:

    FSUse% Count Size Files/sec App Overhead
    .....
    0 640000 4096 35154.6 1026984
    0 720000 4096 36740.3 1023844
    0 800000 4096 36184.6 916599
    0 880000 4096 1282.7 1054367
    0 960000 4096 3951.3 918773
    0 1040000 4096 40646.2 996448
    0 1120000 4096 43610.1 895647
    0 1200000 4096 40333.1 921048

    And a single sync pass took:

    real 0m52.407s
    user 0m0.000s
    sys 0m0.090s

    After the patch, there is no impact on fsmark results, and each
    individual sync(2) operation run concurrently with the same fsmark
    workload takes roughly 7s:

    real 0m6.930s
    user 0m0.000s
    sys 0m0.039s

    IOWs, sync is 7-8x faster on a busy filesystem and does not have an
    adverse impact on ongoing async data write operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

12 Jan, 2013

1 commit

  • writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
    with down_read_trylock() because

    - If ->s_umount is write locked, then the sb is not idle. That is
    writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.

    - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
    writeback, it may bring us deadlock problem when doing umount. In order to
    fix the problem, ext4 and btrfs implemented their own writeback functions
    instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
    code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().

    The name of these two functions is cumbersome, so rename them to
    try_to_writeback_inodes_sb(_nr).

    This idea came from Christoph Hellwig.
    Some code is from the patch of Kamal Mostafa.

    Reviewed-by: Jan Kara
    Signed-off-by: Miao Xie
    Signed-off-by: Fengguang Wu

    Miao Xie
     

12 Dec, 2012

1 commit


04 Aug, 2012

1 commit


01 Aug, 2012

1 commit

  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

06 May, 2012

1 commit

  • Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
    because iput() can do a lot of work - for example truncate the inode if it's
    the last iput on unlinked file. Some filesystems depend on flusher thread
    progressing (e.g. because they need to flush delay allocated blocks to reduce
    allocation uncertainty) and so flusher thread doing truncate creates
    interesting dependencies and possibilities for deadlocks.

    We get rid of iput() in flusher thread by using the fact that I_SYNC inode
    flag effectively pins the inode in memory. So if we take care to either hold
    i_lock or have I_SYNC set, we can get away without taking inode reference
    in writeback_sb_inodes().

    As a side effect of these changes, we also fix possible use-after-free in
    wb_writeback() because inode_wait_for_writeback() call could try to reacquire
    i_lock on the inode that was already free.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

25 Apr, 2012

1 commit


07 Mar, 2012

1 commit


11 Jan, 2012

3 commits

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The tracing ring-buffer used this function briefly, but not anymore.
    Make it local to the writeback code again.

    Also, move the function so that no forward declaration needs to be
    reintroduced.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Jan, 2012

1 commit


18 Dec, 2011

2 commits

  • De-account the accumulative dirty counters on page redirty.

    Page redirties (very common in ext4) will introduce mismatch between
    counters (a) and (b)

    a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
    b) NR_WRITTEN, BDI_WRITTEN

    This will introduce systematic errors in balanced_rate and result in
    dirty page position errors (ie. the dirty pages are no longer balanced
    around the global/bdi setpoints).

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It's a years long problem that a large number of short-lived dirtiers
    (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
    (eg. dd) as well as pushing the dirty pages to the global hard limit.

    The solution is to charge the pages dirtied by the exited gcc to the
    other random dirtying tasks. It sounds not perfect, however should
    behave good enough in practice, seeing as that throttled tasks aren't
    actually running so those that are running are more likely to pick it up
    and get throttled, therefore promoting an equal spread.

    Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Randy Dunlap
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Oct, 2011

1 commit

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     

03 Oct, 2011

1 commit


19 Aug, 2011

1 commit

  • Revert the pass-good area introduced in ffd1f609ab10 ("writeback:
    introduce max-pause and pass-good dirty limits") and make the max-pause
    area smaller and safe.

    This fixes ~30% performance regression in the ext3 data=writeback
    fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
    12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

    Using deadline scheduler also has a regression, but not that big as CFQ,
    so this suggests we have some write starvation.

    The test logs show that

    - the disks are sometimes under utilized

    - global dirty pages sometimes rush high to the pass-good area for
    several hundred seconds, while in the mean time some bdi dirty pages
    drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the
    global dirty pages dropped under global dirty threshold and bdi_dirty
    rush very high (for example, 2 times higher than bdi_thresh). During
    which time balance_dirty_pages() is not called at all.

    So the problems are

    1) The random writes progress so slow that they break the assumption of
    the max-pause logic that "8 pages per 200ms is typically more than
    enough to curb heavy dirtiers".

    2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
    for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
    and then (bdi_thresh >> bdi_dirty) for others.

    3) The higher max-pause/pass-good thresholds somehow leads to the bad
    swing of dirty pages.

    The fix is to allow the task to slightly dirty over task_bdi_thresh, but
    no way to exceed bdi_dirty and/or global dirty_thresh.

    Tests show that it fixed the JBOD regression completely (both behavior
    and performance), while still being able to cut down large pause times
    in balance_dirty_pages() for single-disk cases.

    Reported-by: Li Shaohua
    Tested-by: Li Shaohua
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

10 Jul, 2011

5 commits

  • Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
    concern of not holding I_SYNC for too long. (At least, that was the
    comment previously.) This doesn't make sense now because the only
    time we wait for I_SYNC is if we are calling sync or fsync, and in
    that case we need to write out all of the data anyway. Previously
    there may have been other code paths that waited on I_SYNC, but not
    any more. -- Theodore Ts'o

    So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
    will adapt to as large as the storage device can write within 500ms.

    XFS is observed to do IO completions in a batch, and the batch size is
    equal to the write chunk size. To avoid dirty pages to suddenly drop
    out of balance_dirty_pages()'s dirty control scope and create large
    fluctuations, the chunk size is also limited to half the control scope.

    The balance_dirty_pages() control scrope is

    [(background_thresh + dirty_thresh) / 2, dirty_thresh]

    which is by default [15%, 20%] of global dirty pages, whose range size
    is dirty_thresh / DIRTY_FULL_SCOPE.

    The adpative write chunk size will be rounded to the nearest 4MB
    boundary.

    http://bugzilla.kernel.org/show_bug.cgi?id=13930

    CC: Theodore Ts'o
    CC: Dave Chinner
    CC: Chris Mason
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The max-pause limit helps to keep the sleep time inside
    balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
    per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
    normally is enough to stop dirtiers from continue pushing the dirty
    pages high, unless there are a sufficient large number of slow dirtiers
    (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
    write bandwidth of a slow disk and hence accumulating more and more dirty
    pages).

    The pass-good limit helps to let go of the good bdi's in the presence of
    a blocked bdi (ie. NFS server not responding) or slow USB disk which for
    some reason build up a large number of initial dirty pages that refuse
    to go away anytime soon.

    For example, given two bdi's A and B and the initial state

    bdi_thresh_A = dirty_thresh / 2
    bdi_thresh_B = dirty_thresh / 2
    bdi_dirty_A = dirty_thresh / 2
    bdi_dirty_B = dirty_thresh / 2

    Then A get blocked, after a dozen seconds

    bdi_thresh_A = 0
    bdi_thresh_B = dirty_thresh
    bdi_dirty_A = dirty_thresh / 2
    bdi_dirty_B = dirty_thresh / 2

    The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
    will be effectively throttled by condition (nr_dirty < dirty_thresh).
    This has two problems:
    (1) we lose the protections for light dirtiers
    (2) balance_dirty_pages() effectively becomes IO-less because the
    (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
    for IO, but balance_dirty_pages() loses an important way to break
    out of the loop which leads to more spread out throttle delays.

    DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
    DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
    example while this patch uses the more conservative value 8 so as not to
    surprise people with too many dirty pages than expected.

    The max-pause limit won't noticeably impact the speed dirty pages are
    knocked down when there is a sudden drop of global/bdi dirty thresholds.
    Because the heavy dirties will be throttled below 160KB/s which is slow
    enough. It does help to avoid long dirty throttle delays and especially
    will make light dirtiers more responsive.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The start of a heavy weight application (ie. KVM) may instantly knock
    down determine_dirtyable_memory() if the swap is not enabled or full.
    global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
    dirty thresholds that are _much_ lower than the global/bdi dirty pages.

    balance_dirty_pages() will then heavily throttle all dirtiers including
    the light ones, until the dirty pages drop below the new dirty thresholds.
    During this _deep_ dirty-exceeded state, the system may appear rather
    unresponsive to the users.

    About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
    threshold to heavy dirtiers than light ones, and the dirty pages will
    be throttled around the heavy dirtiers' dirty threshold and reasonably
    below the light dirtiers' dirty threshold. In this state, only the heavy
    dirtiers will be throttled and the dirty pages are carefully controlled
    to not exceed the light dirtiers' dirty threshold. However if the
    threshold itself suddenly drops below the number of dirty pages, the
    light dirtiers will get heavily throttled.

    So introduce global_dirty_limit for tracking the global dirty threshold
    with policies

    - follow downwards slowly
    - follow up in one shot

    global_dirty_limit can effectively mask out the impact of sudden drop of
    dirtyable memory. It will be used in the next patch for two new type of
    dirty limits. Note that the new dirty limits are not going to avoid
    throttling the light dirtiers, but could limit their sleep time to 200ms.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The estimation value will start from 100MB/s and adapt to the real
    bandwidth in seconds.

    It tries to update the bandwidth only when disk is fully utilized.
    Any inactive period of more than one second will be skipped.

    The estimated bandwidth will be reflecting how fast the device can
    writeout when _fully utilized_, and won't drop to 0 when it goes idle.
    The value will remain constant at disk idle time. At busy write time, if
    not considering fluctuations, it will also remain high unless be knocked
    down by possible concurrent reads that compete for the disk time and
    bandwidth with async writes.

    The estimation is not done purely in the flusher because there is no
    guarantee for write_cache_pages() to return timely to update bandwidth.

    The bdi->avg_write_bandwidth smoothing is very effective for filtering
    out sudden spikes, however may be a little biased in long term.

    The overheads are low because the bdi bandwidth update only occurs at
    200ms intervals.

    The 200ms update interval is suitable, because it's not possible to get
    the real bandwidth for the instance at all, due to large fluctuations.

    The NFS commits can be as large as seconds worth of data. One XFS
    completion may be as large as half second worth of data if we are going
    to increase the write chunk to half second worth of data. In ext4,
    fluctuations with time period of around 5 seconds is observed. And there
    is another pattern of irregular periods of up to 20 seconds on SSD tests.

    That's why we are not only doing the estimation at 200ms intervals, but
    also averaging them over a period of 3 seconds and then go further to do
    another level of smoothing in avg_write_bandwidth.

    CC: Li Shaohua
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
    and initialize the struct writeback_control there.

    struct writeback_control is basically designed to control writeback of a
    single file, but we keep abuse it for writing multiple files in
    writeback_sb_inodes() and its callers.

    It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
    work->nr_pages starts to make sense, and instead of saving and restoring
    pages_skipped in writeback_sb_inodes it can always start with a clean
    zero value.

    It also makes a neat IO pattern change: large dirty files are now
    written in the full 4MB writeback chunk size, rather than whatever
    remained quota in wbc->nr_to_write.

    Acked-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

08 Jun, 2011

6 commits

  • Remove two unused struct writeback_control fields:

    .encountered_congestion (completely unused)
    .nonblocking (never set, checked/showed in XFS,NFS/btrfs)

    The .for_background check in nfs_write_inode() is also removed btw,
    as .for_background implies WB_SYNC_NONE.

    Reviewed-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • When wbc.more_io was first introduced, it indicates whether there are
    at least one superblock whose s_more_io contains more IO work. Now with
    the per-bdi writeback, it can be replaced with a simple b_more_io test.

    Acked-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • This removes writeback_control.wb_start and does more straightforward
    sync livelock prevention by setting .older_than_this to prevent extra
    inodes from being enqueued in the first place.

    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     
  • The flusher works on dirty inodes in batches, and may quit prematurely
    if the batch of inodes happen to be metadata-only dirtied: in this case
    wbc->nr_to_write won't be decreased at all, which stands for "no pages
    written" but also mis-interpreted as "no progress".

    So introduce writeback_control.inodes_written to count the inodes get
    cleaned from VFS POV. A non-zero value means there are some progress on
    writeback, in which case more writeback can be tried.

    Acked-by: Jan Kara
    Acked-by: Mel Gorman
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
    WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
    do livelock prevention for it, too.

    Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
    using page tagging") is a partial fix in that it only fixed the
    WB_SYNC_ALL phase livelock.

    Although ext4 is tested to no longer livelock with commit f446daaea9,
    it may due to some "redirty_tail() after pages_skipped" effect which
    is by no means a guarantee for _all_ the file systems.

    Note that writeback_inodes_sb() is called by not only sync(), they are
    treated the same because the other callers also need livelock prevention.

    Impact: It changes the order in which pages/inodes are synced to disk.
    Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
    until finished with the current inode.

    Acked-by: Jan Kara
    CC: Dave Chinner
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

25 Mar, 2011

2 commits

  • All that remains of the inode_lock is protecting the inode hash list
    manipulation and traversals. Rename the inode_lock to
    inode_hash_lock to reflect it's actual function.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

31 Oct, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
    Btrfs: deal with errors from updating the tree log
    Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
    Btrfs: make SNAP_DESTROY async
    Btrfs: add SNAP_CREATE_ASYNC ioctl
    Btrfs: add START_SYNC, WAIT_SYNC ioctls
    Btrfs: async transaction commit
    Btrfs: fix deadlock in btrfs_commit_transaction
    Btrfs: fix lockdep warning on clone ioctl
    Btrfs: fix clone ioctl where range is adjacent to extent
    Btrfs: fix delalloc checks in clone ioctl
    Btrfs: drop unused variable in block_alloc_rsv
    Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
    Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
    Btrfs: Use ERR_CAST helpers
    Btrfs: use memdup_user helpers
    Btrfs: fix raid code for removing missing drives
    Btrfs: Switch the extent buffer rbtree into a radix tree
    Btrfs: restructure try_release_extent_buffer()
    Btrfs: use the flusher threads for delalloc throttling
    Btrfs: tune the chunk allocation to 5% of the FS as metadata
    ...

    Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
    remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
    useless and removed in commit 5e8067adfdba: "rcu head remove init")

    Linus Torvalds
     

29 Oct, 2010

1 commit

  • When btrfs is running low on metadata space, it needs to force delayed
    allocation pages to disk. It currently does this with a suboptimal walk
    of a private list of inodes with delayed allocation, and it would be
    much better if we used the generic flusher threads.

    writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
    thread to start IO on all the dirty pages in the FS before it returns.
    This adds variants of writeback_inodes_sb* that allow the caller to
    control how many pages get sent down.

    Signed-off-by: Chris Mason

    Chris Mason