11 Jan, 2012

3 commits

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The tracing ring-buffer used this function briefly, but not anymore.
    Make it local to the writeback code again.

    Also, move the function so that no forward declaration needs to be
    reintroduced.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Jan, 2012

1 commit


18 Dec, 2011

2 commits

  • De-account the accumulative dirty counters on page redirty.

    Page redirties (very common in ext4) will introduce mismatch between
    counters (a) and (b)

    a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
    b) NR_WRITTEN, BDI_WRITTEN

    This will introduce systematic errors in balanced_rate and result in
    dirty page position errors (ie. the dirty pages are no longer balanced
    around the global/bdi setpoints).

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It's a years long problem that a large number of short-lived dirtiers
    (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
    (eg. dd) as well as pushing the dirty pages to the global hard limit.

    The solution is to charge the pages dirtied by the exited gcc to the
    other random dirtying tasks. It sounds not perfect, however should
    behave good enough in practice, seeing as that throttled tasks aren't
    actually running so those that are running are more likely to pick it up
    and get throttled, therefore promoting an equal spread.

    Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Randy Dunlap
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Oct, 2011

1 commit

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     

03 Oct, 2011

1 commit


19 Aug, 2011

1 commit

  • Revert the pass-good area introduced in ffd1f609ab10 ("writeback:
    introduce max-pause and pass-good dirty limits") and make the max-pause
    area smaller and safe.

    This fixes ~30% performance regression in the ext3 data=writeback
    fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
    12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

    Using deadline scheduler also has a regression, but not that big as CFQ,
    so this suggests we have some write starvation.

    The test logs show that

    - the disks are sometimes under utilized

    - global dirty pages sometimes rush high to the pass-good area for
    several hundred seconds, while in the mean time some bdi dirty pages
    drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the
    global dirty pages dropped under global dirty threshold and bdi_dirty
    rush very high (for example, 2 times higher than bdi_thresh). During
    which time balance_dirty_pages() is not called at all.

    So the problems are

    1) The random writes progress so slow that they break the assumption of
    the max-pause logic that "8 pages per 200ms is typically more than
    enough to curb heavy dirtiers".

    2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
    for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
    and then (bdi_thresh >> bdi_dirty) for others.

    3) The higher max-pause/pass-good thresholds somehow leads to the bad
    swing of dirty pages.

    The fix is to allow the task to slightly dirty over task_bdi_thresh, but
    no way to exceed bdi_dirty and/or global dirty_thresh.

    Tests show that it fixed the JBOD regression completely (both behavior
    and performance), while still being able to cut down large pause times
    in balance_dirty_pages() for single-disk cases.

    Reported-by: Li Shaohua
    Tested-by: Li Shaohua
    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

10 Jul, 2011

5 commits

  • Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
    concern of not holding I_SYNC for too long. (At least, that was the
    comment previously.) This doesn't make sense now because the only
    time we wait for I_SYNC is if we are calling sync or fsync, and in
    that case we need to write out all of the data anyway. Previously
    there may have been other code paths that waited on I_SYNC, but not
    any more. -- Theodore Ts'o

    So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
    will adapt to as large as the storage device can write within 500ms.

    XFS is observed to do IO completions in a batch, and the batch size is
    equal to the write chunk size. To avoid dirty pages to suddenly drop
    out of balance_dirty_pages()'s dirty control scope and create large
    fluctuations, the chunk size is also limited to half the control scope.

    The balance_dirty_pages() control scrope is

    [(background_thresh + dirty_thresh) / 2, dirty_thresh]

    which is by default [15%, 20%] of global dirty pages, whose range size
    is dirty_thresh / DIRTY_FULL_SCOPE.

    The adpative write chunk size will be rounded to the nearest 4MB
    boundary.

    http://bugzilla.kernel.org/show_bug.cgi?id=13930

    CC: Theodore Ts'o
    CC: Dave Chinner
    CC: Chris Mason
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The max-pause limit helps to keep the sleep time inside
    balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
    per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
    normally is enough to stop dirtiers from continue pushing the dirty
    pages high, unless there are a sufficient large number of slow dirtiers
    (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
    write bandwidth of a slow disk and hence accumulating more and more dirty
    pages).

    The pass-good limit helps to let go of the good bdi's in the presence of
    a blocked bdi (ie. NFS server not responding) or slow USB disk which for
    some reason build up a large number of initial dirty pages that refuse
    to go away anytime soon.

    For example, given two bdi's A and B and the initial state

    bdi_thresh_A = dirty_thresh / 2
    bdi_thresh_B = dirty_thresh / 2
    bdi_dirty_A = dirty_thresh / 2
    bdi_dirty_B = dirty_thresh / 2

    Then A get blocked, after a dozen seconds

    bdi_thresh_A = 0
    bdi_thresh_B = dirty_thresh
    bdi_dirty_A = dirty_thresh / 2
    bdi_dirty_B = dirty_thresh / 2

    The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
    will be effectively throttled by condition (nr_dirty < dirty_thresh).
    This has two problems:
    (1) we lose the protections for light dirtiers
    (2) balance_dirty_pages() effectively becomes IO-less because the
    (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
    for IO, but balance_dirty_pages() loses an important way to break
    out of the loop which leads to more spread out throttle delays.

    DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
    DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
    example while this patch uses the more conservative value 8 so as not to
    surprise people with too many dirty pages than expected.

    The max-pause limit won't noticeably impact the speed dirty pages are
    knocked down when there is a sudden drop of global/bdi dirty thresholds.
    Because the heavy dirties will be throttled below 160KB/s which is slow
    enough. It does help to avoid long dirty throttle delays and especially
    will make light dirtiers more responsive.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The start of a heavy weight application (ie. KVM) may instantly knock
    down determine_dirtyable_memory() if the swap is not enabled or full.
    global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
    dirty thresholds that are _much_ lower than the global/bdi dirty pages.

    balance_dirty_pages() will then heavily throttle all dirtiers including
    the light ones, until the dirty pages drop below the new dirty thresholds.
    During this _deep_ dirty-exceeded state, the system may appear rather
    unresponsive to the users.

    About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
    threshold to heavy dirtiers than light ones, and the dirty pages will
    be throttled around the heavy dirtiers' dirty threshold and reasonably
    below the light dirtiers' dirty threshold. In this state, only the heavy
    dirtiers will be throttled and the dirty pages are carefully controlled
    to not exceed the light dirtiers' dirty threshold. However if the
    threshold itself suddenly drops below the number of dirty pages, the
    light dirtiers will get heavily throttled.

    So introduce global_dirty_limit for tracking the global dirty threshold
    with policies

    - follow downwards slowly
    - follow up in one shot

    global_dirty_limit can effectively mask out the impact of sudden drop of
    dirtyable memory. It will be used in the next patch for two new type of
    dirty limits. Note that the new dirty limits are not going to avoid
    throttling the light dirtiers, but could limit their sleep time to 200ms.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The estimation value will start from 100MB/s and adapt to the real
    bandwidth in seconds.

    It tries to update the bandwidth only when disk is fully utilized.
    Any inactive period of more than one second will be skipped.

    The estimated bandwidth will be reflecting how fast the device can
    writeout when _fully utilized_, and won't drop to 0 when it goes idle.
    The value will remain constant at disk idle time. At busy write time, if
    not considering fluctuations, it will also remain high unless be knocked
    down by possible concurrent reads that compete for the disk time and
    bandwidth with async writes.

    The estimation is not done purely in the flusher because there is no
    guarantee for write_cache_pages() to return timely to update bandwidth.

    The bdi->avg_write_bandwidth smoothing is very effective for filtering
    out sudden spikes, however may be a little biased in long term.

    The overheads are low because the bdi bandwidth update only occurs at
    200ms intervals.

    The 200ms update interval is suitable, because it's not possible to get
    the real bandwidth for the instance at all, due to large fluctuations.

    The NFS commits can be as large as seconds worth of data. One XFS
    completion may be as large as half second worth of data if we are going
    to increase the write chunk to half second worth of data. In ext4,
    fluctuations with time period of around 5 seconds is observed. And there
    is another pattern of irregular periods of up to 20 seconds on SSD tests.

    That's why we are not only doing the estimation at 200ms intervals, but
    also averaging them over a period of 3 seconds and then go further to do
    another level of smoothing in avg_write_bandwidth.

    CC: Li Shaohua
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
    and initialize the struct writeback_control there.

    struct writeback_control is basically designed to control writeback of a
    single file, but we keep abuse it for writing multiple files in
    writeback_sb_inodes() and its callers.

    It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
    work->nr_pages starts to make sense, and instead of saving and restoring
    pages_skipped in writeback_sb_inodes it can always start with a clean
    zero value.

    It also makes a neat IO pattern change: large dirty files are now
    written in the full 4MB writeback chunk size, rather than whatever
    remained quota in wbc->nr_to_write.

    Acked-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

08 Jun, 2011

6 commits

  • Remove two unused struct writeback_control fields:

    .encountered_congestion (completely unused)
    .nonblocking (never set, checked/showed in XFS,NFS/btrfs)

    The .for_background check in nfs_write_inode() is also removed btw,
    as .for_background implies WB_SYNC_NONE.

    Reviewed-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • When wbc.more_io was first introduced, it indicates whether there are
    at least one superblock whose s_more_io contains more IO work. Now with
    the per-bdi writeback, it can be replaced with a simple b_more_io test.

    Acked-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • This removes writeback_control.wb_start and does more straightforward
    sync livelock prevention by setting .older_than_this to prevent extra
    inodes from being enqueued in the first place.

    Acked-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     
  • The flusher works on dirty inodes in batches, and may quit prematurely
    if the batch of inodes happen to be metadata-only dirtied: in this case
    wbc->nr_to_write won't be decreased at all, which stands for "no pages
    written" but also mis-interpreted as "no progress".

    So introduce writeback_control.inodes_written to count the inodes get
    cleaned from VFS POV. A non-zero value means there are some progress on
    writeback, in which case more writeback can be tried.

    Acked-by: Jan Kara
    Acked-by: Mel Gorman
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
    WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
    do livelock prevention for it, too.

    Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
    using page tagging") is a partial fix in that it only fixed the
    WB_SYNC_ALL phase livelock.

    Although ext4 is tested to no longer livelock with commit f446daaea9,
    it may due to some "redirty_tail() after pages_skipped" effect which
    is by no means a guarantee for _all_ the file systems.

    Note that writeback_inodes_sb() is called by not only sync(), they are
    treated the same because the other callers also need livelock prevention.

    Impact: It changes the order in which pages/inodes are synced to disk.
    Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
    until finished with the current inode.

    Acked-by: Jan Kara
    CC: Dave Chinner
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

25 Mar, 2011

2 commits

  • All that remains of the inode_lock is protecting the inode hash list
    manipulation and traversals. Rename the inode_lock to
    inode_hash_lock to reflect it's actual function.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

31 Oct, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
    Btrfs: deal with errors from updating the tree log
    Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
    Btrfs: make SNAP_DESTROY async
    Btrfs: add SNAP_CREATE_ASYNC ioctl
    Btrfs: add START_SYNC, WAIT_SYNC ioctls
    Btrfs: async transaction commit
    Btrfs: fix deadlock in btrfs_commit_transaction
    Btrfs: fix lockdep warning on clone ioctl
    Btrfs: fix clone ioctl where range is adjacent to extent
    Btrfs: fix delalloc checks in clone ioctl
    Btrfs: drop unused variable in block_alloc_rsv
    Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
    Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
    Btrfs: Use ERR_CAST helpers
    Btrfs: use memdup_user helpers
    Btrfs: fix raid code for removing missing drives
    Btrfs: Switch the extent buffer rbtree into a radix tree
    Btrfs: restructure try_release_extent_buffer()
    Btrfs: use the flusher threads for delalloc throttling
    Btrfs: tune the chunk allocation to 5% of the FS as metadata
    ...

    Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
    remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
    useless and removed in commit 5e8067adfdba: "rcu head remove init")

    Linus Torvalds
     

29 Oct, 2010

1 commit

  • When btrfs is running low on metadata space, it needs to force delayed
    allocation pages to disk. It currently does this with a suboptimal walk
    of a private list of inodes with delayed allocation, and it would be
    much better if we used the generic flusher threads.

    writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
    thread to start IO on all the dirty pages in the FS before it returns.
    This adds variants of writeback_inodes_sb* that allow the caller to
    control how many pages get sent down.

    Signed-off-by: Chris Mason

    Chris Mason
     

28 Oct, 2010

2 commits

  • Conflicts:
    fs/ext4/inode.c
    fs/ext4/mballoc.c
    include/trace/events/ext4.h

    Theodore Ts'o
     
  • This is analogous to Jan Kara's commit,
    f446daaea9d4a420d16c606f755f3689dcb2d0ce
    mm: implement writeback livelock avoidance using page tagging

    but since we forked write_cache_pages, we need to reimplement
    it there (and in ext4_da_writepages, since range_cyclic handling
    was moved to there)

    If you start a large buffered IO to a file, and then set
    fsync after it, you'll find that fsync does not complete
    until the other IO stops.

    If you continue re-dirtying the file (say, putting dd
    with conv=notrunc in a loop), when fsync finally completes
    (after all IO is done), it reports via tracing that
    it has written many more pages than the file contains;
    in other words it has synced and re-synced pages in
    the file multiple times.

    This then leads to problems with our writeback_index
    update, since it advances it by pages written, and
    essentially sets writeback_index off the end of the
    file...

    With the following patch, we only sync as much as was
    dirty at the time of the sync.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     

27 Oct, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • Declare 'bdi_pending_list' and 'tag_pages_for_writeback()' to remove
    following sparse warnings:

    mm/backing-dev.c:46:1: warning: symbol 'bdi_pending_list' was not declared. Should it be static?
    mm/page-writeback.c:825:6: warning: symbol 'tag_pages_for_writeback' was not declared. Should it be static?

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

26 Oct, 2010

1 commit

  • Convert the inode LRU to use lazy updates to reduce lock and
    cacheline traffic. We avoid moving inodes around in the LRU list
    during iget/iput operations so these frequent operations don't need
    to access the LRUs. Instead, we defer the refcount checks to
    reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
    reclaim that iget has touched the inode in the past. This means that
    only reclaim should be touching the LRU with any frequency, hence
    significantly reducing lock acquisitions and the amount contention
    on LRU updates.

    This also removes the inode_in_use list, which means we now only
    have one list for tracking the inode LRU status. This makes it much
    simpler to split out the LRU list operations under it's own lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     

12 Aug, 2010

1 commit

  • Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
    that the latter can be avoided when under global dirty background
    threshold (which is the normal state for most systems).

    Signed-off-by: Wu Fengguang
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

06 Jul, 2010

2 commits

  • The case where we have a superblock doesn't require a loop here as we scan
    over all inodes in writeback_sb_inodes. Split it out into a separate helper
    to make the code simpler. This also allows to get rid of the sb member in
    struct writeback_control, which was rather out of place there.

    Also update the comments in writeback_sb_inodes that explain the handling
    of inodes from wrong superblocks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This was just an odd wrapper around writeback_inodes_wb. Removing this
    also allows to get rid of the bdi member of struct writeback_control
    which was rather out of place there.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jun, 2010

1 commit

  • If a filesystem writes more than one page in ->writepage, write_cache_pages
    fails to notice this and continues to attempt writeback when wbc->nr_to_write
    has gone negative - this trace was captured from XFS:

    wbc_writeback_start: towrt=1024
    wbc_writepage: towrt=1024
    wbc_writepage: towrt=0
    wbc_writepage: towrt=-1
    wbc_writepage: towrt=-5
    wbc_writepage: towrt=-21
    wbc_writepage: towrt=-85

    This has adverse effects on filesystem writeback behaviour. write_cache_pages()
    needs to terminate after a certain number of pages are written, not after a
    certain number of calls to ->writepage are made. This is a regression
    introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add
    no_nrwrite_index_update writeback control flag"), but cannot be reverted
    directly due to subsequent bug fixes that have gone in on top of it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

01 Jun, 2010

1 commit


22 May, 2010

1 commit

  • When CONFIG_BLOCK isn't enabled:

    mm/page-writeback.c: In function 'laptop_mode_timer_fn':
    mm/page-writeback.c:708: error: dereferencing pointer to incomplete type
    mm/page-writeback.c:709: error: dereferencing pointer to incomplete type

    Fix this by essentially eliminating the laptop sync handlers when
    CONFIG_BLOCK isn't set, as most are only used from the block layer code.
    The exception is laptop_sync_completion() which is used from sys_sync(),
    make that an empty declaration in that case.

    Reported-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 May, 2010

1 commit

  • When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
    writeback to kick off writeback of pending dirty inodes, then follow
    that up with a WB_SYNC_ALL to wait for it. Since umount already holds
    the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
    writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
    since WB_SYNC_ALL writeback is a data integrity operation and thus
    a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
    it's a lot slower.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Apr, 2010

1 commit

  • One of the features of laptop-mode is that it forces a writeout of dirty
    pages if something else triggers a physical read or write from a device.
    The current implementation flushes pages on all devices, rather than only
    the one that triggered the flush. This patch alters the behaviour so that
    only the recently accessed block device is flushed, preventing other
    disks being spun up for no terribly good reason.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Jens Axboe

    Matthew Garrett
     

12 Mar, 2010

1 commit

  • Do not pin/unpin superblock for every inode in writeback_inodes_wb(), pin
    it for the whole group of inodes which belong to the same superblock and
    call writeback_sb_inodes() handler for them.

    Signed-off-by: Edward Shishkin
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Edward Shishkin
     

23 Dec, 2009

1 commit

  • ext4, at least, would like to start pushing on writeback if it starts
    to get close to ENOSPC when reserving worst-case blocks for delalloc
    writes. Writing out delalloc data will convert those worst-case
    predictions into usually smaller actual usage, freeing up space
    before we hit ENOSPC based on this speculation.

    Thanks to Jens for the suggestion for the helper function,
    & the naming help.

    I've made the helper return status on whether writeback was
    started even though I don't plan to use it in the ext4 patch;
    it seems like it would be potentially useful to test this
    in some cases.

    Signed-off-by: Eric Sandeen
    Acked-by: Jan Kara

    Eric Sandeen
     

18 Dec, 2009

1 commit

  • After I_SYNC was split from I_LOCK the leftover is always used together with
    I_NEW and thus superflous.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig