04 Apr, 2014

1 commit

  • Pull tracing updates from Steven Rostedt:
    "Most of the changes were largely clean ups, and some documentation.
    But there were a few features that were added:

    Uprobes now work with event triggers and multi buffers and have
    support under ftrace and perf.

    The big feature is that the function tracer can now be used within the
    multi buffer instances. That is, you can now trace some functions in
    one buffer, others in another buffer, all functions in a third buffer
    and so on. They are basically agnostic from each other. This only
    works for the function tracer and not for the function graph trace,
    although you can have the function graph tracer running in the top
    level buffer (or any tracer for that matter) and have different
    function tracing going on in the sub buffers"

    * tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
    tracing: Add BUG_ON when stack end location is over written
    tracepoint: Remove unused API functions
    Revert "tracing: Move event storage for array from macro to standalone function"
    ftrace: Constify ftrace_text_reserved
    tracepoints: API doc update to tracepoint_probe_register() return value
    tracepoints: API doc update to data argument
    ftrace: Fix compilation warning about control_ops_free
    ftrace/x86: BUG when ftrace recovery fails
    ftrace: Warn on error when modifying ftrace function
    ftrace: Remove freelist from struct dyn_ftrace
    ftrace: Do not pass data to ftrace_dyn_arch_init
    ftrace: Pass retval through return in ftrace_dyn_arch_init()
    ftrace: Inline the code from ftrace_dyn_table_alloc()
    ftrace: Cleanup of global variables ftrace_new_pgs and ftrace_update_cnt
    tracing: Evaluate len expression only once in __dynamic_array macro
    tracing: Correctly expand len expressions from __dynamic_array macro
    tracing/module: Replace include of tracepoint.h with jump_label.h in module.h
    tracing: Fix event header migrate.h to include tracepoint.h
    tracing: Fix event header writeback.h to include tracepoint.h
    tracing: Warn if a tracepoint is not set via debugfs
    ...

    Linus Torvalds
     

07 Mar, 2014

1 commit


22 Feb, 2014

1 commit

  • This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
    Chinner has reported the commit may cause some
    inodes to be left out from sync(2). This is because we can call
    redirty_tail() for some inode (which sets i_dirtied_when to current time)
    after sync(2) has started or similarly requeue_inode() can set
    i_dirtied_when to current time if writeback had to skip some pages. The
    real problem is in the functions clobbering i_dirtied_when but fixing
    that isn't trivial so revert is a safer choice for now.

    CC: stable@vger.kernel.org # >= 3.13
    Signed-off-by: Jan Kara

    Jan Kara
     

13 Nov, 2013

1 commit

  • When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
    (e.g. causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large.

    The following script reproduces the problem:

    function run_writers
    {
    for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
    dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
    done
    }

    for dir in "$@"; do
    run_writers $dir
    done

    sleep 40
    time sync

    Fix the problem by disregarding inodes dirtied after sync(2) was called
    in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
    a time stamp when sync has started which is used for setting up work for
    flusher threads.

    To give some numbers, when above script is run on two ext4 filesystems
    on simple SATA drive, the average sync time from 10 runs is 267.549
    seconds with standard deviation 104.799426. With the patched kernel,
    the average sync time from 10 runs is 2.995 seconds with standard
    deviation 0.096.

    Signed-off-by: Jan Kara
    Reviewed-by: Fengguang Wu
    Reviewed-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

02 Apr, 2013

1 commit

  • Writeback implements its own worker pool - each bdi can be associated
    with a worker thread which is created and destroyed dynamically. The
    worker thread for the default bdi is always present and serves as the
    "forker" thread which forks off worker threads for other bdis.

    there's no reason for writeback to implement its own worker pool when
    using unbound workqueue instead is much simpler and more efficient.
    This patch replaces custom worker pool implementation in writeback
    with an unbound workqueue.

    The conversion isn't too complicated but the followings are worth
    mentioning.

    * bdi_writeback->last_active, task and wakeup_timer are removed.
    delayed_work ->dwork is added instead. Explicit timer handling is
    no longer necessary. Everything works by either queueing / modding
    / flushing / canceling the delayed_work item.

    * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
    bdi_writeback->dwork. On each execution, it processes
    bdi->work_list and reschedules itself if there are more things to
    do.

    The function also handles low-mem condition, which used to be
    handled by the forker thread. If the function is running off a
    rescuer thread, it only writes out limited number of pages so that
    the rescuer can serve other bdis too. This preserves the flusher
    creation failure behavior of the forker thread.

    * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
    bdi_writeback_workfn() about on-going bdi unregistration so that it
    always drains work_list even if it's running off the rescuer. Note
    that the original code was broken in this regard. Under memory
    pressure, a bdi could finish unregistration with non-empty
    work_list.

    * The default bdi is no longer special. It now is treated the same as
    any other bdi and bdi_cap_flush_forker() is removed.

    * BDI_pending is no longer used. Removed.

    * Some tracepoints become non-applicable. The following TPs are
    removed - writeback_nothread, writeback_wake_thread,
    writeback_wake_forker_thread, writeback_thread_start,
    writeback_thread_stop.

    Everything, including devices coming and going away and rescuer
    operation under simulated memory pressure, seems to work fine in my
    test setup.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Fengguang Wu
    Cc: Jeff Moyer

    Tejun Heo
     

14 Jan, 2013

1 commit

  • Add tracepoints for page dirtying, writeback_single_inode start, inode
    dirtying and writeback. For the latter two inode events, a pair of
    events are defined to denote start and end of the operations (the
    starting one has _start suffix and the one w/o suffix happens after
    the operation is complete). These inode ops are FS specific and can
    be non-trivial and having enclosing tracepoints is useful for external
    tracers.

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
    before dereferencing.

    v3: buffer dirtying moved to a block TP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 May, 2012

1 commit

  • When writeback_single_inode() is called on inode which has I_SYNC already
    set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
    this makes sense only if the caller is flusher thread. For other callers of
    writeback_single_inode() it doesn't really make sense and may be even wrong
    - flusher thread may be doing WB_SYNC_ALL writeback in parallel.

    So we move requeueing from writeback_single_inode() to writeback_sb_inodes().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

25 Mar, 2012

1 commit

  • Pull avoidance patches from Paul Gortmaker:
    "Nearly every subsystem has some kind of header with a proto like:

    void foo(struct device *dev);

    and yet there is no reason for most of these guys to care about the
    sub fields within the device struct. This allows us to significantly
    reduce the scope of headers including headers. For this instance, a
    reduction of about 40% is achieved by replacing the include with the
    simple fact that the device is some kind of a struct.

    Unlike the much larger module.h cleanup, this one is simply two
    commits. One to fix the implicit users, and then one
    to delete the device.h includes from the linux/include/ dir wherever
    possible."

    * tag 'device-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    device.h: audit and cleanup users in main include dir
    device.h: cleanup users outside of linux/include (C files)

    Linus Torvalds
     

16 Mar, 2012

1 commit

  • The header includes a lot of stuff, and
    it in turn gets a lot of use just for the basic "struct device"
    which appears so often.

    Clean up the users as follows:

    1) For those headers only needing "struct device" as a pointer
    in fcn args, replace the include with exactly that.

    2) For headers not really using anything from device.h, simply
    delete the include altogether.

    3) For headers relying on getting device.h implicitly before
    being included themselves, now explicitly include device.h

    4) For files in which doing #1 or #2 uncovers an implicit
    dependency on some other header, fix by explicitly adding
    the required header(s).

    Any C files that were implicitly relying on device.h to be
    present have already been dealt with in advance.

    Total removals from #1 and #2: 51. Total additions coming
    from #3: 9. Total other implicit dependencies from #4: 7.

    As of 3.3-rc1, there were 110, so a net removal of 42 gives
    about a 38% reduction in device.h presence in include/*

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

06 Feb, 2012

1 commit

  • When a SD card is hot removed without umount, del_gendisk() will call
    bdi_unregister() without destroying/freeing it. This leaves the bdi in
    the bdi->dev = NULL, bdi->wb.task = NULL, bdi->bdi_list removed state.

    When sync(2) gets the bdi before bdi_unregister() and calls
    bdi_queue_work() after the unregister, trace_writeback_queue will be
    dereferencing the NULL bdi->dev. Fix it with a simple test for NULL.

    LKML-reference: http://lkml.org/lkml/2012/1/18/346
    Cc: stable@kernel.org
    Reported-by: Rabin Vincent
    Tested-by: Namjae Jeon
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

01 Feb, 2012

1 commit


18 Dec, 2011

2 commits

  • Compensate the task's think time when computing the final pause time,
    so that ->dirty_ratelimit can be executed accurately.

    think time := time spend outside of balance_dirty_pages()

    In the rare case that the task slept longer than the 200ms period time
    (result in negative pause time), the sleep time will be compensated in
    the following periods, too, if it's less than 1 second.

    Accumulated errors are carefully avoided as long as the max pause area
    is not hitted.

    Pseudo code:

    period = pages_dirtied / task_ratelimit;
    think = jiffies - dirty_paused_when;
    pause = period - think;

    1) normal case: period > think

    pause = period - think
    dirty_paused_when = jiffies + pause
    nr_dirtied = 0

    period time
    |===============================>|
    think time pause time
    |===============>|==============>|
    ------|----------------|---------------|------------------------
    dirty_paused_when jiffies

    2) no pause case: period |
    think time
    |===================================================>|
    ------|--------------------------------+-------------------|----
    dirty_paused_when jiffies

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • This makes the binary trace understandable by trace-cmd.

    CC: Dave Chinner
    CC: Curt Wohlgemuth
    CC: Steven Rostedt
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Oct, 2011

4 commits

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     
  • Instead of sending ->older_than_this to queue_io() and
    move_expired_inodes(), send the entire wb_writeback_work
    structure. There are other fields of a work item that are
    useful in these routines and in tracepoints.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     
  • Useful for analyzing the dynamics of the throttling algorithms and
    debugging user reported problems.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It helps understand how various throttle bandwidths are updated.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

03 Oct, 2011

1 commit

  • As proposed by Chris, Dave and Jan, don't start foreground writeback IO
    inside balance_dirty_pages(). Instead, simply let it idle sleep for some
    time to throttle the dirtying task. In the mean while, kick off the
    per-bdi flusher thread to do background writeback IO.

    RATIONALS
    =========

    - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

    If every thread doing writes and being throttled start foreground
    writeback, it leads to N IO submitters from at least N different
    inodes at the same time, end up with N different sets of IO being
    issued with potentially zero locality to each other, resulting in
    much lower elevator sort/merge efficiency and hence we seek the disk
    all over the place to service the different sets of IO.
    OTOH, if there is only one submission thread, it doesn't jump between
    inodes in the same way when congestion clears - it keeps writing to
    the same inode, resulting in large related chunks of sequential IOs
    being issued to the disk. This is more efficient than the above
    foreground writeback because the elevator works better and the disk
    seeks less.

    - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

    With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
    from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

    * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

    * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

    * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

    * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

    - IO size too small for fast arrays and too large for slow USB sticks

    The write_chunk used by current balance_dirty_pages() cannot be
    directly set to some large value (eg. 128MB) for better IO efficiency.
    Because it could lead to more than 1 second user perceivable stalls.
    Even the current 4MB write size may be too large for slow USB sticks.
    The fact that balance_dirty_pages() starts IO on itself couples the
    IO size to wait time, which makes it hard to do suitable IO size while
    keeping the wait time under control.

    Now it's possible to increase writeback chunk size proportional to the
    disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
    the larger writeback size dramatically reduces the seek count to 1/10
    (far beyond my expectation) and improves the write throughput by 24%.

    - long block time in balance_dirty_pages() hurts desktop responsiveness

    Many of us may have the experience: it often takes a couple of seconds
    or even long time to stop a heavy writing dd/cp/tar command with
    Ctrl-C or "kill -9".

    - IO pipeline broken by bumpy write() progress

    There are a broad class of "loop {read(buf); write(buf);}" applications
    whose read() pipeline will be under-utilized or even come to a stop if
    the write()s have long latencies _or_ don't progress in a constant rate.
    The current threshold based throttling inherently transfers the large
    low level IO completion fluctuations to bumpy application write()s,
    and further deteriorates with increasing number of dirtiers and/or bdi's.

    For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
    the rsync progresses very bumpy in legacy kernel, and throughput is
    improved by 67% by this patchset. (plus the larger write chunk size,
    it will be 93% speedup).

    The new rate based throttling can support 1000+ dd's with excellent
    smoothness, low latency and low overheads.

    For the above reasons, it's much better to do IO-less and low latency
    pauses in balance_dirty_pages().

    Jan Kara, Dave Chinner and me explored the scheme to let
    balance_dirty_pages() wait for enough writeback IO completions to
    safeguard the dirty limit. However it's found to have two problems:

    - in large NUMA systems, the per-cpu counters may have big accounting
    errors, leading to big throttle wait time and jitters.

    - NFS may kill large amount of unstable pages with one single COMMIT.
    Because NFS server serves COMMIT with expensive fsync() IOs, it is
    desirable to delay and reduce the number of COMMITs. So it's not
    likely to optimize away such kind of bursty IO completions, and the
    resulted large (and tiny) stall times in IO completion based throttling.

    So here is a pause time oriented approach, which tries to control the
    pause time in each balance_dirty_pages() invocations, by controlling
    the number of pages dirtied before calling balance_dirty_pages(), for
    smooth and efficient dirty throttling:

    - avoid useless (eg. zero pause time) balance_dirty_pages() calls
    - avoid too small pause time (less than 4ms, which burns CPU power)
    - avoid too large pause time (more than 200ms, which hurts responsiveness)
    - avoid big fluctuations of pause times

    It can control pause times at will. The default policy (in a followup
    patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
    in 1000-dd case.

    BEHAVIOR CHANGE
    ===============

    (1) dirty threshold

    Users will notice that the applications will get throttled once crossing
    the global (background + dirty)/2=15% threshold, and then balanced around
    17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
    memory in 1-dd case.

    Since the task will be soft throttled earlier than before, it may be
    perceived by end users as performance "slow down" if his application
    happens to dirty more than 15% dirtyable memory.

    (2) smoothness/responsiveness

    Users will notice a more responsive system during heavy writeback.
    "killall dd" will take effect instantly.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Aug, 2011

1 commit


10 Jul, 2011

2 commits

  • Add trace event balance_dirty_state for showing the global dirty page
    counts and thresholds at each global_dirty_limits() invocation. This
    will cover the callers throttle_vm_writeout(), over_bground_thresh()
    and each balance_dirty_pages() loop.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
    and initialize the struct writeback_control there.

    struct writeback_control is basically designed to control writeback of a
    single file, but we keep abuse it for writing multiple files in
    writeback_sb_inodes() and its callers.

    It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
    work->nr_pages starts to make sense, and instead of saving and restoring
    pages_skipped in writeback_sb_inodes it can always start with a clean
    zero value.

    It also makes a neat IO pattern change: large dirty files are now
    written in the full 4MB writeback chunk size, rather than whatever
    remained quota in wbc->nr_to_write.

    Acked-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

08 Jun, 2011

3 commits

  • Note that it adds a little overheads to account the moved/enqueued
    inodes from b_dirty to b_io. The "moved" accounting may be later used to
    limit the number of inodes that can be moved in one shot, in order to
    keep spinlock hold time under control.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It is valuable to know how the dirty inodes are iterated and their IO size.

    "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

    - "state" reflects inode->i_state at the end of writeback_single_inode()
    - "index" reflects mapping->writeback_index after the ->writepages() call
    - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
    - "wrote" is the number of pages actually written

    v2: add trace event writeback_single_inode_requeue as proposed by Dave.

    CC: Dave Chinner
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • When wbc.more_io was first introduced, it indicates whether there are
    at least one superblock whose s_more_io contains more IO work. Now with
    the per-bdi writeback, it can be replaced with a simple b_more_io test.

    Acked-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

14 Jan, 2011

1 commit

  • This tracks when balance_dirty_pages() tries to wakeup the flusher thread
    for background writeback (if it was not started already).

    Suggested-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

27 Oct, 2010

3 commits

  • …r if significant congestion is not being encountered in the current zone

    If congestion_wait() is called with no BDI congested, the caller will
    sleep for the full timeout and this may be an unnecessary sleep. This
    patch adds a wait_iff_congested() that checks congestion and only sleeps
    if a BDI is congested else, it calls cond_resched() to ensure the caller
    is not hogging the CPU longer than its quota but otherwise will not sleep.

    This is aimed at reducing some of the major desktop stalls reported during
    IO. For example, while kswapd is operating, it calls congestion_wait()
    but it could just have been reclaiming clean page cache pages with no
    congestion. Without this patch, it would sleep for a full timeout but
    after this patch, it'll just call schedule() if it has been on the CPU too
    long. Similar logic applies to direct reclaimers that are not making
    enough progress.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • There is strong evidence to indicate a lot of time is being spent in
    congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
    for congestion_wait to record when congestion_wait() was called, how long
    the timeout was for and how long it actually slept.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This removes more dead code that was somehow missed by commit 0d99519efef
    (writeback: remove unused nonblocking and congestion checks). There are
    no behavior change except for the removal of two entries from one of the
    ext4 tracing interface.

    The nonblocking checks in ->writepages are no longer used because the
    flusher now prefer to block on get_request_wait() than to skip inodes on
    IO congestion. The latter will lead to more seeky IO.

    The nonblocking checks in ->writepage are no longer used because it's
    redundant with the WB_SYNC_NONE check.

    We no long set ->nonblocking in VM page out and page migration, because
    a) it's effectively redundant with WB_SYNC_NONE in current code
    b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
    that would skip some dirty inodes on congestion and page out others, which
    is unfair in terms of LRU age.

    Inspired by Christoph Hellwig. Thanks!

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Cc: David Howells
    Cc: Sage Weil
    Cc: Steve French
    Cc: Chris Mason
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

08 Aug, 2010

5 commits

  • Add 2 new trace points to the periodic write-back wake up case, just like we do
    in the 'bdi_queue_work()' function. Namely, introduce:

    1. trace_writeback_wake_thread(bdi)
    2. trace_writeback_wake_forker_thread(bdi)

    The first event is triggered every time we wake up a bdi thread to start
    periodic background write-out. The second event is triggered only when the bdi
    thread does not exist and should be created by the forker thread.

    This patch was suggested by Dave Chinner and Christoph Hellwig.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • include/trace/events/writeback.h uses dev_name(), so it needs to
    include linux/device.h.

    include/trace/events/writeback.h:12: error: implicit declaration of function 'dev_name'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Randy Dunlap
     
  • Add a trace event to the ->writepage loop in write_cache_pages to give
    visibility into how the ->writepage call is changing variables within the
    writeback control structure. Of most interest is how wbc->nr_to_write changes
    from call to call, especially with filesystems that write multiple pages
    in ->writepage.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dave Chinner
     
  • Tracing high level background writeback events is good, but it doesn't
    give the entire picture. Add visibility into write throttling to catch IO
    dispatched by foreground throttling of processing dirtying lots of pages.

    Signed-off-by: Dave Chinner
    Signed-off-by: Jens Axboe

    Dave Chinner
     
  • Trace queue/sched/exec parts of the writeback loop. This provides
    insight into when and why flusher threads are scheduled to run. e.g
    a sync invocation leaves traces like:

    sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
    flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0

    This also lays the foundation for adding more writeback tracing to
    provide deeper insight into the whole writeback path.

    The original tracing code is from Jens Axboe, though this version is
    a rewrite as a result of the code being traced changing
    significantly.

    Signed-off-by: Dave Chinner
    Signed-off-by: Jens Axboe

    Dave Chinner