22 Nov, 2011

1 commit

  • Writeback and thinkpad_acpi have been using thaw_process() to prevent
    deadlock between the freezer and kthread_stop(); unfortunately, this
    is inherently racy - nothing prevents freezing from happening between
    thaw_process() and kthread_stop().

    This patch implements kthread_freezable_should_stop() which enters
    refrigerator if necessary but is guaranteed to return if
    kthread_stop() is invoked. Both thaw_process() users are converted to
    use the new function.

    Note that this deadlock condition exists for many of freezable
    kthreads. They need to be converted to use the new should_stop or
    freezable workqueue.

    Tested with synthetic test case.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Cc: Jens Axboe
    Cc: Oleg Nesterov

    Tejun Heo
     

11 Nov, 2011

1 commit

  • bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
    from all super_blocks and then del_timer_sync() the writeback timer.

    However, this can race with __mark_inode_dirty(), leading to
    bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
    we're unregistering, after we've called del_timer_sync().

    This can end up with the bdi being freed with an active timer inside it,
    as in the case of the following dump after the removal of an SD card.

    Fix this by redoing the del_timer_sync() in bdi_destory().

    ------------[ cut here ]------------
    WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
    ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
    Modules linked in:
    Backtrace:
    [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
    r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
    [] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
    [] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
    r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
    r3:00000009
    [] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
    r3:c02b1b29 r2:c02bc662
    [] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
    r6:c7964000 r5:00000001 r4:c7964000
    [] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
    [] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
    [] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
    r5:c79641f0 r4:c796420c
    [] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
    r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
    [] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
    r5:c79643b4 r4:c79641f0
    [] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
    r4:c7964000
    [] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
    r5:00000000 r4:c794c400
    [] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
    r5:c794c400 r4:c0322824
    [] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
    r5:c78d5e00 r4:c74083c0
    [] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
    r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
    r3:00000000
    [] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
    r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
    [] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
    r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
    [] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
    r5:c794dc00 r4:c794dc00
    [] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
    r5:c794dc00 r4:c7942300
    [] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
    r6:c7942300 r5:00000000 r4:00000000 r3:00000000
    [] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
    ---[ end trace e5c83c92ada51c76 ]---

    Cc: stable@kernel.org
    Signed-off-by: Rabin Vincent
    Signed-off-by: Linus Walleij
    Signed-off-by: Jens Axboe

    Rabin Vincent
     

07 Nov, 2011

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

01 Nov, 2011

1 commit


31 Oct, 2011

1 commit

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     

03 Oct, 2011

3 commits

  • There are some imperfections in balanced_dirty_ratelimit.

    1) large fluctuations

    The dirty_rate used for computing balanced_dirty_ratelimit is merely
    averaged in the past 200ms (very small comparing to the 3s estimation
    period for write_bw), which makes rather dispersed distribution of
    balanced_dirty_ratelimit.

    It's pretty hard to average out the singular points by increasing the
    estimation period. Considering that the averaging technique will
    introduce very undesirable time lags, I give it up totally. (btw, the 3s
    write_bw averaging time lag is much more acceptable because its impact
    is one-way and therefore won't lead to oscillations.)

    The more practical way is filtering -- most singular
    balanced_dirty_ratelimit points can be filtered out by remembering some
    prev_balanced_rate and prev_prev_balanced_rate. However the more
    reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.

    2) due to truncates and fs redirties, the (write_bw dirty_rate)
    match could become unbalanced, which may lead to large systematical
    errors in balanced_dirty_ratelimit. The truncates, due to its possibly
    bumpy nature, can hardly be compensated smoothly. So let's face it. When
    some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
    high, dirty pages will go higher than the setpoint. task_ratelimit will
    in turn become lower than dirty_ratelimit. So if we consider both
    balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
    only when they are on the same side of dirty_ratelimit, the systematical
    errors in balanced_dirty_ratelimit won't be able to bring
    dirty_ratelimit far away.

    The balanced_dirty_ratelimit estimation may also be inaccurate near
    @limit or @freerun, however is less an issue.

    3) since we ultimately want to

    - keep the fluctuations of task ratelimit as small as possible
    - keep the dirty pages around the setpoint as long time as possible

    the update policy used for (2) also serves the above goals nicely:
    if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
    and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
    there is no point to bring up dirty_ratelimit in a hurry only to hurt
    both the above two goals.

    So, we make use of task_ratelimit to limit the update of dirty_ratelimit
    in two ways:

    1) avoid changing dirty rate when it's against the position control target
    (the adjusted rate will slow down the progress of dirty pages going
    back to setpoint).

    2) limit the step size. task_ratelimit is changing values step by step,
    leaving a consistent trace comparing to the randomly jumping
    balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
    errors in stable state and typically larger errors when there are big
    errors in rate. So it's a pretty good limiting factor for the step
    size of dirty_ratelimit.

    Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
    task_ratelimit is merely used as a limiting factor.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
    when there are N dd tasks.

    On write() syscall, use bdi->dirty_ratelimit
    ============================================

    balance_dirty_pages(pages_dirtied)
    {
    task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
    pause = pages_dirtied / task_ratelimit;
    sleep(pause);
    }

    On every 200ms, update bdi->dirty_ratelimit
    ===========================================

    bdi_update_dirty_ratelimit()
    {
    task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
    balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
    bdi->dirty_ratelimit = balanced_dirty_ratelimit
    }

    Estimation of balanced bdi->dirty_ratelimit
    ===========================================

    balanced task_ratelimit
    -----------------------

    balance_dirty_pages() needs to throttle tasks dirtying pages such that
    the total amount of dirty pages stays below the specified dirty limit in
    order to avoid memory deadlocks. Furthermore we desire fairness in that
    tasks get throttled proportionally to the amount of pages they dirty.

    IOW we want to throttle tasks such that we match the dirty rate to the
    writeout bandwidth, this yields a stable amount of dirty pages:

    dirty_rate == write_bw (1)

    The fairness requirement gives us:

    task_ratelimit = balanced_dirty_ratelimit
    == write_bw / N (2)

    where N is the number of dd tasks. We don't know N beforehand, but
    still can estimate balanced_dirty_ratelimit within 200ms.

    Start by throttling each dd task at rate

    task_ratelimit = task_ratelimit_0 (3)
    (any non-zero initial value is OK)

    After 200ms, we measured

    dirty_rate = # of pages dirtied by all dd's / 200ms
    write_bw = # of pages written to the disk / 200ms

    For the aggressive dd dirtiers, the equality holds

    dirty_rate == N * task_rate
    == N * task_ratelimit_0 (4)
    Or
    task_ratelimit_0 == dirty_rate / N (5)

    Now we conclude that the balanced task ratelimit can be estimated by

    write_bw
    balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6)
    dirty_rate

    Because with (4) and (5) we can get the desired equality (1):

    write_bw
    balanced_dirty_ratelimit == (dirty_rate / N) * ----------
    dirty_rate
    == write_bw / N

    Then using the balanced task ratelimit we can compute task pause times like:

    task_pause = task->nr_dirtied / task_ratelimit

    task_ratelimit with position control
    ------------------------------------

    However, while the above gives us means of matching the dirty rate to
    the writeout bandwidth, it at best provides us with a stable dirty page
    count (assuming a static system). In order to control the dirty page
    count such that it is high enough to provide performance, but does not
    exceed the specified limit we need another control.

    The dirty position control works by extending (2) to

    task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7)

    where pos_ratio is a negative feedback function that subjects to

    1) f(setpoint) = 1.0
    2) df/dx < 0

    That is, if the dirty pages are ABOVE the setpoint, we throttle each
    task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
    pages are created less fast than they are cleaned, thus DROP to the
    setpoints (and the reverse).

    Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
    remains CONSTANT for the past 200ms, we get

    task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8)

    Putting (8) into (6), we get the formula used in
    bdi_update_dirty_ratelimit():

    write_bw
    balanced_dirty_ratelimit *= pos_ratio * ---------- (9)
    dirty_rate

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce the BDI_DIRTIED counter. It will be used for estimating the
    bdi's dirty bandwidth.

    CC: Jan Kara
    CC: Michael Rubin
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

03 Sep, 2011

2 commits


27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

26 Jul, 2011

2 commits

  • * Merge akpm patch series: (122 commits)
    drivers/connector/cn_proc.c: remove unused local
    Documentation/SubmitChecklist: add RCU debug config options
    reiserfs: use hweight_long()
    reiserfs: use proper little-endian bitops
    pnpacpi: register disabled resources
    drivers/rtc/rtc-tegra.c: properly initialize spinlock
    drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
    drivers/rtc: add support for Qualcomm PMIC8xxx RTC
    drivers/rtc/rtc-s3c.c: support clock gating
    drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
    init: skip calibration delay if previously done
    misc/eeprom: add eeprom access driver for digsy_mtc board
    misc/eeprom: add driver for microwire 93xx46 EEPROMs
    checkpatch.pl: update $logFunctions
    checkpatch: make utf-8 test --strict
    checkpatch.pl: add ability to ignore various messages
    checkpatch: add a "prefer __aligned" check
    checkpatch: validate signature styles and To: and Cc: lines
    checkpatch: add __rcu as a sparse modifier
    checkpatch: suggest using min_t or max_t
    ...

    Did this as a merge because of (trivial) conflicts in
    - Documentation/feature-removal-schedule.txt
    - arch/xtensa/include/asm/uaccess.h
    that were just easier to fix up in the merge than in the patch series.

    Linus Torvalds
     
  • Vito said:

    : The system has many usb disks coming and going day to day, with their
    : respective bdi's having min_ratio set to 1 when inserted. It works for
    : some time until eventually min_ratio can no longer be set, even when the
    : active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
    : anywhere near 100.
    :
    : This then leads to an unrelated starvation problem caused by write-heavy
    : fuse mounts being used atop the usb disks, a problem the min_ratio setting
    : at the underlying devices bdi effectively prevents.

    Fix this leakage by resetting the bdi min_ratio when unregistering the
    BDI.

    Signed-off-by: Peter Zijlstra
    Reported-by: Vito Caputo
    Cc: Wu Fengguang
    Cc: Miklos Szeredi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 Jul, 2011

1 commit

  • backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu

    synchronize_rcu sleeps several timer ticks. synchronize_rcu_expedited is
    much faster.

    With 100Hz timer frequency, when we remove 10000 block devices with
    "dmsetup remove_all" command, it takes 27 minutes. With this patch,
    removing 10000 block devices takes only 15 seconds.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

10 Jul, 2011

4 commits

  • Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats.

    btw, increase digital field width to 10, for keeping the possibly
    huge BdiWritten number aligned at least for desktop systems.

    Impact: this could break user space tools if they are dumb enough to
    depend on the number of white spaces.

    CC: Theodore Ts'o
    CC: Jan Kara
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The estimation value will start from 100MB/s and adapt to the real
    bandwidth in seconds.

    It tries to update the bandwidth only when disk is fully utilized.
    Any inactive period of more than one second will be skipped.

    The estimated bandwidth will be reflecting how fast the device can
    writeout when _fully utilized_, and won't drop to 0 when it goes idle.
    The value will remain constant at disk idle time. At busy write time, if
    not considering fluctuations, it will also remain high unless be knocked
    down by possible concurrent reads that compete for the disk time and
    bandwidth with async writes.

    The estimation is not done purely in the flusher because there is no
    guarantee for write_cache_pages() to return timely to update bandwidth.

    The bdi->avg_write_bandwidth smoothing is very effective for filtering
    out sudden spikes, however may be a little biased in long term.

    The overheads are low because the bdi bandwidth update only occurs at
    200ms intervals.

    The 200ms update interval is suitable, because it's not possible to get
    the real bandwidth for the instance at all, due to large fluctuations.

    The NFS commits can be as large as seconds worth of data. One XFS
    completion may be as large as half second worth of data if we are going
    to increase the write chunk to half second worth of data. In ext4,
    fluctuations with time period of around 5 seconds is observed. And there
    is another pattern of irregular periods of up to 20 seconds on SSD tests.

    That's why we are not only doing the estimation at 200ms intervals, but
    also averaging them over a period of 3 seconds and then go further to do
    another level of smoothing in avg_write_bandwidth.

    CC: Li Shaohua
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Introduce the BDI_WRITTEN counter. It will be used for estimating the
    bdi's write bandwidth.

    Peter Zijlstra :
    Move BDI_WRITTEN accounting into __bdi_writeout_inc().
    This will cover and fix fuse, which only calls bdi_writeout_inc().

    CC: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     
  • Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
    and initialize the struct writeback_control there.

    struct writeback_control is basically designed to control writeback of a
    single file, but we keep abuse it for writing multiple files in
    writeback_sb_inodes() and its callers.

    It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
    work->nr_pages starts to make sense, and instead of saving and restoring
    pages_skipped in writeback_sb_inodes it can always start with a clean
    zero value.

    It also makes a neat IO pattern change: large dirty files are now
    written in the full 4MB writeback chunk size, rather than whatever
    remained quota in wbc->nr_to_write.

    Acked-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

21 May, 2011

1 commit


31 Mar, 2011

1 commit


25 Mar, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

17 Mar, 2011

1 commit


10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Oct, 2010

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • PF_FLUSHER is only ever set, not tested, remove it.

    Signed-off-by: Peter Zijlstra
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • …r if significant congestion is not being encountered in the current zone

    If congestion_wait() is called with no BDI congested, the caller will
    sleep for the full timeout and this may be an unnecessary sleep. This
    patch adds a wait_iff_congested() that checks congestion and only sleeps
    if a BDI is congested else, it calls cond_resched() to ensure the caller
    is not hogging the CPU longer than its quota but otherwise will not sleep.

    This is aimed at reducing some of the major desktop stalls reported during
    IO. For example, while kswapd is operating, it calls congestion_wait()
    but it could just have been reclaiming clean page cache pages with no
    congestion. Without this patch, it would sleep for a full timeout but
    after this patch, it'll just call schedule() if it has been on the CPU too
    long. Similar logic applies to direct reclaimers that are not making
    enough progress.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • There is strong evidence to indicate a lot of time is being spent in
    congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
    for congestion_wait to record when congestion_wait() was called, how long
    the timeout was for and how long it actually slept.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Oct, 2010

1 commit

  • The use of the same inode list structure (inode->i_list) for two
    different list constructs with different lifecycles and purposes
    makes it impossible to separate the locking of the different
    operations. Therefore, to enable the separation of the locking of
    the writeback and reclaim lists, split the inode->i_list into two
    separate lists dedicated to their specific tracking functions.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     

22 Sep, 2010

1 commit


27 Aug, 2010

1 commit

  • This patch fixes the following issue:

    INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000
    ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
    ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
    00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
    Call Trace:
    [] schedule_timeout+0x34/0xf1
    [] ? wait_for_common+0x3f/0x130
    [] ? trace_hardirqs_on+0xd/0xf
    [] wait_for_common+0xd2/0x130
    [] ? default_wake_function+0x0/0xf
    [] ? _raw_spin_unlock+0x26/0x2a
    [] wait_for_completion+0x18/0x1a
    [] sync_inodes_sb+0xca/0x1bc
    [] __sync_filesystem+0x47/0x7e
    [] sync_filesystem+0x47/0x4b
    [] generic_shutdown_super+0x22/0xd2
    [] kill_anon_super+0x11/0x4f
    [] nfs4_kill_super+0x3f/0x72 [nfs]
    [] deactivate_locked_super+0x21/0x41
    [] deactivate_super+0x40/0x45
    [] mntput_no_expire+0xb8/0xed
    [] release_mounts+0x9a/0xb0
    [] put_mnt_ns+0x6a/0x7b
    [] nfs_follow_remote_path+0x19a/0x296 [nfs]
    [] nfs4_try_mount+0x75/0xaf [nfs]
    [] nfs4_get_sb+0x276/0x2ff [nfs]
    [] vfs_kern_mount+0xb8/0x196
    [] do_kern_mount+0x48/0xe8
    [] do_mount+0x771/0x7e8
    [] sys_mount+0x83/0xbd
    [] system_call_fastpath+0x16/0x1b

    The reason of this hang was a race condition: when the flusher thread is
    forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
    visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
    'bdi->wb.task' is still NULL. And this is a dangerous time window.

    If at this time someone queues a work for this bdi, he does not see the bdi
    thread and wakes up the forker thread instead! But the forker has already
    forked this bdi thread, but just did not make it visible yet!

    The result is that we lose the wake up event for this bdi thread and the NFS4
    code waits forever.

    To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
    then make them visible in 'bdi->wb.task', and only after this wake them up.
    This is exactly what this patch does.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     

12 Aug, 2010

1 commit

  • Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
    that the latter can be avoided when under global dirty background
    threshold (which is the normal state for most systems).

    Signed-off-by: Wu Fengguang
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

08 Aug, 2010

7 commits

  • Fix a bug where a lock is _bh nested within another _bh lock,
    but forgets to use the _bh variant for unlock.

    Further more, it's not necessary to test _bh locks, the inner lock
    can just use spin_lock(). So fix up the bug by making that change.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch makes sure we first initialize everything and set the BDI_registered
    flag, and only after this we add the bdi to 'bdi_list'. Current code adds the
    bdi to the list too early, and as a result I the

    WARN(!test_bit(BDI_registered, &bdi->state)

    in bdi forker is triggered. Also, it is in general good practice to make things
    visible only when they are fully initialized.

    Also, this patch does few micro clean-ups:
    1. Removes the 'exit' label which does not do anything, just returns. This
    allows to get rid of few braces and 'ret' variable and make the code smaller.
    2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded
    '-ENOMEM'. Theoretically, some day 'kthread_run()' can return something
    else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to
    NULL.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Add 2 new trace points to the periodic write-back wake up case, just like we do
    in the 'bdi_queue_work()' function. Namely, introduce:

    1. trace_writeback_wake_thread(bdi)
    2. trace_writeback_wake_forker_thread(bdi)

    The first event is triggered every time we wake up a bdi thread to start
    periodic background write-out. The second event is triggered only when the bdi
    thread does not exist and should be created by the forker thread.

    This patch was suggested by Dave Chinner and Christoph Hellwig.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • The 'setup_timer()' function also calls 'init_timer()', so the extra
    'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically
    'init_timer()' plus callback function and data pointers initialization.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
    should take care of the periodic background write-out. However, the write-out
    will actually start only 'dirty_writeback_interval' centisecs later, so we can
    delay the wake-up.

    This change was requested by Nick Piggin who pointed out that if we delay the
    wake-up, we weed out 2 unnecessary contex switches, which matters because
    '__mark_inode_dirty()' is a hot-path function.

    This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
    sets up a timer to wake-up the bdi thread and returns. So the wake-up is
    delayed.

    We also delete the timer in bdi threads just before writing-back. And
    synchronously delete it when unregistering bdi. At the unregister point the bdi
    does not have any users, so no one can arm it again.

    Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
    context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
    this change as well.

    This patch also moves the 'bdi_wb_init()' function down in the file to avoid
    forward-declaration of 'bdi_wakeup_thread_delayed()'.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
    bad for battery-driven devices.

    There are two types of activities bdi threads do:
    1. process bdi works from the 'bdi->work_list'
    2. periodic write-back

    So there are 2 sources of wake-up events for bdi threads:

    1. 'bdi_queue_work()' - submits bdi works
    2. '__mark_inode_dirty()' - adds dirty I/O to bdi's

    The former already has bdi wake-up code. The latter does not, and this patch
    adds it.

    '__mark_inode_dirty()' is hot-path function, but this patch adds another
    'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
    the bdi has no dirty inodes. So adding this spinlock should be fine and should
    not affect performance.

    This patch makes sure bdi threads and the forker thread do not wake-up if there
    is nothing to do. The forker thread will nevertheless wake up at least every
    5 min. to check whether it has to kill a bdi thread. This can also be optimized,
    but is not worth it.

    This patch also tidies up the warning about unregistered bid, and turns it from
    an ugly crocodile to a simple 'WARN()' statement.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently, bdi threads can decide to exit if there were no useful activities
    for 5 minutes. However, this causes nasty races: we can easily oops in the
    'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.

    And even if we do not oops, but the bdi tread exits immediately after we wake
    it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
    in the bdi work processing.

    This patch makes the forker thread to be the central place which not only
    creates bdi threads, but also kills them if they were inactive long enough.
    This better design-wise.

    Another reason why this change was done is to prepare for the further changes
    which will prevent the bdi threads from waking up every 5 sec and wasting
    power. Indeed, when the task does not wake up periodically anymore, it won't be
    able to exit either.

    This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
    forker thread as well. So now the forker thread sets the BDI_pending bit, then
    forks the task or kills it, then clears the bit and wakes up the waiting
    process.

    The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
    function was changed as well - now it first removes the bdi from the
    'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
    guaranteed that the forker thread won't race with it, because the bdi is not
    visible. Note, the forker thread sets the 'BDI_pending' bit under the
    'bdi->wb_lock' which is essential for proper serialization.

    And additionally, when we change 'bdi->wb.task', we now take the
    'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
    would when raced with, say, 'bdi_queue_work()'.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy