01 Aug, 2012

1 commit

  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

31 Jul, 2012

1 commit

  • Pull writeback updates from Wu Fengguang:
    "Use time based periods to age the writeback proportions, which can
    adapt equally well to fast/slow devices."

    Fix up trivial conflict in comment in fs/sync.c

    * tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Fix some comment errors
    block: Convert BDI proportion calculations to flexible proportions
    lib: Fix possible deadlock in flexible proportion code
    lib: Proportions with flexible period

    Linus Torvalds
     

23 Jul, 2012

1 commit

  • In principle, a filesystem may want to have ->sync_fs() called during sync(1)
    although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
    Only writeback code really needs bdi set to something reasonable. So move the
    checks where they are more logical.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

09 Jun, 2012

2 commits

  • Signed-off-by: Wanpeng Li
    Signed-off-by: Fengguang Wu

    Wanpeng Li
     
  • Fix bug introduced by 169ebd90. We have to have wb_list_lock locked when
    restarting writeback loop after having waited for inode writeback.

    Bug description by Ted Tso:

    I can reproduce this fairly easily by using ext4 w/o a journal, running
    under KVM with 1024megs memory, with fsstress (xfstests #13):

    [ 45.153294] =====================================
    [ 45.154784] [ BUG: bad unlock balance detected! ]
    [ 45.155591] 3.5.0-rc1-00002-gb22b1f1 #124 Not tainted
    [ 45.155591] -------------------------------------
    [ 45.155591] flush-254:16/2499 is trying to release lock (&(&wb->list_lock)->rlock) at:
    [ 45.155591] [] writeback_sb_inodes+0x160/0x327
    [ 45.155591] but there are no more locks to release!

    Reported-by: Theodore Ts'o
    Tested-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

06 May, 2012

7 commits

  • Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
    because iput() can do a lot of work - for example truncate the inode if it's
    the last iput on unlinked file. Some filesystems depend on flusher thread
    progressing (e.g. because they need to flush delay allocated blocks to reduce
    allocation uncertainty) and so flusher thread doing truncate creates
    interesting dependencies and possibilities for deadlocks.

    We get rid of iput() in flusher thread by using the fact that I_SYNC inode
    flag effectively pins the inode in memory. So if we take care to either hold
    i_lock or have I_SYNC set, we can get away without taking inode reference
    in writeback_sb_inodes().

    As a side effect of these changes, we also fix possible use-after-free in
    wb_writeback() because inode_wait_for_writeback() call could try to reacquire
    i_lock on the inode that was already free.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • The code in writeback_single_inode() is relatively complex. The list requeing
    logic makes sense only for flusher thread but not really for sync_inode() or
    write_inode_now() callers. Also when we want to get rid of inode references
    held by flusher thread, we will need a special I_SYNC handling there.

    So separate part of writeback_single_inode() which does the real writeback work
    into __writeback_single_inode() and make writeback_single_inode() do only stuff
    necessary for callers writing only one inode, moving the special list handling
    into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
    could skip some inode during sync(2) because other writer refiled it from b_io
    to b_dirty list. Also I_SYNC handling is moved into the callers of
    __writeback_single_inode() to make locking easier.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • writeback_single_inode() doesn't need wb->list_lock for anything on entry now.
    So remove the requirement. This makes locking of writeback_single_inode()
    temporarily awkward (entering with i_lock, returning with i_lock and
    wb->list_lock) but it will be sanitized in the next patch.

    Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was
    just taking it to make usage convenient for callers but with
    writeback_single_inode() changing it's not very convenient anymore. So remove
    the lock from that function.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • Move inode requeueing after inode has been written out into a separate
    function.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
    writing them all, just clear the bit only when we succeeded writing all the
    pages. We also move the clearing of the bit close to other i_state handling to
    separate it from writeback list handling. This is desirable because list
    handling will differ for flusher thread and other writeback_single_inode()
    callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
    checking it in ->writepages or ->write_inode implementation) so this movement
    is safe.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • When writeback_single_inode() is called on inode which has I_SYNC already
    set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
    this makes sense only if the caller is flusher thread. For other callers of
    writeback_single_inode() it doesn't really make sense and may be even wrong
    - flusher thread may be doing WB_SYNC_ALL writeback in parallel.

    So we move requeueing from writeback_single_inode() to writeback_sb_inodes().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • Move clearing of I_SYNC into inode_sync_complete(). It is more logical to have
    clearing of I_SYNC bit and waking of waiters in one place. Also later we will
    have two places needing to clear I_SYNC and wake up waiters so this allows them
    to use the common helper. Moving of I_SYNC clearing to a later stage of
    writeback_single_inode() is safe since we hold i_lock all the time.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

29 Mar, 2012

1 commit


25 Mar, 2012

1 commit

  • Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker:
    "Fix up files in fs/ and lib/ dirs to only use module.h if they really
    need it.

    These are trivial in scope vs the work done previously. We now have
    things where any few remaining cleanups can be farmed out to arch or
    subsystem maintainers, and I have done so when possible. What is
    remaining here represents the bits that don't clearly lie within a
    single arch/subsystem boundary, like the fs dir and the lib dir.

    Some duplicate includes arising from overlapping fixes from
    independent subsystem maintainer submissions are also quashed."

    Fix up trivial conflicts due to clashes with other include file cleanups
    (including some due to the previous bug.h cleanup pull).

    * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    lib: reduce the use of module.h wherever possible
    fs: reduce the use of module.h wherever possible
    includecheck: delete any duplicate instances of module.h

    Linus Torvalds
     

21 Mar, 2012

3 commits

  • The comment is hopelessly outdated and misplaced. We no longer have 'bdi'
    part of writeback work, the comment about blockdev super is outdated,
    comment about throttling as well. Information about list handling is in
    more detail at queue_io(). So just move the bit about older_than_this to
    close to move_expired_inodes() and remove the rest.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • inode_sync_wait() in write_inode_now() is just bogus. That function waits for
    I_SYNC bit to be cleared but writeback_single_inode() clears the bit on return
    so the wait is effectivelly a nop unless someone else submits the inode for
    writeback again. All the waiting write_inode_now() needs is achieved by using
    WB_SYNC_ALL writeback mode.

    Signed-off-by: Jan Kara
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Fengguang Wu

    Jan Kara
     
  • Pull trivial tree from Jiri Kosina:
    "It's indeed trivial -- mostly documentation updates and a bunch of
    typo fixes from Masanari.

    There are also several linux/version.h include removals from Jesper."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
    kcore: fix spelling in read_kcore() comment
    constify struct pci_dev * in obvious cases
    Revert "char: Fix typo in viotape.c"
    init: fix wording error in mm_init comment
    usb: gadget: Kconfig: fix typo for 'different'
    Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
    writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
    writeback: fix typo in the writeback_control comment
    Documentation: Fix multiple typo in Documentation
    tpm_tis: fix tis_lock with respect to RCU
    Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
    Doc: Update numastat.txt
    qla4xxx: Add missing spaces to error messages
    compiler.h: Fix typo
    security: struct security_operations kerneldoc fix
    Documentation: broken URL in libata.tmpl
    Documentation: broken URL in filesystems.tmpl
    mtd: simplify return logic in do_map_probe()
    mm: fix comment typo of truncate_inode_pages_range
    power: bq27x00: Fix typos in comment
    ...

    Linus Torvalds
     

07 Mar, 2012

1 commit


29 Feb, 2012

1 commit


01 Feb, 2012

1 commit


11 Jan, 2012

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     

09 Jan, 2012

1 commit

  • * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
    PM / Hibernate: Implement compat_ioctl for /dev/snapshot
    PM / Freezer: fix return value of freezable_schedule_timeout_killable()
    PM / shmobile: Allow the A4R domain to be turned off at run time
    PM / input / touchscreen: Make st1232 use device PM QoS constraints
    PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
    PM / shmobile: Remove the stay_on flag from SH7372's PM domains
    PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
    PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    ARM: S3C64XX: Implement basic power domain support
    PM / shmobile: Use common always on power domain governor
    ...

    Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
    XBT_FORCE_SLEEP bit

    Linus Torvalds
     

08 Jan, 2012

1 commit


04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

26 Dec, 2011

1 commit

  • * pm-sleep: (51 commits)
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
    PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
    PM / Sleep: Make [un]lock_system_sleep() generic
    PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
    PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
    PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
    Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
    PM / Sleep: Unify diagnostic messages from device suspend/resume
    ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
    PM / Hibernate: Remove deprecated hibernation test modes
    PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
    ...

    Conflicts:
    kernel/kmod.c

    Rafael J. Wysocki
     

22 Dec, 2011

1 commit

  • * master: (848 commits)
    SELinux: Fix RCU deref check warning in sel_netport_insert()
    binary_sysctl(): fix memory leak
    mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
    ipmi_watchdog: restore settings when BMC reset
    oom: fix integer overflow of points in oom_badness
    memcg: keep root group unchanged if creation fails
    nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
    nilfs2: unbreak compat ioctl
    cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
    evm: prevent racing during tfm allocation
    evm: key must be set once during initialization
    mmc: vub300: fix type of firmware_rom_wait_states module parameter
    Revert "mmc: enable runtime PM by default"
    mmc: sdhci: remove "state" argument from sdhci_suspend_host
    x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
    IB/qib: Correct sense on freectxts increment and decrement
    RDMA/cma: Verify private data length
    cgroups: fix a css_set not found bug in cgroup_attach_proc
    oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
    Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
    ...

    Conflicts:
    kernel/cgroup_freezer.c

    Rafael J. Wysocki
     

18 Dec, 2011

2 commits

  • Current livelock avoidance code makes background work to include only inodes
    that were dirtied before background writeback has started. However background
    writeback can be running for a long time and thus excluding newly dirtied
    inodes can eventually exclude significant portion of dirty inodes making
    background writeback inefficient. Since background writeback avoids livelocking
    the flusher thread by yielding to any other work, there is no real reason why
    background work should not include all dirty inodes so change the logic in
    wb_writeback().

    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     
  • This makes the binary trace understandable by trace-cmd.

    CC: Dave Chinner
    CC: Curt Wohlgemuth
    CC: Steven Rostedt
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

29 Nov, 2011

1 commit


22 Nov, 2011

1 commit

  • Writeback and thinkpad_acpi have been using thaw_process() to prevent
    deadlock between the freezer and kthread_stop(); unfortunately, this
    is inherently racy - nothing prevents freezing from happening between
    thaw_process() and kthread_stop().

    This patch implements kthread_freezable_should_stop() which enters
    refrigerator if necessary but is guaranteed to return if
    kthread_stop() is invoked. Both thaw_process() users are converted to
    use the new function.

    Note that this deadlock condition exists for many of freezable
    kthreads. They need to be converted to use the new should_stop or
    freezable workqueue.

    Tested with synthetic test case.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Cc: Jens Axboe
    Cc: Oleg Nesterov

    Tejun Heo
     

31 Oct, 2011

2 commits

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     
  • Instead of sending ->older_than_this to queue_io() and
    move_expired_inodes(), send the entire wb_writeback_work
    structure. There are other fields of a work item that are
    useful in these routines and in tracepoints.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     

03 Oct, 2011

2 commits

  • One thing puzzled me is that in JBOD case, the per-disk writeout
    performance is smaller than the corresponding single-disk case even
    when they have comparable bdi_thresh. Tracing shows find that in single
    disk case, bdi_writeback is always kept high while in JBOD case, it
    could drop low from time to time and correspondingly bdi_reclaimable
    could sometimes rush high.

    The fix is to watch bdi_reclaimable and kick background writeback as
    soon as it goes high. This resembles the global background threshold
    but in per-bdi manner. The trick is, as long as bdi_reclaimable does
    not go high, bdi_writeback naturally won't go low because
    bdi_reclaimable+bdi_writeback ~= bdi_thresh.

    With less fluctuated writeback pages, JBOD performance is observed to
    increase noticeably in various cases.

    vmstat:nr_written values before/after patch:

    3.1.0-rc4-wo-underrun+ 3.1.0-rc4-bgthresh3+
    ------------------------ ------------------------
    125596480 +25.9% 158179363 JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
    61790815 +110.4% 130032231 JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
    58853546 -0.1% 58823828 JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
    110159811 +24.7% 137355377 JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
    69544762 +10.8% 77080047 JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
    50644862 +0.5% 50890006 JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
    42677090 +28.0% 54643527 JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
    47491324 +13.3% 53785605 JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
    52548986 +0.9% 53001031 JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
    26783091 +36.8% 36650248 JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
    35526347 +14.0% 40492312 JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
    44670723 -1.1% 44177606 JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
    127996037 +22.4% 156719990 JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
    57518856 +3.8% 59677625 JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
    51919909 +12.2% 58269894 JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
    86410514 +79.0% 154660433 JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
    40132519 +38.6% 55617893 JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
    48423248 +7.5% 52042927 JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
    206041046 +44.1% 296846536 JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
    72312903 -19.4% 58272885 JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
    50635672 -0.5% 50384787 JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
    68308534 +115.7% 147324758 JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
    57882933 +14.5% 66269621 JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
    52183472 +12.8% 58855181 JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
    53788956 +94.2% 104460352 JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
    44493342 +35.5% 60298210 JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
    42641209 +18.9% 50681038 JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • No behavior change.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Jul, 2011

1 commit

  • This fixes a soft lockup on conditions

    a) the flusher is working on a work by __bdi_start_writeback(), while

    b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
    grab sb->s_umount and enqueue a new work for the flusher to execute

    The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
    Then if the inode is requeued, wb_writeback() will busy retry on it.
    As a result, wb_writeback() loops for ever without releasing
    wb->list_lock, which further blocks other tasks.

    Fix the busy loop by redirtying the inode. This may undesirably delay
    the writeback of the inode, however most likely it will be picked up
    soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
    even writeback_inodes_wb().

    bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html

    Reported-by: Christoph Hellwig
    Tested-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

24 Jul, 2011

1 commit

  • Fix a system hang bug introduced by commit b7a2441f9966 ("writeback:
    remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate
    queue_io() into wb_writeback()") easily reproducible with high memory
    pressure and lots of file creation/deletions, for example, a kernel
    build in limited memory.

    It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE
    state, the flusher will get stuck busy retrying that inode, never
    releasing wb->list_lock. The lock in turn blocks all kinds of other
    tasks when they are trying to grab it.

    As put by Jan, it's a safe change regarding data integrity. I_FREEING or
    I_WILL_FREE inodes are written back by iput_final() and it is reclaim
    code that is responsible for eventually removing them. So writeback code
    can safely ignore them. I_NEW inodes should move out of this state when
    they are fully set up and in the writeback round following that, we will
    consider them for writeback. So the change makes sense.

    CC: Jan Kara
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

20 Jul, 2011

1 commit

  • The per-sb shrinker has the same requirement as the writeback
    threads of ensuring that the superblock is usable and pinned for the
    time it takes to run the work. Both need to take a passive reference
    to the sb, take a read lock on the s_umount lock and then only
    continue if an unmount is not in progress.

    pin_sb_for_writeback() does this exactly, so move it to fs/super.c
    and rename it to grab_super_passive() and exporting it via
    fs/internal.h for all the VFS code to be able to use.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

10 Jul, 2011

2 commits

  • Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
    concern of not holding I_SYNC for too long. (At least, that was the
    comment previously.) This doesn't make sense now because the only
    time we wait for I_SYNC is if we are calling sync or fsync, and in
    that case we need to write out all of the data anyway. Previously
    there may have been other code paths that waited on I_SYNC, but not
    any more. -- Theodore Ts'o

    So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
    will adapt to as large as the storage device can write within 500ms.

    XFS is observed to do IO completions in a batch, and the batch size is
    equal to the write chunk size. To avoid dirty pages to suddenly drop
    out of balance_dirty_pages()'s dirty control scope and create large
    fluctuations, the chunk size is also limited to half the control scope.

    The balance_dirty_pages() control scrope is

    [(background_thresh + dirty_thresh) / 2, dirty_thresh]

    which is by default [15%, 20%] of global dirty pages, whose range size
    is dirty_thresh / DIRTY_FULL_SCOPE.

    The adpative write chunk size will be rounded to the nearest 4MB
    boundary.

    http://bugzilla.kernel.org/show_bug.cgi?id=13930

    CC: Theodore Ts'o
    CC: Dave Chinner
    CC: Chris Mason
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • The start of a heavy weight application (ie. KVM) may instantly knock
    down determine_dirtyable_memory() if the swap is not enabled or full.
    global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
    dirty thresholds that are _much_ lower than the global/bdi dirty pages.

    balance_dirty_pages() will then heavily throttle all dirtiers including
    the light ones, until the dirty pages drop below the new dirty thresholds.
    During this _deep_ dirty-exceeded state, the system may appear rather
    unresponsive to the users.

    About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
    threshold to heavy dirtiers than light ones, and the dirty pages will
    be throttled around the heavy dirtiers' dirty threshold and reasonably
    below the light dirtiers' dirty threshold. In this state, only the heavy
    dirtiers will be throttled and the dirty pages are carefully controlled
    to not exceed the light dirtiers' dirty threshold. However if the
    threshold itself suddenly drops below the number of dirty pages, the
    light dirtiers will get heavily throttled.

    So introduce global_dirty_limit for tracking the global dirty threshold
    with policies

    - follow downwards slowly
    - follow up in one shot

    global_dirty_limit can effectively mask out the impact of sudden drop of
    dirtyable memory. It will be used in the next patch for two new type of
    dirty limits. Note that the new dirty limits are not going to avoid
    throttling the light dirtiers, but could limit their sleep time to 200ms.

    Signed-off-by: Wu Fengguang

    Wu Fengguang