01 Apr, 2009

2 commits

  • Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838

    On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
    way from the user's point of view:

    # echo 500 >/proc/sys/vm/dirty_writeback_centisecs
    # cat /proc/sys/vm/dirty_writeback_centisecs
    499

    So, we have 5000 jiffies converted to only 499 clock ticks and reported
    back.

    TICK_NSEC = 999848
    ACTHZ = 256039

    Keeping in-kernel variable in units passed from userspace will fix issue
    of course, but this probably won't be right for every sysctl.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Add a helper function account_page_dirtied(). Use that from two
    callsites. reiser4 adds a function which adds a third callsite.

    Signed-off-by: Edward Shishkin
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     

27 Mar, 2009

1 commit

  • Enlarge default dirty ratios from 5/10 to 10/20. This fixes [Bug
    #12809] iozone regression with 2.6.29-rc6.

    The iozone benchmarks are performed on a 1200M file, with 8GB ram.

    iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
    iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls

    The performance regression is triggered by commit 1cf6e7d83bf3(mm: task
    dirty accounting fix), which makes more correct/thorough dirty
    accounting.

    The default 5/10 dirty ratios were picked (a) with the old dirty logic
    and (b) largely at random and (c) designed to be aggressive. In
    particular, that (a) means that having fixed some of the dirty
    accounting, maybe the real bug is now that it was always too aggressive,
    just hidden by an accounting issue.

    The enlarged 10/20 dirty ratios are just about enough to fix the regression.

    [ We will have to look at how this affects the old fsync() latency issue,
    but that probably will need independent work. - Linus ]

    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Reported-by: "Lin, Ming M"
    Tested-by: "Lin, Ming M"
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

19 Feb, 2009

1 commit

  • YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
    cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

    Additionally, there is some inconsistency about when task_dirty_inc is
    called. It is used for dirty balancing, however it even gets called for
    __set_page_dirty_no_writeback.

    So rather than increment it in a set_page_dirty wrapper, move it down to
    exactly where the dirty page accounting stats are incremented.

    Cc: YAMAMOTO Takashi
    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

13 Feb, 2009

1 commit

  • A bug was introduced into write_cache_pages cyclic writeout by commit
    31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
    fix"). The intention (and comments) is that we should cycle back and
    look for more dirty pages at the beginning of the file if there is no
    more work to be done.

    But the !done condition was dropped from the test. This means that any
    time the page writeout loop breaks (eg. due to nr_to_write == 0), we
    will set index to 0, then goto again. This will set done_index to
    index, then find done is set, so will proceed to the end of the
    function. When updating mapping->writeback_index for cyclic writeout,
    we now use done_index == 0, so we're always cycling back to 0.

    This seemed to be causing random mmap writes (slapadd and iozone) to
    start writing more pages from the LRU and writeout would slowdown, and
    caused bugzilla entry

    http://bugzilla.kernel.org/show_bug.cgi?id=12604

    about Berkeley DB slowing down dramatically.

    With this patch, iozone random write performance is increased nearly
    5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).

    Signed-off-by: Nick Piggin
    Reported-and-tested-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Feb, 2009

2 commits

  • Commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa ("write-back: fix
    nr_to_write counter") fixed nr_to_write counter, but didn't set the break
    condition properly.

    If nr_to_write == 0 after being decremented it will loop one more time
    before setting done = 1 and breaking the loop.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Artem Bityutskiy
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Federico Cuello
     
  • We need to pass an unsigned long as the minimum, because it gets casted
    to an unsigned long in the sysctl handler. If we pass an int, we'll
    access four more bytes on 64bit arches, resulting in a random minimum
    value.

    [rientjes@google.com: fix type of `old_bytes']
    Signed-off-by: Sven Wegener
    Cc: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sven Wegener
     

04 Feb, 2009

1 commit

  • Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some
    @wbc->nr_to_write breakage.

    It made the following changes:
    1. Decrement wbc->nr_to_write instead of nr_to_write
    2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
    3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
    WB_SYNC_NONE, otherwise keep going.

    However, according to the commit message, the intention was to only make
    change 3. Change 1 is a bug. Change 2 does not seem to be necessary,
    and it breaks UBIFS expectations, so if needed, it should be done
    separately later. And change 2 does not seem to be documented in the
    commit message.

    This patch does the following:
    1. Undo changes 1 and 2
    2. Add a comment explaining change 3 (it very useful to have comments
    in _code_, not only in the commit).

    Signed-off-by: Artem Bityutskiy
    Acked-by: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     

07 Jan, 2009

10 commits

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The background dirty and dirty limits are better defined with type
    specifiers of unsigned long since negative writeback thresholds are not
    possible.

    These values, as returned by get_dirty_limits(), are normally compared
    with ZVC values to determine whether writeback shall commence or be
    throttled. Such page counts cannot be negative, so declaring the page
    limits as signed is unnecessary.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Now that we have the early-termination logic in place, it makes sense to
    bail out early in all other cases where done is set to 1.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Terminate the write_cache_pages loop upon encountering the first page past
    end, without locking the page. Pages cannot have their index change when
    we have a reference on them (truncate, eg truncate_inode_pages_range
    performs the same check without the page lock).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if we get stuck behind another process that is
    cleaning pages, we will be forced to wait for them to finish, then perform
    our own writeout (if it was redirtied during the long wait), then wait for
    that.

    If a page under writeout is still clean, we can skip waiting for it (if
    we're part of a data integrity sync, we'll be waiting for all writeout
    pages afterwards, so we'll still be waiting for the other guy's write
    that's cleaned the page).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Get rid of some complex expressions from flow control statements, add a
    comment, remove some duplicate code.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
    so the function will return success after writing out nr_to_write pages,
    even if that was not sufficient to guarantee data integrity.

    The callers tend to set it to values that could break data interity
    semantics easily in practice. For example, nr_to_write can be set to
    mapping->nr_pages * 2, however if a file has a single, dirty page, then
    fsync is called, subsequent pages might be concurrently added and dirtied,
    then write_cache_pages might writeout two of these newly dirty pages,
    while not writing out the old page that should have been written out.

    Fix this by ignoring nr_to_write if it is a data integrity sync.

    This is a data integrity bug.

    The reason this has been done in the past is to avoid stalling sync
    operations behind page dirtiers.

    "If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow."

    What we do today is return success after an arbitrary amount of pages are
    written, whether or not we have provided the data-integrity semantics that
    the caller has asked for. Even this doesn't actually fix all stall cases
    completely: in the above situation, if the file has a huge number of pages
    in pagecache (but not dirty), then mapping->nrpages is going to be huge,
    even if pages are being dirtied.

    This change does indeed make the possibility of long stalls lager, and
    that's not a good thing, but lying about data integrity is even worse. We
    have to either perform the sync, or return -ELINUXISLAME so at least the
    caller knows what has happened.

    There are subsequent competing approaches in the works to solve the stall
    problems properly, without compromising data integrity.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if ret signals a real error, but we still have some
    pages left in the pagevec, done would be set to 1, but the remaining pages
    would continue to be processed and ret will be overwritten in the process.

    It could easily be overwritten with success, and thus success will be
    returned even if there is an error. Thus the caller is told all writes
    succeeded, wheras in reality some did not.

    Fix this by bailing immediately if there is an error, and retaining the
    first error code.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • We'd like to break out of the loop early in many situations, however the
    existing code has been setting mapping->writeback_index past the final
    page in the pagevec lookup for cyclic writeback. This is a problem if we
    don't process all pages up to the final page.

    Currently the code mostly keeps writeback_index reasonable and hacked
    around this by not breaking out of the loop or writing pages outside the
    range in these cases. Keep track of a real "done index" that enables us
    to terminate the loop in a much more flexible manner.

    Needed by the subsequent patch to preserve writepage errors, and then
    further patches to break out of the loop early for other reasons. However
    there are no functional changes with this patch alone.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, scanned == 1 is supposed to mean that cyclic
    writeback has circled through zero, thus we should not circle again.
    However it gets set to 1 after the first successful pagevec lookup. This
    leads to cases where not enough data gets written.

    Counterexample: file with first 10 pages dirty, writeback_index == 5,
    nr_to_write == 10. Then the 5 last pages will be found, and scanned will
    be set to 1, after writing those out, we will not cycle back to get the
    first 5.

    Rework this logic, now we'll always cycle unless we started off from index
    0. When cycling, only write out as far as 1 page before the start page
    from the first cycle (so we don't write parts of the file twice).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Oct, 2008

1 commit

  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Oct, 2008

1 commit


16 Oct, 2008

1 commit

  • If no_nrwrite_index_update is set we don't update nr_to_write and
    address space writeback_index in write_cache_pages. This change
    enables a file system to skip these updates in write_cache_pages and do
    them in the writepages() callback. This patch will be followed by an
    ext4 patch that make use of these new flags.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    CC: linux-fsdevel@vger.kernel.org

    Aneesh Kumar K.V
     

14 Oct, 2008

1 commit


27 Jul, 2008

1 commit

  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 Jul, 2008

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
    ext4: Documention update for new ordered mode and delayed allocation
    ext4: do not set extents feature from the kernel
    ext4: Don't allow nonextenst mount option for large filesystem
    ext4: Enable delalloc by default.
    ext4: delayed allocation i_blocks fix for stat
    ext4: fix delalloc i_disksize early update issue
    ext4: Handle page without buffers in ext4_*_writepage()
    ext4: Add ordered mode support for delalloc
    ext4: Invert lock ordering of page_lock and transaction start in delalloc
    mm: Add range_cont mode for writeback
    ext4: delayed allocation ENOSPC handling
    percpu_counter: new function percpu_counter_sum_and_set
    ext4: Add delayed allocation support in data=writeback mode
    vfs: add hooks for ext4's delayed allocation support
    jbd2: Remove data=ordered mode support using jbd buffer heads
    ext4: Use new framework for data=ordered mode in JBD2
    jbd2: Implement data=ordered mode handling via inodes
    vfs: export filemap_fdatawrite_range()
    ext4: Fix lock inversion in ext4_ext_truncate()
    ext4: Invert the locking order of page_lock and transaction start
    ...

    Linus Torvalds
     

12 Jul, 2008

1 commit

  • Filesystems like ext4 needs to start a new transaction in
    the writepages for block allocation. This happens with delayed
    allocation and there is limit to how many credits we can request
    from the journal layer. So we call write_cache_pages multiple
    times with wbc->nr_to_write set to the maximum possible value
    limitted by the max journal credits available.

    Add a new mode to writeback that enables us to handle this
    behaviour. In the new mode we update the wbc->range_start
    to point to the new offset to be written. Next call to
    call to write_cache_pages will start writeout from specified
    range_start offset. In the new mode we also limit writing
    to the specified wbc->range_end.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Acked-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     

24 May, 2008

1 commit

  • Currently there is no protection from the root user to use up all of
    memory for trace buffers. If the root user allocates too many entries,
    the OOM killer might start kill off all tasks.

    This patch adds an algorith to check the following condition:

    pages_requested > (freeable_memory + current_trace_buffer_pages) / 4

    If the above is met then the allocation fails. The above prevents more
    than 1/4th of freeable memory from being used by trace buffers.

    To determine the freeable_memory, I made determine_dirtyable_memory in
    mm/page-writeback.c global.

    Special thanks goes to Peter Zijlstra for suggesting the above calculation.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Steven Rostedt
     

30 Apr, 2008

6 commits

  • Fuse will use temporary buffers to write back dirty data from memory mappings
    (normal writes are done synchronously). This is needed, because there cannot
    be any guarantee about the time in which a write will complete.

    By using temporary buffers, from the MM's point if view the page is written
    back immediately. If the writeout was due to memory pressure, this
    effectively migrates data from a full zone to a less full zone.

    This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
    as temporary buffers.

    [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fuse needs this for writable mmap support.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
    the global dirty threshold allocated to this bdi.

    [mszeredi@suse.cz]

    - fix parsing in max_ratio_store().
    - export bdi_set_max_ratio() to modules
    - limit bdi_dirty with bdi->max_ratio
    - document new sysfs attribute

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Under normal circumstances each device is given a part of the total write-back
    cache that relates to its current avg writeout speed in relation to the other
    devices.

    min_ratio - allows one to assign a minimum portion of the write-back cache to
    a particular device. This is useful in situations where you might want to
    provide a minimum QoS. (One request for this feature came from flash based
    storage people who wanted to avoid writing out at all costs - they of course
    needed some pdflush hacks as well)

    max_ratio - allows one to assign a maximum portion of the dirty limit to a
    particular device. This is useful in situations where you want to avoid one
    device taking all or most of the write-back cache. Eg. an NFS mount that is
    prone to get stuck, or a FUSE mount which you don't trust to play fair.

    Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
    the global dirty threshold allocated to this bdi.

    [mszeredi@suse.cz]

    - fix parsing in min_ratio_store()
    - document new sysfs attribute

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
    This allows us to see and set the various BDI specific variables.

    In particular this properly exposes the read-ahead window for all relevant
    users and /sys/block//queue/read_ahead_kb should be deprecated.

    With patient help from Kay Sievers and Greg KH

    [mszeredi@suse.cz]

    - split off NFS and FUSE changes into separate patches
    - document new sysfs attributes under Documentation/ABI
    - do bdi_class_init as a core_initcall, otherwise the "default" BDI
    won't be initialized
    - remove bdi_init_fmt macro, it's not used very much

    [akpm@linux-foundation.org: fix ia64 warning]
    Signed-off-by: Peter Zijlstra
    Cc: Kay Sievers
    Acked-by: Greg KH
    Cc: Trond Myklebust
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

06 Feb, 2008

4 commits

  • After making dirty a 100M file, the normal behavior is to start the
    writeback for all data after 30s delays. But sometimes the following
    happens instead:

    - after 30s: ~4M
    - after 5s: ~4M
    - after 5s: all remaining 92M

    Some analyze shows that the internal io dispatch queues goes like this:

    s_io s_more_io
    -------------------------
    1) 100M,1K 0
    2) 1K 96M
    3) 0 96M
    1) initial state with a 100M file and a 1K file

    2) 4M written, nr_to_write 0, no more writes(BUG)

    nr_to_write > 0 in (3) fools the upper layer to think that data have all
    been written out. The big dirty file is actually still sitting in
    s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
    becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
    may starve newly expired inodes in s_dirty. It is also not an option to
    draw inodes from both s_more_io and s_dirty, an let the loop go on: this
    might lead to live locks, and might also starve other superblocks in sync
    time(well kupdate may still starve some superblocks, that's another bug).

    We have to return when a full scan of s_io completes. So nr_to_write > 0
    does not necessarily mean that "all data are written". This patch
    introduces a flag writeback_control.more_io to indicate that more io should
    be done. With it the big dirty file no longer has to wait for the next
    kupdate invokation 5s later.

    In sync_sb_inodes() we only set more_io on super_blocks we actually
    visited. This avoids the interaction between two pdflush deamons.

    Also in __sync_single_inode() we don't blindly keep requeuing the io if the
    filesystem cannot progress. Failing to do so may lead to 100% iowait.

    Tested-by: Mike Snitzer
    Signed-off-by: Fengguang Wu
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     
  • task_dirty_limit() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

15 Jan, 2008

1 commit

  • This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
    requested by Fengguang Wu. It's not quite fully baked yet, and while
    there are patches around to fix the problems it caused, they should get
    more testing. Says Fengguang: "I'll resend them both for -mm later on,
    in a more complete patchset".

    See

    http://bugzilla.kernel.org/show_bug.cgi?id=9738

    for some of this discussion.

    Requested-by: Fengguang Wu
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Nov, 2007

1 commit

  • This code harks back to the days when we didn't count dirty mapped
    pages, which led us to try to balance the number of dirty unmapped pages
    by how much unmapped memory there was in the system.

    That makes no sense any more, since now the dirty counts include the
    mapped pages. Not to mention that the math doesn't work with HIGHMEM
    machines anyway, and causes the unmapped_ratio to potentially turn
    negative (which we do catch thanks to clamping it at a minimum value,
    but I mention that as an indication of how broken the code is).

    The code also was written at a time when the default dirty ratio was
    much larger, and the unmapped_ratio logic effectively capped that large
    dirty ratio a bit. Again, we've since lowered the dirty ratio rather
    aggressively, further lessening the point of that code.

    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Nov, 2007

1 commit

  • We allow violation of bdi limits if there is a lot of room on the system.
    Once we hit half the total limit we start enforcing bdi limits and bdi
    ramp-up should happen. Doing it this way avoids many small writeouts on an
    otherwise idle system and should also speed up the ramp-up.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

20 Oct, 2007

1 commit