09 Oct, 2009

1 commit

  • It makes sense to do IOWAIT when someone is blocked
    due to IO throttle, as suggested by Kame and Peter.

    There is an old comment for not doing IOWAIT on throttle,
    however it has been mismatching the code for a long time.

    If we stop accounting IOWAIT for 2.6.32, it could be an
    undesirable behavior change. So restore the io_schedule.

    CC: KAMEZAWA Hiroyuki
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     

26 Sep, 2009

5 commits

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Treat bdi_start_writeback(0) as a special request to do background write,
    and stop such work when we are below the background dirty threshold.

    Also simplify the (nr_pages
    CC: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Some filesystem may choose to write much more than ratelimit_pages
    before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
    determine number to write based on real number of dirtied pages.

    Otherwise it is possible that
    loop {
    btrfs_file_write(): dirty 1024 pages
    balance_dirty_pages(): write up to 48 pages (= ratelimit_pages * 1.5)
    }
    in which the writeback rate cannot keep up with dirty rate, and the
    dirty pages go all the way beyond dirty_thresh.

    The increased write_chunk may make the dirtier more bumpy.
    So filesystems shall be take care not to dirty too much at
    a time (eg. > 4MB) without checking the ratelimit.

    Signed-off-by: Wu Fengguang
    Acked-by: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Wu Fengguang
     

24 Sep, 2009

2 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

1 commit

  • global_lru_pages() / zone_lru_pages() can be used in two ways:
    - to estimate max reclaimable pages in determine_dirtyable_memory()
    - to calculate the slab scan ratio

    When swap is full or not present, the anon lru lists are not reclaimable
    and also won't be scanned. So the anon pages shall not be counted in both
    usage scenarios. Also rename to _reclaimable_pages: now they are counting
    the possibly reclaimable lru pages.

    It can greatly (and correctly) increase the slab scan rate under high
    memory pressure (when most file pages have been reclaimed and swap is
    full/absent), thus reduce false OOM kills.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Wu Fengguang
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Jesse Barnes
    Cc: David Howells
    Cc: "Li, Ming Chun"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

21 Sep, 2009

2 commits


16 Sep, 2009

5 commits

  • bdi_start_writeback() is currently split into two paths, one for
    WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
    for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
    only WB_SYNC_NONE.

    Push down the writeback_control allocation and only accept the
    parameters that make sense for each function. This cleans up
    the API considerably.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now that bdi_writeback_all() no longer handles integrity writeback,
    it doesn't have to block anymore. This means that we can switch
    bdi_list reader side protection to RCU.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's only set, it's never checked. Kill it.

    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The dirtying of page and set_page_dirty() can be moved into the page lock.

    - In shmem_write_end(), the page was dirtied while the page lock was held,
    but it's being marked dirty just after dropping the page lock.
    - In shmem_symlink(), both dirtying and marking can be moved into page lock.

    It's valuable for the hwpoison code to know whether one bad page can be dropped
    without losing data. It mainly judges by testing the PG_dirty bit after taking
    the page lock. So it becomes important that the dirtying of page and the
    marking of dirtiness are both done inside the page lock. Which is a common
    practice, but sadly not a rule.

    The noticeable exceptions are
    - mapped pages
    - pages with buffer_heads
    The above pages could go dirty at any time. Fortunately the hwpoison will
    unmap the page and release the buffer_heads beforehand anyway.

    Many other types of pages (eg. metadata pages) can also be dirtied at will by
    their owners, the hwpoison code cannot do meaningful things to them anyway.
    Only the dirtiness of pagecache pages owned by regular files are interested.

    v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)

    Acked-by: Hugh Dickins
    Reviewed-by: WANG Cong
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     

11 Sep, 2009

2 commits

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Aug, 2009

1 commit

  • Conflicts:
    arch/sparc/kernel/smp_64.c
    arch/x86/kernel/cpu/perf_counter.c
    arch/x86/kernel/setup_percpu.c
    drivers/cpufreq/cpufreq_ondemand.c
    mm/percpu.c

    Conflicts in core and arch percpu codes are mostly from commit
    ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
    num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
    the first chunk allocators into mm/percpu.c, the changes are moved
    from arch code to mm/percpu.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

11 Jul, 2009

1 commit


04 Jul, 2009

1 commit

  • Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
    changes. As alpha in percpu tree uses 'weak' attribute instead of
    inline assembly, there's no need for __used attribute.

    Conflicts:
    arch/alpha/include/asm/percpu.h
    arch/mn10300/kernel/vmlinux.lds.S
    include/linux/percpu-defs.h

    Tejun Heo
     

01 Jul, 2009

1 commit

  • balance_dirty_pages can overreact and move all of the dirty pages to
    writeback unnecessarily.

    balance_dirty_pages makes its decision to throttle based on the number of
    dirty plus writeback pages that are over the calculated limit,so it will
    continue to move pages even when there are plenty of pages in writeback
    and less than the threshold still dirty.

    This allows it to overshoot its limits and move all the dirty pages to
    writeback while waiting for the drives to catch up and empty the writeback
    list.

    A simple fio test easily demonstrates this problem.

    fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10

    This is the simplest fix I could find, but I'm not entirely sure that it
    alone will be enough for all cases. But it certainly is an improvement on
    my desktop machine writing to 2 disks.

    Do we need something more for machines with large arrays where
    bdi_threshold * number_of_drives is greater than the dirty_ratio ?

    Signed-off-by: Richard Kennedy
    Acked-by: Peter Zijlstra
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

24 Jun, 2009

1 commit

  • Percpu variable definition is about to be updated such that all percpu
    symbols including the static ones must be unique. Update percpu
    variable definitions accordingly.

    * as,cfq: rename ioc_count uniquely

    * cpufreq: rename cpu_dbs_info uniquely

    * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it

    * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
    rename it

    * ipv4,6: rename cookie_scratch uniquely

    * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
    pmc_irq_entry and nmi_entry to pmc_nmi_entry

    * perf_counter: rename disable_count to perf_disable_count

    * ftrace: rename test_event_disable to ftrace_test_event_disable

    * kmemleak: rename test_pointer to kmemleak_test_pointer

    * mce: rename next_interval to mce_next_interval

    [ Impact: percpu usage cleanups, no duplicate static percpu var names ]

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Cc: Ivan Kokshaysky
    Cc: Jens Axboe
    Cc: Dave Jones
    Cc: Jeremy Fitzhardinge
    Cc: linux-mm
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Li Zefan
    Cc: Catalin Marinas
    Cc: Andi Kleen

    Tejun Heo
     

17 Jun, 2009

1 commit

  • get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
    with variable pbdi_dirty as one of the arguments. This variable is an
    unsigned long * but both functions expect it to be a long *. This causes
    the following sparse warnings:

    warning: incorrect type in argument 3 (different signedness)
    expected long *pbdi_dirty
    got unsigned long *pbdi_dirty
    warning: incorrect type in argument 2 (different signedness)
    expected long *pdirty
    got unsigned long *pbdi_dirty

    Fix the warnings by changing the long * to unsigned long * in both
    functions.

    Signed-off-by: H Hartley Sweeten
    Cc: Johannes Weiner
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     

18 May, 2009

1 commit

  • wb_kupdate() function has a bug on linux-2.6.30-rc5. This bug causes
    generic_sync_sb_inodes() to start to write inodes back much earlier than
    our expectations because it miscalculates oldest_jif in wb_kupdate().

    This bug was introduced in 704503d836042d4a4c7685b7036e7de0418fbc0f
    ('mm: fix proc_dointvec_userhz_jiffies "breakage"').

    Signed-off-by: Toshiyuki Okajima
    Cc: Alexey Dobriyan
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshiyuki Okajima
     

01 Apr, 2009

2 commits

  • Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838

    On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
    way from the user's point of view:

    # echo 500 >/proc/sys/vm/dirty_writeback_centisecs
    # cat /proc/sys/vm/dirty_writeback_centisecs
    499

    So, we have 5000 jiffies converted to only 499 clock ticks and reported
    back.

    TICK_NSEC = 999848
    ACTHZ = 256039

    Keeping in-kernel variable in units passed from userspace will fix issue
    of course, but this probably won't be right for every sysctl.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Add a helper function account_page_dirtied(). Use that from two
    callsites. reiser4 adds a function which adds a third callsite.

    Signed-off-by: Edward Shishkin
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     

27 Mar, 2009

1 commit

  • Enlarge default dirty ratios from 5/10 to 10/20. This fixes [Bug
    #12809] iozone regression with 2.6.29-rc6.

    The iozone benchmarks are performed on a 1200M file, with 8GB ram.

    iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
    iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls

    The performance regression is triggered by commit 1cf6e7d83bf3(mm: task
    dirty accounting fix), which makes more correct/thorough dirty
    accounting.

    The default 5/10 dirty ratios were picked (a) with the old dirty logic
    and (b) largely at random and (c) designed to be aggressive. In
    particular, that (a) means that having fixed some of the dirty
    accounting, maybe the real bug is now that it was always too aggressive,
    just hidden by an accounting issue.

    The enlarged 10/20 dirty ratios are just about enough to fix the regression.

    [ We will have to look at how this affects the old fsync() latency issue,
    but that probably will need independent work. - Linus ]

    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Reported-by: "Lin, Ming M"
    Tested-by: "Lin, Ming M"
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

19 Feb, 2009

1 commit

  • YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
    cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

    Additionally, there is some inconsistency about when task_dirty_inc is
    called. It is used for dirty balancing, however it even gets called for
    __set_page_dirty_no_writeback.

    So rather than increment it in a set_page_dirty wrapper, move it down to
    exactly where the dirty page accounting stats are incremented.

    Cc: YAMAMOTO Takashi
    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

13 Feb, 2009

1 commit

  • A bug was introduced into write_cache_pages cyclic writeout by commit
    31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
    fix"). The intention (and comments) is that we should cycle back and
    look for more dirty pages at the beginning of the file if there is no
    more work to be done.

    But the !done condition was dropped from the test. This means that any
    time the page writeout loop breaks (eg. due to nr_to_write == 0), we
    will set index to 0, then goto again. This will set done_index to
    index, then find done is set, so will proceed to the end of the
    function. When updating mapping->writeback_index for cyclic writeout,
    we now use done_index == 0, so we're always cycling back to 0.

    This seemed to be causing random mmap writes (slapadd and iozone) to
    start writing more pages from the LRU and writeout would slowdown, and
    caused bugzilla entry

    http://bugzilla.kernel.org/show_bug.cgi?id=12604

    about Berkeley DB slowing down dramatically.

    With this patch, iozone random write performance is increased nearly
    5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).

    Signed-off-by: Nick Piggin
    Reported-and-tested-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Feb, 2009

2 commits

  • Commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa ("write-back: fix
    nr_to_write counter") fixed nr_to_write counter, but didn't set the break
    condition properly.

    If nr_to_write == 0 after being decremented it will loop one more time
    before setting done = 1 and breaking the loop.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Artem Bityutskiy
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Federico Cuello
     
  • We need to pass an unsigned long as the minimum, because it gets casted
    to an unsigned long in the sysctl handler. If we pass an int, we'll
    access four more bytes on 64bit arches, resulting in a random minimum
    value.

    [rientjes@google.com: fix type of `old_bytes']
    Signed-off-by: Sven Wegener
    Cc: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sven Wegener
     

04 Feb, 2009

1 commit

  • Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some
    @wbc->nr_to_write breakage.

    It made the following changes:
    1. Decrement wbc->nr_to_write instead of nr_to_write
    2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
    3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
    WB_SYNC_NONE, otherwise keep going.

    However, according to the commit message, the intention was to only make
    change 3. Change 1 is a bug. Change 2 does not seem to be necessary,
    and it breaks UBIFS expectations, so if needed, it should be done
    separately later. And change 2 does not seem to be documented in the
    commit message.

    This patch does the following:
    1. Undo changes 1 and 2
    2. Add a comment explaining change 3 (it very useful to have comments
    in _code_, not only in the commit).

    Signed-off-by: Artem Bityutskiy
    Acked-by: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     

07 Jan, 2009

7 commits

  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The background dirty and dirty limits are better defined with type
    specifiers of unsigned long since negative writeback thresholds are not
    possible.

    These values, as returned by get_dirty_limits(), are normally compared
    with ZVC values to determine whether writeback shall commence or be
    throttled. Such page counts cannot be negative, so declaring the page
    limits as signed is unnecessary.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Now that we have the early-termination logic in place, it makes sense to
    bail out early in all other cases where done is set to 1.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Terminate the write_cache_pages loop upon encountering the first page past
    end, without locking the page. Pages cannot have their index change when
    we have a reference on them (truncate, eg truncate_inode_pages_range
    performs the same check without the page lock).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if we get stuck behind another process that is
    cleaning pages, we will be forced to wait for them to finish, then perform
    our own writeout (if it was redirtied during the long wait), then wait for
    that.

    If a page under writeout is still clean, we can skip waiting for it (if
    we're part of a data integrity sync, we'll be waiting for all writeout
    pages afterwards, so we'll still be waiting for the other guy's write
    that's cleaned the page).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Get rid of some complex expressions from flow control statements, add a
    comment, remove some duplicate code.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
    so the function will return success after writing out nr_to_write pages,
    even if that was not sufficient to guarantee data integrity.

    The callers tend to set it to values that could break data interity
    semantics easily in practice. For example, nr_to_write can be set to
    mapping->nr_pages * 2, however if a file has a single, dirty page, then
    fsync is called, subsequent pages might be concurrently added and dirtied,
    then write_cache_pages might writeout two of these newly dirty pages,
    while not writing out the old page that should have been written out.

    Fix this by ignoring nr_to_write if it is a data integrity sync.

    This is a data integrity bug.

    The reason this has been done in the past is to avoid stalling sync
    operations behind page dirtiers.

    "If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow."

    What we do today is return success after an arbitrary amount of pages are
    written, whether or not we have provided the data-integrity semantics that
    the caller has asked for. Even this doesn't actually fix all stall cases
    completely: in the above situation, if the file has a huge number of pages
    in pagecache (but not dirty), then mapping->nrpages is going to be huge,
    even if pages are being dirtied.

    This change does indeed make the possibility of long stalls lager, and
    that's not a good thing, but lying about data integrity is even worse. We
    have to either perform the sync, or return -ELINUXISLAME so at least the
    caller knows what has happened.

    There are subsequent competing approaches in the works to solve the stall
    problems properly, without compromising data integrity.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin