10 Aug, 2010

1 commit

  • We try to avoid livelocks of writeback when some steadily creates dirty
    pages in a mapping we are writing out. For memory-cleaning writeback,
    using nr_to_write works reasonably well but we cannot really use it for
    data integrity writeback. This patch tries to solve the problem.

    The idea is simple: Tag all pages that should be written back with a
    special tag (TOWRITE) in the radix tree. This can be done rather quickly
    and thus livelocks should not happen in practice. Then we start doing the
    hard work of locking pages and sending them to disk only for those pages
    that have TOWRITE tag set.

    Note: Adding new radix tree tag grows radix tree node from 288 to 296
    bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
    However, the number of slab/slub items per page remains the same (13 and 7
    respectively).

    Signed-off-by: Jan Kara
    Cc: Dave Chinner
    Cc: Nick Piggin
    Cc: Chris Mason
    Cc: Theodore Ts'o
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

06 Jul, 2010

1 commit


11 Jun, 2010

1 commit


09 Jun, 2010

2 commits

  • sync can currently take a really long time if a concurrent writer is
    extending a file. The problem is that the dirty pages on the address
    space grow in the same direction as write_cache_pages scans, so if
    the writer keeps ahead of writeback, the writeback will not
    terminate until the writer stops adding dirty pages.

    For a data integrity sync, we only need to write the pages dirty at
    the time we start the writeback, so we can stop scanning once we get
    to the page that was at the end of the file at the time the scan
    started.

    This will prevent operations like copying a large file preventing
    sync from completing as it will not write back pages that were
    dirtied after the sync was started. This does not impact the
    existing integrity guarantees, as any dirty page (old or new)
    within the EOF range at the start of the scan will still be
    captured.

    This patch will not prevent sync from blocking on large writes into
    holes. That requires more complex intervention while this patch only
    addresses the common append-case of this sync holdoff.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Dave Chinner
     
  • If a filesystem writes more than one page in ->writepage, write_cache_pages
    fails to notice this and continues to attempt writeback when wbc->nr_to_write
    has gone negative - this trace was captured from XFS:

    wbc_writeback_start: towrt=1024
    wbc_writepage: towrt=1024
    wbc_writepage: towrt=0
    wbc_writepage: towrt=-1
    wbc_writepage: towrt=-5
    wbc_writepage: towrt=-21
    wbc_writepage: towrt=-85

    This has adverse effects on filesystem writeback behaviour. write_cache_pages()
    needs to terminate after a certain number of pages are written, not after a
    certain number of calls to ->writepage are made. This is a regression
    introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add
    no_nrwrite_index_update writeback control flag"), but cannot be reverted
    directly due to subsequent bug fixes that have gone in on top of it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

01 Jun, 2010

1 commit


22 May, 2010

3 commits

  • The laptop mode timer had the nr_pages and sb_locked arguments
    mixed up.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When CONFIG_BLOCK isn't enabled:

    mm/page-writeback.c: In function 'laptop_mode_timer_fn':
    mm/page-writeback.c:708: error: dereferencing pointer to incomplete type
    mm/page-writeback.c:709: error: dereferencing pointer to incomplete type

    Fix this by essentially eliminating the laptop sync handlers when
    CONFIG_BLOCK isn't set, as most are only used from the block layer code.
    The exception is laptop_sync_completion() which is used from sys_sync(),
    make that an empty declaration in that case.

    Reported-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Commit 69b62d01 fixed up most of the places where we would enter
    busy schedule() spins when disabling the periodic background
    writeback. This fixes up the sb timer so that it doesn't get
    hammered on with the delay disabled, and ensures that it gets
    rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
    gets modified.

    bdi_forker_task() also needs to check for !dirty_writeback_centisecs
    and use schedule() appropriately, fix that up too.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 May, 2010

1 commit

  • When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
    writeback to kick off writeback of pending dirty inodes, then follow
    that up with a WB_SYNC_ALL to wait for it. Since umount already holds
    the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
    writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
    since WB_SYNC_ALL writeback is a data integrity operation and thus
    a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
    it's a lot slower.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Apr, 2010

1 commit

  • One of the features of laptop-mode is that it forces a writeout of dirty
    pages if something else triggers a physical read or write from a device.
    The current implementation flushes pages on all devices, rather than only
    the one that triggered the flush. This patch alters the behaviour so that
    only the recently accessed block device is flushed, preventing other
    disks being spun up for no terribly good reason.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Jens Axboe

    Matthew Garrett
     

03 Dec, 2009

1 commit

  • - no one is calling wb_writeback and write_cache_pages with
    wbc.nonblocking=1 any more
    - lumpy pageout will want to do nonblocking writeback without the
    congestion wait

    So remove the congestion checks as suggested by Chris.

    Signed-off-by: Wu Fengguang
    Cc: Chris Mason
    Cc: Jens Axboe
    Cc: Trond Myklebust
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Evgeniy Polyakov
    Cc: Alex Elder
    Signed-off-by: Jens Axboe

    Wu Fengguang
     

09 Oct, 2009

1 commit

  • It makes sense to do IOWAIT when someone is blocked
    due to IO throttle, as suggested by Kame and Peter.

    There is an old comment for not doing IOWAIT on throttle,
    however it has been mismatching the code for a long time.

    If we stop accounting IOWAIT for 2.6.32, it could be an
    undesirable behavior change. So restore the io_schedule.

    CC: KAMEZAWA Hiroyuki
    CC: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     

26 Sep, 2009

5 commits

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Treat bdi_start_writeback(0) as a special request to do background write,
    and stop such work when we are below the background dirty threshold.

    Also simplify the (nr_pages
    CC: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Some filesystem may choose to write much more than ratelimit_pages
    before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
    determine number to write based on real number of dirtied pages.

    Otherwise it is possible that
    loop {
    btrfs_file_write(): dirty 1024 pages
    balance_dirty_pages(): write up to 48 pages (= ratelimit_pages * 1.5)
    }
    in which the writeback rate cannot keep up with dirty rate, and the
    dirty pages go all the way beyond dirty_thresh.

    The increased write_chunk may make the dirtier more bumpy.
    So filesystems shall be take care not to dirty too much at
    a time (eg. > 4MB) without checking the ratelimit.

    Signed-off-by: Wu Fengguang
    Acked-by: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Wu Fengguang
     

24 Sep, 2009

2 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

1 commit

  • global_lru_pages() / zone_lru_pages() can be used in two ways:
    - to estimate max reclaimable pages in determine_dirtyable_memory()
    - to calculate the slab scan ratio

    When swap is full or not present, the anon lru lists are not reclaimable
    and also won't be scanned. So the anon pages shall not be counted in both
    usage scenarios. Also rename to _reclaimable_pages: now they are counting
    the possibly reclaimable lru pages.

    It can greatly (and correctly) increase the slab scan rate under high
    memory pressure (when most file pages have been reclaimed and swap is
    full/absent), thus reduce false OOM kills.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Wu Fengguang
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Jesse Barnes
    Cc: David Howells
    Cc: "Li, Ming Chun"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

21 Sep, 2009

2 commits


16 Sep, 2009

5 commits

  • bdi_start_writeback() is currently split into two paths, one for
    WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
    for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
    only WB_SYNC_NONE.

    Push down the writeback_control allocation and only accept the
    parameters that make sense for each function. This cleans up
    the API considerably.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now that bdi_writeback_all() no longer handles integrity writeback,
    it doesn't have to block anymore. This means that we can switch
    bdi_list reader side protection to RCU.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's only set, it's never checked. Kill it.

    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The dirtying of page and set_page_dirty() can be moved into the page lock.

    - In shmem_write_end(), the page was dirtied while the page lock was held,
    but it's being marked dirty just after dropping the page lock.
    - In shmem_symlink(), both dirtying and marking can be moved into page lock.

    It's valuable for the hwpoison code to know whether one bad page can be dropped
    without losing data. It mainly judges by testing the PG_dirty bit after taking
    the page lock. So it becomes important that the dirtying of page and the
    marking of dirtiness are both done inside the page lock. Which is a common
    practice, but sadly not a rule.

    The noticeable exceptions are
    - mapped pages
    - pages with buffer_heads
    The above pages could go dirty at any time. Fortunately the hwpoison will
    unmap the page and release the buffer_heads beforehand anyway.

    Many other types of pages (eg. metadata pages) can also be dirtied at will by
    their owners, the hwpoison code cannot do meaningful things to them anyway.
    Only the dirtiness of pagecache pages owned by regular files are interested.

    v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)

    Acked-by: Hugh Dickins
    Reviewed-by: WANG Cong
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
    powerpc64: convert to dynamic percpu allocator
    sparc64: use embedding percpu first chunk allocator
    percpu: kill lpage first chunk allocator
    x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
    percpu: update embedding first chunk allocator to handle sparse units
    percpu: use group information to allocate vmap areas sparsely
    vmalloc: implement pcpu_get_vm_areas()
    vmalloc: separate out insert_vmalloc_vm()
    percpu: add chunk->base_addr
    percpu: add pcpu_unit_offsets[]
    percpu: introduce pcpu_alloc_info and pcpu_group_info
    percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
    percpu: add @align to pcpu_fc_alloc_fn_t
    percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
    percpu: drop @static_size from first chunk allocators
    percpu: generalize first chunk allocator selection
    percpu: build first chunk allocators selectively
    percpu: rename 4k first chunk allocator to page
    percpu: improve boot messages
    percpu: fix pcpu_reclaim() locking
    ...

    Fix trivial conflict as by Tejun Heo in kernel/sched.c

    Linus Torvalds
     

11 Sep, 2009

2 commits

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This is a first step at introducing per-bdi flusher threads. We should
    have no change in behaviour, although sb_has_dirty_inodes() is now
    ridiculously expensive, as there's no easy way to answer that question.
    Not a huge problem, since it'll be deleted in subsequent patches.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Aug, 2009

1 commit

  • Conflicts:
    arch/sparc/kernel/smp_64.c
    arch/x86/kernel/cpu/perf_counter.c
    arch/x86/kernel/setup_percpu.c
    drivers/cpufreq/cpufreq_ondemand.c
    mm/percpu.c

    Conflicts in core and arch percpu codes are mostly from commit
    ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
    num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
    the first chunk allocators into mm/percpu.c, the changes are moved
    from arch code to mm/percpu.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

11 Jul, 2009

1 commit


04 Jul, 2009

1 commit

  • Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
    changes. As alpha in percpu tree uses 'weak' attribute instead of
    inline assembly, there's no need for __used attribute.

    Conflicts:
    arch/alpha/include/asm/percpu.h
    arch/mn10300/kernel/vmlinux.lds.S
    include/linux/percpu-defs.h

    Tejun Heo
     

01 Jul, 2009

1 commit

  • balance_dirty_pages can overreact and move all of the dirty pages to
    writeback unnecessarily.

    balance_dirty_pages makes its decision to throttle based on the number of
    dirty plus writeback pages that are over the calculated limit,so it will
    continue to move pages even when there are plenty of pages in writeback
    and less than the threshold still dirty.

    This allows it to overshoot its limits and move all the dirty pages to
    writeback while waiting for the drives to catch up and empty the writeback
    list.

    A simple fio test easily demonstrates this problem.

    fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10

    This is the simplest fix I could find, but I'm not entirely sure that it
    alone will be enough for all cases. But it certainly is an improvement on
    my desktop machine writing to 2 disks.

    Do we need something more for machines with large arrays where
    bdi_threshold * number_of_drives is greater than the dirty_ratio ?

    Signed-off-by: Richard Kennedy
    Acked-by: Peter Zijlstra
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

24 Jun, 2009

1 commit

  • Percpu variable definition is about to be updated such that all percpu
    symbols including the static ones must be unique. Update percpu
    variable definitions accordingly.

    * as,cfq: rename ioc_count uniquely

    * cpufreq: rename cpu_dbs_info uniquely

    * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it

    * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
    rename it

    * ipv4,6: rename cookie_scratch uniquely

    * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
    pmc_irq_entry and nmi_entry to pmc_nmi_entry

    * perf_counter: rename disable_count to perf_disable_count

    * ftrace: rename test_event_disable to ftrace_test_event_disable

    * kmemleak: rename test_pointer to kmemleak_test_pointer

    * mce: rename next_interval to mce_next_interval

    [ Impact: percpu usage cleanups, no duplicate static percpu var names ]

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Cc: Ivan Kokshaysky
    Cc: Jens Axboe
    Cc: Dave Jones
    Cc: Jeremy Fitzhardinge
    Cc: linux-mm
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Li Zefan
    Cc: Catalin Marinas
    Cc: Andi Kleen

    Tejun Heo
     

17 Jun, 2009

1 commit

  • get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
    with variable pbdi_dirty as one of the arguments. This variable is an
    unsigned long * but both functions expect it to be a long *. This causes
    the following sparse warnings:

    warning: incorrect type in argument 3 (different signedness)
    expected long *pbdi_dirty
    got unsigned long *pbdi_dirty
    warning: incorrect type in argument 2 (different signedness)
    expected long *pdirty
    got unsigned long *pbdi_dirty

    Fix the warnings by changing the long * to unsigned long * in both
    functions.

    Signed-off-by: H Hartley Sweeten
    Cc: Johannes Weiner
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     

18 May, 2009

1 commit

  • wb_kupdate() function has a bug on linux-2.6.30-rc5. This bug causes
    generic_sync_sb_inodes() to start to write inodes back much earlier than
    our expectations because it miscalculates oldest_jif in wb_kupdate().

    This bug was introduced in 704503d836042d4a4c7685b7036e7de0418fbc0f
    ('mm: fix proc_dointvec_userhz_jiffies "breakage"').

    Signed-off-by: Toshiyuki Okajima
    Cc: Alexey Dobriyan
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshiyuki Okajima
     

01 Apr, 2009

2 commits

  • Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838

    On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
    way from the user's point of view:

    # echo 500 >/proc/sys/vm/dirty_writeback_centisecs
    # cat /proc/sys/vm/dirty_writeback_centisecs
    499

    So, we have 5000 jiffies converted to only 499 clock ticks and reported
    back.

    TICK_NSEC = 999848
    ACTHZ = 256039

    Keeping in-kernel variable in units passed from userspace will fix issue
    of course, but this probably won't be right for every sysctl.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Add a helper function account_page_dirtied(). Use that from two
    callsites. reiser4 adds a function which adds a third callsite.

    Signed-off-by: Edward Shishkin
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     

27 Mar, 2009

1 commit

  • Enlarge default dirty ratios from 5/10 to 10/20. This fixes [Bug
    #12809] iozone regression with 2.6.29-rc6.

    The iozone benchmarks are performed on a 1200M file, with 8GB ram.

    iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
    iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls

    The performance regression is triggered by commit 1cf6e7d83bf3(mm: task
    dirty accounting fix), which makes more correct/thorough dirty
    accounting.

    The default 5/10 dirty ratios were picked (a) with the old dirty logic
    and (b) largely at random and (c) designed to be aggressive. In
    particular, that (a) means that having fixed some of the dirty
    accounting, maybe the real bug is now that it was always too aggressive,
    just hidden by an accounting issue.

    The enlarged 10/20 dirty ratios are just about enough to fix the regression.

    [ We will have to look at how this affects the old fsync() latency issue,
    but that probably will need independent work. - Linus ]

    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Reported-by: "Lin, Ming M"
    Tested-by: "Lin, Ming M"
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Wu Fengguang