11 Dec, 2014

1 commit

  • Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback
    page accounting") mem_cgroup_end_page_stat consumes locked and flags
    variables directly rather than via pointers which might trigger C
    undefined behavior as those variables are initialized only in the slow
    path of mem_cgroup_begin_page_stat.

    Although mem_cgroup_end_page_stat handles parameters correctly and
    touches them only when they hold a sensible value it is caller which
    loads a potentially uninitialized value which then might allow compiler
    to do crazy things.

    I haven't seen any warning from gcc and it seems that the current
    version (4.9) doesn't exploit this type undefined behavior but Sasha has
    reported the following:

    UBSan: Undefined behaviour in mm/rmap.c:1084:2
    load of value 255 is not a valid value for type '_Bool'
    CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    ubsan_epilogue (lib/ubsan.c:159)
    __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
    page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
    unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
    unmap_single_vma (mm/memory.c:1348)
    unmap_vmas (mm/memory.c:1377 (discriminator 3))
    exit_mmap (mm/mmap.c:2837)
    mmput (kernel/fork.c:659)
    do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
    do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
    SyS_exit_group (kernel/exit.c:901)
    tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

    Fix this by using pointer parameters for both locked and flags and be
    more robust for future compiler changes even though the current code is
    implemented correctly.

    Signed-off-by: Michal Hocko
    Reported-by: Sasha Levin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Oct, 2014

2 commits

  • Commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") changed
    page migration to uncharge the old page right away. The page is locked,
    unmapped, truncated, and off the LRU, but it could race with writeback
    ending, which then doesn't unaccount the page properly:

    test_clear_page_writeback() migration
    wait_on_page_writeback()
    TestClearPageWriteback()
    mem_cgroup_migrate()
    clear PCG_USED
    mem_cgroup_update_page_stat()
    if (PageCgroupUsed(pc))
    decrease memcg pages under writeback

    release pc->mem_cgroup->move_lock

    The per-page statistics interface is heavily optimized to avoid a
    function call and a lookup_page_cgroup() in the file unmap fast path,
    which means it doesn't verify whether a page is still charged before
    clearing PageWriteback() and it has to do it in the stat update later.

    Rework it so that it looks up the page's memcg once at the beginning of
    the transaction and then uses it throughout. The charge will be
    verified before clearing PageWriteback() and migration can't uncharge
    the page as long as that is still set. The RCU lock will protect the
    memcg past uncharge.

    As far as losing the optimization goes, the following test results are
    from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
    three times in a nested fashion, so that there are two negative passes
    that don't account but still go through the new transaction overhead.
    There is no actual difference:

    old: 33.195102545 seconds time elapsed ( +- 0.01% )
    new: 33.199231369 seconds time elapsed ( +- 0.03% )

    The time spent in page_remove_rmap()'s callees still adds up to the
    same, but the time spent in the function itself seems reduced:

    # Children Self Command Shared Object Symbol
    old: 0.12% 0.11% filemapstress [kernel.kallsyms] [k] page_remove_rmap
    new: 0.12% 0.08% filemapstress [kernel.kallsyms] [k] page_remove_rmap

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [3.17.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A follow-up patch would have changed the call signature. To save the
    trouble, just fold it instead.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [3.17.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Oct, 2014

2 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of activities on percpu front. Notable changes are...

    - percpu allocator now can take @gfp. If @gfp doesn't contain
    GFP_KERNEL, it tries to allocate from what's already available to
    the allocator and a work item tries to keep the reserve around
    certain level so that these atomic allocations usually succeed.

    This will replace the ad-hoc percpu memory pool used by
    blk-throttle and also be used by the planned blkcg support for
    writeback IOs.

    Please note that I noticed a bug in how @gfp is interpreted while
    preparing this pull request and applied the fix 6ae833c7fe0c
    ("percpu: fix how @gfp is interpreted by the percpu allocator")
    just now.

    - percpu_ref now uses longs for percpu and global counters instead of
    ints. It leads to more sparse packing of the percpu counters on
    64bit machines but the overhead should be negligible and this
    allows using percpu_ref for refcnting pages and in-memory objects
    directly.

    - The switching between percpu and single counter modes of a
    percpu_ref is made independent of putting the base ref and a
    percpu_ref can now optionally be initialized in single or killed
    mode. This allows avoiding percpu shutdown latency for cases where
    the refcounted objects may be synchronously created and destroyed
    in rapid succession with only a fraction of them reaching fully
    operational status (SCSI probing does this when combined with
    blk-mq support). It's also planned to be used to implement forced
    single mode to detect underflow more timely for debugging.

    There's a separate branch percpu/for-3.18-consistent-ops which cleans
    up the duplicate percpu accessors. That branch causes a number of
    conflicts with s390 and other trees. I'll send a separate pull
    request w/ resolutions once other branches are merged"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
    percpu: fix how @gfp is interpreted by the percpu allocator
    blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
    percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
    percpu_ref: add PERCPU_REF_INIT_* flags
    percpu_ref: decouple switching to percpu mode and reinit
    percpu_ref: decouple switching to atomic mode and killing
    percpu_ref: add PCPU_REF_DEAD
    percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
    percpu_ref: replace pcpu_ prefix with percpu_
    percpu_ref: minor code and comment updates
    percpu_ref: relocate percpu_ref_reinit()
    Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
    Revert "percpu: free percpu allocation info for uniprocessor system"
    percpu-refcount: make percpu_ref based on longs instead of ints
    percpu-refcount: improve WARN messages
    percpu: fix locking regression in the failure path of pcpu_alloc()
    percpu-refcount: add @gfp to percpu_ref_init()
    proportions: add @gfp to init functions
    percpu_counter: add @gfp to percpu_counter_init()
    percpu_counter: make percpu_counters_lock irq-safe
    ...

    Linus Torvalds
     
  • Nested calls to min/max functions result in shadow warnings in W=2 builds.
    Avoid the warning by using the min3 and max3 macros to get the min/max of
    3 values instead of nested calls.

    Signed-off-by: Mark Rustad
    Signed-off-by: Jeff Kirsher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rustad
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    [flex_]proportions init functions so that !GFP_KERNEL allocation masks
    can be used with them too.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Peter Zijlstra

    Tejun Heo
     

07 Aug, 2014

1 commit

  • Setting vm_dirty_bytes and dirty_background_bytes is not protected by
    any serialization.

    Therefore, it's possible for either variable to change value after the
    test in global_dirty_limits() to determine whether available_memory
    needs to be initialized or not.

    Always ensure that available_memory is properly initialized.

    Signed-off-by: David Rientjes
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Jul, 2014

1 commit

  • Under memory pressure, it is possible for dirty_thresh, calculated by
    global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if
    strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:

    bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh

    by dividing by zero.

    Signed-off-by: Maxim Patlasov
    Acked-by: Rik van Riel
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     

09 Jun, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Clean ups and miscellaneous bug fixes, in particular for the new
    collapse_range and zero_range fallocate functions. In addition,
    improve the scalability of adding and remove inodes from the orphan
    list"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
    ext4: handle symlink properly with inline_data
    ext4: fix wrong assert in ext4_mb_normalize_request()
    ext4: fix zeroing of page during writeback
    ext4: remove unused local variable "stored" from ext4_readdir(...)
    ext4: fix ZERO_RANGE test failure in data journalling
    ext4: reduce contention on s_orphan_lock
    ext4: use sbi in ext4_orphan_{add|del}()
    ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
    ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
    ext4: remove unnecessary double parentheses
    ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
    ext4: make local functions static
    ext4: fix block bitmap validation when bigalloc, ^flex_bg
    ext4: fix block bitmap initialization under sparse_super2
    ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
    ext4: avoid unneeded lookup when xattr name is invalid
    ext4: fix data integrity sync in ordered mode
    ext4: remove obsoleted check
    ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
    ext4: fix locking for O_APPEND writes
    ...

    Linus Torvalds
     

07 Jun, 2014

1 commit


05 Jun, 2014

2 commits

  • There is an orphaned prehistoric comment , which used to be against
    get_dirty_limits(), the dawn of global_dirtyable_memory().

    Back then, the implementation of get_dirty_limits() is complicated and
    full of magic numbers, so this comment is necessary. But we now use the
    clear and neat global_dirtyable_memory(), which renders this comment
    ambiguous and useless. Remove it.

    Signed-off-by: Jianyu Zhan
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • Replace places where __get_cpu_var() is used for an address calculation
    with this_cpu_ptr().

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 May, 2014

1 commit

  • When we perform a data integrity sync we tag all the dirty pages with
    PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check
    for this tag in write_cache_pages_da and creates a struct
    mpage_da_data containing contiguously indexed pages tagged with this
    tag and sync these pages with a call to mpage_da_map_and_submit. This
    process is done in while loop until all the PAGECACHE_TAG_TOWRITE
    pages are synced. We also do journal start and stop in each iteration.
    journal_stop could initiate journal commit which would call
    ext4_writepage which in turn will call ext4_bio_write_page even for
    delayed OR unwritten buffers. When ext4_bio_write_page is called for
    such buffers, even though it does not sync them but it clears the
    PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
    are also not synced by the currently running data integrity sync. We
    will end up with dirty pages although sync is completed.

    This could cause a potential data loss when the sync call is followed
    by a truncate_pagecache call, which is exactly the case in
    collapse_range. (It will cause generic/127 failure in xfstests)

    To avoid this issue, we can use set_page_writeback_keepwrite instead of
    set_page_writeback, which doesn't clear TOWRITE tag.

    Cc: stable@vger.kernel.org
    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Namjae Jeon
     

07 May, 2014

1 commit

  • It is possible for "limit - setpoint + 1" to equal zero, after getting
    truncated to a 32 bit variable, and resulting in a divide by zero error.

    Using the fully 64 bit divide functions avoids this problem. It also
    will cause pos_ratio_polynom() to return the correct value when
    (setpoint - limit) exceeds 2^32.

    Also uninline pos_ratio_polynom, at Andrew's request.

    Signed-off-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: Nishanth Aravamudan
    Cc: Luiz Capitulino
    Cc: Masayoshi Mizuma
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

08 Apr, 2014

1 commit


07 Feb, 2014

1 commit

  • During aio stress test, we observed the following lockdep warning. This
    mean AIO+numa_balancing is currently deadlockable.

    The problem is, aio_migratepage disable interrupt, but
    __set_page_dirty_nobuffers unintentionally enable it again.

    Generally, all helper function should use spin_lock_irqsave() instead of
    spin_lock_irq() because they don't know caller at all.

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&ctx->completion_lock)->rlock);

    lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

    dump_stack+0x19/0x1b
    print_usage_bug+0x1f7/0x208
    mark_lock+0x21d/0x2a0
    mark_held_locks+0xb9/0x140
    trace_hardirqs_on_caller+0x105/0x1d0
    trace_hardirqs_on+0xd/0x10
    _raw_spin_unlock_irq+0x2c/0x50
    __set_page_dirty_nobuffers+0x8c/0xf0
    migrate_page_copy+0x434/0x540
    aio_migratepage+0xb1/0x140
    move_to_new_page+0x7d/0x230
    migrate_pages+0x5e5/0x700
    migrate_misplaced_page+0xbc/0xf0
    do_numa_page+0x102/0x190
    handle_pte_fault+0x241/0x970
    handle_mm_fault+0x265/0x370
    __do_page_fault+0x172/0x5a0
    do_page_fault+0x1a/0x70
    page_fault+0x28/0x30

    Signed-off-by: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

30 Jan, 2014

2 commits

  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Tejun reported stuttering and latency spikes on a system where random
    tasks would enter direct reclaim and get stuck on dirty pages. Around
    50% of memory was occupied by tmpfs backed by an SSD, and another disk
    (rotating) was reading and writing at max speed to shrink a partition.

    : The problem was pretty ridiculous. It's a 8gig machine w/ one ssd and 10k
    : rpm harddrive and I could reliably reproduce constant stuttering every
    : several seconds for as long as buffered IO was going on on the hard drive
    : either with tmpfs occupying somewhere above 4gig or a test program which
    : allocates about the same amount of anon memory. Although swap usage was
    : zero, turning off swap also made the problem go away too.
    :
    : The trigger conditions seem quite plausible - high anon memory usage w/
    : heavy buffered IO and swap configured - and it's highly likely that this
    : is happening in the wild too. (this can happen with copying large files
    : to usb sticks too, right?)

    This patch (of 2):

    The dirty_balance_reserve is an approximation of the fraction of free
    pages that the page allocator does not make available for page cache
    allocations. As a result, it has to be taken into account when
    calculating the amount of "dirtyable memory", the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    However, currently the reserve is subtracted from the sum of free and
    reclaimable pages, which is non-sensical and leads to erroneous results
    when the system is dominated by unreclaimable pages and the
    dirty_balance_reserve is bigger than free+reclaimable. In that case, at
    least the already allocated cache should be considered dirtyable.

    Fix the calculation by subtracting the reserve from the amount of free
    pages, then adding the reclaimable pages on top.

    [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

17 Oct, 2013

1 commit

  • Toralf runs trinity on UML/i386. After some time it hangs and the last
    message line is

    BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]

    It's found that pages_dirtied becomes very large. More than 1000000000
    pages in this case:

    period = HZ * pages_dirtied / task_ratelimit;
    BUG_ON(pages_dirtied > 2000000000);
    BUG_ON(pages_dirtied > 1000000000); < 0) {
    + extern int printf(char *, ...);
    + printf("ick : pause : %li\n", pause);
    + printf("ick: pages_dirtied : %lu\n", pages_dirtied);
    + printf("ick: task_ratelimit: %lu\n", task_ratelimit);
    + BUG_ON(1);
    + }
    trace_balance_dirty_pages(bdi,

    Since pause is bounded by [min_pause, max_pause] where min_pause is also
    bounded by max_pause. It's suspected and demonstrated that the
    max_pause calculation goes wrong:

    ick: pause : -717
    ick: min_pause : -177
    ick: max_pause : -717
    ick: pages_dirtied : 14
    ick: task_ratelimit: 0

    The problem lies in the two "long = unsigned long" assignments in
    bdi_max_pause() which might go negative if the highest bit is 1, and the
    min_t(long, ...) check failed to protect it falling under 0. Fix all of
    them by using "unsigned long" throughout the function.

    Signed-off-by: Fengguang Wu
    Reported-by: Toralf Förster
    Tested-by: Toralf Förster
    Reviewed-by: Jan Kara
    Cc: Richard Weinberger
    Cc: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

13 Sep, 2013

1 commit

  • Add memcg routines to count writeback pages, later dirty pages will also
    be accounted.

    After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
    accounting"), we can use 'struct page' flag to test page state instead
    of per page_cgroup flag. But memcg has a feature to move a page from a
    cgroup to another one and may have race between "move" and "page stat
    accounting". So in order to avoid the race we have designed a new lock:

    mem_cgroup_begin_update_page_stat()
    modify page information -->(a)
    mem_cgroup_update_page_stat() -->(b)
    mem_cgroup_end_update_page_stat()

    It requires both (a) and (b)(writeback pages accounting) to be pretected
    in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
    !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
    read lock in the most cases (no task is moving), and spin_lock_irqsave
    on top in the slow path.

    There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
    And the lock order is:
    --> memcg->move_lock
    --> mapping->tree_lock

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Reviewed-by: Greg Thelen
    Cc: Fengguang Wu
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     

12 Sep, 2013

3 commits

  • The feature prevents mistrusted filesystems (ie: FUSE mounts created by
    unprivileged users) to grow a large number of dirty pages before
    throttling. For such filesystems balance_dirty_pages always check bdi
    counters against bdi limits. I.e. even if global "nr_dirty" is under
    "freerun", it's not allowed to skip bdi checks. The only use case for now
    is fuse: it sets bdi max_ratio to 1% by default and system administrators
    are supposed to expect that this limit won't be exceeded.

    The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
    filesystem may set the flag when it initializes its BDI.

    The problematic scenario comes from the fact that nobody pays attention to
    the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
    writeback). The implementation of fuse writeback releases original page
    (by calling end_page_writeback) almost immediately. A fuse request queued
    for real processing bears a copy of original page. Hence, if userspace
    fuse daemon doesn't finalize write requests in timely manner, an
    aggressive mmap writer can pollute virtually all memory by those temporary
    fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
    nobody cares.

    To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
    problem" as a shortcut for "a possibility of uncontrolled grow of amount
    of RAM consumed by temporary pages allocated by kernel fuse to process
    writeback".

    The problem was very easy to reproduce. There is a trivial example
    filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
    added "sleep(1);" to the write methods, then recompiled and mounted it.
    Then created a huge file on the mount point and run a simple program which
    mmap-ed the file to a memory region, then wrote a data to the region. An
    hour later I observed almost all RAM consumed by fuse writeback. Since
    then some unrelated changes in kernel fuse made it more difficult to
    reproduce, but it is still possible now.

    Putting this theoretical happens-in-the-lab thing aside, there is another
    thing that really hurts real world (FUSE) users. This is write-through
    page cache policy FUSE currently uses. I.e. handling write(2), kernel
    fuse populates page cache and flushes user data to the server
    synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
    ("writeback cache policy") solve the problem, but they also make resolving
    NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
    a huge file to a fuse mount would result in memory starvation. Miklos,
    the maintainer of FUSE, believes strictlimit feature the way to go.

    And eventually putting FUSE topics aside, there is one more use-case for
    strictlimit feature. Using a slow USB stick (mass storage) in a machine
    with huge amount of RAM installed is a well-known pain. Let's make simple
    computations. Assuming 64GB of RAM installed, existing implementation of
    balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
    dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
    /media/my-usb-storage/" may return in a few seconds, but subsequent
    "umount /media/my-usb-storage/" will take more than two hours if effective
    throughput of the storage is, to say, 1MB/sec.

    After inclusion of strictlimit feature, it will be trivial to add a knob
    (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
    Manually or via udev rule. May be I'm wrong, but it seems to be quite a
    natural desire to limit the amount of dirty memory for some devices we are
    not fully trust (in the sense of sustainable throughput).

    [akpm@linux-foundation.org: fix warning in page-writeback.c]
    Signed-off-by: Maxim Patlasov
    Cc: Jan Kara
    Cc: Miklos Szeredi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     
  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • This reverts commit 75f7ad8e043d. It was the result of a problem
    observed with a 3.2 kernel and merged in 3.9, while the issue had been
    resolved upstream in 3.3 (commit ab8fabd46f81: "mm: exclude reserved
    pages from dirtyable memory").

    The "reserved pages" are a superset of min_free_kbytes, thus this change
    is redundant and confusing. Revert it.

    Signed-off-by: Johannes Weiner
    Cc: Paul Szabo
    Cc: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

30 Apr, 2013

1 commit

  • Walking a bio's page mappings has proved problematic, so create a new
    bio flag to indicate that a bio's data needs to be snapshotted in order
    to guarantee stable pages during writeback. Next, for the one user
    (ext3/jbd) of snapshotting, hook all the places where writes can be
    initiated without PG_writeback set, and set BIO_SNAP_STABLE there.

    We must also flag journal "metadata" bios for stable writeout, since
    file data can be written through the journal. Finally, the
    MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
    rid of it.

    [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
    [akpm@linux-foundation.org: teeny cleanup]
    Signed-off-by: Darrick J. Wong
    Cc: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

01 Mar, 2013

2 commits

  • Pull writeback fixes from Wu Fengguang:
    "Two writeback fixes

    - fix negative (setpoint - dirty) in 32bit archs

    - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    Negative (setpoint-dirty) in bdi_position_ratio()
    vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

    Linus Torvalds
     
  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • When calculating amount of dirtyable memory, min_free_kbytes should be
    subtracted because it is not intended for dirty pages.

    Addresses http://bugs.debian.org/695182

    [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
    [akpm@linux-foundation.org: fix min() warning]
    Signed-off-by: Paul Szabo
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Szabo
     

22 Feb, 2013

2 commits

  • This provides a band-aid to provide stable page writes on jbd without
    needing to backport the fixed locking and page writeback bit handling
    schemes of jbd2. The band-aid works by using bounce buffers to snapshot
    page contents instead of waiting.

    For those wondering about the ext3 bandage -- fixing the jbd locking
    (which was done as part of ext4dev years ago) is a lot of surgery, and
    setting PG_writeback on data pages when we actually hold the page lock
    dropped ext3 performance by nearly an order of magnitude. If we're
    going to migrate iscsi and raid to use stable page writes, the
    complaints about high latency will likely return. We might as well
    centralize their page snapshotting thing to one place.

    Signed-off-by: Darrick J. Wong
    Tested-by: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

08 Feb, 2013

1 commit


24 Jan, 2013

1 commit

  • In bdi_position_ratio(), get difference (setpoint-dirty) right even when
    negative. Both setpoint and dirty are unsigned long, the difference was
    zero-padded thus wrongly sign-extended to s64. This issue affects all
    32-bit architectures, does not affect 64-bit architectures where long
    and s64 are equivalent.

    In this function, dirty is between freerun and limit, the pseudo-float x
    is between [-1,1], expected to be negative about half the time. With
    zero-padding, instead of a small negative x we obtained a large positive
    one so bdi_position_ratio() returned garbage.

    Casting the difference to s64 also prevents overflow with left-shift;
    though normally these numbers are small and I never observed a 32-bit
    overflow there.

    (This patch does not solve the PAE OOM issue.)

    Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/
    School of Mathematics and Statistics University of Sydney Australia

    Reviewed-by: Jan Kara
    Reported-by: Paul Szabo
    Reference: http://bugs.debian.org/695182
    Signed-off-by: Paul Szabo
    Signed-off-by: Fengguang Wu

    paul.szabo@sydney.edu.au
     

14 Jan, 2013

1 commit

  • Add tracepoints for page dirtying, writeback_single_inode start, inode
    dirtying and writeback. For the latter two inode events, a pair of
    events are defined to denote start and end of the operations (the
    starting one has _start suffix and the one w/o suffix happens after
    the operation is complete). These inode ops are FS specific and can
    be non-trivial and having enclosing tracepoints is useful for external
    tracers.

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
    before dereferencing.

    v3: buffer dirtying moved to a block TP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Dec, 2012

1 commit

  • The system uses global_dirtyable_memory() to calculate number of
    dirtyable pages/pages that can be allocated to the page cache. A bug
    causes an underflow thus making the page count look like a big unsigned
    number. This in turn confuses the dirty writeback throttling to
    aggressively write back pages as they become dirty (usually 1 page at a
    time). This generally only affects systems with highmem because the
    underflowed count gets subtracted from the global count of dirtyable
    memory.

    The problem was introduced with v3.2-4896-gab8fabd

    Fix is to ensure we don't get an underflowed total of either highmem or
    global dirtyable memory.

    Signed-off-by: Sonny Rao
    Signed-off-by: Puneet Kumar
    Acked-by: Johannes Weiner
    Tested-by: Damien Wyart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sonny Rao
     

12 Dec, 2012

1 commit


28 Sep, 2012

1 commit


04 Aug, 2012

1 commit

  • Finally we can kill the 'sync_supers' kernel thread along with the
    '->write_super()' superblock operation because all the users are gone.
    Now every file-system is supposed to self-manage own superblock and
    its dirty state.

    The nice thing about killing this thread is that it improves power management.
    Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
    every 5 seconds no matter what - even if there were no dirty superblocks and
    even if there were no file-systems using this service (e.g., btrfs and
    journalled ext4 do not need it). So it was wasting power most of the time. And
    because the thread was in the core of the kernel, all systems had to have it.
    So I am quite happy to make it go away.

    Interestingly, this thread is a left-over from the pdflush kernel thread which
    was a self-forking kernel thread responsible for all the write-back in old
    Linux kernels. It was turned into per-block device BDI threads, and
    'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     

09 Jun, 2012

2 commits


06 May, 2012

1 commit

  • This prevents global_dirty_limit from remaining 0 (the initial value)
    for long time, since it's only updated in update_dirty_limit() when
    above the dirty freerun area.

    It will avoid unexpected consequences when some random code use it as a
    convenient approximation of the global dirty threshold.

    Signed-off-by: Fengguang Wu

    Fengguang Wu