05 Jun, 2014

40 commits

  • [akpm@linux-foundation.org: don't overwrite kstrtoull()'s errno]
    Signed-off-by: Fabian Frederick
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Signed-off-by: Fabian Frederick
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Signed-off-by: Fabian Frederick
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • This patch also fixes one function declaration over 80 characters.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Fix checkpatch warnings about EXPORT_SYMBOL and return()

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • - EXPORT_SYMBOL

    - typo: unexpectidly->unexpectedly

    - function prototype over 80 characters

    Signed-off-by: Fabian Frederick
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • no level printk converted to pr_warn (if err)
    no level printk converted to pr_info (disabling non-boot cpus)
    Other printk converted to respective level.

    Signed-off-by: Fabian Frederick
    Cc: "Rafael J. Wysocki"
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Usually, BUG_ON and friends aren't even evaluated in sparse, but recently
    compiletime_assert_atomic_type() was added, and that now results in a
    sparse warning every time it is used.

    The reason turns out to be the temporary variable, after it sparse no
    longer considers the value to be a constant, and results in a warning and
    an error. The error is the more annoying part of this as it suppresses
    any further warnings in the same file, hiding other problems.

    Unfortunately the condition cannot be simply expanded out to avoid the
    temporary variable since it breaks compiletime_assert on old versions of
    GCC such as GCC 4.2.4 which the latest metag compiler is based on.

    Therefore #ifndef __CHECKER__ out the __compiletime_error_fallback which
    uses the potentially negative size array to trigger a conditional compiler
    error, so that sparse doesn't see it.

    Signed-off-by: James Hogan
    Cc: Johannes Berg
    Cc: Daniel Santos
    Cc: Luciano Coelho
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Acked-by: Johannes Berg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Hogan
     
  • Fixing 2 typo in function comments.

    Signed-off-by: Fabian Frederick
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • ...like other filesystems.

    Signed-off-by: Fabian Frederick
    Cc: Matthew Garrett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • sys_sgetmask and sys_ssetmask are obsolete system calls no longer
    supported in libc.

    This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
    mode configuration.That option is enabled by default for those
    architectures.

    Signed-off-by: Fabian Frederick
    Cc: Steven Miao
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Koichi Yasutake
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Greg Ungerer
    Cc: Heiko Carstens
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • zswap_dstmem is a percpu block of memory, which should be allocated using
    kmalloc_node(), to get better NUMA locality.

    Without it, all the blocks are allocated from a single node.

    Signed-off-by: Eric Dumazet
    Acked-by: Seth Jennings
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Now, we can build zsmalloc as module because unmap_kernel_range was
    exported.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • zsmalloc needs exported unmap_kernel_range for building as a module. See
    https://lkml.org/lkml/2013/1/18/487

    I didn't send a patch to make unmap_kernel_range exportable at that time
    because zram was staging stuff and I thought VM function exporting for
    staging stuff makes no sense.

    Now zsmalloc was promoted. If we can't build zsmalloc as module, it means
    we can't build zram as module, either. Additionally, buddy map_vm_area is
    already exported so let's export unmap_kernel_range to help his buddy.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K
    page size, not 254. The old value may forget count the ZS_MIN_ALLOC_SIZE
    in.

    This patch fixes this trivial issue in the comments.

    Signed-off-by: Weijie Yang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • zbud_alloc is only called by zswap_frontswap_store with unsigned int len.
    Change function parameter + update >= 0 check.

    Signed-off-by: Fabian Frederick
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We want to skip the physical block(PAGE_SIZE) which is partially covered
    by the discard bio, so we check the remaining size and subtract it if
    there is a need to goto the next physical block.

    The current offset usage in zram_bio_discard is incorrect, it will cause
    its upper filesystem breakdown. Consider the following scenario:

    On some architecture or config, PAGE_SIZE is 64K for example, filesystem
    is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
    offset = 4K and size=72K, normally, it should not really discard any
    physical block as it partially cover two physical blocks. However, with
    the current offset usage, it will discard the second physical block and
    free its memory, which will cause filesystem breakdown.

    This patch corrects the offset usage in zram_bio_discard.

    Signed-off-by: Weijie Yang
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Acked-by: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Cc: Bob Liu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • mem_cgroup_force_empty_list() can iterate a large number of pages on an
    lru and mem_cgroup_move_parent() doesn't return an errno unless certain
    criteria, none of which indicate that the iteration may be taking too
    long, is met.

    We have encountered the following stack trace many times indicating
    "need_resched set for > 51000020 ns (51 ticks) without schedule", for
    example:

    scheduler_tick()

    mem_cgroup_move_account+0x4d/0x1d5
    mem_cgroup_move_parent+0x8d/0x109
    mem_cgroup_reparent_charges+0x149/0x2ba
    mem_cgroup_css_offline+0xeb/0x11b
    cgroup_offline_fn+0x68/0x16b
    process_one_work+0x129/0x350

    If this iteration is taking too long, we still need to do cond_resched()
    even when an individual page is not busy.

    [rientjes@google.com: changelog]
    Signed-off-by: Hugh Dickins
    Signed-off-by: David Rientjes
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Existing description is worded in a way which almost encourages setting of
    vfs_cache_pressure above 100, possibly way above it.

    Users are left in a dark what this numeric value is - an int? a
    percentage? what the scale is?

    As a result, we are getting reports about noticeable performance
    degradation from users who have set vfs_cache_pressure to ridiculously
    high values - because they thought there is no downside to it.

    Via code inspection it's obvious that this value is treated as a
    percentage. This patch changes text to reflect this fact, and adds a
    cautionary paragraph advising against setting vfs_cache_pressure sky high.

    Signed-off-by: Denys Vlasenko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • Currently memory error handler handles action optional errors in the
    deferred manner by default. And if a recovery aware application wants
    to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
    However, such signal can be sent only to the main thread, so it's
    problematic if the application wants to have a dedicated thread to
    handler such signals.

    So this patch adds dedicated thread support to memory error handler. We
    have PF_MCE_EARLY flags for each thread separately, so with this patch
    AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
    thread. If you want to implement a dedicated thread, you call prctl()
    to set PF_MCE_EARLY on the thread.

    Memory error handler collects processes to be killed, so this patch lets
    it check PF_MCE_EARLY flag on each thread in the collecting routines.

    No behavioral change for all non-early kill cases.

    Tony said:

    : The old behavior was crazy - someone with a multithreaded process might
    : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
    : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
    : that thread wasn't the main thread for the process.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Tony Luck
    Cc: Kamil Iskra
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When Linux sees an "action optional" machine check (where h/w has reported
    an error that is not in the current execution path) we generally do not
    want to signal a process, since most processes do not have a SIGBUS
    handler - we'd just prematurely terminate the process for a problem that
    they might never actually see.

    task_early_kill() decides whether to consider a process - and it checks
    whether this specific process has been marked for early signals with
    "prctl", or if the system administrator has requested early signals for
    all processes using /proc/sys/vm/memory_failure_early_kill.

    But for MF_ACTION_REQUIRED case we must not defer. The error is in the
    execution path of the current thread so we must send the SIGBUS
    immediatley.

    Fix by passing a flag argument through collect_procs*() to
    task_early_kill() so it knows whether we can defer or must take action.

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • When a thread in a multi-threaded application hits a machine check because
    of an uncorrectable error in memory - we want to send the SIGBUS with
    si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
    if the active thread is not the primary thread in the process.
    collect_procs() just finds primary threads and this test:

    if ((flags & MF_ACTION_REQUIRED) && t == current) {

    will see that the thread we found isn't the current thread and so send a
    si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
    thread at this time).

    We can fix this by checking whether "current" shares the same mm with the
    process that collect_procs() said owned the page. If so, we send the
    SIGBUS to current (with code BUS_MCEERR_AR).

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Reported-by: Otto Bruggeman
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • There is an orphaned prehistoric comment , which used to be against
    get_dirty_limits(), the dawn of global_dirtyable_memory().

    Back then, the implementation of get_dirty_limits() is complicated and
    full of magic numbers, so this comment is necessary. But we now use the
    clear and neat global_dirtyable_memory(), which renders this comment
    ambiguous and useless. Remove it.

    Signed-off-by: Jianyu Zhan
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we
    can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
    gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that
    the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
    has already become a dead code snippet, and it should have been deleted.

    This patch is to delete the redundant loop as Hugh and Shaohua
    suggested.

    [hughd@google.com: fix comment, simplify code]
    Signed-off-by: Chen Yucong
    Cc: Shaohua Li
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • We already have a function named hugepages_supported(), and the similar
    name hugepage_migration_support() is a bit unconfortable, so let's rename
    it hugepage_migration_supported().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Some clarification on how faultaround works.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There is evidencs that the faultaround feature is less relevant on
    architectures with page size bigger then 4k. Which makes sense since page
    fault overhead per byte of mapped area should be less there.

    Let's rework the feature to specify faultaround area in bytes instead of
    page order. It's 64 kilobytes for now.

    The patch effectively disables faultaround on architectures with page size
    >= 64k (like ppc64).

    It's possible that some other size of faultaround area is relevant for a
    platform. We can expose `fault_around_bytes' variable to arch-specific
    code once such platforms will be found.

    Signed-off-by: Kirill A. Shutemov
    Cc: Rusty Russell
    Cc: Hugh Dickins
    Cc: Madhavan Srinivasan
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • add_active_range() has been repalced by memblock_set_node(). Clean up the
    comments to comply with that change.

    Signed-off-by: Zhang Zhen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     
  • Transform action part of ttu_flags into individiual bits. These flags
    aren't part of any uses-space visible api or even trace events.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it
    never unmaps pages. There is no reason for invalidation because ptes are
    left unchanged.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
    process_vm_writev, it's a kind of IPC for copying data between processes.
    Currently this option is placed inside "Processor type and features".

    This patch moves it into "General setup" (where all other arch-independed
    syscalls and ipc features are placed) and changes prompt string to less
    cryptic.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Christopher Yeoh
    Cc: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
    ensured that file/anon lists were scanned proportionally for reclaim from
    kswapd but ignored it for direct reclaim. The intent was to minimse
    direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
    long stall for many small stalls and distorts aging for normal workloads
    like streaming readers/writers. Hugh Dickins pointed out that a
    side-effect of the same commit was that when one LRU list dropped to zero
    that the entirety of the other list was shrunk leading to excessive
    reclaim in memcgs. This patch scans the file/anon lists proportionally
    for direct reclaim to similarly age page whether reclaimed by kswapd or
    direct reclaim but takes care to abort reclaim if one LRU drops to zero
    after reclaiming the requested number of pages.

    Based on ext4 and using the Intel VM scalability test

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
    Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
    Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
    Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
    Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Minor Faults 35154 36784
    Major Faults 611 1305
    Swap Ins 394 1651
    Swap Outs 4394 5891
    Allocation stalls 118616 44781
    Direct pages scanned 4935171 4602313
    Kswapd pages scanned 15921292 16258483
    Kswapd pages reclaimed 15913301 16248305
    Direct pages reclaimed 4933368 4601133
    Kswapd efficiency 99% 99%
    Kswapd velocity 670088.047 682555.961
    Direct efficiency 99% 99%
    Direct velocity 207709.217 193212.133
    Percentage direct scans 23% 22%
    Page writes by reclaim 4858.000 6232.000
    Page writes file 464 341
    Page writes anon 4394 5891

    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tim Chen
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g. when we are doing large amount of file read and
    page cache is under pressure. The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim. But counting
    creates a log jam acquiring the sb_lock.

    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.

    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)

    ffsb running in a configuration that is meant to simulate a mail server showed

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
    Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
    Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
    Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
    Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • This series is aimed at regressions noticed during reclaim activity. The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me. I'm posting them again to see
    if there was a reason they were dropped or if they just got lost. Dave?
    Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine? Hugh, does this
    work for you on the memcg test cases?

    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.

    postmark
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)

    ffsb (mail server simulator)
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
    Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
    Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
    Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
    Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)

    dd of a large file
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
    WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
    WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)

    stutter (times mmap latency during large amounts of IO)

    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
    Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
    Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
    Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
    Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
    Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
    Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
    Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
    Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
    Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
    Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
    Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
    Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
    Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)

    This patch (of 3):

    We will like to unregister the sb shrinker before ->kill_sb(). This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb. We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.

    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block. That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:

    shrinker umount

    down_read(shrinker_rwsem)
    down_write(sb->s_umount)
    shrinker_unregister
    down_write(shrinker_rwsem)

    grab_super_passive(sb)
    down_read_trylock(sb->s_umount)


    ....

    up_read(shrinker_rwsem)


    up_write(shrinker_rwsem)
    ->kill_sb()
    ....

    So it is safe to deregister the shrinker before ->kill_sb().

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     
  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • msync() currently syncs more than POSIX requires or BSD or Solaris
    implement. It is supposed to be equivalent to fdatasync(), not fsync(),
    and it is only supposed to sync the portion of the file that overlaps the
    range passed to msync.

    If the VMA is non-linear, fall back to syncing the entire file, but we
    still optimise to only fdatasync() the entire file, not the full fsync().

    akpm: there are obvious concerns with bck-compatibility: is anyone relying
    on the undocumented side-effect for their data integrity? And how would
    they ever know if this change broke their data integrity?

    We think the risk is reasonably low, and this patch brings the kernel into
    line with other OS's and with what the manpage has always said...

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Acked-by: Jeff Moyer
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Remove an unused global variable mce_entry and relative operations in
    do_machine_check().

    Signed-off-by: Chen Yucong
    Cc: Naoya Horiguchi
    Cc: Wu Fengguang
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Fix checkpatch warning:
    WARNING: kfree(NULL) is safe this check is probably not required

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick