05 Jun, 2014

40 commits

  • Transform action part of ttu_flags into individiual bits. These flags
    aren't part of any uses-space visible api or even trace events.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it
    never unmaps pages. There is no reason for invalidation because ptes are
    left unchanged.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
    process_vm_writev, it's a kind of IPC for copying data between processes.
    Currently this option is placed inside "Processor type and features".

    This patch moves it into "General setup" (where all other arch-independed
    syscalls and ipc features are placed) and changes prompt string to less
    cryptic.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Christopher Yeoh
    Cc: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
    ensured that file/anon lists were scanned proportionally for reclaim from
    kswapd but ignored it for direct reclaim. The intent was to minimse
    direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
    long stall for many small stalls and distorts aging for normal workloads
    like streaming readers/writers. Hugh Dickins pointed out that a
    side-effect of the same commit was that when one LRU list dropped to zero
    that the entirety of the other list was shrunk leading to excessive
    reclaim in memcgs. This patch scans the file/anon lists proportionally
    for direct reclaim to similarly age page whether reclaimed by kswapd or
    direct reclaim but takes care to abort reclaim if one LRU drops to zero
    after reclaiming the requested number of pages.

    Based on ext4 and using the Intel VM scalability test

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
    Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
    Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
    Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
    Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Minor Faults 35154 36784
    Major Faults 611 1305
    Swap Ins 394 1651
    Swap Outs 4394 5891
    Allocation stalls 118616 44781
    Direct pages scanned 4935171 4602313
    Kswapd pages scanned 15921292 16258483
    Kswapd pages reclaimed 15913301 16248305
    Direct pages reclaimed 4933368 4601133
    Kswapd efficiency 99% 99%
    Kswapd velocity 670088.047 682555.961
    Direct efficiency 99% 99%
    Direct velocity 207709.217 193212.133
    Percentage direct scans 23% 22%
    Page writes by reclaim 4858.000 6232.000
    Page writes file 464 341
    Page writes anon 4394 5891

    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tim Chen
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g. when we are doing large amount of file read and
    page cache is under pressure. The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim. But counting
    creates a log jam acquiring the sb_lock.

    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.

    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)

    ffsb running in a configuration that is meant to simulate a mail server showed

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
    Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
    Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
    Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
    Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • This series is aimed at regressions noticed during reclaim activity. The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me. I'm posting them again to see
    if there was a reason they were dropped or if they just got lost. Dave?
    Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine? Hugh, does this
    work for you on the memcg test cases?

    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.

    postmark
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)

    ffsb (mail server simulator)
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
    Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
    Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
    Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
    Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)

    dd of a large file
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
    WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
    WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)

    stutter (times mmap latency during large amounts of IO)

    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
    Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
    Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
    Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
    Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
    Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
    Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
    Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
    Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
    Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
    Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
    Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
    Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
    Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)

    This patch (of 3):

    We will like to unregister the sb shrinker before ->kill_sb(). This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb. We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.

    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block. That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:

    shrinker umount

    down_read(shrinker_rwsem)
    down_write(sb->s_umount)
    shrinker_unregister
    down_write(shrinker_rwsem)

    grab_super_passive(sb)
    down_read_trylock(sb->s_umount)


    ....

    up_read(shrinker_rwsem)


    up_write(shrinker_rwsem)
    ->kill_sb()
    ....

    So it is safe to deregister the shrinker before ->kill_sb().

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     
  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • msync() currently syncs more than POSIX requires or BSD or Solaris
    implement. It is supposed to be equivalent to fdatasync(), not fsync(),
    and it is only supposed to sync the portion of the file that overlaps the
    range passed to msync.

    If the VMA is non-linear, fall back to syncing the entire file, but we
    still optimise to only fdatasync() the entire file, not the full fsync().

    akpm: there are obvious concerns with bck-compatibility: is anyone relying
    on the undocumented side-effect for their data integrity? And how would
    they ever know if this change broke their data integrity?

    We think the risk is reasonably low, and this patch brings the kernel into
    line with other OS's and with what the manpage has always said...

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Acked-by: Jeff Moyer
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Remove an unused global variable mce_entry and relative operations in
    do_machine_check().

    Signed-off-by: Chen Yucong
    Cc: Naoya Horiguchi
    Cc: Wu Fengguang
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Fix checkpatch warning:
    WARNING: kfree(NULL) is safe this check is probably not required

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • ...like other filesystems.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • hugetlbfs_i_mmap_mutex_key is only used in inode.c

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mmdebug.h is included twice.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • alloc_huge_page() now mixes normal code path with error handle logic.
    This patches move out the error handle logic, to make normal code path
    more clean and redue code duplicate.

    Signed-off-by: Jianyu Zhan
    Acked-by: Davidlohr Bueso
    Reviewed-by: Michal Hocko
    Reviewed-by: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • The comment about pages under writeback is far from the relevant code, so
    let's move it to the right place.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If a page is marked for immediate reclaim then it is moved to the tail of
    the LRU list. This occurs when the system is under enough memory pressure
    for pages under writeback to reach the end of the LRU but we test for this
    using atomic operations on every writeback. This patch uses an optimistic
    non-atomic test first. It'll miss some pages in rare cases but the
    consequences are not severe enough to warrant such a penalty.

    While the function does not dominate profiles during a simple dd test the
    cost of it is reduced.

    73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
    23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There is no need to calculate zone_idx(preferred_zone) multiple times
    or use the pgdat to figure it out.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Discarding buffers uses a bunch of atomic operations when discarding
    buffers because ...... I can't think of a reason. Use a cmpxchg loop to
    clear all the necessary flags. In most (all?) cases this will be a single
    atomic operations.

    [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When adding pages to the LRU we clear the active bit unconditionally.
    As the page could be reachable from other paths we cannot use unlocked
    operations without risk of corruption such as a parallel
    mark_page_accessed. This patch tests if is necessary to clear the
    active flag before using an atomic operation. This potentially opens a
    tiny race when PageActive is checked as mark_page_accessed could be
    called after PageActive was checked. The race already exists but this
    patch changes it slightly. The consequence is that that the page may be
    promoted to the active list that might have been left on the inactive
    list before the patch. It's too tiny a race and too marginal a
    consequence to always use atomic operations for.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There should be no references to it any more and a parallel mark should
    not be reordered against us. Use non-locked varient to clear page active.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible. This is unnecessary as what
    could it possible race against? Use an unlocked variant.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • get_pageblock_migratetype() is called during free with IRQs disabled.
    This is unnecessary and disables IRQs for longer than necessary.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
    __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
    cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
    unlikely branch in the fast path. This patch moves the check out of the
    fast path and after it has been determined that the watermarks have not
    been met. This helps the common fast path at the cost of making the slow
    path slower and hitting kswapd with a performance cost. It's a reasonable
    tradeoff.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently it's calculated once per zone in the zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A node/zone index is used to check if pages are compatible for merging
    but this happens unconditionally even if the buddy page is not free. Defer
    the calculation as long as possible. Ideally we would check the zone boundary
    but nodes can overlap.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch exposes the jump_label reference count in preparation for the
    next patch. cpusets cares about both the jump_label being enabled and how
    many users of the cpusets there currently are.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a zone cannot be used for a dirty page then it gets marked "full" which
    is cached in the zlc and later potentially skipped by allocation requests
    that have nothing to do with dirty zones.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The zlc is used on NUMA machines to quickly skip over zones that are full.
    However it is always updated, even for the first zone scanned when the
    zlc might not even be active. As it's a write to a bitmap that
    potentially bounces cache line it's deceptively expensive and most
    machines will not care. Only update the zlc if it was active.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, on kmem_cache_destroy we delete the cache from the slab_list
    before __kmem_cache_shutdown, inserting it back to the list on failure.
    Initially, this was done, because we could release the slab_mutex in
    __kmem_cache_shutdown to delete sysfs slub entry, but since commit
    41a212859a4d ("slub: use sysfs'es release mechanism for kmem_cache") we
    remove sysfs entry later in kmem_cache_destroy after dropping the
    slab_mutex, so that no implementation of __kmem_cache_shutdown can ever
    release the lock. Therefore we can simplify the code a bit by moving
    list_del after __kmem_cache_shutdown.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Current names are rather inconsistent. Let's try to improve them.

    Brief change log:

    ** old name ** ** new name **

    kmem_cache_create_memcg memcg_create_kmem_cache
    memcg_kmem_create_cache memcg_regsiter_cache
    memcg_kmem_destroy_cache memcg_unregister_cache

    kmem_cache_destroy_memcg_children memcg_cleanup_cache_params
    mem_cgroup_destroy_all_caches memcg_unregister_all_caches

    create_work memcg_register_cache_work
    memcg_create_cache_work_func memcg_register_cache_func
    memcg_create_cache_enqueue memcg_schedule_register_cache

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Instead of calling an additional routine in dmam_pool_destroy() rely on
    what dmam_pool_release() is doing.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Some systems require a larger maximum PAGE_SIZE order for CMA allocations.
    To accommodate such systems, increase the upper-bound of the
    CMA_ALIGNMENT range to 12 (which ends up being 16MB on systems with 4K
    pages).

    Signed-off-by: Marc Carino
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc Carino