05 Jun, 2014

40 commits

  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • msync() currently syncs more than POSIX requires or BSD or Solaris
    implement. It is supposed to be equivalent to fdatasync(), not fsync(),
    and it is only supposed to sync the portion of the file that overlaps the
    range passed to msync.

    If the VMA is non-linear, fall back to syncing the entire file, but we
    still optimise to only fdatasync() the entire file, not the full fsync().

    akpm: there are obvious concerns with bck-compatibility: is anyone relying
    on the undocumented side-effect for their data integrity? And how would
    they ever know if this change broke their data integrity?

    We think the risk is reasonably low, and this patch brings the kernel into
    line with other OS's and with what the manpage has always said...

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Acked-by: Jeff Moyer
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Remove an unused global variable mce_entry and relative operations in
    do_machine_check().

    Signed-off-by: Chen Yucong
    Cc: Naoya Horiguchi
    Cc: Wu Fengguang
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Fix checkpatch warning:
    WARNING: kfree(NULL) is safe this check is probably not required

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • ...like other filesystems.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • hugetlbfs_i_mmap_mutex_key is only used in inode.c

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mmdebug.h is included twice.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • alloc_huge_page() now mixes normal code path with error handle logic.
    This patches move out the error handle logic, to make normal code path
    more clean and redue code duplicate.

    Signed-off-by: Jianyu Zhan
    Acked-by: Davidlohr Bueso
    Reviewed-by: Michal Hocko
    Reviewed-by: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • The comment about pages under writeback is far from the relevant code, so
    let's move it to the right place.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If a page is marked for immediate reclaim then it is moved to the tail of
    the LRU list. This occurs when the system is under enough memory pressure
    for pages under writeback to reach the end of the LRU but we test for this
    using atomic operations on every writeback. This patch uses an optimistic
    non-atomic test first. It'll miss some pages in rare cases but the
    consequences are not severe enough to warrant such a penalty.

    While the function does not dominate profiles during a simple dd test the
    cost of it is reduced.

    73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
    23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There is no need to calculate zone_idx(preferred_zone) multiple times
    or use the pgdat to figure it out.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Discarding buffers uses a bunch of atomic operations when discarding
    buffers because ...... I can't think of a reason. Use a cmpxchg loop to
    clear all the necessary flags. In most (all?) cases this will be a single
    atomic operations.

    [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When adding pages to the LRU we clear the active bit unconditionally.
    As the page could be reachable from other paths we cannot use unlocked
    operations without risk of corruption such as a parallel
    mark_page_accessed. This patch tests if is necessary to clear the
    active flag before using an atomic operation. This potentially opens a
    tiny race when PageActive is checked as mark_page_accessed could be
    called after PageActive was checked. The race already exists but this
    patch changes it slightly. The consequence is that that the page may be
    promoted to the active list that might have been left on the inactive
    list before the patch. It's too tiny a race and too marginal a
    consequence to always use atomic operations for.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There should be no references to it any more and a parallel mark should
    not be reordered against us. Use non-locked varient to clear page active.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible. This is unnecessary as what
    could it possible race against? Use an unlocked variant.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • get_pageblock_migratetype() is called during free with IRQs disabled.
    This is unnecessary and disables IRQs for longer than necessary.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
    __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
    cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
    unlikely branch in the fast path. This patch moves the check out of the
    fast path and after it has been determined that the watermarks have not
    been met. This helps the common fast path at the cost of making the slow
    path slower and hitting kswapd with a performance cost. It's a reasonable
    tradeoff.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently it's calculated once per zone in the zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A node/zone index is used to check if pages are compatible for merging
    but this happens unconditionally even if the buddy page is not free. Defer
    the calculation as long as possible. Ideally we would check the zone boundary
    but nodes can overlap.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch exposes the jump_label reference count in preparation for the
    next patch. cpusets cares about both the jump_label being enabled and how
    many users of the cpusets there currently are.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a zone cannot be used for a dirty page then it gets marked "full" which
    is cached in the zlc and later potentially skipped by allocation requests
    that have nothing to do with dirty zones.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The zlc is used on NUMA machines to quickly skip over zones that are full.
    However it is always updated, even for the first zone scanned when the
    zlc might not even be active. As it's a write to a bitmap that
    potentially bounces cache line it's deceptively expensive and most
    machines will not care. Only update the zlc if it was active.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, on kmem_cache_destroy we delete the cache from the slab_list
    before __kmem_cache_shutdown, inserting it back to the list on failure.
    Initially, this was done, because we could release the slab_mutex in
    __kmem_cache_shutdown to delete sysfs slub entry, but since commit
    41a212859a4d ("slub: use sysfs'es release mechanism for kmem_cache") we
    remove sysfs entry later in kmem_cache_destroy after dropping the
    slab_mutex, so that no implementation of __kmem_cache_shutdown can ever
    release the lock. Therefore we can simplify the code a bit by moving
    list_del after __kmem_cache_shutdown.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Current names are rather inconsistent. Let's try to improve them.

    Brief change log:

    ** old name ** ** new name **

    kmem_cache_create_memcg memcg_create_kmem_cache
    memcg_kmem_create_cache memcg_regsiter_cache
    memcg_kmem_destroy_cache memcg_unregister_cache

    kmem_cache_destroy_memcg_children memcg_cleanup_cache_params
    mem_cgroup_destroy_all_caches memcg_unregister_all_caches

    create_work memcg_register_cache_work
    memcg_create_cache_work_func memcg_register_cache_func
    memcg_create_cache_enqueue memcg_schedule_register_cache

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Instead of calling an additional routine in dmam_pool_destroy() rely on
    what dmam_pool_release() is doing.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Some systems require a larger maximum PAGE_SIZE order for CMA allocations.
    To accommodate such systems, increase the upper-bound of the
    CMA_ALIGNMENT range to 12 (which ends up being 16MB on systems with 4K
    pages).

    Signed-off-by: Marc Carino
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc Carino
     
  • Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add plist_requeue(), which moves the specified plist_node after all other
    same-priority plist_nodes in the list. This is essentially an optimized
    plist_del() followed by plist_add().

    This is needed by swap, which (with the next patch in this set) uses a
    plist of available swap devices. When a swap device (either a swap
    partition or swap file) are added to the system with swapon(), the device
    is added to a plist, ordered by the swap device's priority. When swap
    needs to allocate a page from one of the swap devices, it takes the page
    from the first swap device on the plist, which is the highest priority
    swap device. The swap device is left in the plist until all its pages are
    used, and then removed from the plist when it becomes full.

    However, as described in man 2 swapon, swap must allocate pages from swap
    devices with the same priority in round-robin order; to do this, on each
    swap page allocation, swap uses a page from the first swap device in the
    plist, and then calls plist_requeue() to move that swap device entry to
    after any other same-priority swap devices. The next swap page allocation
    will again use a page from the first swap device in the plist and requeue
    it, and so on, resulting in round-robin usage of equal-priority swap
    devices.

    Also add plist_test_requeue() test function, for use by plist_test() to
    test plist_requeue() function.

    Signed-off-by: Dan Streetman
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
    define and initialize a struct plist_head.

    Add plist_for_each_continue() and plist_for_each_entry_continue(),
    equivalent to list_for_each_continue() and list_for_each_entry_continue(),
    to iterate over a plist continuing after the current position.

    Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
    and ->next, implemented by list_prev_entry() and list_next_entry(), to
    access the prev/next struct plist_node entry. These are needed because
    unlike struct list_head, direct access of the prev/next struct plist_node
    isn't possible; the list must be navigated via the contained struct
    list_head. e.g. instead of accessing the prev by list_prev_entry(node,
    node_list) it can be accessed by plist_prev(node).

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • In previous commit(mm: use the light version __mod_zone_page_state in
    mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as
    suggested by Andrew, to reduce the risks that new call sites incorrectly
    using mlocked_vma_newpage() without knowing they are adding racing, this
    patch folds mlocked_vma_newpage() into its only call site,
    page_add_new_anon_rmap, to make it open-cocded for people to know what is
    going on.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jianyu Zhan
    Suggested-by: Andrew Morton
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mlocked_vma_newpage() is called with pte lock held(a spinlock), which
    implies preemtion disabled, and the vm stat counter is not modified from
    interrupt context, so we need not use an irq-safe mod_zone_page_state()
    here, using a light-weight version __mod_zone_page_state() would be OK.

    This patch also documents __mod_zone_page_state() and some of its
    callsites. The comment above __mod_zone_page_state() is from Hugh
    Dickins, and acked by Christoph.

    Most credits to Hugh and Christoph for the clarification on the usage of
    the __mod_zone_page_state().

    [akpm@linux-foundation.org: coding-style fixes]
    Suggested-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Jianyu Zhan
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan