05 Jun, 2014

40 commits

  • Usually, BUG_ON and friends aren't even evaluated in sparse, but recently
    compiletime_assert_atomic_type() was added, and that now results in a
    sparse warning every time it is used.

    The reason turns out to be the temporary variable, after it sparse no
    longer considers the value to be a constant, and results in a warning and
    an error. The error is the more annoying part of this as it suppresses
    any further warnings in the same file, hiding other problems.

    Unfortunately the condition cannot be simply expanded out to avoid the
    temporary variable since it breaks compiletime_assert on old versions of
    GCC such as GCC 4.2.4 which the latest metag compiler is based on.

    Therefore #ifndef __CHECKER__ out the __compiletime_error_fallback which
    uses the potentially negative size array to trigger a conditional compiler
    error, so that sparse doesn't see it.

    Signed-off-by: James Hogan
    Cc: Johannes Berg
    Cc: Daniel Santos
    Cc: Luciano Coelho
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Acked-by: Johannes Berg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Hogan
     
  • zbud_alloc is only called by zswap_frontswap_store with unsigned int len.
    Change function parameter + update >= 0 check.

    Signed-off-by: Fabian Frederick
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We already have a function named hugepages_supported(), and the similar
    name hugepage_migration_support() is a bit unconfortable, so let's rename
    it hugepage_migration_supported().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Transform action part of ttu_flags into individiual bits. These flags
    aren't part of any uses-space visible api or even trace events.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mmdebug.h is included twice.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible. This is unnecessary as what
    could it possible race against? Use an unlocked variant.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch exposes the jump_label reference count in preparation for the
    next patch. cpusets cares about both the jump_label being enabled and how
    many users of the cpusets there currently are.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Current names are rather inconsistent. Let's try to improve them.

    Brief change log:

    ** old name ** ** new name **

    kmem_cache_create_memcg memcg_create_kmem_cache
    memcg_kmem_create_cache memcg_regsiter_cache
    memcg_kmem_destroy_cache memcg_unregister_cache

    kmem_cache_destroy_memcg_children memcg_cleanup_cache_params
    mem_cgroup_destroy_all_caches memcg_unregister_all_caches

    create_work memcg_register_cache_work
    memcg_create_cache_work_func memcg_register_cache_func
    memcg_create_cache_enqueue memcg_schedule_register_cache

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add plist_requeue(), which moves the specified plist_node after all other
    same-priority plist_nodes in the list. This is essentially an optimized
    plist_del() followed by plist_add().

    This is needed by swap, which (with the next patch in this set) uses a
    plist of available swap devices. When a swap device (either a swap
    partition or swap file) are added to the system with swapon(), the device
    is added to a plist, ordered by the swap device's priority. When swap
    needs to allocate a page from one of the swap devices, it takes the page
    from the first swap device on the plist, which is the highest priority
    swap device. The swap device is left in the plist until all its pages are
    used, and then removed from the plist when it becomes full.

    However, as described in man 2 swapon, swap must allocate pages from swap
    devices with the same priority in round-robin order; to do this, on each
    swap page allocation, swap uses a page from the first swap device in the
    plist, and then calls plist_requeue() to move that swap device entry to
    after any other same-priority swap devices. The next swap page allocation
    will again use a page from the first swap device in the plist and requeue
    it, and so on, resulting in round-robin usage of equal-priority swap
    devices.

    Also add plist_test_requeue() test function, for use by plist_test() to
    test plist_requeue() function.

    Signed-off-by: Dan Streetman
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
    define and initialize a struct plist_head.

    Add plist_for_each_continue() and plist_for_each_entry_continue(),
    equivalent to list_for_each_continue() and list_for_each_entry_continue(),
    to iterate over a plist continuing after the current position.

    Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
    and ->next, implemented by list_prev_entry() and list_next_entry(), to
    access the prev/next struct plist_node entry. These are needed because
    unlike struct list_head, direct access of the prev/next struct plist_node
    isn't possible; the list must be navigated via the contained struct
    list_head. e.g. instead of accessing the prev by list_prev_entry(node,
    node_list) it can be accessed by plist_prev(node).

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • During compaction, update_nr_listpages() has been used to count remaining
    non-migrated and free pages after a call to migrage_pages(). The
    freepages counting has become unneccessary, and it turns out that
    migratepages counting is also unnecessary in most cases.

    The only situation when it's needed to count cc->migratepages is when
    migrate_pages() returns with a negative error code. Otherwise, the
    non-negative return value is the number of pages that were not migrated,
    which is exactly the count of remaining pages in the cc->migratepages
    list.

    Furthermore, any non-zero count is only interesting for the tracepoint of
    mm_compaction_migratepages events, because after that all remaining
    unmigrated pages are put back and their count is set to 0.

    This patch therefore removes update_nr_listpages() completely, and changes
    the tracepoint definition so that the manual counting is done only when
    the tracepoint is enabled, and only when migrate_pages() returns a
    negative error code.

    Furthermore, migrate_pages() and the tracepoints won't be called when
    there's nothing to migrate. This potentially avoids some wasted cycles
    and reduces the volume of uninteresting mm_compaction_migratepages events
    where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this
    was about 75% of the events. The mm_compaction_isolate_migratepages event
    is better for determining that nothing was isolated for migration, and
    this one was just duplicating the info.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.

    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction. This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called. On large machines, this could be
    potentially very expensive.

    This patch adds a per-zone cached migration scanner pfn only for async
    compaction. It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated. The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Greg Thelen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
    order to just create the name of a per memcg cache, let's allocate it in
    place. We only need to pass the memcg name to kmem_cache_create_memcg for
    that - everything else can be done in slab_common.c.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • With ELF extended numbering 16-bit bound is not hard limit any more.

    [akpm@linux-foundation.org: fix typo]
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • KSM was converted to use rmap_walk() and now nobody uses these functions
    outside mm/rmap.c.

    Let's covert them back to static.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Now that we are doing NUMA-aware shrinking, and can have shrinkers
    running in parallel, or working on individual nodes, it seems like we
    should also be sticking the node in the output.

    Signed-off-by: Dave Hansen
    Acked-by: Dave Chinner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I was looking at a trace of the slab shrinkers (attachment in this comment):

    https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67

    and noticed that "total_scan" can go negative in some cases. We
    used to dump out the "total_scan" variable directly, but some of
    the shrinker modifications along the way changed that.

    This patch just dumps it out directly, again. It doesn't make
    any sense to derive it from new_nr and nr any more since there
    are now other shrinkers that can be running in parallel and
    mucking with those values.

    Here's an example of the negative numbers in the output:

    > kswapd0-840 [000] 160.869398: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
    > kswapd0-840 [000] 160.869618: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
    > kswapd0-840 [000] 160.870031: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
    > kswapd0-840 [000] 160.870464: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
    > kswapd0-840 [000] 163.384144: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
    > kswapd0-840 [000] 163.384297: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
    > kswapd0-840 [000] 163.384414: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
    > kswapd0-840 [000] 163.384657: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
    > kswapd0-840 [000] 163.384880: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
    > kswapd0-840 [000] 163.385256: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
    > kswapd0-840 [000] 163.385598: mm_shrink_slab_end: i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512

    Signed-off-by: Dave Hansen
    Acked-by: Dave Chinner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Use BOOTMEM_DEFAULT instead of 0 in the comment.

    Signed-off-by: Wang Sheng-Hui
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • Currently, in put_compound_page(), we have

    ======
    if (likely(!PageTail(page))) { first_page;

    smp_rmb();
    if (likely(PageTail(page)))
    return head;
    }
    return page;
    }
    ======

    here, the (3) unlikely in the case is a negative hint, because it is
    *likely* a tail page. So the check (3) in this case is not good, so I
    introduce a helper for this case.

    So this patch introduces compound_head_by_tail() which deals with a
    possible tail page(though it could be spilt by a racy thread), and make
    compound_head() a wrapper on it.

    This patch has no functional change, and it reduces the object
    size slightly:
    text data bss dec hex filename
    11003 1328 16 12347 303b mm/swap.o.orig
    10971 1328 16 12315 301b mm/swap.o.patched

    I've ran "perf top -e branch-miss" to observe branch-miss in this case.
    As Michael points out, it's a slow path, so only very few times this case
    happens. But I grep'ed the code base, and found there still are some
    other call sites could be benifited from this helper. And given that it
    only bloating up the source by only 5 lines, but with a reduced object
    size. I still believe this helper deserves to exsit.

    Signed-off-by: Jianyu Zhan
    Cc: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Jiang Liu
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Sasha Levin
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • The nmask argument to set_mempolicy() is const according to the user-space
    header numaif.h, and since the kernel does indeed not modify it, it might
    as well be declared const in the kernel.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The nmask argument to mbind() is const according to the userspace header
    numaif.h, and since the kernel does indeed not modify it, it might as well
    be declared const in the kernel.

    Signed-off-by: Rasmus Villemoes
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • A block device driver may choose to provide a rw_page operation. These
    will be called when the filesystem is attempting to do page sized I/O to
    page cache pages (ie not for direct I/O). This does preclude I/Os that
    are larger than page size, so this may only be a performance gain for
    some devices.

    Signed-off-by: Matthew Wilcox
    Tested-by: Dheeraj Reddy
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • page_endio() takes care of updating all the appropriate page flags once
    I/O has finished to a page. Switch to using mapping_set_error() instead
    of setting AS_EIO directly; this will handle thin-provisioned devices
    correctly.

    Signed-off-by: Matthew Wilcox
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The last in-tree caller of block_write_full_page_endio() was removed in
    January 2013. It's time to remove the EXPORT_SYMBOL, which leaves
    block_write_full_page() as the only caller of
    block_write_full_page_endio(), so inline block_write_full_page_endio()
    into block_write_full_page().

    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • At present, we have the following mutexes protecting data related to per
    memcg kmem caches:

    - slab_mutex. This one is held during the whole kmem cache creation
    and destruction paths. We also take it when updating per root cache
    memcg_caches arrays (see memcg_update_all_caches). As a result, taking
    it guarantees there will be no changes to any kmem cache (including per
    memcg). Why do we need something else then? The point is it is
    private to slab implementation and has some internal dependencies with
    other mutexes (get_online_cpus). So we just don't want to rely upon it
    and prefer to introduce additional mutexes instead.

    - activate_kmem_mutex. Initially it was added to synchronize
    initializing kmem limit (memcg_activate_kmem). However, since we can
    grow per root cache memcg_caches arrays only on kmem limit
    initialization (see memcg_update_all_caches), we also employ it to
    protect against memcg_caches arrays relocation (e.g. see
    __kmem_cache_destroy_memcg_children).

    - We have a convention not to take slab_mutex in memcontrol.c, but we
    want to walk over per memcg memcg_slab_caches lists there (e.g. for
    destroying all memcg caches on offline). So we have per memcg
    slab_caches_mutex's protecting those lists.

    The mutexes are taken in the following order:

    activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex

    Such a syncrhonization scheme has a number of flaws, for instance:

    - We can't call kmem_cache_{destroy,shrink} while walking over a
    memcg::memcg_slab_caches list due to locking order. As a result, in
    mem_cgroup_destroy_all_caches we schedule the
    memcg_cache_params::destroy work shrinking and destroying the cache.

    - We don't have a mutex to synchronize per memcg caches destruction
    between memcg offline (mem_cgroup_destroy_all_caches) and root cache
    destruction (__kmem_cache_destroy_memcg_children). Currently we just
    don't bother about it.

    This patch simplifies it by substituting per memcg slab_caches_mutex's
    with the global memcg_slab_mutex. It will be held whenever a new per
    memcg cache is created or destroyed, so it protects per root cache
    memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
    order is following:

    activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex

    This allows us to call kmem_cache_{create,shrink,destroy} under the
    memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
    work any more - we can simply destroy caches while iterating over a per
    memcg slab caches list.

    Also using the global mutex simplifies synchronization between concurrent
    per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
    vs __kmem_cache_destroy_memcg_children.

    The downside of this is that we substitute per-memcg slab_caches_mutex's
    with a hummer-like global mutex, but since we already take either the
    slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
    shouldn't hurt concurrency a lot.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset is a part of preparations for kmemcg re-parenting. It
    targets at simplifying kmemcg work-flows and synchronization.

    First, it removes async per memcg cache destruction (see patches 1, 2).
    Now caches are only destroyed on memcg offline. That means the caches
    that are not empty on memcg offline will be leaked. However, they are
    already leaked, because memcg_cache_params::nr_pages normally never drops
    to 0 so the destruction work is never scheduled except kmem_cache_shrink
    is called explicitly. In the future I'm planning reaping such dead caches
    on vmpressure or periodically.

    Second, it substitutes per memcg slab_caches_mutex's with the global
    memcg_slab_mutex, which should be taken during the whole per memcg cache
    creation/destruction path before the slab_mutex (see patch 3). This
    greatly simplifies synchronization among various per memcg cache
    creation/destruction paths.

    I'm still not quite sure about the end picture, in particular I don't know
    whether we should reap dead memcgs' kmem caches periodically or try to
    merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
    more details), but whichever way we choose, this set looks like a
    reasonable change to me, because it greatly simplifies kmemcg work-flows
    and eases further development.

    This patch (of 3):

    After a memcg is offlined, we mark its kmem caches that cannot be deleted
    right now due to pending objects as dead by setting the
    memcg_cache_params::dead flag, so that memcg_release_pages will schedule
    cache destruction (memcg_cache_params::destroy) as soon as the last slab
    of the cache is freed (memcg_cache_params::nr_pages drops to zero).

    I guess the idea was to destroy the caches as soon as possible, i.e.
    immediately after freeing the last object. However, it just doesn't work
    that way, because kmem caches always preserve some pages for the sake of
    performance, so that nr_pages never gets to zero unless the cache is
    shrunk explicitly using kmem_cache_shrink. Of course, we could account
    the total number of objects on the cache or check if all the slabs
    allocated for the cache are empty on kmem_cache_free and schedule
    destruction if so, but that would be too costly.

    Thus we have a piece of code that works only when we explicitly call
    kmem_cache_shrink, but complicates the whole picture a lot. Moreover,
    it's racy in fact. For instance, kmem_cache_shrink may free the last slab
    and thus schedule cache destruction before it finishes checking that the
    cache is empty, which can lead to use-after-free.

    So I propose to remove this async cache destruction from
    memcg_release_pages, and check if the cache is empty explicitly after
    calling kmem_cache_shrink instead. This will simplify things a lot w/o
    introducing any functional changes.

    And regarding dead memcg caches (i.e. those that are left hanging around
    after memcg offline for they have objects), I suppose we should reap them
    either periodically or on vmpressure as Glauber suggested initially. I'm
    going to implement this later.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
    selected by CONFIG_MEMCG automatically. So we can kill this option in
    init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.

    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • In mm/swap.c, __lru_cache_add() is exported, but actually there are no
    users outside this file.

    This patch unexports __lru_cache_add(), and makes it static. It also
    exports lru_cache_add_file(), as it is use by cifs and fuse, which can
    loaded as modules.

    Signed-off-by: Jianyu Zhan
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Bob Liu
    Cc: Seth Jennings
    Cc: Joonsoo Kim
    Cc: Rafael Aquini
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Christoph Hellwig
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan