09 Sep, 2015

40 commits

  • s/succees/success/

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • We cache isolate_start_pfn before entering isolate_migratepages(). If
    pageblock is skipped in isolate_migratepages() due to whatever reason,
    cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
    that were freed. For example, the following scenario can be possible:

    - assume order-9 compaction, pageblock order is 9
    - start_isolate_pfn is 0x200
    - isolate_migratepages()
    - skip a number of pageblocks
    - start to isolate from pfn 0x600
    - cc->migrate_pfn = 0x620
    - return
    - last_migrated_pfn is set to 0x200
    - check flushing condition
    - current_block_start is set to 0x600
    - last_migrated_pfn < current_block_start then do useless flush

    This wrong flush would not help the performance and success rate so this
    patch tries to fix it. One simple way to know the exact position where
    we start to isolate migratable pages is that we cache it in
    isolate_migratepages() before entering actual isolation. This patch
    implements that and fixes the problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • alloc_pages_node() might fail when called with NUMA_NO_NODE and
    __GFP_THISNODE on a CPU belonging to a memoryless node. To make the
    local-node fallback more robust and prevent such situations, use
    numa_mem_id(), which was introduced for similar scenarios in the slab
    context.

    Suggested-by: Christoph Lameter
    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Perform the same debug checks in alloc_pages_node() as are done in
    __alloc_pages_node(), by making the former function a wrapper of the
    latter one.

    In addition to better diagnostics in DEBUG_VM builds for situations
    which have been already fatal (e.g. out-of-bounds node id), there are
    two visible changes for potential existing buggy callers of
    alloc_pages_node():

    - calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
    overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
    applied. This will now be fatal.
    - calling alloc_pages_node() with an offline node will now be checked for
    DEBUG_VM builds. Since it's not fatal if the node has been previously online,
    and this patch may expose some existing buggy callers, change the VM_BUG_ON
    in __alloc_pages_node() to VM_WARN_ON.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Acked-by: Johannes Weiner
    Acked-by: Christoph Lameter
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This is merely a politeness: I've not found that shrink_page_list()
    leads to deadlock with the page it holds locked across
    wait_on_page_writeback(); but nevertheless, why hold others off by
    keeping the page locked there?

    And while we're at it: remove the mistaken "not " from the commentary on
    this Case 3 (and a distracting blank line from Case 2, if I may).

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If the list_head is empty then we'll have called list_lru_from_kmem for
    nothing. Move that call inside of the list_empty if block.

    Signed-off-by: Jeff Layton
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • In log_early function, crt_early_log should also count once when
    'crt_early_log >= ARRAY_SIZE(early_log)'. Otherwise the reported count
    from kmemleak_init is one less than 'actual number'.

    Then, in kmemleak_init, if early_log buffer size equal actual number,
    kmemleak will init sucessful, so change warning condition to
    'crt_early_log > ARRAY_SIZE(early_log)'.

    Signed-off-by: Wang Kai
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Kai
     
  • __split_vma() doesn't need out_err label, neither need initializing err.

    copy_vma() can return NULL directly when kmem_cache_alloc() fails.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
    undoes range or swaps. But mm can drop clean page without notifying
    shmem. This makes fstat sometimes return out-of-date block size.

    The problem can be partially solved when we add
    inode_operations->getattr which calls shmem_recalc_inode to update
    i_blocks for fstat.

    shmem_recalc_inode also updates counter used by statfs and
    vm_committed_as. For them the situation is not changed. They still
    suffer from the discrepancy after dropping clean page and before the
    function is called by aforementioned triggers.

    Signed-off-by: Yu Zhao
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • Since commit e3239ff92a17 ("memblock: Rename memblock_region to
    memblock_type and memblock_property to memblock_region"), all local
    variables of the membock_type type were renamed to 'type'. This commit
    renames all remaining local variables with the memblock_type type to the
    same view.

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • memory_failure() can be called at any page at any time, which means that
    we can't eliminate the possibility of containment failure. In such case
    the best option is to leak the page intentionally (and never touch it
    later.)

    We have an unpoison function for testing, and it cannot handle such
    containment-failed pages, which results in kernel panic (visible with
    various calltraces.) So this patch suggests that we limit the
    unpoisonable pages to properly contained pages and ignore any other
    ones.

    Testers are recommended to keep in mind that there're un-unpoisonable
    pages when writing test programs.

    Signed-off-by: Naoya Horiguchi
    Tested-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Wanpeng Li reported a race between soft_offline_page() and
    unpoison_memory(), which causes the following kernel panic:

    BUG: Bad page state in process bash pfn:97000
    page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
    flags: 0x1fffff80080048(uptodate|active|swapbacked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x40(active)
    Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
    CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
    Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
    Call Trace:
    dump_stack+0x48/0x5c
    bad_page+0xe6/0x140
    free_pages_prepare+0x2f9/0x320
    ? uncharge_list+0xdd/0x100
    free_hot_cold_page+0x40/0x170
    __put_single_page+0x20/0x30
    put_page+0x25/0x40
    unmap_and_move+0x1a6/0x1f0
    migrate_pages+0x100/0x1d0
    ? kill_procs+0x100/0x100
    ? unlock_page+0x6f/0x90
    __soft_offline_page+0x127/0x2a0
    soft_offline_page+0xa6/0x200

    This race is explained like below:

    CPU0 CPU1

    soft_offline_page
    __soft_offline_page
    TestSetPageHWPoison
    unpoison_memory
    PageHWPoison check (true)
    TestClearPageHWPoison
    put_page -> release refcount held by get_hwpoison_page in unpoison_memory
    put_page -> release refcount held by isolate_lru_page in __soft_offline_page
    migrate_pages

    The second put_page() releases refcount held by isolate_lru_page() which
    will lead to unmap_and_move() releases the last refcount of page and w/
    mapcount still 1 since try_to_unmap() is not called if there is only one
    user map the page. Anyway, the page refcount and mapcount will still
    mess if the page is mapped by multiple users.

    This race was introduced by commit 4491f71260 ("mm/memory-failure: set
    PageHWPoison before migrate_pages()"), which focuses on preventing the
    reuse of successfully migrated page. Before this commit we prevent the
    reuse by changing the migratetype to MIGRATE_ISOLATE during soft
    offlining, which has the following problems, so simply reverting the
    commit is not a best option:

    1) it doesn't eliminate the reuse completely, because
    set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
    target page if the pageblock of the page contains one or more
    unmovable pages (i.e. has_unmovable_pages() returns true).

    2) the original code changes migratetype to MIGRATE_ISOLATE
    forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
    regardless of the original migratetype state, which could impact
    other subsystems like memory hotplug or compaction.

    This patch moves PageSetHWPoison just after put_page() in
    unmap_and_move(), which closes up the reported race window and minimizes
    another race window b/w SetPageHWPoison and reallocation (which causes
    the reuse of soft-offlined page.) The latter race window still exists
    but it's acceptable, because it's rare and effectively the same as
    ordinary "containment failure" case even if it happens, so keep the
    window open is acceptable.

    Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
    Signed-off-by: Wanpeng Li
    Signed-off-by: Naoya Horiguchi
    Reported-by: Wanpeng Li
    Tested-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • num_poisoned_pages counter will be changed outside mm/memory-failure.c
    by a subsequent patch, so this patch prepares wrappers to manipulate it.

    Signed-off-by: Naoya Horiguchi
    Tested-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Replace most instances of put_page() in memory error handling with
    put_hwpoison_page().

    Signed-off-by: Wanpeng Li
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Hwpoison injection takes a refcount of target page and another refcount
    of head page of THP if the target page is the tail page of a THP.
    However, current code doesn't release the refcount of head page if the
    THP is not supported to be injected wrt hwpoison filter.

    Fix it by reducing the refcount of head page if the target page is the
    tail page of a THP and it is not supported to be injected.

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Introduce put_hwpoison_page to put refcount for memory error handling.

    Signed-off-by: Wanpeng Li
    Suggested-by: Naoya Horiguchi
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • There is a race between madvise_hwpoison path and memory_failure:

    CPU0 CPU1

    madvise_hwpoison
    get_user_pages_fast
    PageHWPoison check (false)
    memory_failure
    TestSetPageHWPoison
    soft_offline_page
    PageHWPoison check (true)
    return -EBUSY (without put_page)

    Signed-off-by: Wanpeng Li
    Suggested-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • THP pages will get a refcount in madvise_hwpoison() w/
    MF_COUNT_INCREASED flag, however, the refcount is still held when fail
    to split THP pages.

    Fix it by reducing the refcount of THP pages when fail to split THP.

    Signed-off-by: Wanpeng Li
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The early_ioremap library now has a generic copy_from_early_mem()
    function. Use the generic copy function for x86 relocate_initrd().

    [akpm@linux-foundation.org: remove MAX_MAP_CHUNK define, per Yinghai Lu]
    Signed-off-by: Mark Salter
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Arnd Bergmann
    Cc: Ard Biesheuvel
    Cc: Mark Rutland
    Cc: Russell King
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Salter
     
  • The use of mem= could leave part or all of the initrd outside of the
    kernel linear map. This will lead to an error when unpacking the initrd
    and a probable failure to boot. This patch catches that situation and
    relocates the initrd to be fully within the linear map.

    Signed-off-by: Mark Salter
    Acked-by: Will Deacon
    Cc: Catalin Marinas
    Cc: Arnd Bergmann
    Cc: Ard Biesheuvel
    Cc: Mark Rutland
    Cc: Russell King
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Salter
     
  • When booting an arm64 kernel w/initrd using UEFI/grub, use of mem= will
    likely cut off part or all of the initrd. This leaves it outside the
    kernel linear map which leads to failure when unpacking. The x86 code
    has a similar need to relocate an initrd outside of mapped memory in
    some cases.

    The current x86 code uses early_memremap() to copy the original initrd
    from unmapped to mapped RAM. This patchset creates a generic
    copy_from_early_mem() utility based on that x86 code and has arm64 and
    x86 share it in their respective initrd relocation code.

    This patch (of 3):

    In some early boot circumstances, it may be necessary to copy from RAM
    outside the kernel linear mapping to mapped RAM. The need to relocate
    an initrd is one example in the x86 code. This patch creates a helper
    function based on current x86 code.

    Signed-off-by: Mark Salter
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Arnd Bergmann
    Cc: Ard Biesheuvel
    Cc: Mark Rutland
    Cc: Russell King
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Salter
     
  • The URL for libhugetlbfs has changed. Also, put a stronger emphasis on
    using libgugetlbfs for hugetlb regression testing.

    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Joern Engel
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The hugetlb selftests provide minimal coverage. Have run script point
    people at libhugetlbfs for better regression testing.

    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Joern Engel
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This manually reverts 7e50533d4b842 ("selftests: add hugetlbfstest").

    The hugetlbfstest test depends on hugetlb pages being counted in a
    task's rss. This functionality is not in the kernel, so the test will
    always fail. Remove test to avoid confusion.

    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Joern Engel
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The compaction free scanner is looking for PageBuddy() pages and
    skipping all others. For large compound pages such as THP or hugetlbfs,
    we can save a lot of iterations if we skip them at once using their
    compound_order(). This is generally unsafe and we can read a bogus
    value of order due to a race, but if we are careful, the only danger is
    skipping too much.

    When tested with stress-highalloc from mmtests on 4GB system with 1GB
    hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
    least 15%.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compaction migrate scanner tries to skip THP pages by their order,
    to reduce number of iterations for pages it cannot isolate. The check
    is only done if PageLRU() is true, which means it applies to THP pages,
    but not e.g. hugetlbfs pages or any other non-LRU compound pages, which
    we have to iterate by base pages.

    This limitation comes from the assumption that it's only safe to read
    compound_order() when we have the zone's lru_lock and THP cannot be
    split under us. But the only danger (after filtering out order values
    that are not below MAX_ORDER, to prevent overflows) is that we skip too
    much or too little after reading a bogus compound_order() due to a rare
    race. This is the same reasoning as patch 99c0fd5e51c4 ("mm,
    compaction: skip buddy pages by their order in the migrate scanner")
    introduced for unsafely reading PageBuddy() order.

    After this patch, all pages are tested for PageCompound() and we skip
    them by compound_order(). The test is done after the test for
    balloon_page_movable() as we don't want to assume if balloon pages (or
    other pages with own isolation and migration implementation if a generic
    API gets implemented) are compound or not.

    When tested with stress-highalloc from mmtests on 4GB system with 1GB
    hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by
    15%.

    [kirill.shutemov@linux.intel.com: change PageTransHuge checks to PageCompound for different series was squashed here]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Reseting the cached compaction scanner positions is now open-coded in
    __reset_isolation_suitable() and compact_finished(). Encapsulate the
    functionality in a new function reset_cached_positions().

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Handling the position where compaction free scanner should restart
    (stored in cc->free_pfn) got more complex with commit e14c720efdd7 ("mm,
    compaction: remember position within pageblock in free pages scanner").
    Currently the position is updated in each loop iteration of
    isolate_freepages(), although it should be enough to update it only when
    breaking from the loop. There's also an extra check outside the loop
    updates the position in case we have met the migration scanner.

    This can be simplified if we move the test for having isolated enough
    from the for-loop header next to the test for contention, and
    determining the restart position only in these cases. We can reuse the
    isolate_start_pfn variable for this instead of setting cc->free_pfn
    directly. Outside the loop, we can simply set cc->free_pfn to current
    value of isolate_start_pfn without any extra check.

    Also add a VM_BUG_ON to catch possible mistake in the future, in case we
    later add a new condition that terminates isolate_freepages_block()
    prematurely without also considering the condition in
    isolate_freepages().

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Assorted compaction cleanups and optimizations. The interesting patches
    are 4 and 5. In 4, skipping of compound pages in single iteration is
    improved for migration scanner, so it works also for !PageLRU compound
    pages such as hugetlbfs, slab etc. Patch 5 introduces this kind of
    skipping in the free scanner. The trick is that we can read
    compound_order() without any protection, if we are careful to filter out
    values larger than MAX_ORDER. The only danger is that we skip too much.
    The same trick was already used for reading the freepage order in the
    migrate scanner.

    To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc
    from mmtests, set to simulate THP allocations (including __GFP_COMP) on
    a 4GB system where 1GB was occupied by hugetlbfs pages. I'll include
    just the relevant stats:

    Patch 3 Patch 4 Patch 5

    Compaction stalls 7523 7529 7515
    Compaction success 323 304 322
    Compaction failures 7200 7224 7192
    Page migrate success 247778 264395 240737
    Page migrate failure 15358 33184 21621
    Compaction pages isolated 906928 980192 909983
    Compaction migrate scanned 2005277 1692805 1498800
    Compaction free scanned 13255284 11539986 9011276
    Compaction cost 288 305 277

    With 5 iterations per patch, the results are still noisy, but we can see
    that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the
    hugetlbfs pages at once. Interestingly, free_scanned is also reduced
    and I have no idea why. Patch 5 further reduces free_scanned as
    expected, by 15%. Other stats are unaffected modulo noise.

    [1] https://lkml.org/lkml/2015/1/19/158

    This patch (of 5):

    Compaction should finish when the migration and free scanner meet, i.e.
    they reach the same pageblock. Currently however, the test in
    compact_finished() simply just compares the exact pfns, which may yield
    a false negative when the free scanner position is in the middle of a
    pageblock and the migration scanner reaches the begining of the same
    pageblock.

    This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
    remember position within pageblock in free pages scanner") allowed the
    free scanner position to be in the middle of a pageblock between
    invocations. The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent
    infinite loop in compact_zone") prevented the issue by adding a special
    check in the migration scanner to satisfy the current detection of
    scanners meeting.

    However, the proper fix is to make the detection more robust. This
    patch introduces the compact_scanners_met() function that returns true
    when the free scanner position is in the same or lower pageblock than
    the migration scanner. The special case in isolate_migratepages()
    introduced by 1d5bfe1ffb5b is removed.

    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • add [pci|dma]_pool_zalloc coccinelle check.
    replaces instances of [pci|dma]_pool_alloc() followed by memset(0)
    with [pci|dma]_pool_zalloc().

    Signed-off-by: Sean O. Stalley
    Acked-by: Julia Lawall
    Cc: Vinod Koul
    Cc: Bjorn Helgaas
    Cc: Gilles Muller
    Cc: Nicolas Palix
    Cc: Michal Marek
    Cc: Sebastian Andrzej Siewior
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean O. Stalley
     
  • Add a wrapper function for pci_pool_alloc() to get zeroed memory.

    Signed-off-by: Sean O. Stalley
    Cc: Vinod Koul
    Cc: Bjorn Helgaas
    Cc: Gilles Muller
    Cc: Nicolas Palix
    Cc: Michal Marek
    Cc: Sebastian Andrzej Siewior
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean O. Stalley
     
  • Add a wrapper function for dma_pool_alloc() to get zeroed memory.

    Signed-off-by: Sean O. Stalley
    Cc: Vinod Koul
    Cc: Bjorn Helgaas
    Cc: Gilles Muller
    Cc: Nicolas Palix
    Cc: Michal Marek
    Cc: Sebastian Andrzej Siewior
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean O. Stalley
     
  • Currently a call to dma_pool_alloc() with a ___GFP_ZERO flag returns a
    non-zeroed memory region.

    This patchset adds support for the __GFP_ZERO flag to dma_pool_alloc(),
    adds 2 wrapper functions for allocing zeroed memory from a pool, and
    provides a coccinelle script for finding & replacing instances of
    dma_pool_alloc() followed by memset(0) with a single dma_pool_zalloc()
    call.

    There was some concern that this always calls memset() to zero, instead
    of passing __GFP_ZERO into the page allocator.
    [https://lkml.org/lkml/2015/7/15/881]

    I ran a test on my system to get an idea of how often dma_pool_alloc()
    calls into pool_alloc_page().

    After Boot: [ 30.119863] alloc_calls:541, page_allocs:7
    After an hour: [ 3600.951031] alloc_calls:9566, page_allocs:12
    After copying 1GB file onto a USB drive:
    [ 4260.657148] alloc_calls:17225, page_allocs:12

    It doesn't look like dma_pool_alloc() calls down to the page allocator
    very often (at least on my system).

    This patch (of 4):

    Currently the __GFP_ZERO flag is ignored by dma_pool_alloc().
    Make dma_pool_alloc() zero the memory if this flag is set.

    Signed-off-by: Sean O. Stalley
    Acked-by: David Rientjes
    Cc: Vinod Koul
    Cc: Bjorn Helgaas
    Cc: Gilles Muller
    Cc: Nicolas Palix
    Cc: Michal Marek
    Cc: Sebastian Andrzej Siewior
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean O. Stalley
     
  • reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
    number of pages removed from the candidate list. But shrink_page_list()
    puts back mlocked pages without passing it to caller and without
    counting as nr_reclaimed. This increases nr_isolated.

    To fix this, this patch changes shrink_page_list() to pass unevictable
    pages back to caller. Caller will take care those pages.

    Minchan said:

    It fixes two issues.

    1. With unevictable page, cma_alloc will be successful.

    Exactly speaking, cma_alloc of current kernel will fail due to
    unevictable pages.

    2. fix leaking of NR_ISOLATED counter of vmstat

    With it, too_many_isolated works. Otherwise, it could make hang until
    the process get SIGKILL.

    Signed-off-by: Jaewon Kim
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     
  • If transparent huge pages are enabled, we can isolate many more pages
    than we actually need to scan, because we count both single and huge
    pages equally in isolate_lru_pages().

    Since commit 5bc7b8aca942d ("mm: thp: add split tail pages to shrink
    page list in page reclaim"), we scan all the tail pages immediately
    after a huge page split (see shrink_page_list()). As a result, we can
    reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run!

    This is easy to catch on memcg reclaim with zswap enabled. The latter
    makes swapout instant so that if we happen to scan an unreferenced huge
    page we will evict both its head and tail pages immediately, which is
    likely to result in excessive reclaim.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • __nocast does no good for vm_flags_t. It only produces useless sparse
    warnings.

    Let's drop it.

    Signed-off-by: Kirill A. Shutemov
    Cc: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Bootmem isn't popular any more, but some architectures still use it, and
    freeing to bootmem after calling free_all_bootmem_core() can end up
    scribbling over random memory. Instead, make sure the kernel generates
    a warning in this case by ensuring the node_bootmem_map field is
    non-NULL when are freeing or marking bootmem.

    An instance of this bug was just fixed in the tile architecture ("tile:
    use free_bootmem_late() for initrd") and catching this case more widely
    seems like a good thing.

    Signed-off-by: Chris Metcalf
    Acked-by: Mel Gorman
    Cc: Yasuaki Ishimatsu
    Cc: Pekka Enberg
    Cc: Paul McQuade
    Cc: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Nowaday, set/unset_migratetype_isolate() is defined and used only in
    mm/page_isolation, so let's limit the scope within the file.

    Signed-off-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • This check was introduced as part of
    6f4576e3687 ("mempolicy: apply page table walker on queue_pages_range()")

    which got duplicated by
    48684a65b4e ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)")

    by reintroducing it earlier on queue_page_test_walk()

    Signed-off-by: Aristeu Rozanski
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Acked-by: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aristeu Rozanski