24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

23 Aug, 2018

1 commit

  • __paginginit is the same thing as __meminit except for platforms without
    sparsemem, there it is defined as __init.

    Remove __paginginit and use __meminit. Use __ref in one single function
    that merges __meminit and __init sections: setup_usemap().

    Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

06 Jun, 2018

1 commit

  • Pull xfs updates from Darrick Wong:
    "New features this cycle include the ability to relabel mounted
    filesystems, support for fallocated swapfiles, and using FUA for pure
    data O_DSYNC directio writes. With this cycle we begin to integrate
    online filesystem repair and refactor the growfs code in preparation
    for eventual subvolume support, though the road ahead for both
    features is quite long.

    There are also numerous refactorings of the iomap code to remove
    unnecessary log overhead, to disentangle some of the quota code, and
    to prepare for buffer head removal in a future upstream kernel.

    Metadata validation continues to improve, both in the hot path
    veifiers and the online filesystem check code. I anticipate sending a
    second pull request in a few days with more metadata validation
    improvements.

    This series has been run through a full xfstests run over the weekend
    and through a quick xfstests run against this morning's master, with
    no major failures reported.

    Summary:

    - Strengthen inode number and structure validation when allocating
    inodes.

    - Reduce pointless buffer allocations during cache miss

    - Use FUA for pure data O_DSYNC directio writes

    - Various iomap refactorings

    - Strengthen quota metadata verification to avoid unfixable broken
    quota

    - Make AGFL block freeing a deferred operation to avoid blowing out
    transaction reservations when running complex operations

    - Get rid of the log item descriptors to reduce log overhead

    - Fix various reflink bugs where inodes were double-joined to
    transactions

    - Don't issue discards when trimming unwritten extents

    - Refactor incore dquot initialization and retrieval interfaces

    - Fix some locking problmes in the quota scrub code

    - Strengthen btree structure checks in scrub code

    - Rewrite swapfile activation to use iomap and support unwritten
    extents

    - Make scrub exit to userspace sooner when corruptions or
    cross-referencing problems are found

    - Make scrub invoke the data fork scrubber directly on metadata
    inodes

    - Don't do background reclamation of post-eof and cow blocks when the
    fs is suspended

    - Fix secondary superblock buffer lifespan hinting

    - Refactor growfs to use table-dispatched functions instead of long
    stringy functions

    - Move growfs code to libxfs

    - Implement online fs label getting and setting

    - Introduce online filesystem repair (in a very limited capacity)

    - Fix unit conversion problems in the realtime freemap iteration
    functions

    - Various refactorings and cleanups in preparation to remove buffer
    heads in a future release

    - Reimplement the old bmap call with iomap

    - Remove direct buffer head accesses from seek hole/data

    - Various bug fixes"

    * tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
    fs: use ->is_partially_uptodate in page_cache_seek_hole_data
    fs: remove the buffer_unwritten check in page_seek_hole_data
    fs: move page_cache_seek_hole_data to iomap.c
    xfs: use iomap_bmap
    iomap: add an iomap-based bmap implementation
    iomap: add a iomap_sector helper
    iomap: use __bio_add_page in iomap_dio_zero
    iomap: move IOMAP_F_BOUNDARY to gfs2
    iomap: fix the comment describing IOMAP_NOWAIT
    iomap: inline data should be an iomap type, not a flag
    mm: split ->readpages calls to avoid non-contiguous pages lists
    mm: return an unsigned int from __do_page_cache_readahead
    mm: give the 'ret' variable a better name __do_page_cache_readahead
    block: add a lower-level bio_add_page interface
    xfs: fix error handling in xfs_refcount_insert()
    xfs: fix xfs_rtalloc_rec units
    xfs: strengthen rtalloc query range checks
    xfs: xfs_rtbuf_get should check the bmapi_read results
    xfs: xfs_rtword_t should be unsigned, not signed
    dax: change bdev_dax_supported() to support boolean returns
    ...

    Linus Torvalds
     

02 Jun, 2018

1 commit


25 May, 2018

1 commit

  • This reverts the following commits that change CMA design in MM.

    3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")

    1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA")

    bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")

    Ville reported a following error on i386.

    Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
    microcode: microcode updated early to revision 0x4, date = 2013-06-28
    Initializing CPU#0
    Initializing HighMem for node 0 (000377fe:00118000)
    Initializing Movable for node 0 (00000001:00118000)
    BUG: Bad page state in process swapper pfn:377fe
    page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
    flags: 0x80000000()
    raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
    page dumped because: nonzero mapcount
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
    Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
    Call Trace:
    dump_stack+0x60/0x96
    bad_page+0x9a/0x100
    free_pages_check_bad+0x3f/0x60
    free_pcppages_bulk+0x29d/0x5b0
    free_unref_page_commit+0x84/0xb0
    free_unref_page+0x3e/0x70
    __free_pages+0x1d/0x20
    free_highmem_page+0x19/0x40
    add_highpages_with_active_regions+0xab/0xeb
    set_highmem_pages_init+0x66/0x73
    mem_init+0x1b/0x1d7
    start_kernel+0x17a/0x363
    i386_start_kernel+0x95/0x99
    startup_32_smp+0x164/0x168

    The reason for this error is that the span of MOVABLE_ZONE is extended
    to whole node span for future CMA initialization, and, normal memory is
    wrongly freed here. I submitted the fix and it seems to work, but,
    another problem happened.

    It's so late time to fix the later problem so I decide to reverting the
    series.

    Reported-by: Ville Syrjälä
    Acked-by: Laura Abbott
    Acked-by: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Apr, 2018

4 commits

  • Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
    and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.

    Therefore, we don't need to maintain ALLOC_CMA at all.

    Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "mm/cma: manage the memory of the CMA area by using the
    ZONE_MOVABLE", v2.

    0. History

    This patchset is the follow-up of the discussion about the "Introduce
    ZONE_CMA (v7)" [1]. Please reference it if more information is needed.

    1. What does this patch do?

    This patch changes the management way for the memory of the CMA area in
    the MM subsystem. Currently the memory of the CMA area is managed by
    the zone where their pfn is belong to. However, this approach has some
    problems since MM subsystem doesn't have enough logic to handle the
    situation that different characteristic memories are in a single zone.
    To solve this issue, this patch try to manage all the memory of the CMA
    area by using the MOVABLE zone. In MM subsystem's point of view,
    characteristic of the memory on the MOVABLE zone and the memory of the
    CMA area are the same. So, managing the memory of the CMA area by using
    the MOVABLE zone will not have any problem.

    2. Motivation

    There are some problems with current approach. See following. Although
    these problem would not be inherent and it could be fixed without this
    conception change, it requires many hooks addition in various code path
    and it would be intrusive to core MM and would be really error-prone.
    Therefore, I try to solve them with this new approach. Anyway,
    following is the problems of the current implementation.

    o CMA memory utilization

    First, following is the freepage calculation logic in MM.

    - For movable allocation: freepage = total freepage
    - For unmovable allocation: freepage = total freepage - CMA freepage

    Freepages on the CMA area is used after the normal freepages in the zone
    where the memory of the CMA area is belong to are exhausted. At that
    moment that the number of the normal freepages is zero, so

    - For movable allocation: freepage = total freepage = CMA freepage
    - For unmovable allocation: freepage = 0

    If unmovable allocation comes at this moment, allocation request would
    fail to pass the watermark check and reclaim is started. After reclaim,
    there would exist the normal freepages so freepages on the CMA areas
    would not be used.

    FYI, there is another attempt [2] trying to solve this problem in lkml.
    And, as far as I know, Qualcomm also has out-of-tree solution for this
    problem.

    Useless reclaim:

    There is no logic to distinguish CMA pages in the reclaim path. Hence,
    CMA page is reclaimed even if the system just needs the page that can be
    usable for the kernel allocation.

    Atomic allocation failure:

    This is also related to the fallback allocation policy for the memory of
    the CMA area. Consider the situation that the number of the normal
    freepages is *zero* since the bunch of the movable allocation requests
    come. Kswapd would not be woken up due to following freepage
    calculation logic.

    - For movable allocation: freepage = total freepage = CMA freepage

    If atomic unmovable allocation request comes at this moment, it would
    fails due to following logic.

    - For unmovable allocation: freepage = total freepage - CMA freepage = 0

    It was reported by Aneesh [3].

    Useless compaction:

    Usual high-order allocation request is unmovable allocation request and
    it cannot be served from the memory of the CMA area. In compaction,
    migration scanner try to migrate the page in the CMA area and make
    high-order page there. As mentioned above, it cannot be usable for the
    unmovable allocation request so it's just waste.

    3. Current approach and new approach

    Current approach is that the memory of the CMA area is managed by the
    zone where their pfn is belong to. However, these memory should be
    distinguishable since they have a strong limitation. So, they are
    marked as MIGRATE_CMA in pageblock flag and handled specially. However,
    as mentioned in section 2, the MM subsystem doesn't have enough logic to
    deal with this special pageblock so many problems raised.

    New approach is that the memory of the CMA area is managed by the
    MOVABLE zone. MM already have enough logic to deal with special zone
    like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA
    area by the MOVABLE zone just naturally work well because constraints
    for the memory of the CMA area that the memory should always be
    migratable is the same with the constraint for the MOVABLE zone.

    There is one side-effect for the usability of the memory of the CMA
    area. The use of MOVABLE zone is only allowed for a request with
    GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
    only allowed for this gfp flag. Before this patchset, a request with
    GFP_MOVABLE can use them. IMO, It would not be a big issue since most
    of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file
    cache page and anonymous page. However, file cache page for blockdev
    file is an exception. Request for it has no GFP_HIGHMEM flag. There is
    pros and cons on this exception. In my experience, blockdev file cache
    pages are one of the top reason that causes cma_alloc() to fail
    temporarily. So, we can get more guarantee of cma_alloc() success by
    discarding this case.

    Note that there is no change in admin POV since this patchset is just
    for internal implementation change in MM subsystem. Just one minor
    difference for admin is that the memory stat for CMA area will be
    printed in the MOVABLE zone. That's all.

    4. Result

    Following is the experimental result related to utilization problem.

    8 CPUs, 1024 MB, VIRTUAL MACHINE
    make -j16

    CMA area: 0 MB 512 MB
    Elapsed-time: 92.4 186.5
    pswpin: 82 18647
    pswpout: 160 69839

    CMA : 0 MB 512 MB
    Elapsed-time: 93.1 93.4
    pswpin: 84 46
    pswpout: 183 92

    akpm: "kernel test robot" reported a 26% improvement in
    vm-scalability.throughput:
    http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

    [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
    [2]: https://lkml.org/lkml/2014/10/15/623
    [3]: http://www.spinics.net/lists/linux-mm/msg100562.html

    Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "unclutter thp migration"

    Motivation:

    THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by splitting the THP into small pages while moving the
    head page to the newly allocated order-0 page. Remaining pages are
    moved to the LRU list by split_huge_page. The same happens if the THP
    allocation fails. This is really ugly and error prone [2].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    The first patch reworks do_pages_move which relies on a very ugly
    calling semantic when the return status is pushed to the migration path
    via private pointer. It uses pre allocated fixed size batching to
    achieve that. We simply cannot do the same if a THP is to be split
    during the migration path which is done in the patch 3. Patch 2 is
    follow up cleanup which removes the mentioned return status calling
    convention ugliness.

    On a side note:

    There are some semantic issues I have encountered on the way when
    working on patch 1 but I am not addressing them here. E.g. trying to
    move THP tail pages will result in either success or EBUSY (the later
    one more likely once we isolate head from the LRU list). Hugetlb
    reports EACCESS on tail pages. Some errors are reported via status
    parameter but migration failures are not even though the original
    `reason' argument suggests there was an intention to do so. From a
    quick look into git history this never worked. I have tried to keep the
    semantic unchanged.

    Then there is a relatively minor thing that the page isolation might
    fail because of pages not being on the LRU - e.g. because they are
    sitting on the per-cpu LRU caches. Easily fixable.

    This patch (of 3):

    do_pages_move is supposed to move user defined memory (an array of
    addresses) to the user defined numa nodes (an array of nodes one for
    each address). The user provided status array then contains resulting
    numa node for each address or an error. The semantic of this function
    is little bit confusing because only some errors are reported back.
    Notably migrate_pages error is only reported via the return value. This
    patch doesn't try to address these semantic nuances but rather change
    the underlying implementation.

    Currently we are processing user input (which can be really large) in
    batches which are stored to a temporarily allocated page. Each address
    is resolved to its struct page and stored to page_to_node structure
    along with the requested target numa node. The array of these
    structures is then conveyed down the page migration path via private
    argument. new_page_node then finds the corresponding structure and
    allocates the proper target page.

    What is the problem with the current implementation and why to change
    it? Apart from being quite ugly it also doesn't cope with unexpected
    pages showing up on the migration list inside migrate_pages path. That
    doesn't happen currently but the follow up patch would like to make the
    thp migration code more clear and that would need to split a THP into
    the list for some cases.

    How does the new implementation work? Well, instead of batching into a
    fixed size array we simply batch all pages that should be migrated to
    the same node and isolate all of them into a linked list which doesn't
    require any additional storage. This should work reasonably well
    because page migration usually migrates larger ranges of memory to a
    specific node. So the common case should work equally well as the
    current implementation. Even if somebody constructs an input where the
    target numa nodes would be interleaved we shouldn't see a large
    performance impact because page migration alone doesn't really benefit
    from batching. mmap_sem batching for the lookup is quite questionable
    and isolate_lru_page which would benefit from batching is not using it
    even in the current implementation.

    Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Anshuman Khandual
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Reale
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Nov, 2017

1 commit

  • This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

    It was a nice cleanup in theory, but as Nicolai Stange points out, we do
    need to make the page dirty for the copy-on-write case even when we
    didn't end up making it writable, since the dirty bit is what we use to
    check that we've gone through a COW cycle.

    Reported-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit

  • Currently we make page table entries dirty all the time regardless of
    access type and don't even consider if the mapping is write-protected.
    The reasoning is that we don't really need dirty tracking on THP and
    making the entry dirty upfront may save some time on first write to the
    page.

    Unfortunately, such approach may result in false-positive
    can_follow_write_pmd() for huge zero page or read-only shmem file.

    Let's only make page dirty only if we about to write to the page anyway
    (as we do for small pages).

    I've restructured the code to make entry dirty inside
    maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
    write-protected.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Nov, 2017

1 commit

  • Pageblock skip hints were added as a heuristic for compaction, which
    shares core code with CMA. Since CMA reliability would suffer from the
    heuristics, compact_control flag ignore_skip_hint was added for the CMA
    use case. Since 6815bf3f233e ("mm/compaction: respect ignore_skip_hint
    in update_pageblock_skip") the flag also means that CMA won't *update*
    the skip hints in addition to ignoring them.

    Today, direct compaction can also ignore the skip hints in the last
    resort attempt, but there's no reason not to set them when isolation
    fails in such case. Thus, this patch splits off a new no_set_skip_hint
    flag to avoid the updating, which only CMA sets. This should improve
    the heuristics a bit, and allow us to simplify the persistent skip bit
    handling as the next step.

    Link: http://lkml.kernel.org/r/20171102121706.21504-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Sep, 2017

2 commits

  • For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves. There are few shortcomings of this implementation,
    though.

    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.

    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion. We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.

    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.

    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.

    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.

    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase
    without worrying about complete memory reserves depletion.

    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 May, 2017

3 commits

  • The main goal of direct compaction is to form a high-order page for
    allocation, but it should also help against long-term fragmentation when
    possible.

    Most lower-than-pageblock-order compactions are for non-movable
    allocations, which means that if we compact in a movable pageblock and
    terminate as soon as we create the high-order page, it's unlikely that
    the fallback heuristics will claim the whole block. Instead there might
    be a single unmovable page in a pageblock full of movable pages, and the
    next unmovable allocation might pick another pageblock and increase
    long-term fragmentation.

    To help against such scenarios, this patch changes the termination
    criteria for compaction so that the current pageblock is finished even
    though the high-order page already exists. Note that it might be
    possible that the high-order page formed elsewhere in the zone due to
    parallel activity, but this patch doesn't try to detect that.

    This is only done with sync compaction, because async compaction is
    limited to pageblock of the same migratetype, where it cannot result in
    a migratetype fallback. (Async compaction also eagerly skips
    order-aligned blocks where isolation fails, which is against the goal of
    migrating away as much of the pageblock as possible.)

    As a result of this patch, long-term memory fragmentation should be
    reduced.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 20%. The number

    Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation patch. We are going to need migratetype at lower layers
    than compact_zone() and compact_finished().

    Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "try to reduce fragmenting fallbacks", v3.

    Last year, Johannes Weiner has reported a regression in page mobility
    grouping [1] and while the exact cause was not found, I've come up with
    some ways to improve it by reducing the number of allocations falling
    back to different migratetype and causing permanent fragmentation.

    The series was tested with mmtests stress-highalloc modified to do
    GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
    balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
    wasn't woken up) on UMA machine with 4GB memory. There were 5 repeats
    of each run, as the extfrag stats are quite volatile (note the stats
    below are sums, not averages, as it was less perl hacking for me).

    Success rate are the same, already high due to the low allocation order
    used, so I'm not including them.

    Compaction stats:
    (the patches are stacked, and I haven't measured the non-functional-changes
    patches separately)

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Compaction stalls 22449 24680 24846 19765 22059 17480
    Compaction success 12971 14836 14608 10475 11632 8757
    Compaction failures 9477 9843 10238 9290 10426 8722
    Page migrate success 3109022 3370438 3312164 1695105 1608435 2111379
    Page migrate failure 911588 1149065 1028264 1112675 1077251 1026367
    Compaction pages isolated 7242983 8015530 7782467 4629063 4402787 5377665
    Compaction migrate scanned 980838938 987367943 957690188 917647238 947155598 1018922197
    Compaction free scanned 557926893 598946443 602236894 594024490 541169699 763651731
    Compaction cost 10243 10578 10304 8286 8398 9440

    Compaction stats are mostly within noise until patch 4, which decreases
    the number of compactions, and migrations. Part of that could be due to
    more pageblocks marked as unmovable, and async compaction skipping
    those. This changes a bit with patch 7, but not so much. Patch 8
    increases free scanner stats and migrations, which comes from the
    changed termination criteria. Interestingly number of compactions
    decreases - probably the fully compacted pageblock satisfies multiple
    subsequent allocations, so it amortizes.

    Next comes the extfrag tracepoint, where "fragmenting" means that an
    allocation had to fallback to a pageblock of another migratetype which
    wasn't fully free (which is almost all of the fallbacks). I have
    locally added another tracepoint for "Page steal" into
    steal_suitable_fallback() which triggers in situations where we are
    allowed to do move_freepages_block(). If we decide to also do
    set_pageblock_migratetype(), it's "Pages steal with pageblock" with
    break down for which allocation migratetype we are stealing and from
    which fallback migratetype. The last part "due to counting" comes from
    patch 4 and counts the events where the counting of movable pages
    allowed us to change pageblock's migratetype, while the number of free
    pages alone wouldn't be enough to cross the threshold.

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Page alloc extfrag event 10155066 8522968 10164959 15622080 13727068 13140319
    Extfrag fragmenting 10149231 8517025 10159040 15616925 13721391 13134792
    Extfrag fragmenting for unmovable 159504 168500 184177 97835 70625 56948
    Extfrag fragmenting unmovable placed with movable 153613 163549 172693 91740 64099 50917
    Extfrag fragmenting unmovable placed with reclaim. 5891 4951 11484 6095 6526 6031
    Extfrag fragmenting for reclaimable 4738 4829 6345 4822 5640 5378
    Extfrag fragmenting reclaimable placed with movable 1836 1902 1851 1579 1739 1760
    Extfrag fragmenting reclaimable placed with unmov. 2902 2927 4494 3243 3901 3618
    Extfrag fragmenting for movable 9984989 8343696 9968518 15514268 13645126 13072466
    Pages steal 179954 192291 210880 123254 94545 81486
    Pages steal with pageblock 22153 18943 20154 33562 29969 33444
    Pages steal with pageblock for unmovable 14350 12858 13256 20660 19003 20852
    Pages steal with pageblock for unmovable from mov. 12812 11402 11683 19072 17467 19298
    Pages steal with pageblock for unmovable from recl. 1538 1456 1573 1588 1536 1554
    Pages steal with pageblock for movable 7114 5489 5965 11787 10012 11493
    Pages steal with pageblock for movable from unmov. 6885 5291 5541 11179 9525 10885
    Pages steal with pageblock for movable from recl. 229 198 424 608 487 608
    Pages steal with pageblock for reclaimable 689 596 933 1115 954 1099
    Pages steal with pageblock for reclaimable from unmov. 273 219 537 658 547 667
    Pages steal with pageblock for reclaimable from mov. 416 377 396 457 407 432
    Pages steal with pageblock due to counting 11834 10075 7530
    ... for unmovable 8993 7381 4616
    ... for movable 2792 2653 2851
    ... for reclaimable 49 41 63

    What we can see is that "Extfrag fragmenting for unmovable" and "...
    placed with movable" drops with almost each patch, which is good as we
    are polluting less movable pageblocks with unmovable pages.

    The most significant change is patch 4 with movable page counting. On
    the other hand it increases "Extfrag fragmenting for movable" by 50%.
    "Pages steal" drops though, so these movable allocation fallbacks find
    only small free pages and are not allowed to steal whole pageblocks
    back. "Pages steal with pageblock" raises, because the patch increases
    the chances of pageblock migratetype changes to happen. This affects
    all migratetypes.

    The summary is that patch 4 is not a clear win wrt these stats, but I
    believe that the tradeoff it makes is a good one. There's less
    pollution of movable pageblocks by unmovable allocations. There's less
    stealing between pageblock, and those that remain have higher chance of
    changing migratetype also the pageblock itself, so it should more
    faithfully reflect the migratetype of the pages within the pageblock.
    The increase of movable allocations falling back to unmovable pageblock
    might look dramatic, but those allocations can be migrated by compaction
    when needed, and other patches in the series (7-9) improve that aspect.

    Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
    also reduce the impact on movable fallbacks from patch 4.

    [1] https://www.spinics.net/lists/linux-mm/msg114237.html

    This patch (of 8):

    While currently there are (mostly by accident) no holes in struct
    compact_control (on x86_64), but we are going to add more bool flags, so
    place them all together to the end of the structure. While at it, just
    order all fields from largest to smallest.

    Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

04 May, 2017

3 commits

  • Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

    Simplify the code, no functional changes.

    [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
    Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com
    Signed-off-by: Xishi Qiu
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • NR_PAGES_SCANNED counts number of pages scanned since the last page free
    event in the allocator. This was used primarily to measure the
    reclaimability of zones and nodes, and determine when reclaim should
    give up on them. In that role, it has been replaced in the preceding
    patches by a different mechanism.

    Being implemented as an efficient vmstat counter, it was automatically
    exported to userspace as well. It's however unlikely that anyone
    outside the kernel is using this counter in any meaningful way.

    Remove the counter and the unused pgdat_reclaimable().

    Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
    cleanups".

    Jia reported a scenario in which the kswapd of a node indefinitely spins
    at 100% CPU usage. We have seen similar cases at Facebook.

    The kernel's current method of judging its ability to reclaim a node (or
    whether to back off and sleep) is based on the amount of scanned pages
    in proportion to the amount of reclaimable pages. In Jia's and our
    scenarios, there are no reclaimable pages in the node, however, and the
    condition for backing off is never met. Kswapd busyloops in an attempt
    to restore the watermarks while having nothing to work with.

    This series reworks the definition of an unreclaimable node based not on
    scanning but on whether kswapd is able to actually reclaim pages in
    MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
    the page allocator uses for giving up on direct reclaim and invoking the
    OOM killer. If it cannot free any pages, kswapd will go to sleep and
    leave further attempts to direct reclaim invocations, which will either
    make progress and re-enable kswapd, or invoke the OOM killer.

    Patch #1 fixes the immediate problem Jia reported, the remainder are
    smaller fixlets, cleanups, and overall phasing out of the old method.

    Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
    and directly related to #5, but in itself not relevant to the series.

    If the whole series is too ambitious for 4.11, I would consider the
    first three patches fixes, the rest cleanups.

    This patch (of 9):

    Jia He reports a problem with kswapd spinning at 100% CPU when
    requesting more hugepages than memory available in the system:

    $ echo 4000 >/proc/sys/vm/nr_hugepages

    top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
    Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
    KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

    At that time, there are no reclaimable pages left in the node, but as
    kswapd fails to restore the high watermarks it refuses to go to sleep.

    Kswapd needs to back away from nodes that fail to balance. Up until
    commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") kswapd had such a mechanism. It considered zones whose
    theoretically reclaimable pages it had reclaimed six times over as
    unreclaimable and backed away from them. This guard was erroneously
    removed as the patch changed the definition of a balanced node.

    However, simply restoring this code wouldn't help in the case reported
    here: there *are* no reclaimable pages that could be scanned until the
    threshold is met. Kswapd would stay awake anyway.

    Introduce a new and much simpler way of backing off. If kswapd runs
    through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
    page, make it back off from the node. This is the same number of shots
    direct reclaim takes before declaring OOM. Kswapd will go to sleep on
    that node until a direct reclaimer manages to reclaim some pages, thus
    proving the node reclaimable again.

    [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
    Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
    [shakeelb@google.com: fix condition for throttle_direct_reclaim]
    Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Shakeel Butt
    Reported-by: Jia He
    Tested-by: Jia He
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2017

1 commit

  • We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
    vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
    per cpu lru caches. This seems more than necessary because both can run
    on a single WQ. Both do not block on locks requiring a memory
    allocation nor perform any allocations themselves. We will save one
    rescuer thread this way.

    On the other hand drain_all_pages() queues work on the system wq which
    doesn't have rescuer and so this depend on memory allocation (when all
    workers are stuck allocating and new ones cannot be created).

    Initially we thought this would be more of a theoretical problem but
    Hugh Dickins has reported:

    : 4.11-rc has been giving me hangs after hours of swapping load. At
    : first they looked like memory leaks ("fork: Cannot allocate memory");
    : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
    : before looking at /proc/meminfo one time, and the stat_refresh stuck
    : in D state, waiting for completion of flush_work like many kworkers.
    : kthreadd waiting for completion of flush_work in drain_all_pages().

    This worker should be using WQ_RECLAIM as well in order to guarantee a
    forward progress. We can reuse the same one as for lru draining and
    vmstat.

    Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Tested-by: Yang Li
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Feb, 2017

1 commit

  • Current rmap code can miss a VMA that maps PTE-mapped THP if the first
    suppage of the THP was unmapped from the VMA.

    We need to walk rmap for the whole range of offsets that THP covers, not
    only the first one.

    vma_address() also need to be corrected to check the range instead of
    the first subpage.

    Link: http://lkml.kernel.org/r/20170129173858.45174-6-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Feb, 2017

3 commits

  • Logic on whether we can reap pages from the VMA should match what we
    have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP
    VMAs, but we don't now.

    Let's just extract condition on which we can shoot down pagesi from a
    VMA with MADV_DONTNEED into separate function and use it in both places.

    Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • A "compact_daemon_wake" vmstat exists that represents the number of
    times kcompactd has woken up. This doesn't represent how much work it
    actually did, though.

    It's useful to understand how much compaction work is being done by
    kcompactd versus other methods such as direct compaction and explicitly
    triggered per-node (or system) compaction.

    This adds two new vmstats: "compact_daemon_migrate_scanned" and
    "compact_daemon_free_scanned" to represent the number of pages kcompactd
    has scanned as part of its migration scanner and freeing scanner,
    respectively.

    These values are still accounted for in the general
    "compact_migrate_scanned" and "compact_free_scanned" for compatibility.

    It could be argued that explicitly triggered compaction could also be
    tracked separately, and that could be added if others find it useful.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • In __free_one_page() we do the buddy merging arithmetics on "page/buddy
    index", which is just the lower MAX_ORDER bits of pfn. The operations
    we do that affect the higher bits are bitwise AND and subtraction (in
    that order), where the final result will be the same with the higher
    bits left unmasked, as long as these bits are equal for both buddies -
    which must be true by the definition of a buddy.

    We can therefore use pfn's directly instead of "index" and skip the
    zeroing of >MAX_ORDER bits. This can help a bit by itself, although
    compiler might be smart enough already. It also helps the next patch to
    avoid page_to_pfn() for memory hole checks.

    Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

26 Dec, 2016

1 commit

  • Add a new page flag, PageWaiters, to indicate the page waitqueue has
    tasks waiting. This can be tested rather than testing waitqueue_active
    which requires another cacheline load.

    This bit is always set when the page has tasks on page_waitqueue(page),
    and is set and cleared under the waitqueue lock. It may be set when
    there are no tasks on the waitqueue, which will cause a harmless extra
    wakeup check that will clears the bit.

    The generic bit-waitqueue infrastructure is no longer used for pages.
    Instead, waitqueues are used directly with a custom key type. The
    generic code was not flexible enough to have PageWaiters manipulation
    under the waitqueue lock (which simplifies concurrency).

    This improves the performance of page lock intensive microbenchmarks by
    2-3%.

    Putting two bits in the same word opens the opportunity to remove the
    memory barrier between clearing the lock bit and testing the waiters
    bit, after some work on the arch primitives (e.g., ensuring memory
    operand widths match and cover both bits).

    Signed-off-by: Nicholas Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

15 Dec, 2016

2 commits

  • Add orig_pte field to vm_fault structure to allow ->page_mkwrite
    handlers to fully handle the fault.

    This also allows us to save some passing of extra arguments around.

    Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Oct, 2016

2 commits

  • Several people have reported premature OOMs for order-2 allocations
    (stack) due to OOM rework in 4.7. In the scenario (parallel kernel
    build and dd writing to two drives) many pageblocks get marked as
    Unmovable and compaction free scanner struggles to isolate free pages.
    Joonsoo Kim pointed out that the free scanner skips pageblocks that are
    not movable to prevent filling them and forcing non-movable allocations
    to fallback to other pageblocks. Such heuristic makes sense to help
    prevent long-term fragmentation, but premature OOMs are relatively more
    urgent problem. As a compromise, this patch disables the heuristic only
    for the ultimate compaction priority.

    Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
    Reported-by: Ralf-Peter Rohbeck
    Reported-by: Arkadiusz Miskiewicz
    Reported-by: Olaf Hering
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "make direct compaction more deterministic")

    This is mostly a followup to Michal's oom detection rework, which
    highlighted the need for direct compaction to provide better feedback in
    reclaim/compaction loop, so that it can reliably recognize when
    compaction cannot make further progress, and allocation should invoke
    OOM killer or fail. We've discussed this at LSF/MM [1] where I proposed
    expanding the async/sync migration mode used in compaction to more
    general "priorities". This patchset adds one new priority that just
    overrides all the heuristics and makes compaction fully scan all zones.
    I don't currently think that we need more fine-grained priorities, but
    we'll see. Other than that there's some smaller fixes and cleanups,
    mainly related to the THP-specific hacks.

    I've tested this with stress-highalloc in GFP_KERNEL order-4 and
    THP-like order-9 scenarios. There's some improvement for compaction
    stats for the order-4, which is likely due to the better watermarks
    handling. In the previous version I reported mostly noise wrt
    compaction stats, and decreased direct reclaim - now the reclaim is
    without difference. I believe this is due to the less aggressive
    compaction priority increase in patch 6.

    "before" is a mmotm tree prior to 4.7 release plus the first part of the
    series that was sent and merged separately

    before after
    order-4:

    Compaction stalls 27216 30759
    Compaction success 19598 25475
    Compaction failures 7617 5283
    Page migrate success 370510 464919
    Page migrate failure 25712 27987
    Compaction pages isolated 849601 1041581
    Compaction migrate scanned 143146541 101084990
    Compaction free scanned 208355124 144863510
    Compaction cost 1403 1210

    order-9:

    Compaction stalls 7311 7401
    Compaction success 1634 1683
    Compaction failures 5677 5718
    Page migrate success 194657 183988
    Page migrate failure 4753 4170
    Compaction pages isolated 498790 456130
    Compaction migrate scanned 565371 524174
    Compaction free scanned 4230296 4250744
    Compaction cost 215 203

    [1] https://lwn.net/Articles/684611/

    This patch (of 11):

    A recent patch has added whole_zone flag that compaction sets when
    scanning starts from the zone boundary, in order to report that zone has
    been fully scanned in one attempt. For allocations that want to try
    really hard or cannot fail, we will want to introduce a mode where
    scanning whole zone is guaranteed regardless of the cached positions.

    This patch reuses the whole_zone flag in a way that if it's already
    passed true to compaction, the cached scanner positions are ignored.
    Employing this flag during reclaim/compaction loop will be done in the
    next patch. This patch however converts compaction invoked from
    userspace via procfs to use this flag. Before this patch, the cached
    positions were first reset to zone boundaries and then read back from
    struct zone, so there was a window where a parallel compaction could
    replace the reset values, making the manual compaction less effective.
    Using the flag instead of performing reset is more robust.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

29 Jul, 2016

4 commits

  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The fair zone allocation policy interleaves allocation requests between
    zones to avoid an age inversion problem whereby new pages are reclaimed
    to balance a zone. Reclaim is now node-based so this should no longer
    be an issue and the fair zone allocation policy is not free. This patch
    removes it.

    Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

3 commits

  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch makes swapin readahead to improve thp collapse rate. When
    khugepaged scanned pages, there can be a few of the pages in swap area.

    With the patch THP can collapse 4kB pages into a THP when there are up
    to max_ptes_swap swap ptes in a 2MB range.

    The patch was tested with a test program that allocates 400B of memory,
    writes to it, and then sleeps. I force the system to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    Without the patch, system did not swap in readahead. THP rate was %65
    of the program of the memory, it did not change over time.

    With this patch, after 10 minutes of waiting khugepaged had collapsed
    %99 of the program's memory.

    [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
    [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
    [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This patch is motivated from Hugh and Vlastimil's concern [1].

    There are two ways to get freepage from the allocator. One is using
    normal memory allocation API and the other is __isolate_free_page()
    which is internally used for compaction and pageblock isolation. Later
    usage is rather tricky since it doesn't do whole post allocation
    processing done by normal API.

    One problematic thing I already know is that poisoned page would not be
    checked if it is allocated by __isolate_free_page(). Perhaps, there
    would be more.

    We could add more debug logic for allocated page in the future and this
    separation would cause more problem. I'd like to fix this situation at
    this time. Solution is simple. This patch commonize some logic for
    newly allocated page and uses it on all sites. This will solve the
    problem.

    [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

    [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

25 Jun, 2016

1 commit

  • Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable
    to sleep, unwilling to sleep and avoiding waking kswapd") modified
    __GFP_WAIT to explicitly identify the difference between atomic callers
    and those that were unwilling to sleep. Later the definition was
    removed entirely.

    The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
    and reclaim behaviour but __GFP_ATOMIC was never added. Without it,
    atomic users of the slab allocator strip the __GFP_ATOMIC flag and
    cannot access the page allocator atomic reserves. This patch addresses
    the problem.

    The user-visible impact depends on the workload but potentially atomic
    allocations unnecessarily fail without this path.

    Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Marcin Wojtas
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman