30 Jul, 2016

1 commit

  • Pull fuse updates from Miklos Szeredi:
    "This fixes error propagation from writeback to fsync/close for
    writeback cache mode as well as adding a missing capability flag to
    the INIT message. The rest are cleanups.

    (The commits are recent but all the code actually sat in -next for a
    while now. The recommits are due to conflict avoidance and the
    addition of Cc: stable@...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: use filemap_check_errors()
    mm: export filemap_check_errors() to modules
    fuse: fix wrong assignment of ->flags in fuse_send_init()
    fuse: fuse_flush must check mapping->flags for errors
    fuse: fsync() did not return IO errors
    fuse: don't mess with blocking signals
    new helper: wait_event_killable_exclusive()
    fuse: improve aio directIO write performance for size extending writes

    Linus Torvalds
     

29 Jul, 2016

39 commits

  • Can be used by fuse, btrfs and f2fs to replace opencoded variants.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Merge more updates from Andrew Morton:
    "The rest of MM"

    * emailed patches from Andrew Morton : (101 commits)
    mm, compaction: simplify contended compaction handling
    mm, compaction: introduce direct compaction priority
    mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
    mm, page_alloc: make THP-specific decisions more generic
    mm, page_alloc: restructure direct compaction handling in slowpath
    mm, page_alloc: don't retry initial attempt in slowpath
    mm, page_alloc: set alloc_flags only once in slowpath
    lib/stackdepot.c: use __GFP_NOWARN for stack allocations
    mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
    mm, kasan: account for object redzone in SLUB's nearest_obj()
    mm: fix use-after-free if memory allocation failed in vma_adjust()
    zsmalloc: Delete an unnecessary check before the function call "iput"
    mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
    mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
    mm: optimize copy_page_to/from_iter_iovec
    mm: add cond_resched() to generic_swapfile_activate()
    Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
    mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
    mm: hwpoison: remove incorrect comments
    make __section_nr() more efficient
    ...

    Linus Torvalds
     
  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since THP allocations during page faults can be costly, extra decisions
    are employed for them to avoid excessive reclaim and compaction, if the
    initial compaction doesn't look promising. The detection has never been
    perfect as there is no gfp flag specific to THP allocations. At this
    moment it checks the whole combination of flags that makes up
    GFP_TRANSHUGE, and hopes that no other users of such combination exist,
    or would mind being treated the same way. Extra care is also taken to
    separate allocations from khugepaged, where latency doesn't matter that
    much.

    It is however possible to distinguish these allocations in a simpler and
    more reliable way. The key observation is that after the initial
    compaction followed by the first iteration of "standard"
    reclaim/compaction, both __GFP_NORETRY allocations and costly
    allocations without __GFP_REPEAT are declared as failures:

    /* Do not loop if specifically requested */
    if (gfp_mask & __GFP_NORETRY)
    goto nopage;

    /*
    * Do not retry costly high order allocations unless they are
    * __GFP_REPEAT
    */
    if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
    goto nopage;

    This means we can further distinguish allocations that are costly order
    *and* additionally include the __GFP_NORETRY flag. As it happens,
    GFP_TRANSHUGE allocations do already fall into this category. This will
    also allow other costly allocations with similar high-order benefit vs
    latency considerations to use this semantic. Furthermore, we can
    distinguish THP allocations that should try a bit harder (such as from
    khugepageed) by removing __GFP_NORETRY, as will be done in the next
    patch.

    Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The retry loop in __alloc_pages_slowpath is supposed to keep trying
    reclaim and compaction (and OOM), until either the allocation succeeds,
    or returns with failure. Success here is more probable when reclaim
    precedes compaction, as certain watermarks have to be met for compaction
    to even try, and more free pages increase the probability of compaction
    success. On the other hand, starting with light async compaction (if
    the watermarks allow it), can be more efficient, especially for smaller
    orders, if there's enough free memory which is just fragmented.

    Thus, the current code starts with compaction before reclaim, and to
    make sure that the last reclaim is always followed by a final
    compaction, there's another direct compaction call at the end of the
    loop. This makes the code hard to follow and adds some duplicated
    handling of migration_mode decisions. It's also somewhat inefficient
    that even if reclaim or compaction decides not to retry, the final
    compaction is still attempted. Some gfp flags combination also shortcut
    these retry decisions by "goto noretry;", making it even harder to
    follow.

    This patch attempts to restructure the code with only minimal functional
    changes. The call to the first compaction and THP-specific checks are
    now placed above the retry loop, and the "noretry" direct compaction is
    removed.

    The initial compaction is additionally restricted only to costly orders,
    as we can expect smaller orders to be held back by watermarks, and only
    larger orders to suffer primarily from fragmentation. This better
    matches the checks in reclaim's shrink_zones().

    There are two other smaller functional changes. One is that the upgrade
    from async migration to light sync migration will always occur after the
    initial compaction. This is how it has been until recent patch "mm,
    oom: protect !costly allocations some more", which introduced upgrading
    the mode based on COMPACT_COMPLETE result, but kept the final compaction
    always upgraded, which made it even more special. It's better to return
    to the simpler handling for now, as migration modes will be further
    modified later in the series.

    The second change is that once both reclaim and compaction declare it's
    not worth to retry the reclaim/compact loop, there is no final
    compaction attempt. As argued above, this is intentional. If that
    final compaction were to succeed, it would be due to a wrong retry
    decision, or simply a race with somebody else freeing memory for us.

    The main outcome of this patch should be simpler code. Logically, the
    initial compaction without reclaim is the exceptional case to the
    reclaim/compaction scheme, but prior to the patch, it was the last loop
    iteration that was exceptional. Now the code matches the logic better.
    The change also enable the following patches.

    Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
    kswapd, it first tries get_page_from_freelist() with the new
    alloc_flags, as it may succeed e.g. due to using min watermark instead
    of low watermark. It makes sense to to do this attempt before adjusting
    zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
    path if we just wake up kswapd and successfully allocate.

    This patch therefore moves the initial attempt above the retry label and
    reorganizes a bit the part below the retry label. We still have to
    attempt get_page_from_freelist() on each retry, as some allocations
    cannot do that as part of direct reclaim or compaction, and yet are not
    allowed to fail (even though they do a WARN_ON_ONCE() and thus should
    not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
    and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
    it. As a side-effect, the attempts from direct reclaim/compaction will
    also no longer obey watermarks once this is set, but there's little harm
    in that.

    Kswapd wakeups are also done on each retry to be safe from potential
    races resulting in kswapd going to sleep while a process (that may not
    be able to reclaim by itself) is still looping.

    Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
    initialized, so move the initialization above the retry: label. Also
    make the comment above the initialization more descriptive.

    The only exception in the alloc_flags being constant is
    ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
    allocating thread. We can fix this, and make the code simpler and a bit
    more effective at the same time, by moving the part that determines
    ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().

    This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
    places in __alloc_pages_slowpath() anymore. The only two tests for the
    flag can instead call gfp_pfmemalloc_allowed().

    Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • For KASAN builds:
    - switch SLUB allocator to using stackdepot instead of storing the
    allocation/deallocation stacks in the objects;
    - change the freelist hook so that parts of the freelist can be put
    into the quarantine.

    [aryabinin@virtuozzo.com: fixes]
    Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When looking up the nearest SLUB object for a given address, correctly
    calculate its offset if SLAB_RED_ZONE is enabled for that cache.

    Previously, when KASAN had detected an error on an object from a cache
    with SLAB_RED_ZONE set, the actual start address of the object was
    miscalculated, which led to random stacks having been reported.

    When looking up the nearest SLUB object for a given address, correctly
    calculate its offset if SLAB_RED_ZONE is enabled for that cache.

    Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support")
    Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • There's one case when vma_adjust() expands the vma, overlapping with
    *two* next vma. See case 6 of mprotect, described in the comment to
    vma_merge().

    To handle this (and only this) situation we iterate twice over main part
    of the function. See "goto again".

    Vegard reported[1] that he sees out-of-bounds access complain from
    KASAN, if anon_vma_clone() on the *second* iteration fails.

    This happens because we free 'next' vma by the end of first iteration
    and don't have a way to undo this if anon_vma_clone() fails on the
    second iteration.

    The solution is to do all required allocations upfront, before we touch
    vmas.

    The allocation on the second iteration is only required if first two
    vmas don't have anon_vma, but third does. So we need, in total, one
    anon_vma_clone() call.

    It's easy to adjust 'exporter' to the third vma for such case.

    [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com

    Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • iput() tests whether its argument is NULL and then returns immediately.
    Thus the test around the call is not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net
    Signed-off-by: Markus Elfring
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • Fix region index adjustment error when parameter type_b of
    __next_mem_range_rev() == NULL.

    Signed-off-by: zijun_hu
    Cc: Alexander Kuleshov
    Cc: Ard Biesheuvel
    Cc: Tang Chen
    Cc: Wei Yang
    Cc: Tang Chen
    Cc: Richard Leitner
    Cc: David Gibson
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • If we offline a node, alloc the new page from a nearest neighbor node
    instead of the current node or other remote nodes, because re-migrate is
    a waste of time and the distance of the remote nodes is often very
    large.

    Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
    zone or highmem zone.

    Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • generic_swapfile_activate() can take quite long time, it iterates over
    all blocks of a file, so add cond_resched to it. I observed about 1
    second stalls when activating a swapfile that was almost unfragmented -
    this patch fixes it.

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Alexander Viro
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • This reverts commit f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
    if there are free elements").

    There has been a report about OOM killer invoked when swapping out to a
    dm-crypt device. The primary reason seems to be that the swapout out IO
    managed to completely deplete memory reserves. Ondrej was able to
    bisect and explained the issue by pointing to f9054c70d28b ("mm,
    mempool: only set __GFP_NOMEMALLOC if there are free elements").

    The reason is that the swapout path is not throttled properly because
    the md-raid layer needs to allocate from the generic_make_request path
    which means it allocates from the PF_MEMALLOC context. dm layer uses
    mempool_alloc in order to guarantee a forward progress which used to
    inhibit access to memory reserves when using page allocator. This has
    changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
    there are free elements") which has dropped the __GFP_NOMEMALLOC
    protection when the memory pool is depleted.

    If we are running out of memory and the only way forward to free memory
    is to perform swapout we just keep consuming memory reserves rather than
    throttling the mempool allocations and allowing the pending IO to
    complete up to a moment when the memory is depleted completely and there
    is no way forward but invoking the OOM killer. This is less than
    optimal.

    The original intention of f9054c70d28b was to help with the OOM
    situations where the oom victim depends on mempool allocation to make a
    forward progress. David has mentioned the following backtrace:

    schedule
    schedule_timeout
    io_schedule_timeout
    mempool_alloc
    __split_and_process_bio
    dm_request
    generic_make_request
    submit_bio
    mpage_readpages
    ext4_readpages
    __do_page_cache_readahead
    ra_submit
    filemap_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault

    We do not know more about why the mempool is depleted without being
    replenished in time, though. In any case the dm layer shouldn't depend
    on any allocations outside of the dedicated pools so a forward progress
    should be guaranteed. If this is not the case then the dm should be
    fixed rather than papering over the problem and postponing it to later
    by accessing more memory reserves.

    mempools are a mechanism to maintain dedicated memory reserves to
    guaratee forward progress. Allowing them an unbounded access to the
    page allocator memory reserves is going against the whole purpose of
    this mechanism.

    Bisected by Ondrej Kozina.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Ondrej Kozina
    Reviewed-by: Johannes Weiner
    Acked-by: NeilBrown
    Cc: David Rientjes
    Cc: Mikulas Patocka
    Cc: Ondrej Kozina
    Cc: Tetsuo Handa
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
    isolate a PageWriteback page, which __unmap_and_move() then rejects with
    -EBUSY: of course the writeback might complete in between, but that's
    not what we usually expect, so probably better not to isolate it.

    When tested by stress-highalloc from mmtests, this has reduced the
    number of page migrate failures by 60-70%.

    Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
    Signed-off-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
    let's remove incorrect comment.

    The reason why the page lock is not really needed is that
    dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
    hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
    are just allocated but not linked to active list yet, even without
    taking page lock.

    Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Zhan Chen
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
    section number with a subtraction directly.

    Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Zhou Chengming
    Cc: Dave Hansen
    Cc: Tejun Heo
    Cc: Hanjun Guo
    Cc: Li Bin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhou Chengming
     
  • If the user tries to disable automatic scanning early in the boot
    process using e.g.:

    echo scan=off > /sys/kernel/debug/kmemleak

    then this command will hang until SECS_FIRST_SCAN (= 60) seconds have
    elapsed, even though the system is fully initialised.

    We can fix this using interruptible sleep and checking if we're supposed
    to stop whenever we wake up (like the rest of the code does).

    Link: http://lkml.kernel.org/r/1468835005-2873-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • In some cases, memblock is queried by kernel to determine whether a
    specified address is RAM or not. For example, the ACPI core needs this
    information to determine which attributes to use when mapping ACPI
    regions(acpi_os_ioremap). Use of incorrect memory types can result in
    faults, data corruption, or other issues.

    Removing memory with memblock_enforce_memory_limit() throws away this
    information, and so a kernel booted with 'mem=' may suffer from the
    issues described above. To avoid this, we need to keep those NOMAP
    regions instead of removing all above the limit, which preserves the
    information we need while preventing other use of those regions.

    This patch adds new infrastructure to retain all NOMAP memblock regions
    while removing others, to cater for this.

    Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com
    Signed-off-by: Dennis Chen
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Ard Biesheuvel
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Tony Luck
    Cc: Ingo Molnar
    Cc: Rafael J. Wysocki
    Cc: Will Deacon
    Cc: Mark Rutland
    Cc: Matt Fleming
    Cc: Kaly Xin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dennis Chen
     
  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
    CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
    However, now that the ZONE_DMA conflict has been eliminated it no longer
    makes sense to require CONFIG_EXPERT.

    Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Eric Sandeen
    Reported-by: Jeff Moyer
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • asm-generic headers are generic implementations for architecture
    specific code and should not be included by common code. Thus use the
    asm/ version of sections.h to get at the linker sections.

    Link: http://lkml.kernel.org/r/1468285103-7470-1-git-send-email-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The definition of return value of madvise_free_huge_pmd is not clear
    before. According to the suggestion of Minchan Kim, change the type of
    return value to bool and return true if we do MADV_FREE successfully on
    entire pmd page, otherwise, return false. Comments are added too.

    Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Use ClearPagePrivate/ClearPagePrivate2 helpers to clear
    PG_private/PG_private_2 in page->flags

    Link: http://lkml.kernel.org/r/1467882338-4300-7-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Add __init,__exit attribute for function that only called in module
    init/exit to save memory.

    Link: http://lkml.kernel.org/r/1467882338-4300-6-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Cc: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Some minor commebnt changes:

    1). update zs_malloc(),zs_create_pool() function header
    2). update "Usage of struct page fields"

    Link: http://lkml.kernel.org/r/1467882338-4300-5-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Currently, if a class can not be merged, the max objects of zspage in
    that class may be calculated twice.

    This patch calculate max objects of zspage at the begin, and pass the
    value to can_merge() to decide whether the class can be merged.

    Also this patch remove function get_maxobj_per_zspage(), as there is no
    other place to call this function.

    Link: http://lkml.kernel.org/r/1467882338-4300-4-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • num of max objects in zspage is stored in each size_class now. So there
    is no need to re-calculate it.

    Link: http://lkml.kernel.org/r/1467882338-4300-3-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • the obj index value should be updated after return from
    find_alloced_obj() to avoid CPU burning caused by unnecessary object
    scanning.

    Link: http://lkml.kernel.org/r/1467882338-4300-2-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • This is a cleanup patch. Change "index" to "obj_index" to keep
    consistent with others in zsmalloc.

    Link: http://lkml.kernel.org/r/1467882338-4300-1-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • With node-lru, if there are enough reclaimable pages in highmem but
    nothing in lowmem, VM can try to shrink inactive list although the
    requested zone is lowmem.

    The problem is that if the inactive list is full of highmem pages then a
    direct reclaimer searching for a lowmem page waste CPU scanning
    uselessly. It just burns out CPU. Even, many direct reclaimers are
    stalled by too_many_isolated if lots of parallel reclaimer are going on
    although there are no reclaimable memory in inactive list.

    I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get
    elapsed time.

    hackbench 500 process 2

    = Old =

    1st: 289s 2nd: 310s 3rd: 112s 4th: 272s

    = Now =

    1st: 31s 2nd: 132s 3rd: 162s 4th: 50s.

    [akpm@linux-foundation.org: fixes per Mel]
    Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Page reclaim determines whether a pgdat is unreclaimable by examining
    how many pages have been scanned since a page was freed and comparing
    that to the LRU sizes. Skipped pages are not reclaim candidates but
    contribute to scanned. This can prematurely mark a pgdat as
    unreclaimable and trigger an OOM kill.

    This patch accounts for skipped pages as a partial scan so that an
    unreclaimable pgdat will still be marked as such but by scaling the cost
    of a skip, it'll avoid the pgdat being marked prematurely.

    Link: http://lkml.kernel.org/r/1469110261-7365-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Minchan Kim reported that with per-zone lru state it was possible to
    identify that a normal zone with 8^M anonymous pages could trigger OOM
    with non-atomic order-0 allocations as all pages in the zone were in the
    active list.

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    Call Trace:
    __alloc_pages_nodemask+0xe52/0xe60
    ? new_slab+0x39c/0x3b0
    new_slab+0x39c/0x3b0
    ___slab_alloc.constprop.87+0x6da/0x840
    ? __alloc_skb+0x3c/0x260
    ? enqueue_task_fair+0x73/0xbf0
    ? poll_select_copy_remaining+0x140/0x140
    __slab_alloc.isra.81.constprop.86+0x40/0x6d
    ? __alloc_skb+0x3c/0x260
    kmem_cache_alloc+0x22c/0x260
    ? __alloc_skb+0x3c/0x260
    __alloc_skb+0x3c/0x260
    alloc_skb_with_frags+0x4e/0x1a0
    sock_alloc_send_pskb+0x16a/0x1b0
    ? wait_for_unix_gc+0x31/0x90
    unix_stream_sendmsg+0x28d/0x340
    sock_sendmsg+0x2d/0x40
    sock_write_iter+0x6c/0xc0
    __vfs_write+0xc0/0x120
    vfs_write+0x9b/0x1a0
    ? __might_fault+0x49/0xa0
    SyS_write+0x44/0x90
    do_fast_syscall_32+0xa6/0x1e0

    Mem-Info:
    active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    54409 total pagecache pages
    53215 pages in swap cache
    Swap cache stats: add 300982, delete 247765, find 157978/226539
    Free swap = 3803244kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9642 pages reserved
    0 pages cma reserved

    The problem is due to the active deactivation logic in
    inactive_list_is_low:

    Node 0 active_anon:404412kB inactive_anon:409040kB

    IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due
    to highmem anonymous stat so VM never deactivates normal zone's
    anonymous pages.

    This patch is a modified version of Minchan's original solution but
    based upon it. The problem with Minchan's patch is that any low zone
    with an imbalanced list could force a rotation.

    In this patch, a zone-constrained global reclaim will rotate the list if
    the inactive/active ratio of all eligible zones needs to be corrected.
    It is possible that higher zone pages will be initially rotated
    prematurely but this is the safer choice to maintain overall LRU age.

    Link: http://lkml.kernel.org/r/20160722090929.GJ10438@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the reintroduction of per-zone LRU stats, highmem_file_pages is
    redundant so remove it.

    [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory]
    Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman