21 May, 2016

40 commits

  • Add a test that makes sure ksize() unpoisons the whole chunk.

    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Instead of calling kasan_krealloc(), which replaces the memory
    allocation stack ID (if stack depot is used), just unpoison the whole
    memory chunk.

    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
    boot are initialized, then the rest are initialized in parallel by
    starting one-off "pgdatinitX" kernel thread for each node X.

    If page_ext_init is called before it, some pages will not have valid
    extension, this may lead the below kernel oops when booting up kernel:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] free_pcppages_bulk+0x2d2/0x8d0
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
    Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
    task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
    RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0
    RSP: 0000:ffff88017c087c48 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
    RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
    RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
    R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
    R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
    FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
    Call Trace:
    free_hot_cold_page+0x192/0x1d0
    __free_pages+0x5c/0x90
    __free_pages_boot_core+0x11a/0x14e
    deferred_free_range+0x50/0x62
    deferred_init_memmap+0x220/0x3c3
    kthread+0xf8/0x110
    ret_from_fork+0x22/0x40
    Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
    RIP [] free_pcppages_bulk+0x2d2/0x8d0
    RSP
    CR2: 0000000000000000

    Move page_ext_init() after page_alloc_init_late() to make sure page extension
    is setup for all pages.

    Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • If page migration fails due to -ENOMEM, nr_failed should still be
    incremented for proper statistics.

    This was encountered recently when all page migration vmstats showed 0,
    and inferred that migrate_pages() was never called, although in reality
    the first page migration failed because compaction_alloc() failed to
    find a migration target.

    This patch increments nr_failed so the vmstat is properly accounted on
    ENOMEM.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • While testing the kcompactd in my platform 3G MEM only DMA ZONE. I
    found the kcompactd never wakeup. It seems the zoneindex has already
    minus 1 before. So the traverse here should be
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Zhuangluan Su
    Cc: Yiping Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Feng
     
  • When enabling the below kernel configs:

    CONFIG_DEFERRED_STRUCT_PAGE_INIT
    CONFIG_DEBUG_PAGEALLOC
    CONFIG_PAGE_EXTENSION
    CONFIG_DEBUG_VM

    kernel bootup may fail due to the following oops:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] free_pcppages_bulk+0x2d2/0x8d0
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
    Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
    task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
    RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0
    RSP: 0000:ffff88017c087c48 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
    RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
    RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
    R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
    R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
    FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
    Call Trace:
    free_hot_cold_page+0x192/0x1d0
    __free_pages+0x5c/0x90
    __free_pages_boot_core+0x11a/0x14e
    deferred_free_range+0x50/0x62
    deferred_init_memmap+0x220/0x3c3
    kthread+0xf8/0x110
    ret_from_fork+0x22/0x40
    Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
    RIP [] free_pcppages_bulk+0x2d2/0x8d0
    RSP
    CR2: 0000000000000000

    The problem is lookup_page_ext() returns NULL then page_is_guard() tried
    to access it in page freeing.

    page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
    flag, but freeing page might reach here before the page_ext arrays are
    allocated when feeding a range of pages to the allocator for the first
    time during bootup or memory hotplug.

    When it returns NULL, page_is_guard() should just return false instead
    of checking PAGE_EXT_DEBUG_GUARD unconditionally.

    Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • If a large value is written to scan_sleep_millisecs, for example, that
    period must lapse before khugepaged will wake up for periodic
    collapsing.

    If this value is tuned to 1 day, for example, and then re-tuned to its
    default 10s, khugepaged will still wait for a day before scanning again.

    This patch causes khugepaged to wakeup immediately when the value is
    changed and then sleep until that value is rewritten or the new value
    lapses.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When nfsd is exporting a filesystem over NFS which is then NFS-mounted
    on the local machine there is a risk of deadlock. This happens when
    there are lots of dirty pages in the NFS filesystem and they cause NFSD
    to be throttled, either in throttle_vm_writeout() or in
    balance_dirty_pages().

    To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
    and it provides a 25% increase to the limits that affect NFSD. Any
    process writing to an NFS filesystem will be throttled well before the
    number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
    will not deadlock on pages that it needs to write out. At least it
    shouldn't.

    All processes are allowed a small excess margin to avoid performing too
    many calculations: ratelimit_pages.

    ratelimit_pages is set so that if a thread on every CPU uses the entire
    margin, the total will only go 3% over the limit, and this is much less
    than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
    shouldn't be a problem. But it is.

    The "total memory" that these 3% and 25% are calculated against are not
    really total memory but are "global_dirtyable_memory()" which doesn't
    include anonymous memory, just free memory and page-cache memory.

    The "ratelimit_pages" number is based on whatever the
    global_dirtyable_memory was on the last CPU hot-plug, which might not be
    what you expect, but is probably close to the total freeable memory.

    The throttle threshold uses the global_dirtable_memory at the moment
    when the throttling happens, which could be much less than at the last
    CPU hotplug. So if lots of anonymous memory has been allocated, thus
    pushing out lots of page-cache pages, then NFSD might end up being
    throttled due to dirty NFS pages because the "25%" bonus it gets is
    calculated against a rather small amount of dirtyable memory, while the
    "3%" margin that other processes are allowed to dirty without penalty is
    calculated against a much larger number.

    To remove this possibility of deadlock we need to make sure that the
    margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
    Simply adding ratelimit_pages isn't enough as that should be multiplied
    by the number of cpus.

    So add "global_wb_domain.dirty_limit / 32" as that more accurately
    reflects the current total over-shoot margin. This ensures that the
    number of dirty NFS pages never gets so high that nfsd will be throttled
    waiting for them to be written.

    Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Currently we check page->flags twice for "HWPoisoned" case of
    check_new_page_bad(), which can cause a race with unpoisoning.

    This race unnecessarily taints kernel with "BUG: Bad page state".
    check_new_page_bad() is the only caller of bad_page() which is
    interested in __PG_HWPOISON, so let's move the hwpoison related code in
    bad_page() to it.

    Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp
    Signed-off-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
    free_pages_prepare()'s codeflow is below.

    1)kmemcheck_free_shadow()
    2)kasan_free_pages()
    - set shadow byte of page is freed
    3)kernel_poison_pages()
    3.1) check access to page is valid or not using kasan
    ---> error occur, kasan think it is invalid access
    3.2) poison page
    4)kernel_map_pages()

    So kasan_free_pages() should be called after poisoning the page.

    Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Cc: Andrey Ryabinin
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     
  • fault_around aims to reduce minor faults of file-backed pages via
    speculative ahead pte mapping and relying on readahead logic. However,
    on non-HW access bit architecture the benefit is highly limited because
    they should emulate the young bit with minor faults for reclaim's page
    aging algorithm. IOW, we cannot reduce minor faults on those
    architectures.

    I did quick a test on my ARM machine.

    512M file mmap sequential every word read on eSATA drive 4 times.
    stddev is stable.

    = fault_around 4096 =
    elapsed time(usec): 6747645

    = fault_around 65536 =
    elapsed time(usec): 6709263

    0.5% gain.

    Even when I tested it with eMMC there is no gain because I guess with
    slow storage the major fault is the dominant factor.

    Also, fault_around has the side effect of shrinking slab more
    aggressively and causes higher vmpressure, so if such speculation fails,
    it can evict slab more which can result in page I/O (e.g., inode cache).
    In the end, it would make void any benefit of fault_around.

    So let's make the default "disabled" on those architectures.

    Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox
    Signed-off-by: Minchan Kim
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Currently, faultaround code produces young pte. This can screw up
    vmscan behaviour[1], as it makes vmscan think that these pages are hot
    and not push them out on first round.

    During sparse file access faultaround gets more pages mapped and all of
    them are young. Under memory pressure, this makes vmscan swap out anon
    pages instead, or to drop other page cache pages which otherwise stay
    resident.

    Modify faultaround to produce old ptes, so they can easily be reclaimed
    under memory pressure.

    This can to some extend defeat the purpose of faultaround on machines
    without hardware accessed bit as it will not help us with reducing the
    number of minor page faults.

    We may want to disable faultaround on such machines altogether, but
    that's subject for separate patchset.

    Minchan:
    "I tested 512M mmap sequential word read test on non-HW access bit
    system (i.e., ARM) and confirmed it doesn't increase minor fault any
    more.

    old: 4096 fault_around
    minor fault: 131291
    elapsed time: 6747645 usec

    new: 65536 fault_around
    minor fault: 131291
    elapsed time: 6709263 usec

    0.56% benefit"

    [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org

    Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Tested-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Since commit 92923ca3aace ("mm: meminit: only set page reserved in the
    memblock region") the reserved bit is set on reserved memblock regions.
    However start and end address are passed as unsigned long. This is only
    32bit on i386, so it can end up marking the wrong pages reserved for
    ranges at 4GB and above.

    This was observed on a 32bit Xen dom0 which was booted with initial
    memory set to a value below 4G but allowing to balloon in memory
    (dom0_mem=1024M for example). This would define a reserved bootmem
    region for the additional memory (for example on a 8GB system there was
    a reverved region covering the 4GB-8GB range). But since the addresses
    were passed on as unsigned long, this was actually marking all pages
    from 0 to 4GB as reserved.

    Fixes: 92923ca3aacef63 ("mm: meminit: only set page reserved in the memblock region")
    Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com
    Signed-off-by: Stefan Bader
    Cc: [4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Bader
     
  • userfaultfd_file_create() increments mm->mm_users; this means that the
    memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
    after that can populate the orphaned mm more.

    Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
    mm->mm_count to pin mm_struct. This means that
    atomic_inc_not_zero(mm->mm_users) is needed when we are going to
    actually play with this memory. Except handle_userfault() path doesn't
    need this, the caller must already have a reference.

    The patch adds the new trivial helper, mmget_not_zero(), it can have
    more users.

    Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Comparing an u64 variable to >= 0 returns always true and can therefore
    be removed. This issue was detected using the -Wtype-limits gcc flag.

    This patch fixes following type-limits warning:

    mm/memblock.c: In function `__next_reserved_mem_region':
    mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    if (*idx >= 0 && *idx < type->cnt) {

    Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.net
    Signed-off-by: Richard Leitner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Leitner
     
  • This patch introduces z3fold, a special purpose allocator for storing
    compressed pages. It is designed to store up to three compressed pages
    per physical page. It is a ZBUD derivative which allows for higher
    compression ratio keeping the simplicity and determinism of its
    predecessor.

    This patch comes as a follow-up to the discussions at the Embedded Linux
    Conference in San-Diego related to the talk [1]. The outcome of these
    discussions was that it would be good to have a compressed page
    allocator as stable and deterministic as zbud with with higher
    compression ratio.

    To keep the determinism and simplicity, z3fold, just like zbud, always
    stores an integral number of compressed pages per page, but it can store
    up to 3 pages unlike zbud which can store at most 2. Therefore the
    compression ratio goes to around 2.6x while zbud's one is around 1.7x.

    The patch is based on the latest linux.git tree.

    This version has been updated after testing on various simulators (e.g.
    ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
    comments from Dan Streetman [3].

    [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
    [2] https://lkml.org/lkml/2016/4/21/799
    [3] https://lkml.org/lkml/2016/5/4/852

    Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Comment is partly wrong, this improves it by including the case of
    split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
    is set.

    Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Kirill A. Shutemov
    Cc: Alex Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • compound_mapcount() is only called after PageCompound() has already been
    checked by the caller, so there's no point to check it again. Gcc may
    optimize it away too because it's inline but this will remove the
    runtime check for sure and add it'll add an assert instead.

    Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Kirill A. Shutemov
    Cc: Alex Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The cpu_stat_off variable is unecessary since we can check if a
    workqueue request is pending otherwise. Removal of cpu_stat_off makes
    it pretty easy for the vmstat shepherd to ensure that the proper things
    happen.

    Removing the state also removes all races related to it. Should a
    workqueue not be scheduled as needed for vmstat_update then the shepherd
    will notice and schedule it as needed. Should a workqueue be
    unecessarily scheduled then the vmstat updater will disable it.

    [akpm@linux-foundation.org: fix indentation, per Michal]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.org
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Commit f61c42a7d911 ("memcg: remove tasks/children test from
    mem_cgroup_force_empty()") removed memory reparenting from the function.

    Fix the function's comment.

    Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com
    Signed-off-by: Greg Thelen
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • struct page->flags is unsigned long, so when shifting bits we should use
    UL suffix to match it.

    Found this problem after I added 64-bit CPU specific page flags and
    failed to compile the kernel:

    mm/page_alloc.c: In function '__free_one_page':
    mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]

    Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Jerome Marchand
    Cc: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • It's more convenient to use existing function helper to convert string
    "on/off" to boolean.

    Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.com
    Signed-off-by: Minfei Huang
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • When writeback operation cannot make forward progress because memory
    allocation requests needed for doing I/O cannot be satisfied (e.g.
    under OOM-livelock situation), we can observe flood of order-0 page
    allocation failure messages caused by complete depletion of memory
    reserves.

    This is caused by unconditionally allocating "struct wb_writeback_work"
    objects using GFP_ATOMIC from PF_MEMALLOC context.

    __alloc_pages_nodemask() {
    __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
    __perform_reclaim() {
    current->flags |= PF_MEMALLOC;
    try_to_free_pages() {
    do_try_to_free_pages() {
    wakeup_flusher_threads() {
    wb_start_writeback() {
    kzalloc(sizeof(*work), GFP_ATOMIC) {
    /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
    }
    }
    }
    }
    }
    current->flags &= ~PF_MEMALLOC;
    }
    }
    }
    }

    Since I/O is stalling, allocating writeback requests forever shall
    deplete memory reserves. Fortunately, since wb_start_writeback() can
    fall back to wb_wakeup() when allocating "struct wb_writeback_work"
    failed, we don't need to allow wb_start_writeback() to use memory
    reserves.

    Mem-Info:
    active_anon:289393 inactive_anon:2093 isolated_anon:29
    active_file:10838 inactive_file:113013 isolated_file:859
    unevictable:0 dirty:108531 writeback:5308 unstable:0
    slab_reclaimable:5526 slab_unreclaimable:7077
    mapped:9970 shmem:2159 pagetables:2387 bounce:0
    free:3042 free_pcp:0 free_cma:0
    Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
    lowmem_reserve[]: 0 1732 1732 1732
    Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
    Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    126847 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    76348 pages reserved
    0 pages hwpoisoned
    Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
    Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
    kthreadd: page allocation failure: order:0, mode:0x2200020
    file_io.00: page allocation failure: order:0, mode:0x2200020
    CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    Call Trace:
    warn_alloc_failed+0xf7/0x150
    __alloc_pages_nodemask+0x23f/0xa60
    alloc_pages_current+0x87/0x110
    new_slab+0x3a1/0x440
    ___slab_alloc+0x3cf/0x590
    __slab_alloc.isra.64+0x18/0x1d
    kmem_cache_alloc+0x11c/0x150
    wb_start_writeback+0x39/0x90
    wakeup_flusher_threads+0x7f/0xf0
    do_try_to_free_pages+0x1f9/0x410
    try_to_free_pages+0x94/0xc0
    __alloc_pages_nodemask+0x566/0xa60
    alloc_pages_current+0x87/0x110
    __page_cache_alloc+0xaf/0xc0
    pagecache_get_page+0x88/0x260
    grab_cache_page_write_begin+0x21/0x40
    xfs_vm_write_begin+0x2f/0xf0
    generic_perform_write+0xca/0x1c0
    xfs_file_buffered_aio_write+0xcc/0x1f0
    xfs_file_write_iter+0x84/0x140
    __vfs_write+0xc7/0x100
    vfs_write+0x9d/0x190
    SyS_write+0x50/0xc0
    entry_SYSCALL_64_fastpath+0x12/0x6a
    Mem-Info:
    active_anon:293335 inactive_anon:2093 isolated_anon:0
    active_file:10829 inactive_file:110045 isolated_file:32
    unevictable:0 dirty:109275 writeback:822 unstable:0
    slab_reclaimable:5489 slab_unreclaimable:10070
    mapped:9999 shmem:2159 pagetables:2420 bounce:0
    free:3 free_pcp:0 free_cma:0
    Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
    lowmem_reserve[]: 0 1732 1732 1732
    Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    123086 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    76348 pages reserved
    0 pages hwpoisoned
    SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
    cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
    node 0: slabs: 3218, objs: 205952, free: 0
    file_io.00: page allocation failure: order:0, mode:0x2200020
    CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45

    Assuming that somebody will find a better solution, let's apply this
    patch for now to stop bleeding, for this problem frequently prevents me
    from testing OOM livelock condition.

    Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Signed-off-by: Eric Engestrom
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Engestrom
     
  • If SPARSEMEM, use page_ext in mem_section
    if !SPARSEMEM, use page_ext in pgdata

    Signed-off-by: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • It is used as a pure bool function within kernel source wide.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3
    lines statements.

    Remove redundant return statements for non-return functions, which can
    save lines, at least.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Put the activate_page_pvecs definition next to those of the other
    pagevecs, for clarity.

    Signed-off-by: Ming Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Li
     
  • copy_page_to_iter_iovec() is currently the only user of
    fault_in_pages_writeable(), and it definitely can use fragments from
    high order pages.

    Make sure fault_in_pages_writeable() is only touching two adjacent pages
    at most, as claimed.

    Signed-off-by: Eric Dumazet
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • The page_counter rounds limits down to page size values. This makes
    sense, except in the case of hugetlb_cgroup where it's not possible to
    charge partial hugepages. If the hugetlb_cgroup margin is less than the
    hugepage size being charged, it will fail as expected.

    Round the hugetlb_cgroup limit down to hugepage size, since it is the
    effective limit of the cgroup.

    For consistency, round down PAGE_COUNTER_MAX as well when a
    hugetlb_cgroup is created: this prevents error reports when a user
    cannot restore the value to the kernel default.

    Signed-off-by: David Rientjes
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The nommu do_mmap expects f_op->get_unmapped_area to either succeed or
    return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings.
    Returning addr in the non-MAP_SHARED case was completely wrong, and only
    happened to work because addr was 0. However, it prevented VM_MAYSHARE
    mappings from sharing backing with the fs cache, and forced such
    mappings (including shareable program text) to be copied whenever the
    number of mappings transitioned from 0 to 1, impacting performance and
    memory usage. Subsequent mappings beyond the first still correctly
    shared memory with the first.

    Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level;
    do_mmap already handles the semantic differences between them.

    Signed-off-by: Rich Felker
    Cc: Michal Hocko
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rich Felker
     
  • Since commit 84638335900f ("mm: rework virtual memory accounting")
    RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
    default because of incompatibility with older versions of valgrind.

    Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
    Fortunately it changes only rlim_cur and keeps rlim_max for reverting
    limit back when needed.

    This patch checks current usage also against rlim_max if rlim_cur is
    zero. This is safe because task anyway can increase rlim_cur up to
    rlim_max. Size of brk is still checked against rlim_cur, so this part
    is completely compatible - zero rlim_cur forbids brk() but allows
    private mmap().

    Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Linus Torvalds
    Cc: Cyrill Gorcunov
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • We use generic hooks in remap_pfn_range() to help archs to track pfnmap
    regions. The code is something like:

    int remap_pfn_range()
    {
    ...
    track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
    ...
    pfn -= addr >> PAGE_SHIFT;
    ...
    untrack_pfn(vma, pfn, PAGE_ALIGN(size));
    ...
    }

    Here we can easily find the pfn is changed but not recovered before
    untrack_pfn() is called. That's incorrect.

    There are no known runtime effects - this is from inspection.

    Signed-off-by: Yongji Xie
    Cc: Kirill A. Shutemov
    Cc: Jerome Marchand
    Cc: Ingo Molnar
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Andy Lutomirski
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yongji Xie
     
  • When mixing lots of vmallocs and set_memory_*() (which calls
    vm_unmap_aliases()) I encountered situations where the performance
    degraded severely due to the walking of the entire vmap_area list each
    invocation.

    One simple improvement is to add the lazily freed vmap_area to a
    separate lockless free list, such that we then avoid having to walk the
    full list on each purge.

    Signed-off-by: Chris Wilson
    Reviewed-by: Roman Pen
    Cc: Joonas Lahtinen
    Cc: Tvrtko Ursulin
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Roman Pen
    Cc: Mel Gorman
    Cc: Toshi Kani
    Cc: Shawn Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     
  • memblock_add_region() and memblock_reserve_region() do nothing specific
    before the call of memblock_add_range(), only print debug output.

    We can do the same in memblock_add() and memblock_reserve() since both
    memblock_add_region() and memblock_reserve_region() are not used by
    anybody outside of memblock.c and memblock_{add,reserve}() have the same
    set of flags and nids.

    Since memblock_add_region() and memblock_reserve_region() will be
    inlined, there will not be functional changes, but will improve code
    readability a little.

    Signed-off-by: Alexander Kuleshov
    Acked-by: Ard Biesheuvel
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Tony Luck
    Cc: Tang Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • HWPoison was specific to some particular x86 platforms. And it is often
    seen as high level machine check handler. And therefore, 'MCE' is used
    for the format prefix of printk(). However, 'PowerNV' has also used
    HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
    to memory_failure.c.

    Additionally, 'MCE' and 'Memory failure' have different context. The
    former belongs to exception context and the latter belongs to process
    context. Furthermore, HWPoison can also be used for off-lining those
    sub-health pages that do not trigger any machine check exception.

    This patch aims to replace 'MCE' with a more appropriate prefix.

    [1] commit 75eb3d9b60c2 ("powerpc/powernv: Get FSP memory errors
    and plumb into memory poison infrastructure.")

    Signed-off-by: Chen Yucong
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • The implementation of mk_huge_pmd looks verbose, it could be just
    simplified to one line code.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Since commit 3a5dda7a17cf ("oom: prevent unnecessary oom kills or kernel
    panics"), select_bad_process() is using for_each_process_thread().

    Since oom_unkillable_task() scans all threads in the caller's thread
    group and oom_task_origin() scans signal_struct of the caller's thread
    group, we don't need to call oom_unkillable_task() and oom_task_origin()
    on each thread. Also, since !mm test will be done later at
    oom_badness(), we don't need to do !mm test on each thread. Therefore,
    we only need to do TIF_MEMDIE test on each thread.

    Although the original code was correct it was quite inefficient because
    each thread group was scanned num_threads times which can be a lot
    especially with processes with many threads. Even though the OOM is
    extremely cold path it is always good to be as effective as possible
    when we are inside rcu_read_lock() - aka unpreemptible context.

    If we track number of TIF_MEMDIE threads inside signal_struct, we don't
    need to do TIF_MEMDIE test on each thread. This will allow
    select_bad_process() to use for_each_process().

    This patch adds a counter to signal_struct for tracking how many
    TIF_MEMDIE threads are in a given thread group, and check it at
    oom_scan_process_thread() so that select_bad_process() can use
    for_each_process() rather than for_each_process_thread().

    [mhocko@suse.com: do not blow the signal_struct size]
    Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • task_will_free_mem is a misnomer for a more complex PF_EXITING test for
    early break out from the oom killer because it is believed that such a
    task would release its memory shortly and so we do not have to select an
    oom victim and perform a disruptive action.

    Currently we make sure that the given task is not participating in the
    core dumping because it might get blocked for a long time - see commit
    d003f371b270 ("oom: don't assume that a coredumping thread will exit
    soon").

    The check can still do better though. We shouldn't consider the task
    unless the whole thread group is going down. This is rather unlikely
    but not impossible. A single exiting thread would surely leave all the
    address space behind. If we are really unlucky it might get stuck on
    the exit path and keep its TIF_MEMDIE and so block the oom killer.

    Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko