21 May, 2016
40 commits
-
Add a test that makes sure ksize() unpoisons the whole chunk.
Signed-off-by: Alexander Potapenko
Acked-by: Andrey Ryabinin
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Christoph Lameter
Cc: Konstantin Serebryany
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Instead of calling kasan_krealloc(), which replaces the memory
allocation stack ID (if stack depot is used), just unpoison the whole
memory chunk.Signed-off-by: Alexander Potapenko
Acked-by: Andrey Ryabinin
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Christoph Lameter
Cc: Konstantin Serebryany
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator. From now on the
allocator may reuse it for another allocation. Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped. Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.Freed objects are first added to per-cpu quarantine queues. When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue. Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased. Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov. A number of improvements have been
suggested by Andrey Ryabinin.[glider@google.com: v9]
Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
boot are initialized, then the rest are initialized in parallel by
starting one-off "pgdatinitX" kernel thread for each node X.If page_ext_init is called before it, some pages will not have valid
extension, this may lead the below kernel oops when booting up kernel:BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] free_pcppages_bulk+0x2d2/0x8d0
PGD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0
RSP: 0000:ffff88017c087c48 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
Call Trace:
free_hot_cold_page+0x192/0x1d0
__free_pages+0x5c/0x90
__free_pages_boot_core+0x11a/0x14e
deferred_free_range+0x50/0x62
deferred_init_memmap+0x220/0x3c3
kthread+0xf8/0x110
ret_from_fork+0x22/0x40
Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
RIP [] free_pcppages_bulk+0x2d2/0x8d0
RSP
CR2: 0000000000000000Move page_ext_init() after page_alloc_init_late() to make sure page extension
is setup for all pages.Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If page migration fails due to -ENOMEM, nr_failed should still be
incremented for proper statistics.This was encountered recently when all page migration vmstats showed 0,
and inferred that migrate_pages() was never called, although in reality
the first page migration failed because compaction_alloc() failed to
find a migration target.This patch increments nr_failed so the vmstat is properly accounted on
ENOMEM.Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Cc: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While testing the kcompactd in my platform 3G MEM only DMA ZONE. I
found the kcompactd never wakeup. It seems the zoneindex has already
minus 1 before. So the traverse here should be
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Michal Hocko
Cc: Kirill A. Shutemov
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Zhuangluan Su
Cc: Yiping Xu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When enabling the below kernel configs:
CONFIG_DEFERRED_STRUCT_PAGE_INIT
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_EXTENSION
CONFIG_DEBUG_VMkernel bootup may fail due to the following oops:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] free_pcppages_bulk+0x2d2/0x8d0
PGD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0
RSP: 0000:ffff88017c087c48 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
Call Trace:
free_hot_cold_page+0x192/0x1d0
__free_pages+0x5c/0x90
__free_pages_boot_core+0x11a/0x14e
deferred_free_range+0x50/0x62
deferred_init_memmap+0x220/0x3c3
kthread+0xf8/0x110
ret_from_fork+0x22/0x40
Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
RIP [] free_pcppages_bulk+0x2d2/0x8d0
RSP
CR2: 0000000000000000The problem is lookup_page_ext() returns NULL then page_is_guard() tried
to access it in page freeing.page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
flag, but freeing page might reach here before the page_ext arrays are
allocated when feeding a range of pages to the allocator for the first
time during bootup or memory hotplug.When it returns NULL, page_is_guard() should just return false instead
of checking PAGE_EXT_DEBUG_GUARD unconditionally.Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a large value is written to scan_sleep_millisecs, for example, that
period must lapse before khugepaged will wake up for periodic
collapsing.If this value is tuned to 1 day, for example, and then re-tuned to its
default 10s, khugepaged will still wait for a day before scanning again.This patch causes khugepaged to wakeup immediately when the value is
changed and then sleep until that value is rewritten or the new value
lapses.Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When nfsd is exporting a filesystem over NFS which is then NFS-mounted
on the local machine there is a risk of deadlock. This happens when
there are lots of dirty pages in the NFS filesystem and they cause NFSD
to be throttled, either in throttle_vm_writeout() or in
balance_dirty_pages().To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
and it provides a 25% increase to the limits that affect NFSD. Any
process writing to an NFS filesystem will be throttled well before the
number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
will not deadlock on pages that it needs to write out. At least it
shouldn't.All processes are allowed a small excess margin to avoid performing too
many calculations: ratelimit_pages.ratelimit_pages is set so that if a thread on every CPU uses the entire
margin, the total will only go 3% over the limit, and this is much less
than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
shouldn't be a problem. But it is.The "total memory" that these 3% and 25% are calculated against are not
really total memory but are "global_dirtyable_memory()" which doesn't
include anonymous memory, just free memory and page-cache memory.The "ratelimit_pages" number is based on whatever the
global_dirtyable_memory was on the last CPU hot-plug, which might not be
what you expect, but is probably close to the total freeable memory.The throttle threshold uses the global_dirtable_memory at the moment
when the throttling happens, which could be much less than at the last
CPU hotplug. So if lots of anonymous memory has been allocated, thus
pushing out lots of page-cache pages, then NFSD might end up being
throttled due to dirty NFS pages because the "25%" bonus it gets is
calculated against a rather small amount of dirtyable memory, while the
"3%" margin that other processes are allowed to dirty without penalty is
calculated against a much larger number.To remove this possibility of deadlock we need to make sure that the
margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
Simply adding ratelimit_pages isn't enough as that should be multiplied
by the number of cpus.So add "global_wb_domain.dirty_limit / 32" as that more accurately
reflects the current total over-shoot margin. This ensures that the
number of dirty NFS pages never gets so high that nfsd will be throttled
waiting for them to be written.Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we check page->flags twice for "HWPoisoned" case of
check_new_page_bad(), which can cause a race with unpoisoning.This race unnecessarily taints kernel with "BUG: Bad page state".
check_new_page_bad() is the only caller of bad_page() which is
interested in __PG_HWPOISON, so let's move the hwpoison related code in
bad_page() to it.Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi
Acked-by: Mel Gorman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
free_pages_prepare()'s codeflow is below.1)kmemcheck_free_shadow()
2)kasan_free_pages()
- set shadow byte of page is freed
3)kernel_poison_pages()
3.1) check access to page is valid or not using kasan
---> error occur, kasan think it is invalid access
3.2) poison page
4)kernel_map_pages()So kasan_free_pages() should be called after poisoning the page.
Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com
Signed-off-by: seokhoon.yoon
Cc: Andrey Ryabinin
Cc: Laura Abbott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
fault_around aims to reduce minor faults of file-backed pages via
speculative ahead pte mapping and relying on readahead logic. However,
on non-HW access bit architecture the benefit is highly limited because
they should emulate the young bit with minor faults for reclaim's page
aging algorithm. IOW, we cannot reduce minor faults on those
architectures.I did quick a test on my ARM machine.
512M file mmap sequential every word read on eSATA drive 4 times.
stddev is stable.= fault_around 4096 =
elapsed time(usec): 6747645= fault_around 65536 =
elapsed time(usec): 67092630.5% gain.
Even when I tested it with eMMC there is no gain because I guess with
slow storage the major fault is the dominant factor.Also, fault_around has the side effect of shrinking slab more
aggressively and causes higher vmpressure, so if such speculation fails,
it can evict slab more which can result in page I/O (e.g., inode cache).
In the end, it would make void any benefit of fault_around.So let's make the default "disabled" on those architectures.
Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox
Signed-off-by: Minchan Kim
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, faultaround code produces young pte. This can screw up
vmscan behaviour[1], as it makes vmscan think that these pages are hot
and not push them out on first round.During sparse file access faultaround gets more pages mapped and all of
them are young. Under memory pressure, this makes vmscan swap out anon
pages instead, or to drop other page cache pages which otherwise stay
resident.Modify faultaround to produce old ptes, so they can easily be reclaimed
under memory pressure.This can to some extend defeat the purpose of faultaround on machines
without hardware accessed bit as it will not help us with reducing the
number of minor page faults.We may want to disable faultaround on such machines altogether, but
that's subject for separate patchset.Minchan:
"I tested 512M mmap sequential word read test on non-HW access bit
system (i.e., ARM) and confirmed it doesn't increase minor fault any
more.old: 4096 fault_around
minor fault: 131291
elapsed time: 6747645 usecnew: 65536 fault_around
minor fault: 131291
elapsed time: 6709263 usec0.56% benefit"
[1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: Minchan Kim
Tested-by: Minchan Kim
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Vinayak Menon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit 92923ca3aace ("mm: meminit: only set page reserved in the
memblock region") the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long. This is only
32bit on i386, so it can end up marking the wrong pages reserved for
ranges at 4GB and above.This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example). This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range). But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.Fixes: 92923ca3aacef63 ("mm: meminit: only set page reserved in the memblock region")
Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com
Signed-off-by: Stefan Bader
Cc: [4.2+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
userfaultfd_file_create() increments mm->mm_users; this means that the
memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
after that can populate the orphaned mm more.Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
mm->mm_count to pin mm_struct. This means that
atomic_inc_not_zero(mm->mm_users) is needed when we are going to
actually play with this memory. Except handle_userfault() path doesn't
need this, the caller must already have a reference.The patch adds the new trivial helper, mmget_not_zero(), it can have
more users.Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com
Signed-off-by: Oleg Nesterov
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Comparing an u64 variable to >= 0 returns always true and can therefore
be removed. This issue was detected using the -Wtype-limits gcc flag.This patch fixes following type-limits warning:
mm/memblock.c: In function `__next_reserved_mem_region':
mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
if (*idx >= 0 && *idx < type->cnt) {Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.net
Signed-off-by: Richard Leitner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch introduces z3fold, a special purpose allocator for storing
compressed pages. It is designed to store up to three compressed pages
per physical page. It is a ZBUD derivative which allows for higher
compression ratio keeping the simplicity and determinism of its
predecessor.This patch comes as a follow-up to the discussions at the Embedded Linux
Conference in San-Diego related to the talk [1]. The outcome of these
discussions was that it would be good to have a compressed page
allocator as stable and deterministic as zbud with with higher
compression ratio.To keep the determinism and simplicity, z3fold, just like zbud, always
stores an integral number of compressed pages per page, but it can store
up to 3 pages unlike zbud which can store at most 2. Therefore the
compression ratio goes to around 2.6x while zbud's one is around 1.7x.The patch is based on the latest linux.git tree.
This version has been updated after testing on various simulators (e.g.
ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
comments from Dan Streetman [3].[1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
[2] https://lkml.org/lkml/2016/4/21/799
[3] https://lkml.org/lkml/2016/5/4/852Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
Signed-off-by: Vitaly Wool
Cc: Seth Jennings
Cc: Dan Streetman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Comment is partly wrong, this improves it by including the case of
split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
is set.Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Kirill A. Shutemov
Cc: Alex Williamson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
compound_mapcount() is only called after PageCompound() has already been
checked by the caller, so there's no point to check it again. Gcc may
optimize it away too because it's inline but this will remove the
runtime check for sure and add it'll add an assert instead.Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Kirill A. Shutemov
Cc: Alex Williamson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The cpu_stat_off variable is unecessary since we can check if a
workqueue request is pending otherwise. Removal of cpu_stat_off makes
it pretty easy for the vmstat shepherd to ensure that the proper things
happen.Removing the state also removes all races related to it. Should a
workqueue not be scheduled as needed for vmstat_update then the shepherd
will notice and schedule it as needed. Should a workqueue be
unecessarily scheduled then the vmstat updater will disable it.[akpm@linux-foundation.org: fix indentation, per Michal]
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.org
Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit f61c42a7d911 ("memcg: remove tasks/children test from
mem_cgroup_force_empty()") removed memory reparenting from the function.Fix the function's comment.
Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com
Signed-off-by: Greg Thelen
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
struct page->flags is unsigned long, so when shifting bits we should use
UL suffix to match it.Found this problem after I added 64-bit CPU specific page flags and
failed to compile the kernel:mm/page_alloc.c: In function '__free_one_page':
mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.com
Signed-off-by: Yu Zhao
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Jerome Marchand
Cc: Denys Vlasenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It's more convenient to use existing function helper to convert string
"on/off" to boolean.Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.com
Signed-off-by: Minfei Huang
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When writeback operation cannot make forward progress because memory
allocation requests needed for doing I/O cannot be satisfied (e.g.
under OOM-livelock situation), we can observe flood of order-0 page
allocation failure messages caused by complete depletion of memory
reserves.This is caused by unconditionally allocating "struct wb_writeback_work"
objects using GFP_ATOMIC from PF_MEMALLOC context.__alloc_pages_nodemask() {
__alloc_pages_slowpath() {
__alloc_pages_direct_reclaim() {
__perform_reclaim() {
current->flags |= PF_MEMALLOC;
try_to_free_pages() {
do_try_to_free_pages() {
wakeup_flusher_threads() {
wb_start_writeback() {
kzalloc(sizeof(*work), GFP_ATOMIC) {
/* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
}
}
}
}
}
current->flags &= ~PF_MEMALLOC;
}
}
}
}Since I/O is stalling, allocating writeback requests forever shall
deplete memory reserves. Fortunately, since wb_start_writeback() can
fall back to wb_wakeup() when allocating "struct wb_writeback_work"
failed, we don't need to allow wb_start_writeback() to use memory
reserves.Mem-Info:
active_anon:289393 inactive_anon:2093 isolated_anon:29
active_file:10838 inactive_file:113013 isolated_file:859
unevictable:0 dirty:108531 writeback:5308 unstable:0
slab_reclaimable:5526 slab_unreclaimable:7077
mapped:9970 shmem:2159 pagetables:2387 bounce:0
free:3042 free_pcp:0 free_cma:0
Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
lowmem_reserve[]: 0 1732 1732 1732
Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
126847 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
524157 pages RAM
0 pages HighMem/MovableOnly
76348 pages reserved
0 pages hwpoisoned
Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
kthreadd: page allocation failure: order:0, mode:0x2200020
file_io.00: page allocation failure: order:0, mode:0x2200020
CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
Call Trace:
warn_alloc_failed+0xf7/0x150
__alloc_pages_nodemask+0x23f/0xa60
alloc_pages_current+0x87/0x110
new_slab+0x3a1/0x440
___slab_alloc+0x3cf/0x590
__slab_alloc.isra.64+0x18/0x1d
kmem_cache_alloc+0x11c/0x150
wb_start_writeback+0x39/0x90
wakeup_flusher_threads+0x7f/0xf0
do_try_to_free_pages+0x1f9/0x410
try_to_free_pages+0x94/0xc0
__alloc_pages_nodemask+0x566/0xa60
alloc_pages_current+0x87/0x110
__page_cache_alloc+0xaf/0xc0
pagecache_get_page+0x88/0x260
grab_cache_page_write_begin+0x21/0x40
xfs_vm_write_begin+0x2f/0xf0
generic_perform_write+0xca/0x1c0
xfs_file_buffered_aio_write+0xcc/0x1f0
xfs_file_write_iter+0x84/0x140
__vfs_write+0xc7/0x100
vfs_write+0x9d/0x190
SyS_write+0x50/0xc0
entry_SYSCALL_64_fastpath+0x12/0x6a
Mem-Info:
active_anon:293335 inactive_anon:2093 isolated_anon:0
active_file:10829 inactive_file:110045 isolated_file:32
unevictable:0 dirty:109275 writeback:822 unstable:0
slab_reclaimable:5489 slab_unreclaimable:10070
mapped:9999 shmem:2159 pagetables:2420 bounce:0
free:3 free_pcp:0 free_cma:0
Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
lowmem_reserve[]: 0 1732 1732 1732
Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
123086 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
524157 pages RAM
0 pages HighMem/MovableOnly
76348 pages reserved
0 pages hwpoisoned
SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
node 0: slabs: 3218, objs: 205952, free: 0
file_io.00: page allocation failure: order:0, mode:0x2200020
CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45Assuming that somebody will find a better solution, let's apply this
patch for now to stop bleeding, for this problem frequently prevents me
from testing OOM livelock condition.Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: Jan Kara
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Eric Engestrom
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If SPARSEMEM, use page_ext in mem_section
if !SPARSEMEM, use page_ext in pgdataSigned-off-by: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is used as a pure bool function within kernel source wide.
Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3
lines statements.Remove redundant return statements for non-return functions, which can
save lines, at least.Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Put the activate_page_pvecs definition next to those of the other
pagevecs, for clarity.Signed-off-by: Ming Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
copy_page_to_iter_iovec() is currently the only user of
fault_in_pages_writeable(), and it definitely can use fragments from
high order pages.Make sure fault_in_pages_writeable() is only touching two adjacent pages
at most, as claimed.Signed-off-by: Eric Dumazet
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The page_counter rounds limits down to page size values. This makes
sense, except in the case of hugetlb_cgroup where it's not possible to
charge partial hugepages. If the hugetlb_cgroup margin is less than the
hugepage size being charged, it will fail as expected.Round the hugetlb_cgroup limit down to hugepage size, since it is the
effective limit of the cgroup.For consistency, round down PAGE_COUNTER_MAX as well when a
hugetlb_cgroup is created: this prevents error reports when a user
cannot restore the value to the kernel default.Signed-off-by: David Rientjes
Cc: Michal Hocko
Cc: Nikolay Borisov
Cc: Johannes Weiner
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The nommu do_mmap expects f_op->get_unmapped_area to either succeed or
return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings.
Returning addr in the non-MAP_SHARED case was completely wrong, and only
happened to work because addr was 0. However, it prevented VM_MAYSHARE
mappings from sharing backing with the fs cache, and forced such
mappings (including shareable program text) to be copied whenever the
number of mappings transitioned from 0 to 1, impacting performance and
memory usage. Subsequent mappings beyond the first still correctly
shared memory with the first.Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level;
do_mmap already handles the semantic differences between them.Signed-off-by: Rich Felker
Cc: Michal Hocko
Cc: Greg Ungerer
Cc: Geert Uytterhoeven
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit 84638335900f ("mm: rework virtual memory accounting")
RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
default because of incompatibility with older versions of valgrind.Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
Fortunately it changes only rlim_cur and keeps rlim_max for reverting
limit back when needed.This patch checks current usage also against rlim_max if rlim_cur is
zero. This is safe because task anyway can increase rlim_cur up to
rlim_max. Size of brk is still checked against rlim_cur, so this part
is completely compatible - zero rlim_cur forbids brk() but allows
private mmap().Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com
Signed-off-by: Konstantin Khlebnikov
Acked-by: Linus Torvalds
Cc: Cyrill Gorcunov
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We use generic hooks in remap_pfn_range() to help archs to track pfnmap
regions. The code is something like:int remap_pfn_range()
{
...
track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
...
pfn -= addr >> PAGE_SHIFT;
...
untrack_pfn(vma, pfn, PAGE_ALIGN(size));
...
}Here we can easily find the pfn is changed but not recovered before
untrack_pfn() is called. That's incorrect.There are no known runtime effects - this is from inspection.
Signed-off-by: Yongji Xie
Cc: Kirill A. Shutemov
Cc: Jerome Marchand
Cc: Ingo Molnar
Cc: Vlastimil Babka
Cc: Dave Hansen
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Andrea Arcangeli
Cc: Michal Hocko
Cc: Andy Lutomirski
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When mixing lots of vmallocs and set_memory_*() (which calls
vm_unmap_aliases()) I encountered situations where the performance
degraded severely due to the walking of the entire vmap_area list each
invocation.One simple improvement is to add the lazily freed vmap_area to a
separate lockless free list, such that we then avoid having to walk the
full list on each purge.Signed-off-by: Chris Wilson
Reviewed-by: Roman Pen
Cc: Joonas Lahtinen
Cc: Tvrtko Ursulin
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Roman Pen
Cc: Mel Gorman
Cc: Toshi Kani
Cc: Shawn Lin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
memblock_add_region() and memblock_reserve_region() do nothing specific
before the call of memblock_add_range(), only print debug output.We can do the same in memblock_add() and memblock_reserve() since both
memblock_add_region() and memblock_reserve_region() are not used by
anybody outside of memblock.c and memblock_{add,reserve}() have the same
set of flags and nids.Since memblock_add_region() and memblock_reserve_region() will be
inlined, there will not be functional changes, but will improve code
readability a little.Signed-off-by: Alexander Kuleshov
Acked-by: Ard Biesheuvel
Cc: Mel Gorman
Cc: Pekka Enberg
Cc: Tony Luck
Cc: Tang Chen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
HWPoison was specific to some particular x86 platforms. And it is often
seen as high level machine check handler. And therefore, 'MCE' is used
for the format prefix of printk(). However, 'PowerNV' has also used
HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
to memory_failure.c.Additionally, 'MCE' and 'Memory failure' have different context. The
former belongs to exception context and the latter belongs to process
context. Furthermore, HWPoison can also be used for off-lining those
sub-health pages that do not trigger any machine check exception.This patch aims to replace 'MCE' with a more appropriate prefix.
[1] commit 75eb3d9b60c2 ("powerpc/powernv: Get FSP memory errors
and plumb into memory poison infrastructure.")Signed-off-by: Chen Yucong
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The implementation of mk_huge_pmd looks verbose, it could be just
simplified to one line code.Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit 3a5dda7a17cf ("oom: prevent unnecessary oom kills or kernel
panics"), select_bad_process() is using for_each_process_thread().Since oom_unkillable_task() scans all threads in the caller's thread
group and oom_task_origin() scans signal_struct of the caller's thread
group, we don't need to call oom_unkillable_task() and oom_task_origin()
on each thread. Also, since !mm test will be done later at
oom_badness(), we don't need to do !mm test on each thread. Therefore,
we only need to do TIF_MEMDIE test on each thread.Although the original code was correct it was quite inefficient because
each thread group was scanned num_threads times which can be a lot
especially with processes with many threads. Even though the OOM is
extremely cold path it is always good to be as effective as possible
when we are inside rcu_read_lock() - aka unpreemptible context.If we track number of TIF_MEMDIE threads inside signal_struct, we don't
need to do TIF_MEMDIE test on each thread. This will allow
select_bad_process() to use for_each_process().This patch adds a counter to signal_struct for tracking how many
TIF_MEMDIE threads are in a given thread group, and check it at
oom_scan_process_thread() so that select_bad_process() can use
for_each_process() rather than for_each_process_thread().[mhocko@suse.com: do not blow the signal_struct size]
Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_will_free_mem is a misnomer for a more complex PF_EXITING test for
early break out from the oom killer because it is believed that such a
task would release its memory shortly and so we do not have to select an
oom victim and perform a disruptive action.Currently we make sure that the given task is not participating in the
core dumping because it might get blocked for a long time - see commit
d003f371b270 ("oom: don't assume that a coredumping thread will exit
soon").The check can still do better though. We shouldn't consider the task
unless the whole thread group is going down. This is rather unlikely
but not impossible. A single exiting thread would surely leave all the
address space behind. If we are really unlucky it might get stuck on
the exit path and keep its TIF_MEMDIE and so block the oom killer.Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Tetsuo Handa
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds