16 Jul, 2017
1 commit
-
Pull ->s_options removal from Al Viro:
"Preparations for fsmount/fsopen stuff (coming next cycle). Everything
gets moved to explicit ->show_options(), killing ->s_options off +
some cosmetic bits around fs/namespace.c and friends. Basically, the
stuff needed to work with fsmount series with minimum of conflicts
with other work.It's not strictly required for this merge window, but it would reduce
the PITA during the coming cycle, so it would be nice to have those
bits and pieces out of the way"* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
isofs: Fix isofs_show_options()
VFS: Kill off s_options and helpers
orangefs: Implement show_options
9p: Implement show_options
isofs: Implement show_options
afs: Implement show_options
affs: Implement show_options
befs: Implement show_options
spufs: Implement show_options
bpf: Implement show_options
ramfs: Implement show_options
pstore: Implement show_options
omfs: Implement show_options
hugetlbfs: Implement show_options
VFS: Don't use save/replace_mount_options if not using generic_show_options
VFS: Provide empty name qstr
VFS: Make get_filesystem() return the affected filesystem
VFS: Clean up whitespace in fs/namespace.c and fs/super.c
Provide a function to create a NUL-terminated string from unterminated data
15 Jul, 2017
1 commit
-
Jörn Engel noticed that the expand_upwards() function might not return
-ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
which all define TASK_SIZE as 0xffffffff, but since none of those have
an upwards-growing stack we currently have no actual issue.Nevertheless let's fix this just in case any of the architectures with
an upward-growing stack (currently parisc, metag and partly ia64) define
TASK_SIZE similar.Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
Signed-off-by: Helge Deller
Reported-by: Jörn Engel
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Jul, 2017
5 commits
-
Currently the writeback statistics code uses a percpu counters to hold
various statistics. Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore. However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance. This will likely result in better
runtime of code which deals with modifying the stat counters.While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov
Acked-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Josef Bacik
Cc: Mel Gorman
Cc: Jeff Layton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Page migration (for memory hotplug, soft_offline_page or mbind) needs to
allocate a new memory. This can trigger an oom killer if the target
memory is depleated. Although quite unlikely, still possible,
especially for the memory hotplug (offlining of memoery).Up to now we didn't really have reasonable means to back off.
__GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a
single node and that is not suitable for all callers.But now that we have __GFP_RETRY_MAYFAIL we should use it. It is
preferable to fail the migration than disrupt the system by killing some
processes.Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Alex Belits
Cc: Chris Wilson
Cc: Christoph Hellwig
Cc: Darrick J. Wong
Cc: David Daney
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: NeilBrown
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
request size we can drop the hackish implementation for !costly orders.
__GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
progress and backs of when we are out of memory for the requested size.
Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
to silent the oom killer anymore.Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Alex Belits
Cc: Chris Wilson
Cc: Christoph Hellwig
Cc: Darrick J. Wong
Cc: David Daney
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: NeilBrown
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
ignored for smaller sizes. This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
attempt to free memory at all. The most light weight mode which even
doesn't kick the background reclaim. Should be used carefully because
it might deplete the memory and the next user might hit the more
aggressive reclaim- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
allocation without any attempt to free memory from the current
context but can wake kswapd to reclaim memory if the zone is below
the low watermark. Can be used from either atomic contexts or when
the request is a performance optimization and there is another
fallback for a slow path.- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
non sleeping allocation with an expensive fallback so it can access
some portion of memory reserves. Usually used from interrupt/bh
context with an expensive slow path fallback.- GFP_KERNEL - both background and direct reclaim are allowed and the
_default_ page allocator behavior is used. That means that !costly
allocation requests are basically nofail but there is no guarantee of
that behavior so failures have to be checked properly by callers
(e.g. OOM killer victim is allowed to fail currently).- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
and all allocation requests fail early rather than cause disruptive
reclaim (one round of reclaim in this implementation). The OOM killer
is not invoked.- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
behavior and all allocation requests try really hard. The request
will fail if the reclaim cannot make any progress. The OOM killer
won't be triggered.- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
and all allocation requests will loop endlessly until they succeed.
This might be really dangerous especially for larger orders.Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Alex Belits
Cc: Chris Wilson
Cc: Christoph Hellwig
Cc: Darrick J. Wong
Cc: David Daney
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: NeilBrown
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With gcc 4.1.2:
mm/memory.o: In function `create_huge_pmd':
memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page'Interestingly, create_huge_pmd() is emitted in the assembler output, but
never called.Converting transparent_hugepage_enabled() from a macro to a static
inline function reduced the ability of the compiler to remove unused
code.Fix this by marking create_huge_pmd() inline.
Fixes: 16981d763501c0e0 ("mm: improve readability of transparent_hugepage_enabled()")
Link: http://lkml.kernel.org/r/1499842660-10665-1-git-send-email-geert@linux-m68k.org
Signed-off-by: Geert Uytterhoeven
Acked-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jul, 2017
33 commits
-
The helper function get_wild_bug_type() does not need to be in global
scope, so make it static.Cleans up sparse warning:
"symbol 'get_wild_bug_type' was not declared. Should it be static?"
Link: http://lkml.kernel.org/r/20170622090049.10658-1-colin.king@canonical.com
Signed-off-by: Colin Ian King
Acked-by: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
They return positive value, that is, true, if non-zero value is found.
Rename them to reduce confusion.Link: http://lkml.kernel.org/r/20170516012350.GA16015@js1304-desktop
Signed-off-by: Joonsoo Kim
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
KASAN doesn't happen work with memory hotplug because hotplugged memory
doesn't have any shadow memory. So any access to hotplugged memory
would cause a crash on shadow check.Use memory hotplug notifier to allocate and map shadow memory when the
hotplugged memory is going online and free shadow after the memory
offlined.Link: http://lkml.kernel.org/r/20170601162338.23540-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Cc: "H. Peter Anvin"
Cc: Alexander Potapenko
Cc: Catalin Marinas
Cc: Dmitry Vyukov
Cc: Ingo Molnar
Cc: Ingo Molnar
Cc: Mark Rutland
Cc: Thomas Gleixner
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For some unaligned memory accesses we have to check additional byte of
the shadow memory. Currently we load that byte speculatively to have
only single load + branch on the optimistic fast path.However, this approach has some downsides:
- It's unaligned access, so this prevents porting KASAN on
architectures which doesn't support unaligned accesses.- We have to map additional shadow page to prevent crash if speculative
load happens near the end of the mapped memory. This would
significantly complicate upcoming memory hotplug support.I wasn't able to notice any performance degradation with this patch. So
these speculative loads is just a pain with no gain, let's remove them.Link: http://lkml.kernel.org/r/20170601162338.23540-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Acked-by: Dmitry Vyukov
Cc: Alexander Potapenko
Cc: Mark Rutland
Cc: Catalin Marinas
Cc: Will Deacon
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is missing optimization in zero_p4d_populate() that can save some
memory when mapping zero shadow. Implement it like as others.Link: http://lkml.kernel.org/r/1494829255-23946-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim
Acked-by: Andrey Ryabinin
Cc: "Kirill A . Shutemov"
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
zsmalloc to allocate an object of the maximal size (ZS_MAX_ALLOC_SIZE).
I think however the fix is unneededly complicated.This patch replaces the dynamic calculation of zs_size_classes at init
time by a compile time calculation that uses the DIV_ROUND_UP() macro
already used in get_size_class_index().[akpm@linux-foundation.org: use min_t]
Link: http://lkml.kernel.org/r/20170630114859.1979-1-jmarchan@redhat.com
Signed-off-by: Jerome Marchand
Acked-by: Minchan Kim
Cc: Sergey Senozhatsky
Cc: Mahendran Ganesh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Andrey reported a potential deadlock with the memory hotplug lock and
the cpu hotplug lock.The reason is that memory hotplug takes the memory hotplug lock and then
calls stop_machine() which calls get_online_cpus(). That's the reverse
lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.cThe problem has been there forever. The reason why this was never
reported is that the cpu hotplug locking had this homebrewn recursive
reader writer semaphore construct which due to the recursion evaded the
full lock dep coverage. The memory hotplug code copied that construct
verbatim and therefor has similar issues.Three steps to fix this:
1) Convert the memory hotplug locking to a per cpu rwsem so the
potential issues get reported proper by lockdep.2) Lock the online cpus in mem_hotplug_begin() before taking the memory
hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
code to avoid recursive locking.3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
by invoking lru_add_drain_all_cpuslocked() instead.Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
Reported-by: Andrey Ryabinin
Signed-off-by: Thomas Gleixner
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Vladimir Davydov
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The rework of the cpu hotplug locking unearthed potential deadlocks with
the memory hotplug locking code.The solution for these is to rework the memory hotplug locking code as
well and take the cpu hotplug lock before the memory hotplug lock in
mem_hotplug_begin(), but this will cause a recursive locking of the cpu
hotplug lock when the memory hotplug code calls lru_add_drain_all().Split out the inner workings of lru_add_drain_all() into
lru_add_drain_all_cpuslocked() so this function can be invoked from the
memory hotplug code with the cpu hotplug lock held.Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
Signed-off-by: Thomas Gleixner
Reported-by: Andrey Ryabinin
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Vladimir Davydov
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use rlimit() helper instead of manually writing whole chain from current
task to rlim_cur.Link: http://lkml.kernel.org/r/20170705172811.8027-1-k.opasiak@samsung.com
Signed-off-by: Krzysztof Opasiak
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
list_lru_count_node() iterates over all memcgs to get the total number of
entries on the node but it can race with memcg_drain_all_list_lrus(),
which migrates the entries from a dead cgroup to another. This can return
incorrect number of entries from list_lru_count_node().Fix this by keeping track of entries per node and simply return it in
list_lru_count_node().Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala
Acked-by: Vladimir Davydov
Cc: Jan Kara
Cc: Alexander Polakov
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
expand_stack(vma) fails if address < stack_guard_gap even if there is no
vma->vm_prev. I don't think this makes sense, and we didn't do this
before the recent commit 1be7107fbe18 ("mm: larger stack guard gap,
between vmas").We do not need a gap in this case, any address is fine as long as
security_mmap_addr() doesn't object.This also simplifies the code, we know that address >= prev->vm_end and
thus underflow is not possible.Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Larry Woodman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page. They are punching a new
MAP_FIXED mapping inside the existing stack Vma.This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety. It is a real regression on
the other hand.Let's work around the problem by considering PROT_NONE mapping as a part
of the stack. This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Debugged-by: Vlastimil Babka
Cc: Ben Hutchings
Cc: Willy Tarreau
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
presently pages in the balloon device have random value, and these pages
will be scanned by ksmd on the host. They usually cannot be merged.
Enqueue zero pages will resolve this problem.Link: http://lkml.kernel.org/r/1498698637-26389-1-git-send-email-zhenwei.pi@youruncloud.com
Signed-off-by: zhenwei.pi
Cc: Gioh Kim
Cc: Vlastimil Babka
Cc: Minchan Kim
Cc: Konstantin Khlebnikov
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The align_offset parameter is used by bitmap_find_next_zero_area_off()
to represent the offset of map's base from the previous alignment
boundary; the function ensures that the returned index, plus the
align_offset, honors the specified align_mask.The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical
address, not CMA region position") has the cma driver calculate the
offset to the *next* alignment boundary. In most cases, the base
alignment is greater than that specified when making allocations,
resulting in a zero offset whether we align up or down. In the example
given with the commit, the base alignment (8MB) was half the requested
alignment (16MB) so the math also happened to work since the offset is
8MB in both directions. However, when requesting allocations with an
alignment greater than twice that of the base, the returned index would
not be correctly aligned.Also, the align_order arguments of cma_bitmap_aligned_mask() and
cma_bitmap_aligned_offset() should not be negative so the argument type
was made unsigned.Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position")
Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
Signed-off-by: Angus Clark
Signed-off-by: Doug Berger
Acked-by: Gregory Fong
Cc: Doug Berger
Cc: Angus Clark
Cc: Laura Abbott
Cc: Vlastimil Babka
Cc: Greg Kroah-Hartman
Cc: Lucas Stach
Cc: Catalin Marinas
Cc: Shiraz Hashim
Cc: Jaewon Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__remove_zone() sets up up zone_type, but never uses it for anything.
This does not cause a warning, due to the (necessary) use of
-Wno-unused-but-set-variable. However, it's noise, so just delete it.Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
get_cpu_var() disables preemption and returns the per-CPU version of the
variable. Disabling preemption is useful to ensure atomic access to the
variable within the critical section.In this case however, after the per-CPU version of the variable is
obtained the ->free_lock is acquired. For that reason it seems the raw
accessor could be used. It only seems that ->slots_ret should be
retested (because with disabled preemption this variable can not be set
to NULL otherwise).This popped up during PREEMPT-RT testing because it tries to take
spinlocks in a preempt disabled section. In RT, spinlocks can sleep.Link: http://lkml.kernel.org/r/20170623114755.2ebxdysacvgxzott@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Michal Hocko
Cc: Tim Chen
Cc: Thomas Gleixner
Cc: Ying Huang
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since current_order starts as MAX_ORDER-1 and is then only decremented,
the second half of the loop condition seems superfluous. However, if
order is 0, we may decrement current_order past 0, making it UINT_MAX.
This is obviously too subtle ([1], [2]).Since we need to add some comment anyway, change the two variables to
signed, making the counting-down for loop look more familiar, and
apparently also making gcc generate slightly smaller code.[1] https://lkml.org/lkml/2016/6/20/493
[2] https://lkml.org/lkml/2017/6/19/345[akpm@linux-foundation.org: fix up reject fixupping]
Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes
Reported-by: Hao Lee
Acked-by: Wei Yang
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Cc: zhongjiang
Cc: Sergey Senozhatsky
Cc: Sudip Mukherjee
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Sebastian Andrzej Siewior
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
new_page is yet another duplication of the migration callback which has
to handle hugetlb migration specially. We can safely use the generic
new_page_nodemask for the same purpose.Please note that gigantic hugetlb pages do not need any special handling
because alloc_huge_page_nodemask will make sure to check pages in all
per node pools. The reason this was done previously was that
alloc_huge_page_node treated NO_NUMA_NODE and a specific node
differently and so alloc_huge_page_node(nid) would check on this
specific node.Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Reported-by: Vlastimil Babka
Reviewed-by: Mike Kravetz
Tested-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
alloc_huge_page_nodemask tries to allocate from any numa node in the
allowed node mask starting from lower numa nodes. This might lead to
filling up those low NUMA nodes while others are not used. We can
reduce this risk by introducing a concept of the preferred node similar
to what we have in the regular page allocator. We will start allocating
from the preferred nid and then iterate over all allowed nodes in the
zonelist order until we try them all.This is mimicing the page allocator logic except it operates on per-node
mempools. dequeue_huge_page_vma already does this so distill the
zonelist logic into a more generic dequeue_huge_page_nodemask and use it
in alloc_huge_page_nodemask.This will allow us to use proper per numa distance fallback also for
alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
can get rid of alloc_huge_page_node helper which doesn't have any user
anymore.Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Mike Kravetz
Tested-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "mm, hugetlb: allow proper node fallback dequeue".
While working on a hugetlb migration issue addressed in a separate
patchset[1] I have noticed that the hugetlb allocations from the
preallocated pool are quite subotimal.[1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
There is no fallback mechanism implemented and no notion of preferred
node. I have tried to work around it but Vlastimil was right to push
back for a more robust solution. It seems that such a solution is to
reuse zonelist approach we use for the page alloctor.This series has 3 patches. The first one tries to make hugetlb
allocation layers more clear. The second one implements the zonelist
hugetlb pool allocation and introduces a preferred node semantic which
is used by the migration callbacks. The last patch is a clean up.This patch (of 3):
Hugetlb allocation path for fresh huge pages is unnecessarily complex
and it mixes different interfaces between layers.__alloc_buddy_huge_page is the central place to perform a new
allocation. It checks for the hugetlb overcommit and then relies on
__hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is
all good except that __alloc_buddy_huge_page pushes vma and address down
the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
two different allocation modes - one for memory policy and other node
specific (or to make it more obscure node non-specific) requests.This just screams for a reorganization.
This patch pulls out all the vma specific handling up to
__alloc_buddy_huge_page_with_mpol where it belongs.
__alloc_buddy_huge_page will get nodemask argument and
__hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
page allocator.In short:
__alloc_buddy_huge_page_with_mpol - memory policy handling
__alloc_buddy_huge_page - overcommit handling and accounting
__hugetlb_alloc_buddy_huge_page - page allocator layerAlso note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
is not really needed because the page allocator already handles the
cpusets update.Finally __hugetlb_alloc_buddy_huge_page had a special case for node
specific allocations (when no policy is applied and there is a node
given). This has relied on __GFP_THISNODE to not fallback to a different
node. alloc_huge_page_node is the only caller which relies on this
behavior so move the __GFP_THISNODE there.Not only does this remove quite some code it also should make those
layers easier to follow and clear wrt responsibilities.Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Mike Kravetz
Tested-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During the debugging of the problem described in
https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
output is not really useful to understand issues related to the oom
reaper.So, I assume, that adding some tracepoints might help with debugging of
similar issues.Trace the following events:
1) a process is marked as an oom victim,
2) a process is added to the oom reaper list,
3) the oom reaper starts reaping process's mm,
4) the oom reaper finished reaping,
5) the oom reaper skips reaping.How it works in practice? Below is an example which show how the problem
mentioned above can be found: one process is added twice to the
oom_reaper list:$ cd /sys/kernel/debug/tracing
$ echo "oom:mark_victim" > set_event
$ echo "oom:wake_reaper" >> set_event
$ echo "oom:skip_task_reaping" >> set_event
$ echo "oom:start_task_reaping" >> set_event
$ echo "oom:finish_task_reaping" >> set_event
$ cat trace_pipe
allocate-502 [001] .... 91.836405: mark_victim: pid=502
allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502
allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502
oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502
oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502
oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
Signed-off-by: Roman Gushchin
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Johannes Weiner
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
monitor. The monitor has to stop handling #PF events in the range being
freed. We are reusing userfaultfd_remove callback along with the logic
required to re-get and re-validate the VMA which may change or disappear
because userfaultfd_remove releases mmap_sem.Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Hillf Danton
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The condition checking for THP straddling end of invalidated range is
wrong - it checks 'index' against 'end' but 'index' has been already
advanced to point to the end of THP and thus the condition can never be
true. As a result THP straddling 'end' has been fully invalidated.
Given the nature of invalidate_mapping_pages(), this could be only
performance issue. In fact, we are lucky the condition is wrong because
if it was ever true, we'd leave locked page behind.Fix the condition checking for THP straddling 'end' and also properly
unlock the page. Also update the comment before the condition to
explain why we decide not to invalidate the page as it was not clear to
me and I had to ask Kirill.Link: http://lkml.kernel.org/r/20170619124723.21656-1-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The hugetlb code has its own function to report human-readable sizes.
Convert it to use the shared string_get_size() function. This will lead
to a minor difference in user visible output (MiB/GiB instead of MB/GB),
but some would argue that's desirable anyway.Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org
Signed-off-by: Matthew Wilcox
Cc: Liam R. Howlett
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Aneesh Kumar K.V
Cc: Gerald Schaefer
Cc: zhong jiang
Cc: Andrea Arcangeli
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Alice has reported the following UBSAN splat:
UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
signed integer overflow:
-2147483644 - 2147483525 cannot be represented in type 'long int'
CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4
Hardware name: XXXXXX, BIOS YYYYYY
Call Trace:
dump_stack+0x59/0x87
ubsan_epilogue+0xe/0x40
handle_overflow+0xbb/0xf0
__ubsan_handle_sub_overflow+0x12/0x20
memcg_check_events.isra.36+0x223/0x360
mem_cgroup_commit_charge+0x55/0x140
wp_page_copy+0x34e/0xb80
do_wp_page+0x1e6/0x1300
handle_mm_fault+0x88b/0x1990
__do_page_fault+0x2de/0x8a0
do_page_fault+0x1a/0x20
error_code+0x67/0x6cThe reason is that we subtract two signed types. Let's fix this by
truly mimicing time_after and cast the result of the subtraction.Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: Alice Ferrazzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A few hugetlb allocators loop while calling the page allocator and can
potentially prevent rescheduling if the page allocator slowpath is not
utilized.Conditionally schedule when large numbers of hugepages can be allocated.
Anshuman:
"Fixes a task which was getting hung while writing like 10000 hugepages
(16MB on POWER8) into /proc/sys/vm/nr_hugepages."Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Reviewed-by: Mike Kravetz
Tested-by: Anshuman Khandual
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") has duplicated a large part of
alloc_migrate_target with some hotplug specific special casing.To be more precise it tried to enfore the allocation from a different
node than the original page. As a result the two function diverged in
their shared logic, e.g. the hugetlb allocation strategy.Let's unify the two and express different NUMA requirements by the given
nodemask. new_node_page will simply exclude the node it doesn't care
about and alloc_migrate_target will use all the available nodes.
alloc_migrate_target will then learn to migrate hugetlb pages more
sanely and use preallocated pool when possible.Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which is
quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: zhong jiang
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages. If such a node doesn't have
any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
to allocate a surplus page instead. This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
node 0/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
/sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0Now we consume the whole pool on node 4 and try to offline this node.
All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them. With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.[akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: zhong jiang
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
new_node_page tries to allocate the target page on a different NUMA node
than the source page. This makes sense in most cases during the hotplug
because we are likely to offline the whole numa node. But there are
cases where there are no other nodes to fallback (e.g. when offlining
parts of the only existing node) and we have to fallback to allocating
from the source node. The current code does that but it can be
simplified by checking the nmask and updating it before we even try to
allocate rather than special casing it.This patch shouldn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: zhong jiang
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
movable_node kernel parameter allows making hotpluggable NUMA nodes to
put all the hotplugable memory into movable zone which allows more or
less reliable memory hotremove. At least this is the case for the NUMA
nodes present during the boot (see find_zone_movable_pfns_for_nodes).This is not the case for the memory hotplug, though.
echo online > /sys/devices/system/memory/memoryXYZ/state
will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node. The only option currently is to have
a special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy. Not to mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to make
them use the same code path because the runtime hotplug doesn't play
with the memblock allocator at all.Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone. This should provide a reasonable default onlining
strategy.Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW. From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely). If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks but
let's keep this simple now.Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Reza Arbab
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Yasuaki Ishimatsu
Cc:
Cc: Kani Toshimitsu
Cc:
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: David Rientjes
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When migrating a transparent hugepage, migrate_misplaced_transhuge_page
guards itself against a concurrent fastgup of the page by checking that
the page count is equal to 2 before and after installing the new pmd.If the page count changes, then the pmd is reverted back to the original
entry, however there is a small window where the new (possibly writable)
pmd is installed and the underlying page could be written by userspace.
Restoring the old pmd could therefore result in loss of data.This patch fixes the problem by freezing the page count whilst updating
the page tables, which protects against a concurrent fastgup without the
need to restore the old pmd in the failure case (since the page count
can no longer change under our feet).Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com
Signed-off-by: Will Deacon
Acked-by: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Mark Rutland
Cc: Steve Capper
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When the user specifies too many hugepages or an invalid
default_hugepagesz the communication to the user is implicit in the
allocation message. This patch adds a warning when the desired page
count is not allocated and prints an error when the default_hugepagesz
is invalid on boot.During boot hugepages will allocate until there is a fraction of the
hugepage size left. That is, we allocate until either the request is
satisfied or memory for the pages is exhausted. When memory for the
pages is exhausted, it will most likely lead to the system failing with
the OOM manager not finding enough (or anything) to kill (unless you're
using really big hugepages in the order of 100s of MB or in the GBs).
The user will most likely see the OOM messages much later in the boot
sequence than the implicitly stated message. Worse yet, you may even
get an OOM for each processor which causes many pages of OOMs on modern
systems. Although these messages will be printed earlier than the OOM
messages, at least giving the user errors and warnings will highlight
the configuration as an issue. I'm trying to point the user in the
right direction by providing a more robust statement of what is failing.During the sysctl or echo command, the user can check the results much
easier than if the system hangs during boot and the scenario of having
nothing to OOM for kernel memory is highly unlikely.Mike said:
"Before sending out this patch, I asked Liam off list why he was doing
it. Was it something he just thought would be useful? Or, was there
some type of user situation/need. He said that he had been called in
to assist on several occasions when a system OOMed during boot. In
almost all of these situations, the user had grossly misconfigured
huge pages.DB users want to pre-allocate just the right amount of huge pages, but
sometimes they can be really off. In such situations, the huge page
init code just allocates as many huge pages as it can and reports the
number allocated. There is no indication that it quit allocating
because it ran out of memory. Of course, a user could compare the
number in the message to what they requested on the command line to
determine if they got all the huge pages they requested. The thought
was that it would be useful to at least flag this situation. That way,
the user might be able to better relate the huge page allocation
failure to the OOM.I'm not sure if the e-mail discussion made it obvious that this is
something he has seen on several occasions.I see Michal's point that this will only flag the situation where
someone configures huge pages very badly. And, a more extensive look
at the situation of misconfiguring huge pages might be in order. But,
this has happened on several occasions which led to the creation of
this patch"[akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Aneesh Kumar K.V
Cc: Gerald Schaefer
Cc: zhongjiang
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds