Eric Lee / smarc-fsl-linux-kernel

03 Sep, 2020

1 commit

75aaa8fa7 mm/vunmap: add cond_resched() in vunmap_pmd_range ... Browse Code »

[ Upstream commit e47110e90584a22e9980510b00d0dfad3a83354e ]

Like zap_pte_range add cond_resched so that we can avoid softlockups as
reported below. On non-preemptible kernel with large I/O map region (like
the one we get when using persistent memory with sector mode), an unmap of
the namespace can report below softlockups.

22724.027334] watchdog: BUG: soft lockup - CPU#49 stuck for 23s! [ndctl:50777]
NIP [c0000000000dc224] plpar_hcall+0x38/0x58
LR [c0000000000d8898] pSeries_lpar_hpte_invalidate+0x68/0xb0
Call Trace:
flush_hash_page+0x114/0x200
hpte_need_flush+0x2dc/0x540
vunmap_page_range+0x538/0x6f0
free_unmap_vmap_area+0x30/0x70
remove_vm_area+0xfc/0x140
__vunmap+0x68/0x270
__iounmap.part.0+0x34/0x60
memunmap+0x54/0x70
release_nodes+0x28c/0x300
device_release_driver_internal+0x16c/0x280
unbind_store+0x124/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
__vfs_write+0x3c/0x70
vfs_write+0xd8/0x260
ksys_write+0xdc/0x130
system_call+0x5c/0x70

Reported-by: Harish Sriram
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc:
Link: http://lkml.kernel.org/r/20200807075933.310240-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Sasha Levin
2020-09-03 17:26:52 +0800

29 Apr, 2020

1 commit

f4f235309 vmalloc: fix remap_vmalloc_range() bounds checks ... Browse Code »

commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.

remap_vmalloc_range() has had various issues with the bounds checks it
promises to perform ("This function checks that addr is a valid
vmalloc'ed area, and that it is big enough to cover the vma") over time,
e.g.:

- not detecting pgoff<<<<
Signed-off-by: Andrew Morton
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov
Cc: Daniel Borkmann
Cc: Martin KaFai Lau
Cc: Song Liu
Cc: Yonghong Song
Cc: Andrii Nakryiko
Cc: John Fastabend
Cc: KP Singh
Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2020-04-29 22:33:14 +0800

25 Mar, 2020

1 commit

66f28e110 x86/mm: split vmalloc_sync_all() ... Browse Code »

commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:

* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the
above mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot
Reported-by: Shile Zhang
Signed-off-by: Joerg Roedel
Signed-off-by: Andrew Morton
Tested-by: Borislav Petkov
Acked-by: Rafael J. Wysocki [GHES]
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc:
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Joerg Roedel
2020-03-25 15:25:58 +0800

23 Jan, 2020

1 commit

d30dce351 mm, debug_pagealloc: don't rely on static keys too early ... Browse Code »

commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.

However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.

To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.

[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka
Signed-off-by: Stephen Rothwell
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Qian Cai
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2020-01-23 15:22:40 +0800

26 Sep, 2019

1 commit

315cc066b augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro ... Browse Code »

Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.

[walken@google.com: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[walken@google.com: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
Signed-off-by: Michel Lespinasse
Acked-by: Peter Zijlstra (Intel)
Cc: David Howells
Cc: Davidlohr Bueso
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2019-09-26 08:51:39 +0800

25 Sep, 2019

3 commits

7ea362427 mm/vmalloc.c: move 'area->pages' after if statement ... Browse Code »

If !area->pages statement is true where memory allocation fails, area is
freed.

In this case 'area->pages = pages' should not executed. So move
'area->pages = pages' after if statement.

[akpm@linux-foundation.org: give area->pages the same treatment]
Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
Signed-off-by: Austin Kim
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Uladzislau Rezki (Sony)
Cc: Roman Gushchin
Cc: Roman Penyaev
Cc: Rick Edgecombe
Cc: Mike Rapoport
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Austin Kim
2019-09-25 06:54:10 +0800
688fcbfc0 mm/vmalloc: modify struct vmap_area to reduce its size ... Browse Code »

Objective
---------

The current implementation of struct vmap_area wasted space.

After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.

Description
-----------

1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because

A) "subtree_max_size" is only used when vmap_area is in "free" tree

B) "vm" is only used when vmap_area is in "busy" tree

C) "purge_list" is only used when vmap_area is in vmap_purge_list

2) Eliminate "flags".

;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.

Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com
Signed-off-by: Pengfei Li
Suggested-by: Uladzislau Rezki (Sony)
Reviewed-by: Uladzislau Rezki (Sony)
Cc: Hillf Danton
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Roman Gushchin
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pengfei Li
2019-09-25 06:54:10 +0800
dd3b8353b mm/vmalloc: do not keep unpurged areas in the busy tree ... Browse Code »

The busy tree can be quite big, even though the area is freed or unmapped
it still stays there until "purge" logic removes it.

1) Optimize and reduce the size of "busy" tree by removing a node from
it right away as soon as user triggers free paths. It is possible to
do so, because the allocation is done using another augmented tree.

The vmalloc test driver shows the difference, for example the
"fix_size_alloc_test" is ~11% better comparing with default configuration:

sudo ./test_vmalloc.sh performance

Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec

Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec

2) Since the busy tree now contains allocated areas only and does not
interfere with lazily free nodes, introduce the new function
show_purge_info() that dumps "unpurged" areas that is propagated
through "/proc/vmallocinfo".

3) Eliminate VM_LAZY_FREE flag.

Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Signed-off-by: Pengfei Li
Cc: Roman Gushchin
Cc: Uladzislau Rezki
Cc: Hillf Danton
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-09-25 06:54:10 +0800

04 Sep, 2019

1 commit

fe9041c24 vmalloc: lift the arm flag for coherent mappings to common code ... Browse Code »

The arm architecture had a VM_ARM_DMA_CONSISTENT flag to mark DMA
coherent remapping for a while. Lift this flag to common code so
that we can use it generically. We also check it in the only place
VM_USERMAP is directly check so that we can entirely replace that
flag as well (although I'm not even sure why we'd want to allow
remapping DMA appings, but I'd rather not change behavior).

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2019-09-04 17:13:19 +0800

14 Aug, 2019

1 commit

5336e52c9 mm/vmalloc.c: fix percpu free VM area search criteria ... Browse Code »

Recent changes to the vmalloc code by commit 68ad4a330433
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures. These, in turn, can result
in panic()s in the slub code. One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

RIP: 0033:0x7f46f7441b9b
Call Trace:
dump_stack+0x61/0x80
pcpu_alloc.cold.30+0x22/0x4f
mem_cgroup_css_alloc+0x110/0x650
cgroup_apply_control_enable+0x133/0x330
cgroup_mkdir+0x41b/0x500
kernfs_iop_mkdir+0x5a/0x90
vfs_mkdir+0x102/0x1b0
do_mkdirat+0x7d/0xf0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator. In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space. So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a330433, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
Find a aligned "base" adress near VMALLOC_END.
va = free vm area near VMALLOC_END
Step 2:
Loop through number of requested vm_areas and check,
Step 2.1:
if (base < VMALLOC_START)
1. fail with error
Step 2.2:
// end is offsets[area] + sizes[area]
if (base + end > va->vm_end)
1. Move the base downwards and repeat Step 2
Step 2.3:
if (base + start < va->vm_start)
1. Move to previous free vm_area node, find aligned
base address and repeat Step 2

But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

Step 2.3:
if (base + start < va->vm_start || base + end > va->vm_end)
1. Move to previous free vm_area node, find aligned
base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures. For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check. Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan
Reported-by: Dave Hansen
Acked-by: Dennis Zhou
Reviewed-by: Uladzislau Rezki (Sony)
Cc: Roman Gushchin
Cc: sathyanarayanan kuppuswamy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kuppuswamy Sathyanarayanan
2019-08-14 07:06:52 +0800

22 Jul, 2019

1 commit

3f8fd02b1 mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() ... Browse Code »

On x86-32 with PTI enabled, parts of the kernel page-tables are not shared
between processes. This can cause mappings in the vmalloc/ioremap area to
persist in some page-tables after the region is unmapped and released.

When the region is re-used the processes with the old mappings do not fault
in the new mappings but still access the old ones.

This causes undefined behavior, in reality often data corruption, kernel
oopses and panics and even spontaneous reboots.

Fix this problem by activly syncing unmaps in the vmalloc/ioremap area to
all page-tables in the system before the regions can be re-used.

References: https://bugzilla.suse.com/show_bug.cgi?id=1118689
Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F')
Signed-off-by: Joerg Roedel
Signed-off-by: Thomas Gleixner
Reviewed-by: Dave Hansen
Link: https://lkml.kernel.org/r/20190719184652.11391-4-joro@8bytes.org

Joerg Roedel
2019-07-22 16:18:30 +0800

13 Jul, 2019

7 commits

97105f0ab mm: vmalloc: show number of vmalloc pages in /proc/meminfo ... Browse Code »

Vmalloc() is getting more and more used these days (kernel stacks, bpf and
percpu allocator are new top users), and the total % of memory consumed by
vmalloc() can be pretty significant and changes dynamically.

/proc/meminfo is the best place to display this information: its top goal
is to show top consumers of the memory.

Since the VmallocUsed field in /proc/meminfo is not in use for quite a
long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of
'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
actual physical memory consumption of vmalloc().

Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2019-07-13 02:05:47 +0800
d9009d67f mm/vmalloc.c: spelling> s/informaion/information/ ... Browse Code »

Link: http://lkml.kernel.org/r/20190607113509.15032-1-geert+renesas@glider.be
Signed-off-by: Geert Uytterhoeven
Reviewed-by: Andrew Morton
Acked-by: Souptick Joarder
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geert Uytterhoeven
2019-07-13 02:05:46 +0800
460e42d19 mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va() ... Browse Code »

Trigger a warning if an object that is about to be freed is detached. We
used to have a BUG_ON(), but even though it is considered as faulty
behaviour that is not a good reason to break a system.

Link: http://lkml.kernel.org/r/20190606120411.8298-5-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Cc: Roman Gushchin
Cc: Hillf Danton
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-07-13 02:05:46 +0800
54f63d9d8 mm/vmalloc.c: get rid of one single unlink_va() when merge ... Browse Code »

It does not make sense to try to "unlink" the node that is definitely not
linked with a list nor tree. On the first merge step VA just points to
the previously disconnected busy area.

On the second step, check if the node has been merged and do "unlink" if
so, because now it points to an object that must be linked.

Link: http://lkml.kernel.org/r/20190606120411.8298-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Acked-by: Hillf Danton
Reviewed-by: Roman Gushchin
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-07-13 02:05:46 +0800
82dd23e84 mm/vmalloc.c: preload a CPU with one object for split purpose ... Browse Code »

Refactor the NE_FIT_TYPE split case when it comes to an allocation of one
extra object. We need it in order to build a remaining space. The
preload is done per CPU in non-atomic context with GFP_KERNEL flags.

More permissive parameters can be beneficial for systems which are suffer
from high memory pressure or low memory condition. For example on my KVM
system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page
allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N
workers spinning on fork() and exit(), i can trigger below trace:

[ 179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[ 179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003
[ 179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 179.815171] Call Trace:
[ 179.815178] dump_stack+0x5c/0x7b
[ 179.815182] warn_alloc+0x108/0x190
[ 179.815187] __alloc_pages_slowpath+0xdc7/0xdf0
[ 179.815191] __alloc_pages_nodemask+0x2de/0x330
[ 179.815194] cache_grow_begin+0x77/0x420
[ 179.815197] fallback_alloc+0x161/0x200
[ 179.815200] kmem_cache_alloc+0x1c9/0x570
[ 179.815202] alloc_vmap_area+0x32c/0x990
[ 179.815206] __get_vm_area_node+0xb0/0x170
[ 179.815208] __vmalloc_node_range+0x6d/0x230
[ 179.815211] ? _do_fork+0xce/0x3d0
[ 179.815213] copy_process.part.46+0x850/0x1b90
[ 179.815215] ? _do_fork+0xce/0x3d0
[ 179.815219] _do_fork+0xce/0x3d0
[ 179.815226] ? __do_page_fault+0x2bf/0x4e0
[ 179.815229] do_syscall_64+0x55/0x130
[ 179.815231] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 179.815234] RIP: 0033:0x7fedec4c738b
...
[ 179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b
[ 179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0
[ 179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000
[ 179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[ 179.815245] Mem-Info:
[ 179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0
active_file:502 inactive_file:61 isolated_file:70
unevictable:2 dirty:0 writeback:0 unstable:0
slab_reclaimable:2380 slab_unreclaimable:7520
mapped:15069 shmem:14813 pagetables:10833 bounce:0
free:1922 free_pcp:229 free_cma:0

Link: http://lkml.kernel.org/r/20190606120411.8298-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Cc: Hillf Danton
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Roman Gushchin
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-07-13 02:05:46 +0800
cacca6baf mm/vmalloc.c: remove "node" argument ... Browse Code »

Patch series "Some cleanups for the KVA/vmalloc", v5.

This patch (of 4):

Remove unused argument from the __alloc_vmap_area() function.

Link: http://lkml.kernel.org/r/20190606120411.8298-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Reviewed-by: Roman Gushchin
Cc: Hillf Danton
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-07-13 02:05:46 +0800
8b1e0f81f mm/pgtable: drop pgtable_t variable from pte_fn_t functions ... Browse Code »

Drop the pgtable_t variable from all implementation for pte_fn_t as none
of them use it. apply_to_pte_range() should stop computing it as well.
Should help us save some cycles.

Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual
Acked-by: Matthew Wilcox
Cc: Ard Biesheuvel
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Michal Hocko
Cc: Logan Gunthorpe
Cc: "Kirill A. Shutemov"
Cc: Dan Williams
Cc:
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anshuman Khandual
2019-07-13 02:05:46 +0800

09 Jul, 2019

1 commit

dfd437a25 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Catalin Marinas:

- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}

- Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
manage the permissions of executable vmalloc regions more strictly

- Slight performance improvement by keeping softirqs enabled while
touching the FPSIMD/SVE state (kernel_neon_begin/end)

- Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
XAFLAG and AXFLAG instructions for floating point comparison flags
manipulation) and FRINT (rounding floating point numbers to integers)

- Re-instate ARM64_PSEUDO_NMI support which was previously marked as
BROKEN due to some bugs (now fixed)

- Improve parking of stopped CPUs and implement an arm64-specific
panic_smp_self_stop() to avoid warning on not being able to stop
secondary CPUs during panic

- perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
platforms

- perf: DDR performance monitor support for iMX8QXP

- cache_line_size() can now be set from DT or ACPI/PPTT if provided to
cope with a system cache info not exposed via the CPUID registers

- Avoid warning on hardware cache line size greater than
ARCH_DMA_MINALIGN if the system is fully coherent

- arm64 do_page_fault() and hugetlb cleanups

- Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)

- Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
'arm_boot_flags' introduced in 5.1)

- CONFIG_RANDOMIZE_BASE now enabled in defconfig

- Allow the selection of ARM64_MODULE_PLTS, currently only done via
RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
over into the vmalloc area

- Make ZONE_DMA32 configurable

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
perf: arm_spe: Enable ACPI/Platform automatic module loading
arm_pmu: acpi: spe: Add initial MADT/SPE probing
ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
ACPI/PPTT: Modify node flag detection to find last IDENTICAL
x86/entry: Simplify _TIF_SYSCALL_EMU handling
arm64: rename dump_instr as dump_kernel_instr
arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
arm64: Implement panic_smp_self_stop()
arm64: Improve parking of stopped CPUs
arm64: Expose FRINT capabilities to userspace
arm64: Expose ARMv8.5 CondM capability to userspace
arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
arm64: ARM64_MODULES_PLTS must depend on MODULES
arm64: bpf: do not allocate executable memory
arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
arm64: module: create module allocations without exec permissions
arm64: Allow user selection of ARM64_MODULE_PLTS
acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
arm64: Allow selecting Pseudo-NMI again
...

Linus Torvalds
2019-07-09 00:54:55 +0800

29 Jun, 2019

1 commit

2c9292336 mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning ... Browse Code »

gcc gets confused in pcpu_get_vm_areas() because there are too many
branches that affect whether 'lva' was initialized before it gets used:

mm/vmalloc.c: In function 'pcpu_get_vm_areas':
mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
insert_vmap_area_augment(lva, &va->rb_node,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
&free_vmap_area_root, &free_vmap_area_list);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/vmalloc.c:916:20: note: 'lva' was declared here
struct vmap_area *lva;
^~~

Add an intialization to NULL, and check whether this has changed before
the first use.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Arnd Bergmann
Reviewed-by: Uladzislau Rezki (Sony)
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2019-06-29 16:43:45 +0800

25 Jun, 2019

1 commit

4739d53fc arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP ... Browse Code »

Wire up the special helper functions to manipulate aliases of vmalloc
regions in the linear map.

Acked-by: Will Deacon
Signed-off-by: Ard Biesheuvel
Signed-off-by: Catalin Marinas

Ard Biesheuvel
2019-06-25 01:10:39 +0800

03 Jun, 2019

2 commits

31e67340c mm/vmalloc: Avoid rare case of flushing TLB with weird arguments ... Browse Code »

In a rare case, flush_tlb_kernel_range() could be called with a start
higher than the end.

In vm_remove_mappings(), in case page_address() returns 0 for all pages
(for example they were all in highmem), _vm_unmap_aliases() will be
called with start = ULONG_MAX, end = 0 and flush = 1.

If at the same time, the vmalloc purge operation is triggered by something
else while the current operation is between remove_vm_area() and
_vm_unmap_aliases(), then the vm mapping just removed will be already
purged. In this case the call of vm_unmap_aliases() may not find any other
mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So
only set flush = true if we find something in the direct mapping that we
need to flush, and this way this can't happen.

Signed-off-by: Rick Edgecombe
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: David S. Miller
Cc: Linus Torvalds
Cc: Meelis Roos
Cc: Nadav Amit
Cc: Peter Zijlstra
Cc: Andrew Morton
Cc: Thomas Gleixner
Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com
Signed-off-by: Ingo Molnar

Rick Edgecombe
2019-06-03 17:47:25 +0800
8e41f8726 mm/vmalloc: Fix calculation of direct map addr range ... Browse Code »

The calculation of the direct map address range to flush was wrong.
This could cause the RO direct map alias to not get flushed. Today
this shouldn't be a problem because this flush is only needed on x86
right now and the spurious fault handler will fix cached RO->RW
translations. In the future though, it could cause the permissions
to remain RO in the TLB for the direct map alias, and then the page
would return from the page allocator to some other component as RO
and cause a crash.

So fix fix the address range calculation so the flush will include the
direct map range.

Signed-off-by: Rick Edgecombe
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: David S. Miller
Cc: Linus Torvalds
Cc: Meelis Roos
Cc: Nadav Amit
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com
Signed-off-by: Ingo Molnar

Rick Edgecombe
2019-06-03 17:47:25 +0800

02 Jun, 2019

1 commit

3806b0414 mm/vmalloc.c: fix typo in comment ... Browse Code »

Reported-by: Nicholas Joll
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2019-06-02 06:51:31 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

19 May, 2019

3 commits

a6cf4e0fe mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro ... Browse Code »

This macro adds some debug code to check that vmap allocations are
happened in ascending order.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Roman Gushchin
Cc: Ingo Molnar
Cc: Joel Fernandes
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Tejun Heo
Cc: Thomas Garnier
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-19 06:52:26 +0800
bb850f4da mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro ... Browse Code »

This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.

By default this option is set to 0 and not active. It requires
recompilation of the kernel to activate it. Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Roman Gushchin
Cc: Ingo Molnar
Cc: Joel Fernandes
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Tejun Heo
Cc: Thomas Garnier
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-19 06:52:26 +0800
68ad4a330 mm/vmalloc.c: keep track of free blocks for vmap allocation ... Browse Code »

Patch series "improve vmap allocation", v3.

Objective
---------

Please have a look for the description at:

https://lkml.org/lkml/2018/10/19/786

but let me also summarize it a bit here as well.

The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.

Description
-----------

This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e. an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks. It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator. Because free blocks are always as large
as possible.

It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.

Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree. Knowing that, we can easily traversal toward the lowest (left
most path) free area.

Allocation: ~O(log(N)) complexity. It is sequential allocation method
therefore tends to maximize locality. The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.

I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786

A free block can be split by three different ways. Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they
correspond to how requested size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits. Comparing with current design there is
an extra work with rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do
is just cutting a free block. It is as fast as a current design. Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.

NE_FIT_TYPE - Is much less common case. Basically it happens when
requested size and alignment does not fit left nor right edges, i.e. it
is between them. In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.

Comparing with current design there are two extra steps. First one is we
have to allocate a new vmap_area structure. Second one we have to insert
that remaining free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.

Second one is pretty optimized. Since we know a start point in the tree
we do not do a search from the top. Instead a traversal begins from a
rb-tree node we split.

De-allocation. ~O(log(N)) complexity. An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors. The list provides O(1) access to
prev/next, so it is pretty fast to check it. Summarizing. If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.

There is one more thing that i should mention here. After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree. Apart of that it can also be populated back
to upper levels to fix the tree. For more details please have a look at
the __augment_tree_propagate_from() function and the description.

Tests and stressing
-------------------

I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.

Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it. The time of stressing is days.

If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics. I find it stable, but more testing would
be good.

Performance analysis
--------------------

I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.

Currently it consist of 8 tests. There are three of them which correspond
to different types of splitting(to compare with default). We have 3
ones(see above). Another 5 do allocations in different conditions.

a) sudo ./test_vmalloc.sh performance

When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order. We
do it in order to get stable and repeatable results. Take a look at time
difference in "long_busy_list_alloc_test". It is not surprising because
the worst case is O(N).

# i5-3320M
How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

# Hikey960 8x CPUs
How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

b) time sudo ./test_vmalloc.sh test_repeat_count=1

With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order. It gives
random allocation behaviour. So it is rough comparison, but it puts in
the picture for sure.

# i5-3320M
vs
real 101m22.813s real 0m56.805s
user 0m0.011s user 0m0.015s
sys 0m5.076s sys 0m0.023s

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

# Hikey960 8x CPUs
vs
real unknown real 4m25.214s
user unknown user 0m0.011s
sys unknown sys 0m0.670s

I did not manage to complete this test on "default Hikey960" kernel
version. After 24 hours it was still running, therefore i had to cancel
it. That is why real/user/sys are "unknown".

This patch (of 3):

Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas. Therefore each new allocation causes the list being grown. Due to
over fragmented list and different permissive parameters an allocation can
take a long time. For example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.

Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree. Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e. it is sequential allocation.

Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point. If the block is bigger than
requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree. Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done. In case of merging of
de-allocated memory chunk a large coalesced area is created.

Complexity: ~O(log(N))

[urezki@gmail.com: v3]
Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Joel Fernandes
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-19 06:52:26 +0800

15 May, 2019

2 commits

4d36e6f80 mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t ... Browse Code »

vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
64 bit as well.

Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Reviewed-by: William Kucharski
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-15 10:52:48 +0800
68571be99 mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy() ... Browse Code »

Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
introduced some preempt points, one of those is making an allocation
more prioritized over lazy free of vmap areas.

Prioritizing an allocation over freeing does not work well all the time,
i.e. it should be rather a compromise.

1) Number of lazy pages directly influences the busy list length thus
on operations like: allocation, lookup, unmap, remove, etc.

2) Under heavy stress of vmalloc subsystem I run into a situation when
memory usage gets increased hitting out_of_memory -> panic state due to
completely blocking of logic that frees vmap areas in the
__purge_vmap_area_lazy() function.

Establish a threshold passing which the freeing is prioritized back over
allocation creating a balance between each other.

Using vmalloc test driver in "stress mode", i.e. When all available
test cases are run simultaneously on all online CPUs applying a
pressure on the vmalloc subsystem, my HiKey 960 board runs out of
memory due to the fact that __purge_vmap_area_lazy() logic simply is
not able to free pages in time.

How I run it:

1) You should build your kernel with CONFIG_TEST_VMALLOC=m
2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

During this test "vmap_lazy_nr" pages will go far beyond acceptable
lazy_max_pages() threshold, that will lead to enormous busy list size
and other problems including allocation time and so on.

Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Joel Fernandes
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Tejun Heo
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-15 10:52:48 +0800

30 Apr, 2019

1 commit

868b104d7 mm/vmalloc: Add flag for freeing of special permsissions ... Browse Code »

Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to
immediately clear executable TLB entries before freeing pages, and handle
resetting permissions on the directmap. This flag is useful for any kind
of memory with elevated permissions, or where there can be related
permissions changes on the directmap. Today this is RO+X and RO memory.

Although this enables directly vfreeing non-writeable memory now,
non-writable memory cannot be freed in an interrupt because the allocation
itself is used as a node on deferred free list. So when RO memory needs to
be freed in an interrupt the code doing the vfree needs to have its own
work queue, as was the case before the deferred vfree list was added to
vmalloc.

For architectures with set_direct_map_ implementations this whole operation
can be done with one TLB flush when centralized like this. For others with
directmap permissions, currently only arm64, a backup method using
set_memory functions is used to reset the directmap. When arm64 adds
set_direct_map_ functions, this backup can be removed.

When the TLB is flushed to both remove TLB entries for the vmalloc range
mapping and the direct map permissions, the lazy purge operation could be
done to try to save a TLB flush later. However today vm_unmap_aliases
could flush a TLB range that does not include the directmap. So a helper
is added with extra parameters that can allow both the vmalloc address and
the direct mapping to be flushed during this operation. The behavior of the
normal vm_unmap_aliases function is unchanged.

Suggested-by: Dave Hansen
Suggested-by: Andy Lutomirski
Suggested-by: Will Deacon
Signed-off-by: Rick Edgecombe
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc:
Cc:
Cc:
Cc:
Cc:
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Nadav Amit
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190426001143.4983-17-namit@vmware.com
Signed-off-by: Ingo Molnar

Rick Edgecombe
2019-04-30 18:37:58 +0800

06 Mar, 2019

9 commits

a862f68a8 docs/core-api/mm: fix return value descriptions in mm/ ... Browse Code »

Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:

$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...

Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.

Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-03-06 13:07:20 +0800
92eac1681 docs/mm: vmalloc: re-indent kernel-doc comemnts ... Browse Code »

Some kernel-doc comments in mm/vmalloc.c have leading tab in
indentation. This leads to excessive indentation in the generated HTML
and to the inconsistency of its layout ([1] vs [2]).

Besides, multi-line Note: sections are not handled properly with extra
indentation.

[1] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vm_map_ram
[2] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vfree

Link: http://lkml.kernel.org/r/1549549644-4903-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-03-06 13:07:20 +0800
afd07389d mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! ... Browse Code »

One of the vmalloc stress test case triggers the kernel BUG():

[60.562151] ------------[ cut here ]------------
[60.562154] kernel BUG at mm/vmalloc.c:512!
[60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
[60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390

it can happen due to big align request resulting in overflowing of
calculated address, i.e. it becomes 0 after ALIGN()'s fixup.

Fix it by checking if calculated address is within vstart/vend range.

Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Cc: Ingo Molnar
Cc: Joel Fernandes
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Tejun Heo
Cc: Thomas Garnier
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-03-06 13:07:17 +0800
153178edc vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE ... Browse Code »

Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
enabled. Some test cases in vmalloc test suite module require and make
use of that function. Please note, that it is not supposed to be used
for other purposes.

We need it only for performance analysis, stressing and stability check
of vmalloc allocator.

Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Kees Cook
Cc: Matthew Wilcox
Cc: Shuah Khan
Cc: Oleksiy Avramchenko
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-03-06 13:07:15 +0800
bc84c5352 mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range() ... Browse Code »

vmalloc_user*() calls differ from normal vmalloc() only in that they set
VM_USERMAP flags for the area. During the whole history of vmalloc.c
changes now it is possible simply to pass VM_USERMAP flags directly to
__vmalloc_node_range() call instead of finding the area (which obviously
takes time) after the allocation.

Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Acked-by: Michal Hocko
Cc: Andrey Ryabinin
Cc: Joe Perches
Cc: "Luis R. Rodriguez"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-06 13:07:15 +0800
c67dc6247 mm/vmalloc: do not call kmemleak_free() on not yet accounted memory ... Browse Code »

__vmalloc_area_node() calls vfree() on error path, which in turn calls
kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().

Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Andrey Ryabinin
Cc: Joe Perches
Cc: "Luis R. Rodriguez"
Cc: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-06 13:07:15 +0800
401592d2e mm/vmalloc: fix size check for remap_vmalloc_range_partial() ... Browse Code »

When VM_NO_GUARD is not set area->size includes adjacent guard page,
thus for correct size checking get_vm_area_size() should be used, but
not area->size.

This fixes possible kernel oops when userspace tries to mmap an area on
1 page bigger than was allocated by vmalloc_user() call: the size check
inside remap_vmalloc_range_partial() accounts non-existing guard page
also, so check successfully passes but vmalloc_to_page() returns NULL
(guard page does not physically exist).

The following code pattern example should trigger an oops:

static int oops_mmap(struct file *file, struct vm_area_struct *vma)
{
void *mem;

mem = vmalloc_user(4096);
BUG_ON(!mem);
/* Do not care about mem leak */

return remap_vmalloc_range(vma, mem, 0);
}

And userspace simply mmaps size + PAGE_SIZE:

mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);

Possible candidates for oops which do not have any explicit size
checks:

*** drivers/media/usb/stkwebcam/stk-webcam.c:
v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0);

Or the following one:

*** drivers/video/fbdev/core/fbmem.c
static int
fb_mmap(struct file *file, struct vm_area_struct * vma)
...
res = fb->fb_mmap(info, vma);

Where fb_mmap callback calls remap_vmalloc_range() directly without any
explicit checks:

*** drivers/video/fbdev/vfb.c
static int vfb_mmap(struct fb_info *info,
struct vm_area_struct *vma)
{
return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
}

Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Acked-by: Michal Hocko
Cc: Andrey Ryabinin
Cc: Joe Perches
Cc: "Luis R. Rodriguez"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-06 13:07:15 +0800
5a82ac715 mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA ... Browse Code »

This patch repeats the original one from David S Miller:

2dca6999eed5 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")

but for missed vmalloc_32_user() case, which also requires correct
alignment of virtual address on kernel side to avoid D-caches aliases.
A bit of copy-paste from original patch to recover in memory of what is
all about:

When a vmalloc'd area is mmap'd into userspace, some kind of
co-ordination is necessary for this to work on platforms with cpu
D-caches which can have aliases.

Otherwise kernel side writes won't be seen properly in userspace and
vice versa.

If the kernel side mapping and the user side one have the same
alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
of VMA and for all current users this is true. VM_SHARED will force
SHMLBA alignment of the user side mmap on platforms with D-cache
aliasing matters.

David S. Miller

> What are the user-visible runtime effects of this change?

In simple words: proper alignment avoids possible difference in data,
seen by different virtual mapings: userspace and kernel in our case.
I.e. userspace reads cache line A, kernel writes to cache line B. Both
cache lines correspond to the same physical memory (thus aliases).

So this should fix data corruption for archs with vivt and vipt caches,
e.g. armv6. Personally I've never worked with this archs, I just
spotted the strange difference in code: for one case we do alignment,
for another - not. I have a strong feeling that David simply missed
vmalloc_32_user() case.

>
> Is a -stable backport needed?

No, I do not think so. The only one user of vmalloc_32_user() is
virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
description "The main use of this frame buffer device is testing and
debugging the frame buffer subsystem. Do NOT enable it for normal
systems!".

And it seems to me that this vfb.c does not need 32bit addressable pages
(vmalloc_32_user() case), because it is virtual device and should not
care about things like dma32 zones, etc. Probably is better to clean
the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
and wipe out vmalloc_32_user() from vmalloc.c completely. But I'm not
very much sure that this is worth to do, that's so minor, so we can
leave it as is.

Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Reviewed-by: Andrew Morton
Cc: Stephen Rothwell
Cc: Michal Hocko
Cc: David S. Miller
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-06 13:07:15 +0800
6ade20327 mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap() ... Browse Code »

find_vmap_area() can return a NULL pointer and we're going to
dereference it without checking it first. Use the existing
find_vm_area() function which does exactly what we want and checks for
the NULL pointer.

Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
Fixes: f3c01d2f3ade ("mm: vmalloc: avoid racy handling of debugobjects in vunmap")
Signed-off-by: Liviu Dudau
Reviewed-by: Andrew Morton
Cc: Chintan Pandya
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Liviu Dudau
2019-03-06 13:07:14 +0800