Eric Lee / smarc-fsl-linux-kernel

04 Apr, 2018

1 commit

4608f0645 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next ... Browse Code »

Pull sparc updates from David Miller:

1) Add support for ADI (Application Data Integrity) found in more
recent sparc64 cpus. Essentially this is keyed based access to
virtual memory, and if the key encoded in the virual address is
wrong you get a trap.

The mm changes were reviewed by Andrew Morton and others.

Work by Khalid Aziz.

2) Validate DAX completion index range properly, from Rob Gardner.

3) Add proper Kconfig deps for DAX driver. From Guenter Roeck.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
sparc64: Make atomic_xchg() an inline function rather than a macro.
sparc64: Properly range check DAX completion index
sparc: Make auxiliary vectors for ADI available on 32-bit as well
sparc64: Oracle DAX driver depends on SPARC64
sparc64: Update signal delivery to use new helper functions
sparc64: Add support for ADI (Application Data Integrity)
mm: Allow arch code to override copy_highpage()
mm: Clear arch specific VM flags on protection change
mm: Add address parameter to arch_validate_prot()
sparc64: Add auxiliary vectors to report platform ADI properties
sparc64: Add handler for "Memory Corruption Detected" trap
sparc64: Add HV fault type handlers for ADI related faults
sparc64: Add support for ADI register fields, ASIs and traps
mm, swap: Add infrastructure for saving page metadata on swap
signals, sparc: Add signal codes for ADI violations

Linus Torvalds
2018-04-04 05:08:58 +0800

03 Apr, 2018

10 commits

642e7fd23 Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux ... Browse Code »

Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.

At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.

Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.

This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...

Linus Torvalds
2018-04-03 12:22:12 +0800
f5a8eb632 Merge tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic ... Browse Code »

Pul removal of obsolete architecture ports from Arnd Bergmann:
"This removes the entire architecture code for blackfin, cris, frv,
m32r, metag, mn10300, score, and tile, including the associated device
drivers.

I have been working with the (former) maintainers for each one to
ensure that my interpretation was right and the code is definitely
unused in mainline kernels. Many had fond memories of working on the
respective ports to start with and getting them included in upstream,
but also saw no point in keeping the port alive without any users.

In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company in
charge of an SoC line, a CPU microarchitecture and a software
ecosystem, which was more costly than licensing newer off-the-shelf
CPU cores from a third party (typically ARM, MIPS, or RISC-V). It
seems that all the SoC product lines are still around, but have not
used the custom CPU architectures for several years at this point. In
contrast, CPU instruction sets that remain popular and have actively
maintained kernel ports tend to all be used across multiple licensees.

[ See the new nds32 port merged in the previous commit for the next
generation of "one company in charge of an SoC line, a CPU
microarchitecture and a software ecosystem" - Linus ]

The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
marking any ports as deprecated but remove them all at once after I
made sure that they are all unused. Some architectures (notably tile,
mn10300, and blackfin) are still being shipped in products with old
kernels, but those products will never be updated to newer kernel
releases.

After this series, we still have a few architectures without mainline
gcc support:

- unicore32 and hexagon both have very outdated gcc releases, but the
maintainers promised to work on providing something newer. At least
in case of hexagon, this will only be llvm, not gcc.

- openrisc, risc-v and nds32 are still in the process of finishing
their support or getting it added to mainline gcc in the first
place. They all have patched gcc-7.3 ports that work to some
degree, but complete upstream support won't happen before gcc-8.1.
Csky posted their first kernel patch set last week, their situation
will be similar

[ Palmer Dabbelt points out that RISC-V support is in mainline gcc
since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]"

This really says it all:

2498 files changed, 95 insertions(+), 467668 deletions(-)

* tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits)
MAINTAINERS: UNICORE32: Change email account
staging: iio: remove iio-trig-bfin-timer driver
tty: hvc: remove tile driver
tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers
serial: remove tile uart driver
serial: remove m32r_sio driver
serial: remove blackfin drivers
serial: remove cris/etrax uart drivers
usb: Remove Blackfin references in USB support
usb: isp1362: remove blackfin arch glue
usb: musb: remove blackfin port
usb: host: remove tilegx platform glue
pwm: remove pwm-bfin driver
i2c: remove bfin-twi driver
spi: remove blackfin related host drivers
watchdog: remove bfin_wdt driver
can: remove bfin_can driver
mmc: remove bfin_sdh driver
input: misc: remove blackfin rotary driver
input: keyboard: remove bf54x driver
...

Linus Torvalds
2018-04-03 11:20:12 +0800
d22fff814 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 mm updates from Ingo Molnar:

- Extend the memmap= boot parameter syntax to allow the redeclaration
and dropping of existing ranges, and to support all e820 range types
(Jan H. Schönherr)

- Improve the W+X boot time security checks to remove false positive
warnings on Xen (Jan Beulich)

- Support booting as Xen PVH guest (Juergen Gross)

- Improved 5-level paging (LA57) support, in particular it's possible
now to have a single kernel image for both 4-level and 5-level
hardware (Kirill A. Shutemov)

- AMD hardware RAM encryption support (SME/SEV) fixes (Tom Lendacky)

- Preparatory commits for hardware-encrypted RAM support on Intel CPUs.
(Kirill A. Shutemov)

- Improved Intel-MID support (Andy Shevchenko)

- Show EFI page tables in page_tables debug files (Andy Lutomirski)

- ... plus misc fixes and smaller cleanups

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
x86/cpu/tme: Fix spelling: "configuation" -> "configuration"
x86/boot: Fix SEV boot failure from change to __PHYSICAL_MASK_SHIFT
x86/mm: Update comment in detect_tme() regarding x86_phys_bits
x86/mm/32: Remove unused node_memmap_size_bytes() & CONFIG_NEED_NODE_MEMMAP_SIZE logic
x86/mm: Remove pointless checks in vmalloc_fault
x86/platform/intel-mid: Add special handling for ACPI HW reduced platforms
ACPI, x86/boot: Introduce the ->reduced_hw_early_init() ACPI callback
ACPI, x86/boot: Split out acpi_generic_reduce_hw_init() and export
x86/pconfig: Provide defines and helper to run MKTME_KEY_PROG leaf
x86/pconfig: Detect PCONFIG targets
x86/tme: Detect if TME and MKTME is activated by BIOS
x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
x86/boot/compressed/64: Use page table in trampoline memory
x86/boot/compressed/64: Use stack from trampoline memory
x86/boot/compressed/64: Make sure we have a 32-bit code segment
x86/mm: Do not use paravirtualized calls in native_set_p4d()
kdump, vmcoreinfo: Export pgtable_l5_enabled value
x86/boot/compressed/64: Prepare new top-level page table for trampoline
x86/boot/compressed/64: Set up trampoline memory
x86/boot/compressed/64: Save and restore trampoline memory
...

Linus Torvalds
2018-04-03 06:45:30 +0800
c7b95d515 mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead() ... Browse Code »

Using this helper allows us to avoid the in-kernel calls to the
sys_readahead() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_readahead().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Andrew Morton
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:16:12 +0800
a90f590a1 mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff() ... Browse Code »

Using this helper allows us to avoid the in-kernel calls to the
sys_mmap_pgoff() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_mmap_pgoff().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Andrew Morton
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:16:11 +0800
9d5b7c956 mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64() ... Browse Code »

Using the ksys_fadvise64_64() helper allows us to avoid the in-kernel
calls to the sys_fadvise64_64() syscall. The ksys_ prefix denotes that
this function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as ksys_fadvise64_64().

Some compat stubs called sys_fadvise64(), which then just passed through
the arguments to sys_fadvise64_64(). Get rid of this indirection, and call
ksys_fadvise64_64() directly.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Andrew Morton
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:16:10 +0800
af03c4acb mm: add kernel_[sg]et_mempolicy() helpers; remove in-kernel calls to syscalls ... Browse Code »

Using the mm-internal kernel_[sg]et_mempolicy() helper allows us to get
rid of the mm-internal calls to the sys_[sg]et_mempolicy() syscalls.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Cc: linux-mm@kvack.org
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:34 +0800
e7dc9ad6e mm: add kernel_mbind() helper; remove in-kernel call to syscall ... Browse Code »

Using the mm-internal kernel_mbind() helper allows us to get rid of the
mm-internal call to the sys_mbind() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Cc: linux-mm@kvack.org
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:33 +0800
7addf4438 mm: add kernel_move_pages() helper, move compat syscall to mm/migrate.c ... Browse Code »

Move compat_sys_move_pages() to mm/migrate.c and make it call a newly
introduced helper -- kernel_move_pages() -- instead of the syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Cc: linux-mm@kvack.org
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:32 +0800
b6e9b0bab mm: add kernel_migrate_pages() helper, move compat syscall to mm/mempolicy.c ... Browse Code »

Move compat_sys_migrate_pages() to mm/mempolicy.c and make it call a newly
introduced helper -- kernel_migrate_pages() -- instead of the syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro
Cc: linux-mm@kvack.org
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:15:31 +0800

29 Mar, 2018

5 commits

914b6dfff mm/kmemleak.c: wait for scan completion before disabling free ... Browse Code »

A crash is observed when kmemleak_scan accesses the object->pointer,
likely due to the following race.

TASK A TASK B TASK C
kmemleak_write
(with "scan" and
NOT "scan=on")
kmemleak_scan()
create_object
kmem_cache_alloc fails
kmemleak_disable
kmemleak_do_cleanup
kmemleak_free_enabled = 0
kfree
kmemleak_free bails out
(kmemleak_free_enabled is 0)
slub frees object->pointer
update_checksum
crash - object->pointer
freed (DEBUG_PAGEALLOC)

kmemleak_do_cleanup waits for the scan thread to complete, but not for
direct call to kmemleak_scan via kmemleak_write. So add a wait for
kmemleak_scan completion before disabling kmemleak_free, and while at it
fix the comment on stop_scan_thread.

[vinmenon@codeaurora.org: fix stop_scan_thread comment]
Link: http://lkml.kernel.org/r/1522219972-22809-1-git-send-email-vinmenon@codeaurora.org
Link: http://lkml.kernel.org/r/1522063429-18992-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon
Reviewed-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vinayak Menon
2018-03-29 07:42:05 +0800
b213b54fb mm/memcontrol.c: fix parameter description mismatch ... Browse Code »

There are a couple of places where parameter description and function
name do not match the actual code. Fix it.

Link: http://lkml.kernel.org/r/1520843448-17347-1-git-send-email-honglei.wang@oracle.com
Signed-off-by: Honglei Wang
Acked-by: Tejun Heo
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Honglei Wang
2018-03-29 07:42:05 +0800
c7f26ccfb mm/vmstat.c: fix vmstat_update() preemption BUG ... Browse Code »

Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
cause vmstat_update() to report a BUG due to preemption not being
disabled around smp_processor_id().

Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

BUG: using smp_processor_id() in preemptible [00000000] code:
kworker/1:1/269
caller is vmstat_update+0x50/0xa0
CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
show_stack+0x94/0x128
dump_stack+0xa4/0xe0
check_preemption_disabled+0x118/0x120
vmstat_update+0x50/0xa0
process_one_work+0x144/0x348
worker_thread+0x150/0x4b8
kthread+0x110/0x140
ret_from_kernel_thread+0x14/0x1c

Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
Signed-off-by: Steven J. Hill
Reviewed-by: Andrew Morton
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven J. Hill
2018-03-29 07:42:05 +0800
299815a4f mm/page_owner: fix recursion bug after changing skip entries ... Browse Code »

This patch fixes commit 5f48f0bd4e36 ("mm, page_owner: skip unnecessary
stack_trace entries").

Because if we skip first two entries then logic of checking count value
as 2 for recursion is broken and code will go in one depth recursion.

so we need to check only one call of _RET_IP(__set_page_owner) while
checking for recursion.

Current Backtrace while checking for recursion:-

(save_stack) from (__set_page_owner) // (But recursion returns true here)
(__set_page_owner) from (get_page_from_freelist)
(get_page_from_freelist) from (__alloc_pages_nodemask)
(__alloc_pages_nodemask) from (depot_save_stack)
(depot_save_stack) from (save_stack) // recursion should return true here
(save_stack) from (__set_page_owner)
(__set_page_owner) from (get_page_from_freelist)
(get_page_from_freelist) from (__alloc_pages_nodemask+)
(__alloc_pages_nodemask) from (depot_save_stack)
(depot_save_stack) from (save_stack)
(save_stack) from (__set_page_owner)
(__set_page_owner) from (get_page_from_freelist)

Correct Backtrace with fix:

(save_stack) from (__set_page_owner) // recursion returned true here
(__set_page_owner) from (get_page_from_freelist)
(get_page_from_freelist) from (__alloc_pages_nodemask+)
(__alloc_pages_nodemask) from (depot_save_stack)
(depot_save_stack) from (save_stack)
(save_stack) from (__set_page_owner)
(__set_page_owner) from (get_page_from_freelist)

Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com
Fixes: 5f48f0bd4e36 ("mm, page_owner: skip unnecessary stack_trace entries")
Signed-off-by: Maninder Singh
Signed-off-by: Vaneet Narang
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: Greg Kroah-Hartman
Cc: Ayush Mittal
Cc: Prakash Gupta
Cc: Vinayak Menon
Cc: Vasyl Gomonovych
Cc: Amit Sahrawat
Cc:
Cc: Vaneet Narang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Maninder Singh
2018-03-29 07:42:05 +0800
880cd276d mm, slab: memcg_link the SLAB's kmem_cache ... Browse Code »

All the root caches are linked into slab_root_caches which was
introduced by the commit 510ded33e075 ("slab: implement slab_root_caches
list") but it missed to add the SLAB's kmem_cache.

While experimenting with opt-in/opt-out kmem accounting, I noticed
system crashes due to NULL dereference inside cache_from_memcg_idx()
while deferencing kmem_cache.memcg_params.memcg_caches. The upstream
clean kernel will not see these crashes but SLAB should be consistent
with SLUB which does linked its boot caches (kmem_cache_node and
kmem_cache) into slab_root_caches.

Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
Fixes: 510ded33e075c ("slab: implement slab_root_caches list")
Signed-off-by: Shakeel Butt
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-03-29 07:42:05 +0800

27 Mar, 2018

2 commits

fc5d1073c x86/mm/32: Remove unused node_memmap_size_bytes() & CONFIG_NEED_NODE_MEMMAP_SIZE logic ... Browse Code »

node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it.

Signed-off-by: David Rientjes
Cc: Dave Hansen
Cc: Laura Abbott
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Fixes: f03574f2d5b2 ("x86-32, mm: Rip out x86_32 NUMA remapping code")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803262325540.256524@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar

David Rientjes
2018-03-27 14:45:02 +0800
0bc91d4ba Merge tag 'v4.16-rc7' into x86/mm, to fix up conflict ... Browse Code »

Conflicts:
arch/x86/mm/init_64.c

Signed-off-by: Ingo Molnar

Ingo Molnar
2018-03-27 14:43:39 +0800

26 Mar, 2018

1 commit

a687a5337 treewide: simplify Kconfig dependencies for removed archs ... Browse Code »

A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.

Acked-by: Kalle Valo
Acked-by: Alexandre Belloni
Signed-off-by: Arnd Bergmann

Arnd Bergmann
2018-03-26 21:55:57 +0800

23 Mar, 2018

9 commits

9d3c3354b mm, thp: do not cause memcg oom for thp ... Browse Code »

Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") changed the page allocator to no longer detect
thp allocations based on __GFP_NORETRY.

It did not, however, modify the mem cgroup try_charge() path to avoid
oom kill for either khugepaged collapsing or thp faulting. It is never
expected to oom kill a process to allocate a hugepage for thp; reclaim
is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
(and charging) should fallback instead of oom killing processes.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-03-23 08:07:02 +0800
1c610d5f9 mm/vmscan: wake up flushers for legacy cgroups too ... Browse Code »

Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
pages on the LRU") added flusher invocation to shrink_inactive_list()
when many dirty pages on the LRU are encountered.

However, shrink_inactive_list() doesn't wake up flushers for legacy
cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
flusher wakeup from direct reclaim path") removed the only source of
flusher's wake up in legacy mem cgroup reclaim path.

This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killed

dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50

Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Wake up flushers in legacy cgroup reclaim too.

Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin
Tested-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2018-03-23 08:07:01 +0800
f59f1caf7 Revert "mm: page_alloc: skip over regions of invalid pfns where possible" ... Browse Code »

This reverts commit b92df1de5d28 ("mm: page_alloc: skip over regions of
invalid pfns where possible"). The commit is meant to be a boot init
speed up skipping the loop in memmap_init_zone() for invalid pfns.

But given some specific memory mapping on x86_64 (or more generally
theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
implementation also skips valid pfns which is plain wrong and causes
'kernel BUG at mm/page_alloc.c:1389!'

crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
kernel BUG at mm/page_alloc.c:1389!
invalid opcode: 0000 [#1] SMP
--
RIP: 0010: move_freepages+0x15e/0x160
--
Call Trace:
move_freepages_block+0x73/0x80
__rmqueue+0x263/0x460
get_page_from_freelist+0x7e1/0x9e0
__alloc_pages_nodemask+0x176/0x420
--

crash> page_init_bug -v | grep RAM
1000 - 9bfff System RAM (620.00 KiB)
100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
7b788000 - 7b7fffff System RAM (480.00 KiB)
100000000 - 67fffffff System RAM ( 22.00 GiB)

crash> page_init_bug | head -6
7b788000 - 7b7fffff System RAM (480.00 KiB)
1fffff00000000 0 1 DMA32 4096 1048575
505736 505344 505855
0 0 0 DMA 1 4095
1fffff00000400 0 1 DMA32 4096 1048575
BUG, zones differ!

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001ed7fc0 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 0 0 <<<<
ffffea0001ede1c0 7b787000 0 0 0 0
ffffea0001ede200 7b788000 0 0 1 1fffff00000000

Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek
Acked-by: Ard Biesheuvel
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Pavel Tatashin
Cc: Paul Burton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Vacek
2018-03-23 08:07:01 +0800
b3cd54b25 mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() ... Browse Code »

shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
page lock may lead to deadlock there.

There was a bug report that may be attributed to this:

http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net

Replace lock_page() with trylock_page() and skip the page if we failed
to lock it. We will get to the page on the next scan.

We can test for the PageTransHuge() outside the page lock as we only
need protection against splitting the page under us. Holding pin oni
the page is enough for this.

Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Kirill A. Shutemov
Reported-by: Eric Wheeler
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Tetsuo Handa
Cc: Hugh Dickins
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-03-23 08:07:01 +0800
fa41b900c mm/thp: do not wait for lock_page() in deferred_split_scan() ... Browse Code »

deferred_split_scan() gets called from reclaim path. Waiting for page
lock may lead to deadlock there.

Replace lock_page() with trylock_page() and skip the page if we failed
to lock it. We will get to the page on the next scan.

Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-03-23 08:07:01 +0800
fece2029a mm/khugepaged.c: convert VM_BUG_ON() to collapse fail ... Browse Code »

khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
mapped. We do not collapse such pages. See check
khugepaged_scan_pmd().

But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
somebody managed to instantiate THP in the range and then split the PMD
back to PTEs we would have a problem --
VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.

It's possible since we drop mmap_sem during collapse to re-take for
write.

Replace the VM_BUG_ON() with graceful collapse fail.

Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
Signed-off-by: Kirill A. Shutemov
Cc: Laura Abbott
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-03-23 08:07:01 +0800
63489f8e8 hugetlbfs: check for pgoff value overflow ... Browse Code »

A vma with vm_pgoff large enough to overflow a loff_t type when
converted to a byte offset can be passed via the remap_file_pages system
call. The hugetlbfs mmap routine uses the byte offset to calculate
reservations and file size.

A sequence such as:

mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

will result in the following when task exits/file closed,

kernel BUG at mm/hugetlb.c:749!
Call Trace:
hugetlbfs_evict_inode+0x2f/0x40
evict+0xcb/0x190
__dentry_kill+0xcb/0x150
__fput+0x164/0x1e0
task_work_run+0x84/0xa0
exit_to_usermode_loop+0x7d/0x80
do_syscall_64+0x18b/0x190
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The overflowed pgoff value causes hugetlbfs to try to set up a mapping
with a negative range (end < start) that leaves invalid state which
causes the BUG.

The previous overflow fix to this code was incomplete and did not take
the remap_file_pages system call into account.

[mike.kravetz@oracle.com: v3]
Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
[akpm@linux-foundation.org: include mmdebug.h]
[akpm@linux-foundation.org: fix -ve left shift count on sh]
Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
Signed-off-by: Mike Kravetz
Reported-by: Nic Losby
Acked-by: Michal Hocko
Cc: "Kirill A . Shutemov"
Cc: Yisheng Xie
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2018-03-23 08:07:01 +0800
2e517d681 lockdep: fix fs_reclaim warning ... Browse Code »

Dave Jones reported fs_reclaim lockdep warnings.

============================================
WARNING: possible recursive locking detected
4.15.0-rc9-backup-debug+ #1 Not tainted
--------------------------------------------
sshd/24800 is trying to acquire lock:
(fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

but task is already holding lock:
(fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(fs_reclaim);
lock(fs_reclaim);

*** DEADLOCK ***

May be due to missing lock nesting notation

2 locks held by sshd/24800:
#0: (sk_lock-AF_INET6){+.+.}, at: [] tcp_sendmsg+0x19/0x40
#1: (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
dump_stack+0xbc/0x13f
__lock_acquire+0xa09/0x2040
lock_acquire+0x12e/0x350
fs_reclaim_acquire.part.102+0x29/0x30
kmem_cache_alloc+0x3d/0x2c0
alloc_extent_state+0xa7/0x410
__clear_extent_bit+0x3ea/0x570
try_release_extent_mapping+0x21a/0x260
__btrfs_releasepage+0xb0/0x1c0
btrfs_releasepage+0x161/0x170
try_to_release_page+0x162/0x1c0
shrink_page_list+0x1d5a/0x2fb0
shrink_inactive_list+0x451/0x940
shrink_node_memcg.constprop.88+0x4c9/0x5e0
shrink_node+0x12d/0x260
try_to_free_pages+0x418/0xaf0
__alloc_pages_slowpath+0x976/0x1790
__alloc_pages_nodemask+0x52c/0x5c0
new_slab+0x374/0x3f0
___slab_alloc.constprop.81+0x47e/0x5a0
__slab_alloc.constprop.80+0x32/0x60
__kmalloc_track_caller+0x267/0x310
__kmalloc_reserve.isra.40+0x29/0x80
__alloc_skb+0xee/0x390
sk_stream_alloc_skb+0xb8/0x340
tcp_sendmsg_locked+0x8e6/0x1d30
tcp_sendmsg+0x27/0x40
inet_sendmsg+0xd0/0x310
sock_write_iter+0x17a/0x240
__vfs_write+0x2ab/0x380
vfs_write+0xfb/0x260
SyS_write+0xb6/0x140
do_syscall_64+0x1e5/0xc05
entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37e ("locking/lockdep:
Rework FS_RECLAIM annotation") which replaced the use of
lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
and lockdep_trace_alloc() in slab_pre_alloc_hook() with
fs_reclaim_acquire()/ fs_reclaim_release().

Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
__GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
__GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

/* this guy won't enter reclaim */
if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread
won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
341ce06f69ab ("page allocator: calculate the alloc_flags for allocation
only once") added the PF_MEMALLOC safeguard (

/* Avoid recursion of direct reclaim */
if (p->flags & PF_MEMALLOC)
goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
allow __need_fs_reclaim() to return false.

Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Tetsuo Handa
Reported-by: Dave Jones
Tested-by: Dave Jones
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: Nikolay Borisov
Cc: Michal Hocko
Cc: [4.14+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2018-03-23 08:07:01 +0800
8970a63e9 mm/mempolicy.c: avoid use uninitialized preferred_node ... Browse Code »

Alexander reported a use of uninitialized memory in __mpol_equal(),
which is caused by incorrect use of preferred_node.

When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
numa_node_id() instead of preferred_node, however, __mpol_equal() uses
preferred_node without checking whether it is MPOL_F_LOCAL or not.

[akpm@linux-foundation.org: slight comment tweak]
Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
Signed-off-by: Yisheng Xie
Reported-by: Alexander Potapenko
Tested-by: Alexander Potapenko
Reviewed-by: Andrew Morton
Cc: Dmitriy Vyukov
Cc: Vlastimil Babka
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yisheng Xie
2018-03-23 08:07:01 +0800

20 Mar, 2018

3 commits

0d707a2f2 Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu fixes from Tejun Heo:
"Late percpu pull request for v4.16-rc6.

- percpu allocator pool replenishing no longer triggers OOM or
warning messages.

Also, the alloc interface now understands __GFP_NORETRY and
__GFP_NOWARN. This is to allow avoiding OOMs from userland
triggered actions like bpf map creation.

Also added cond_resched() in alloc loop.

- perpcu allocation now can be interrupted by kill sigs to avoid
deadlocking OOM killer.

- Added Dennis Zhou as a co-maintainer.

He has rewritten the area map allocator, understands most of the
code base and has been responsive for all bug reports"

* 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
percpu: include linux/sched.h for cond_resched()
percpu: add a schedule point in pcpu_balance_workfn()
percpu: allow select gfp to be passed to underlying allocators
percpu: add __GFP_NORETRY semantics to the percpu balancing path
percpu: match chunk allocator declarations with definitions
percpu: add Dennis Zhou as a percpu co-maintainer

Linus Torvalds
2018-03-20 05:48:35 +0800
f52ba1fef mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() ... Browse Code »

In case of memory deficit and low percpu memory pages,
pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
time (as it makes memory allocations itself and waits
for memory reclaim). If tasks doing pcpu_alloc() are
choosen by OOM killer, they can't exit, because they
are waiting for the mutex.

The patch makes pcpu_alloc() to care about killing signal
and use mutex_lock_killable(), when it's allowed by GFP
flags. This guarantees, a task does not miss SIGKILL
from OOM killer.

Signed-off-by: Kirill Tkhai
Signed-off-by: Tejun Heo

Kirill Tkhai
2018-03-20 00:38:50 +0800
71546d100 percpu: include linux/sched.h for cond_resched() ... Browse Code »

microblaze build broke due to missing declaration of the
cond_resched() invocation added recently. Let's include linux/sched.h
explicitly.

Signed-off-by: Tejun Heo
Reported-by: kbuild test robot

Tejun Heo
2018-03-20 00:38:18 +0800

18 Mar, 2018

4 commits

74a049674 sparc64: Add support for ADI (Application Data Integrity) ... Browse Code »

ADI is a new feature supported on SPARC M7 and newer processors to allow
hardware to catch rogue accesses to memory. ADI is supported for data
fetches only and not instruction fetches. An app can enable ADI on its
data pages, set version tags on them and use versioned addresses to
access the data pages. Upper bits of the address contain the version
tag. On M7 processors, upper four bits (bits 63-60) contain the version
tag. If a rogue app attempts to access ADI enabled data pages, its
access is blocked and processor generates an exception. Please see
Documentation/sparc/adi.txt for further details.

This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
MCD (Memory Corruption Detection) on selected memory ranges, enable
TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
version tags on page swap out/in or migration. ADI is not enabled by
default for any task. A task must explicitly enable ADI on a memory
range and set version tag for ADI to be effective for the task.

Signed-off-by: Khalid Aziz
Cc: Khalid Aziz
Reviewed-by: Anthony Yznaga
Signed-off-by: David S. Miller

Khalid Aziz
2018-03-18 22:38:48 +0800
2c2d57b5e mm: Clear arch specific VM flags on protection change ... Browse Code »

When protection bits are changed on a VMA, some of the architecture
specific flags should be cleared as well. An examples of this are the
PKEY flags on x86. This patch expands the current code that clears
PKEY flags for x86, to support similar functionality for other
architectures as well.

Signed-off-by: Khalid Aziz
Cc: Khalid Aziz
Reviewed-by: Anthony Yznaga
Acked-by: Andrew Morton
Signed-off-by: David S. Miller

Khalid Aziz
2018-03-18 22:38:47 +0800
9035cf9a9 mm: Add address parameter to arch_validate_prot() ... Browse Code »

A protection flag may not be valid across entire address space and
hence arch_validate_prot() might need the address a protection bit is
being set on to ensure it is a valid protection flag. For example, sparc
processors support memory corruption detection (as part of ADI feature)
flag on memory addresses mapped on to physical RAM but not on PFN mapped
pages or addresses mapped on to devices. This patch adds address to the
parameters being passed to arch_validate_prot() so protection bits can
be validated in the relevant context.

Signed-off-by: Khalid Aziz
Cc: Khalid Aziz
Reviewed-by: Anthony Yznaga
Acked-by: Michael Ellerman (powerpc)
Acked-by: Andrew Morton
Signed-off-by: David S. Miller

Khalid Aziz
2018-03-18 22:38:47 +0800
ca827d55e mm, swap: Add infrastructure for saving page metadata on swap ... Browse Code »

If a processor supports special metadata for a page, for example ADI
version tags on SPARC M7, this metadata must be saved when the page is
swapped out. The same metadata must be restored when the page is swapped
back in. This patch adds two new architecture specific functions -
arch_do_swap_page() to be called when a page is swapped in, and
arch_unmap_one() to be called when a page is being unmapped for swap
out. These architecture hooks allow page metadata to be saved if the
architecture supports it.

Signed-off-by: Khalid Aziz
Cc: Khalid Aziz
Acked-by: Jerome Marchand
Reviewed-by: Anthony Yznaga
Acked-by: Andrew Morton
Signed-off-by: David S. Miller

Khalid Aziz
2018-03-18 22:38:45 +0800

16 Mar, 2018

2 commits

79375ea3e mm: remove obsolete alloc_remap() ... Browse Code »

Tile was the only remaining architecture to implement alloc_remap(),
and since that is being removed, there is no point in keeping this
function.

Removing all callers simplifies the mem_map handling.

Reviewed-by: Pavel Tatashin
Signed-off-by: Arnd Bergmann

Arnd Bergmann
2018-03-16 17:56:13 +0800
1a8429132 mm: remove blackfin MPU support ... Browse Code »

The CONFIG_MPU option was only defined on blackfin, and that architecture
is now being removed, so the respective code can be simplified.

A lot of other microcontrollers have an MPU, but I suspect that if we
want to bring that support back, we'd do it differently anyway.

Signed-off-by: Arnd Bergmann

Arnd Bergmann
2018-03-16 17:56:10 +0800

15 Mar, 2018

2 commits

3e04040df Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment" ... Browse Code »

This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock
alignment") modified the logic in memmap_init_zone() to initialize
struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
in move_freepages(), which is redundant by its own admission, and
dereferences struct page fields to obtain the zone without checking
whether the struct pages in question are valid to begin with.

Commit 864b75f9d6b0 only makes it worse, since the rounding it does
may cause pfn assume the same value it had in a prior iteration of
the loop, resulting in an infinite loop and a hang very early in the
boot. Also, since it doesn't perform the same rounding on start_pfn
itself but only on intermediate values following an invalid PFN, we
may still hit the same VM_BUG_ON() as before.

So instead, let's fix this at the core, and ensure that the BUG
check doesn't dereference struct page fields of invalid pages.

Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Tested-by: Jan Glauber
Tested-by: Shanker Donthineni
Cc: Daniel Vacek
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Paul Burton
Cc: Pavel Tatashin
Cc: Vlastimil Babka
Cc: Andrew Morton
Signed-off-by: Ard Biesheuvel
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2018-03-15 07:33:28 +0800
745dd37f9 Merge branch 'x86/urgent' into x86/mm to pick up dependencies Browse Code »

Thomas Gleixner
2018-03-15 03:23:25 +0800

10 Mar, 2018

1 commit

864b75f9d mm/page_alloc: fix memmap_init_zone pageblock alignment ... Browse Code »

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") introduced a bug where move_freepages() triggers a
VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
To fix this, simply align the skipped pfns in memmap_init_zone() the
same way as in move_freepages_block().

Seen in one of the RHEL reports:

crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
kernel BUG at mm/page_alloc.c:1389!
invalid opcode: 0000 [#1] SMP
--
RIP: 0010:[] [] move_freepages+0x15e/0x160
RSP: 0018:ffff88054d727688 EFLAGS: 00010087
--
Call Trace:
[] move_freepages_block+0x73/0x80
[] __rmqueue+0x263/0x460
[] get_page_from_freelist+0x7e1/0x9e0
[] __alloc_pages_nodemask+0x176/0x420
--
RIP [] move_freepages+0x15e/0x160
RSP

crash> page_init_bug -v | grep RAM
1000 - 9bfff System RAM (620.00 KiB)
100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
7b788000 - 7b7fffff System RAM (480.00 KiB)
100000000 - 67fffffff System RAM ( 22.00 GiB)

crash> page_init_bug | head -6
7b788000 - 7b7fffff System RAM (480.00 KiB)
1fffff00000000 0 1 DMA32 4096 1048575
505736 505344 505855
0 0 0 DMA 1 4095
1fffff00000400 0 1 DMA32 4096 1048575
BUG, zones differ!

Note that this range follows two not populated sections
68000000-77ffffff in this zone. 7b788000-7b7fffff is the first one
after a gap. This makes memmap_init_zone() skip all the pfns up to the
beginning of this range. But this range is not pageblock (2M) aligned.
In fact no range has to be.

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001ed7fc0 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 0 0 <<<<
ffffea0001ede1c0 7b787000 0 0 0 0
ffffea0001ede200 7b788000 0 0 1 1fffff00000000

Top part of page flags should contain nodeid and zonenr, which is not
the case for page ffffea0001ed8000 here (<<< log | grep -o fffea0001ed[^\ ]* | sort -u
fffea0001ed8000
fffea0001eded20
fffea0001edffc0

crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
fffea0001ed8000
fffea0001eded00
fffea0001eded20
fffea0001edffc0

Initialization of the whole beginning of the section is skipped up to
the start of the range due to the commit b92df1de5d28. Now any code
calling move_freepages_block() (like reusing the page from a freelist as
in this example) with a page from the beginning of the range will get
the page rounded down to start_page ffffea0001ed8000 and passed to
move_freepages() which crashes on assertion getting wrong zonenr.

> VM_BUG_ON(page_zone(start_page) != page_zone(end_page));

Note, page_zone() derives the zone from page flags here.

From similar machine before commit b92df1de5d28:

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
fffff73941e00000 78000000 0 0 1 1fffff00000000
fffff73941ed7fc0 7b5ff000 0 0 1 1fffff00000000
fffff73941ed8000 7b600000 0 0 1 1fffff00000000
fffff73941edff80 7b7fe000 0 0 1 1fffff00000000
fffff73941edffc0 7b7ff000 ffff8e67e04d3ae0 ad84 1 1fffff00020068 uptodate,lru,active,mappedtodisk

All the pages since the beginning of the section are initialized.
move_freepages()' not gonna blow up.

The same machine with this fix applied:

crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001e00000 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 1 1fffff00000000
ffffea0001edff80 7b7fe000 0 0 1 1fffff00000000
ffffea0001edffc0 7b7ff000 ffff88017fb13720 8 2 1fffff00020068 uptodate,lru,active,mappedtodisk

At least the bare minimum of pages is initialized preventing the crash
as well.

Customers started to report this as soon as 7.4 (where b92df1de5d28 was
merged in RHEL) was released. I remember reports from
September/October-ish times. It's not easily reproduced and happens on
a handful of machines only. I guess that's why. But that does not make
it less serious, I think.

Though there actually is a report here:
https://bugzilla.kernel.org/show_bug.cgi?id=196443

And there are reports for Fedora from July:
https://bugzilla.redhat.com/show_bug.cgi?id=1473242
and CentOS:
https://bugs.centos.org/view.php?id=13964
and we internally track several dozens reports for RHEL bug
https://bugzilla.redhat.com/show_bug.cgi?id=1525121

Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Paul Burton
Cc: Pavel Tatashin
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Vacek
2018-03-10 08:40:01 +0800