Eric Lee / smarc-fsl-linux-kernel

28 Feb, 2018

3 commits

78b1cb3fe mm: fail get_vaddr_frames() for filesystem-dax mappings ... Browse Code »

commit b7f0554a56f21fb3e636a627450a9add030889be upstream.

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow V4L2, Exynos, and other frame vector users to create
long standing / irrevocable memory registrations against filesytem-dax
vmas.

[dan.j.williams@intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan]
Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwillia2-desk3.amr.corp.intel.com
Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams
Reviewed-by: Jan Kara
Cc: Inki Dae
Cc: Seung-Woo Kim
Cc: Joonyoung Shim
Cc: Kyungmin Park
Cc: Mauro Carvalho Chehab
Cc: Mel Gorman
Cc: Vlastimil Babka
Cc: Christoph Hellwig
Cc: Doug Ledford
Cc: Hal Rosenstock
Cc: Jason Gunthorpe
Cc: Jeff Moyer
Cc: Ross Zwisler
Cc: Sean Hefty
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-02-28 17:18:34 +0800
b29ea3c0a mm: introduce get_user_pages_longterm ... Browse Code »

commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.

Patch series "introduce get_user_pages_longterm()", v2.

Here is a new get_user_pages api for cases where a driver intends to
keep an elevated page count indefinitely. This is distinct from usages
like iov_iter_get_pages where the elevated page counts are transient.
The iov_iter_get_pages cases immediately turn around and submit the
pages to a device driver which will put_page when the i/o operation
completes (under kernel control).

In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime
of the block / page and needs reasonable limits on how long it can wait
for pages in a mapping to become idle.

Fixing filesystems to actually wait for dax pages to be idle before
blocks from a truncate/hole-punch operation are repurposed is saved for
a later patch series.

Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.

I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings
were supported by the kernel. The behavior regression this policy
change implies is one of the reasons we maintain the "dax enabled.
Warning: EXPERIMENTAL, use at your own risk" notification when mounting
a filesystem in dax mode.

It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.

This patch (of 4):

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow long standing memory registrations against
filesytem-dax vmas. Device-dax vmas do not have this problem and are
explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and
V4L2).

[akpm@linux-foundation.org: use kcalloc()]
Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams
Suggested-by: Christoph Hellwig
Cc: Doug Ledford
Cc: Hal Rosenstock
Cc: Inki Dae
Cc: Jan Kara
Cc: Jason Gunthorpe
Cc: Jeff Moyer
Cc: Joonyoung Shim
Cc: Kyungmin Park
Cc: Mauro Carvalho Chehab
Cc: Mel Gorman
Cc: Ross Zwisler
Cc: Sean Hefty
Cc: Seung-Woo Kim
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2018-02-28 17:18:33 +0800
f2562ed54 mm: avoid spurious 'bad pmd' warning messages ... Browse Code »

commit d0f0931de936a0a468d7e59284d39581c16d3a73 upstream.

When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax:
dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
pages, they were all added to the end of if() statements after existing
pmd_trans_huge() checks. So, things like:

- if (pmd_trans_huge(*pmd))
+ if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

When further checks were added after pmd_trans_unstable() checks by
commit 7267ec008b5c ("mm: postpone page table allocation until we have
page to map") they were also added at the end of the conditional:

+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

This ordering is fine for pmd_trans_huge(), but doesn't work for
pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd()
check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
pmd_trans_unstable()), which prints out a warning and returns 1. So, we
do end up doing the right thing, but only after spamming dmesg with
suspicious looking messages:

mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

Reorder these checks in a helper so that pmd_devmap() is checked first,
avoiding the error messages, and add a comment explaining why the
ordering is important.

Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler
Reviewed-by: Jan Kara
Cc: Pawel Lebioda
Cc: "Darrick J. Wong"
Cc: Alexander Viro
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Dave Hansen
Cc: Matthew Wilcox
Cc: "Kirill A . Shutemov"
Cc: Dave Jiang
Cc: Xiong Zhou
Cc: Eryu Guan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Ross Zwisler
2018-02-28 17:18:33 +0800

25 Feb, 2018

5 commits

8bec83b2c shmem: fix compilation warnings on unused functions ... Browse Code »

commit f1f5929cd9715c1cdfe07a890f12ac7d2c5304ec upstream.

Compiling shmem.c with SHMEM and TRANSAPRENT_HUGE_PAGECACHE enabled
raises warnings on two unused functions when CONFIG_TMPFS and
CONFIG_SYSFS are both disabled:

mm/shmem.c:390:20: warning: `shmem_format_huge' defined but not used [-Wunused-function]
static const char *shmem_format_huge(int huge)
^~~~~~~~~~~~~~~~~
mm/shmem.c:373:12: warning: `shmem_parse_huge' defined but not used [-Wunused-function]
static int shmem_parse_huge(const char *str)
^~~~~~~~~~~~~~~~

A conditional compilation on tmpfs or sysfs removes the warnings.

Link: http://lkml.kernel.org/r/20161118055749.11313-1-jeremy.lefaure@lse.epita.fr
Signed-off-by: Jérémy Lefaure
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jérémy Lefaure
2018-02-25 18:05:54 +0800
1dc683933 shmem: avoid maybe-uninitialized warning ... Browse Code »

commit 23f919d4ad0eb325595f10f55be4301b2965d6d6 upstream.

After enabling -Wmaybe-uninitialized warnings, we get a false-postive
warning for shmem:

mm/shmem.c: In function `shmem_getpage_gfp':
include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This can be easily avoided, since the correct 'info' pointer is known at
the time we first enter the function, so we can simply move the
initialization up. Moving it before the first label avoids the warning
and lets us remove two later initializations.

Note that the function is so hard to read that it not only confuses the
compiler, but also most readers and without this patch it could\ easily
break if one of the 'goto's changed.

Link: https://www.spinics.net/lists/kernel/msg2368133.html
Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Acked-by: Michal Hocko
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Andreas Gruenbacher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2018-02-25 18:05:50 +0800
4b5b4f6f5 mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep ... Browse Code »

[ Upstream commit 7f6f60a1ba52538c16f26930bfbcfe193d9d746a ]

earlyprintk=efi,keep does not work any more with a warning
in mm/early_ioremap.c: WARN_ON(system_state != SYSTEM_BOOTING):
Boot just hangs because of the earlyprintk within the earlyprintk
implementation code itself.

This is caused by a new introduced middle state in:

69a78ff226fe ("init: Introduce SYSTEM_SCHEDULING state")

early_ioremap() is fine in both SYSTEM_BOOTING and SYSTEM_SCHEDULING
states, original condition should be updated accordingly.

Signed-off-by: Dave Young
Acked-by: Thomas Gleixner
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: bp@suse.de
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171209041610.GA3249@dhcp-128-65.nay.redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Dave Young
2018-02-25 18:05:49 +0800
5cab144f0 Provide a function to create a NUL-terminated string from unterminated data ... Browse Code »

commit f35157417215ec138c920320c746fdb3e04ef1d5 upstream.

Provide a function, kmemdup_nul(), that will create a NUL-terminated string
from an unterminated character array where the length is known in advance.

This is better than kstrndup() in situations where we already know the
string length as the strnlen() in kstrndup() is superfluous.

Signed-off-by: David Howells
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

David Howells
2018-02-25 18:05:41 +0800
274ee93f0 mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed. ... Browse Code »

commit bb422a738f6566f7439cd347d54e321e4fe92a9f upstream.

Syzbot caught an oops at unregister_shrinker() because combination of
commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
injection made register_shrinker() fail and the caller of
register_shrinker() did not check for failure.

----------
[ 554.881422] FAULT_INJECTION: forcing a failure.
[ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
[ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
[ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 554.881445] Call Trace:
[ 554.881459] dump_stack+0x194/0x257
[ 554.881474] ? arch_local_irq_restore+0x53/0x53
[ 554.881486] ? find_held_lock+0x35/0x1d0
[ 554.881507] should_fail+0x8c0/0xa40
[ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
[ 554.881537] ? check_noncircular+0x20/0x20
[ 554.881546] ? find_next_zero_bit+0x2c/0x40
[ 554.881560] ? ida_get_new_above+0x421/0x9d0
[ 554.881577] ? find_held_lock+0x35/0x1d0
[ 554.881594] ? __lock_is_held+0xb6/0x140
[ 554.881628] ? check_same_owner+0x320/0x320
[ 554.881634] ? lock_downgrade+0x990/0x990
[ 554.881649] ? find_held_lock+0x35/0x1d0
[ 554.881672] should_failslab+0xec/0x120
[ 554.881684] __kmalloc+0x63/0x760
[ 554.881692] ? lock_downgrade+0x990/0x990
[ 554.881712] ? register_shrinker+0x10e/0x2d0
[ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
[ 554.881737] register_shrinker+0x10e/0x2d0
[ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
[ 554.881755] ? _down_write_nest_lock+0x120/0x120
[ 554.881765] ? memcpy+0x45/0x50
[ 554.881785] sget_userns+0xbcd/0xe20
(...snipped...)
[ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
[ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 554.898732] general protection fault: 0000 [#1] SMP KASAN
[ 554.898737] Dumping ftrace buffer:
[ 554.898741] (ftrace buffer empty)
[ 554.898743] Modules linked in:
[ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
[ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
[ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
[ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
[ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
[ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
[ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
[ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
[ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
[ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
[ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
[ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 554.898818] Call Trace:
[ 554.898828] unregister_shrinker+0x79/0x300
[ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
[ 554.898844] ? down_write+0x87/0x120
[ 554.898851] ? deactivate_super+0x139/0x1b0
[ 554.898857] ? down_read+0x150/0x150
[ 554.898864] ? check_same_owner+0x320/0x320
[ 554.898875] deactivate_locked_super+0x64/0xd0
[ 554.898883] deactivate_super+0x141/0x1b0
----------

Since allowing register_shrinker() callers to call unregister_shrinker()
when register_shrinker() failed can simplify error recovery path, this
patch makes unregister_shrinker() no-op when register_shrinker() failed.
Also, reset shrinker->nr_deferred in case unregister_shrinker() was
by error called twice.

Signed-off-by: Tetsuo Handa
Signed-off-by: Aliaksei Karaliou
Reported-by: syzbot
Cc: Glauber Costa
Cc: Al Viro
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Tetsuo Handa
2018-02-25 18:05:40 +0800

22 Feb, 2018

1 commit

731845451 mm: hide a #warning for COMPILE_TEST ... Browse Code »

commit af27d9403f5b80685b79c88425086edccecaf711 upstream.

We get a warning about some slow configurations in randconfig kernels:

mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]

The warning is reasonable by itself, but gets in the way of randconfig
build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.

The warning was added in 2013 in commit 75980e97dacc ("mm: fold
page->_last_nid into page->flags where possible").

Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2018-02-22 22:43:48 +0800

04 Feb, 2018

1 commit

7b8623841 kmemleak: add scheduling point to kmemleak_scan() ... Browse Code »

[ Upstream commit bde5f6bc68db51128f875a756e9082a6c6ff7b4c ]

kmemleak_scan() will scan struct page for each node and it can be really
large and resulting in a soft lockup. We have seen a soft lockup when
do scan while compile kernel:

watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [bash:10287]
[...]
Call Trace:
kmemleak_scan+0x21a/0x4c0
kmemleak_write+0x312/0x350
full_proxy_write+0x5a/0xa0
__vfs_write+0x33/0x150
vfs_write+0xad/0x1a0
SyS_write+0x52/0xc0
do_syscall_64+0x61/0x1a0
entry_SYSCALL64_slow_path+0x25/0x25

Fix this by adding cond_resched every MAX_SCAN_SIZE.

Link: http://lkml.kernel.org/r/1511439788-20099-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie
Suggested-by: Catalin Marinas
Acked-by: Catalin Marinas
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Yisheng Xie
2018-02-04 00:05:39 +0800

31 Jan, 2018

5 commits

19a7db1e2 mm: fix 100% CPU kswapd busyloop on unreclaimable nodes ... Browse Code »

commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream.

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Shakeel Butt
Reported-by: Jia He
Tested-by: Jia He
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Cc: Dmitry Shmidt
Signed-off-by: Greg Kroah-Hartman

Johannes Weiner
2018-01-31 19:55:53 +0800
685cce58f mm, page_alloc: fix potential false positive in __zone_watermark_ok ... Browse Code »

commit b050e3769c6b4013bb937e879fc43bf1847ee819 upstream.

Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
order-0 allocations"), __zone_watermark_ok() check for high-order
allocations will shortcut per-migratetype free list checks for
ALLOC_HARDER allocations, and return true as long as there's free page
of any migratetype. The intention is that ALLOC_HARDER can allocate
from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.

However, as a side effect, the watermark check will then also return
true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
allocation cannot actually obtain isolated pages, and might not be able
to obtain CMA pages, this can result in a false positive.

The condition should be rare and perhaps the outcome is not a fatal one.
Still, it's better if the watermark check is correct. There also
shouldn't be a performance tradeoff here.

Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations")
Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Rik van Riel
Cc: David Rientjes
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2018-01-31 19:55:51 +0800
714c19ef5 cma: fix calculation of aligned offset ... Browse Code »

commit e048cb32f69038aa1c8f11e5c1b331be4181659d upstream.

The align_offset parameter is used by bitmap_find_next_zero_area_off()
to represent the offset of map's base from the previous alignment
boundary; the function ensures that the returned index, plus the
align_offset, honors the specified align_mask.

The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical
address, not CMA region position") has the cma driver calculate the
offset to the *next* alignment boundary. In most cases, the base
alignment is greater than that specified when making allocations,
resulting in a zero offset whether we align up or down. In the example
given with the commit, the base alignment (8MB) was half the requested
alignment (16MB) so the math also happened to work since the offset is
8MB in both directions. However, when requesting allocations with an
alignment greater than twice that of the base, the returned index would
not be correctly aligned.

Also, the align_order arguments of cma_bitmap_aligned_mask() and
cma_bitmap_aligned_offset() should not be negative so the argument type
was made unsigned.

Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position")
Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
Signed-off-by: Angus Clark
Signed-off-by: Doug Berger
Acked-by: Gregory Fong
Cc: Doug Berger
Cc: Angus Clark
Cc: Laura Abbott
Cc: Vlastimil Babka
Cc: Greg Kroah-Hartman
Cc: Lucas Stach
Cc: Catalin Marinas
Cc: Shiraz Hashim
Cc: Jaewon Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Vlastimil Babka
Signed-off-by: Greg Kroah-Hartman

Doug Berger
2018-01-31 19:55:51 +0800
bc0e2174b hwpoison, memcg: forcibly uncharge LRU pages ... Browse Code »

commit 18365225f0440d09708ad9daade2ec11275c3df9 upstream.

Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.

hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: Laurent Dufour
Tested-by: Laurent Dufour
Reviewed-by: Balbir Singh
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2018-01-31 19:55:51 +0800
c57664bd1 mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack ... Browse Code »

commit 561b5e0709e4a248c67d024d4d94b6e31e3edf2f upstream.

Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page. They are punching a new
MAP_FIXED mapping inside the existing stack Vma.

This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety. It is a real regression on
the other hand.

Let's work around the problem by considering PROT_NONE mapping as a part
of the stack. This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.

Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Debugged-by: Vlastimil Babka
Cc: Ben Hutchings
Cc: Willy Tarreau
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2018-01-31 19:55:51 +0800

17 Jan, 2018

1 commit

1ecdfc1ee zswap: don't param_set_charp while holding spinlock ... Browse Code »

commit fd5bb66cd934987e49557455b6497fc006521940 upstream.

Change the zpool/compressor param callback function to release the
zswap_pools_lock spinlock before calling param_set_charp, since that
function may sleep when it calls kmalloc with GFP_KERNEL.

While this problem has existed for a while, I wasn't able to trigger it
using a tight loop changing either/both the zpool and compressor params; I
think it's very unlikely to be an issue on the stable kernels, especially
since most zswap users will change the compressor and/or zpool from sysfs
only one time each boot - or zero times, if they add the params to the
kernel boot.

Fixes: c99b42c3529e ("zswap: use charp for zswap param strings")
Link: http://lkml.kernel.org/r/20170126155821.4545-1-ddstreet@ieee.org
Signed-off-by: Dan Streetman
Reported-by: Sergey Senozhatsky
Cc: Michal Hocko
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Vlastimil Babka
Signed-off-by: Greg Kroah-Hartman

Dan Streetman
2018-01-17 16:38:52 +0800

05 Jan, 2018

1 commit

1972bb9d9 kaiser: vmstat show NR_KAISERTABLE as nr_overhead ... Browse Code »

The kaiser update made an interesting choice, never to free any shadow
page tables. Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing. Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).

But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).

Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).

In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too). But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).

["nr_overhead" should of course say "nr_kaisertable", if it needs
to stay; but for the moment we are being coy, preferring that when
Joe Blow notices a new line in his /proc/vmstat, he does not get
too curious about what this "kaiser" stuff might be.]

Signed-off-by: Hugh Dickins
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2018-01-05 22:46:33 +0800

14 Dec, 2017

3 commits

161840044 zsmalloc: calling zs_map_object() from irq is a bug ... Browse Code »

[ Upstream commit 1aedcafbf32b3f232c159b14cd0d423fcfe2b861 ]

Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new
BUG_ON(), it's always been there, but was recently changed to
VM_BUG_ON(). There are several problems there. First, we use use
per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
corrupt those buffers. Second, and more importantly, we believe it's
possible to start leaking sensitive information. Consider the following
case:

-> process P
swap out
zram
per-cpu mapping CPU1
compress page A
-> IRQ

swap out
zram
per-cpu mapping CPU1
compress page B
write page from per-cpu mapping CPU1 to zsmalloc pool
iret

-> process P
write page from per-cpu mapping CPU1 to zsmalloc pool [*]
return

* so we store overwritten data that actually belongs to another
page (task) and potentially contains sensitive data. And when
process P will page fault it's going to read (swap in) that
other task's data.

Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sergey Senozhatsky
2017-12-14 16:28:23 +0800
c2edc33d4 thp: fix MADV_DONTNEED vs. numa balancing race ... Browse Code »

commit ced108037c2aa542b3ed8b7afd1576064ad1362a upstream.

In case prot_numa, we are under down_read(mmap_sem). It's critical to
not clear pmd intermittently to avoid race with MADV_DONTNEED which is
also under down_read(mmap_sem):

CPU0: CPU1:
change_huge_pmd(prot_numa=1)
pmdp_huge_get_and_clear_notify()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Link: http://lkml.kernel.org/r/20170302151034.27829-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hillf Danton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
[jwang: adjust context for 4.9 ]
Signed-off-by: Jack Wang
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2017-12-14 16:28:16 +0800
7bdd685ce thp: reduce indentation level in change_huge_pmd() ... Browse Code »

commit 0a85e51d37645e9ce57e5e1a30859e07810ed07c upstream.

Patch series "thp: fix few MADV_DONTNEED races"

For MADV_DONTNEED to work properly with huge pages, it's critical to not
clear pmd intermittently unless you hold down_write(mmap_sem).

Otherwise MADV_DONTNEED can miss the THP which can lead to userspace
breakage.

See example of such race in commit message of patch 2/4.

All these races are found by code inspection. I haven't seen them
triggered. I don't think it's worth to apply them to stable@.

This patch (of 4):

Restructure code in preparation for a fix.

Link: http://lkml.kernel.org/r/20170302151034.27829-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hillf Danton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
[jwang: adjust context for 4.9]
Signed-off-by: Jack Wang
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2017-12-14 16:28:16 +0800

10 Dec, 2017

2 commits

fae478cd9 mm: fix remote numa hits statistics ... Browse Code »

[ Upstream commit 2df26639e708a88dcc22171949da638a9998f3bc ]

Jia He has noticed that commit b9f00e147f27 ("mm, page_alloc: reduce
branches in zone_statistics") has an unintentional side effect that
remote node allocation requests are accounted as NUMA_MISS rathat than
NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.

There are many of these potentially because the flag is used very rarely
while we have many users of __alloc_pages_node.

Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
follow up patch) and treat all allocations that were satisfied from the
preferred zone's node as NUMA_HITS because this is the same node we
requested the allocation from in most cases. If this is not the local
node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.

One downsize would be that an allocation request for a node which is
outside of the mempolicy nodemask would be reported as a hit which is a
bit weird but that was the case before b9f00e147f27 already.

Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Jia He
Reviewed-by: Vlastimil Babka # with cbmc[1] superpowers
Acked-by: Mel Gorman
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Taku Izumi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2017-12-10 05:01:51 +0800
ee23ae915 mm, oom_reaper: gather each vma to prevent leaking TLB entry ... Browse Code »

commit 687cb0884a714ff484d038e9190edc874edcf146 upstream.

tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
space. In this case, tlb->fullmm is true. Some archs like arm64
doesn't flush TLB when tlb->fullmm is true:

commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

Which causes leaking of tlb entries.

Will clarifies his patch:
"Basically, we tag each address space with an ASID (PCID on x86) which
is resident in the TLB. This means we can elide TLB invalidation when
pulling down a full mm because we won't ever assign that ASID to
another mm without doing TLB invalidation elsewhere (which actually
just nukes the whole TLB).

I think that means that we could potentially not fault on a kernel
uaccess, because we could hit in the TLB"

There could be a window between complete_signal() sending IPI to other
cores and all threads sharing this mm are really kicked off from cores.
In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
flush TLB then frees pages. However, due to the above problem, the TLB
entries are not really flushed on arm64. Other threads are possible to
access these pages through TLB entries. Moreover, a copy_to_user() can
also write to these pages without generating page fault, causes
use-after-free bugs.

This patch gathers each vma instead of gathering full vm space. In this
case tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.

Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Signed-off-by: Wang Nan
Acked-by: Michal Hocko
Acked-by: David Rientjes
Cc: Minchan Kim
Cc: Will Deacon
Cc: Bob Liu
Cc: Ingo Molnar
Cc: Roman Gushchin
Cc: Konstantin Khlebnikov
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
[backported to 4.9 stable tree]
Signed-off-by: Michal Hocko
Signed-off-by: Greg Kroah-Hartman

Wang Nan
2017-12-10 05:01:47 +0800

05 Dec, 2017

4 commits

ba32d7dce mm/madvise.c: fix madvise() infinite loop under special circumstances ... Browse Code »

commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91 upstream.

MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation. The calling
convention is quite subtle there. madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.

It seems this has been broken since introduction. Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

[mhocko@suse.com: rewrite changelog]
Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
Signed-off-by: chenjie
Signed-off-by: guoxuenan
Acked-by: Michal Hocko
Cc: Minchan Kim
Cc: zhangyi (F)
Cc: Miao Xie
Cc: Mike Rapoport
Cc: Shaohua Li
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Cc: David Rientjes
Cc: Anshuman Khandual
Cc: Rik van Riel
Cc: Carsten Otte
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

chenjie
2017-12-05 18:24:32 +0800
cebe139e5 mm, hugetlbfs: introduce ->split() to vm_operations_struct ... Browse Code »

commit 31383c6865a578834dd953d9dbc88e6b19fe3997 upstream.

Patch series "device-dax: fix unaligned munmap handling"

When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and fail attempts to split vmas into unaligned ranges. It
would be messy to teach the munmap path about device-dax alignment
constraints in the same (hstate) way that hugetlbfs communicates this
constraint. Instead, these patches introduce a new ->split() vm
operation.

This patch (of 2):

The device-dax interface has similar constraints as hugetlbfs in that it
requires the munmap path to unmap in huge page aligned units. Rather
than add more custom vma handling code in __split_vma() introduce a new
vm operation to perform this vma specific check.

Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: Dan Williams
Cc: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Williams
2017-12-05 18:24:32 +0800
436f19a2e mm/cma: fix alloc_contig_range ret code/potential leak ... Browse Code »

commit 63cd448908b5eb51d84c52f02b31b9b4ccd1cb5a upstream.

If the call __alloc_contig_migrate_range() in alloc_contig_range returns
-EBUSY, processing continues so that test_pages_isolated() is called
where there is a tracepoint to identify the busy pages. However, it is
possible for busy pages to become available between the calls to these
two routines. In this case, the range of pages may be allocated.
Unfortunately, the original return code (ret == -EBUSY) is still set and
returned to the caller. Therefore, the caller believes the pages were
not allocated and they are leaked.

Update the comment to indicate that allocation is still possible even if
__alloc_contig_migrate_range returns -EBUSY. Also, clear return code in
this case so that it is not accidentally used or returned to caller.

Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation failure")
Signed-off-by: Mike Kravetz
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Laura Abbott
Cc: Michal Hocko
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mike Kravetz
2017-12-05 18:24:31 +0800
7031ae2ab mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d() ... Browse Code »

commit a8f97366452ed491d13cf1e44241bc0b5740b1f0 upstream.

Currently, we unconditionally make page table dirty in touch_pmd().
It may result in false-positive can_follow_write_pmd().

We may avoid the situation, if we would only make the page table entry
dirty if caller asks for write access -- FOLL_WRITE.

The patch also changes touch_pud() in the same way.

Signed-off-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Linus Torvalds
[Salvatore Bonaccorso: backport for 4.9:
- Adjust context
- Drop specific part for PUD-sized transparent hugepages. Support
for PUD-sized transparent hugepages was added in v4.11-rc1
]
Signed-off-by: Ben Hutchings
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2017-12-05 18:24:31 +0800

24 Nov, 2017

2 commits

ceaec6e8c mm/pagewalk.c: report holes in hugetlb ranges ... Browse Code »

commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.

This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.

Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.

This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.

This issue was found using an AFL-based fuzzer.

v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)

Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call

Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2017-11-24 15:33:42 +0800
9980b8278 mm/page_alloc.c: broken deferred calculation ... Browse Code »

commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.

In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.

The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.

The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.

Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin
Acked-by: Michal Hocko
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Pavel Tatashin
2017-11-24 15:33:42 +0800

21 Oct, 2017

2 commits

076a6220b slub: do not merge cache if slub_debug contains a never-merge flag ... Browse Code »

[ Upstream commit c6e28895a4372992961888ffaadc9efc643b5bfe ]

In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
commandline but never checks if there are features from the
SLAB_NEVER_MERGE set.

As a result selected by slub_debug caches are always mergeable if they
have been created without a custom constructor set or without one of the
SLAB_* debug features on.

This moves the SLAB_NEVER_MERGE check below the flags update from
commandline to make sure it won't merge the slab cache if one of the debug
features is on.

Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d
Signed-off-by: Grygorii Maistrenko
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Acked-by: Christoph Lameter
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Grygorii Maistrenko
2017-10-21 23:21:36 +0800
a5f043b24 mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next ... Browse Code »

[ Upstream commit ddffe98d166f4a93d996d5aa628fd745311fc1e7 ]

To identify that pages of page table are allocated from bootmem
allocator, magic number sets to page->lru.next.

But page->lru list is initialized in reserve_bootmem_region(). So when
calling free_pagetable(), the function cannot find the magic number of
pages. And free_pagetable() frees the pages by free_reserved_page() not
put_page_bootmem().

But if the pages are allocated from bootmem allocator and used as page
table, the pages have private flag. So before freeing the pages, we
should clear the private flag by put_page_bootmem().

Before applying the commit 7bfec6f47bb0 ("mm, page_alloc: check multiple
page fields with a single branch"), we could find the following visible
issue:

BUG: Bad page state in process kworker/u1024:1
page:ffffea103cfd8040 count:0 mapcount:0 mappi
flags: 0x6fffff80000800(private)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x800(private)

Call Trace:
[...] dump_stack+0x63/0x87
[...] bad_page+0x114/0x130
[...] free_pages_prepare+0x299/0x2d0
[...] free_hot_cold_page+0x31/0x150
[...] __free_pages+0x25/0x30
[...] free_pagetable+0x6f/0xb4
[...] remove_pagetable+0x379/0x7ff
[...] vmemmap_free+0x10/0x20
[...] sparse_remove_one_section+0x149/0x180
[...] __remove_pages+0x2e9/0x4f0
[...] arch_remove_memory+0x63/0xc0
[...] remove_memory+0x8c/0xc0
[...] acpi_memory_device_remove+0x79/0xa5
[...] acpi_bus_trim+0x5a/0x8d
[...] acpi_bus_trim+0x38/0x8d
[...] acpi_device_hotplug+0x1b7/0x418
[...] acpi_hotplug_work_fn+0x1e/0x29
[...] process_one_work+0x152/0x400
[...] worker_thread+0x125/0x4b0
[...] kthread+0xd8/0xf0
[...] ret_from_fork+0x22/0x40

And the issue still silently occurs.

Until freeing the pages of page table allocated from bootmem allocator,
the page->freelist is never used. So the patch sets magic number to
page->freelist instead of page->lru.next.

[isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Yasuaki Ishimatsu
2017-10-21 23:21:36 +0800

12 Oct, 2017

1 commit

2b8197073 mm, oom_reaper: skip mm structs with mmu notifiers ... Browse Code »

commit 4d4bbd8526a8fbeb2c090ea360211fceff952383 upstream.

Andrea has noticed that the oom_reaper doesn't invalidate the range via
mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
corrupt the memory of the kvm guest for example.

tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
sufficient as per Andrea:

"mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end. For KVM
mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
notifier implementation has to implement either ->invalidate_range
method or the invalidate_range_start/end methods, not both. And if you
implement invalidate_range_start/end like KVM is forced to do, calling
mmu_notifier_invalidate_range in common code is a noop for KVM.

For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD iommuv2)
can get away only implementing ->invalidate_range"

As the callback is allowed to sleep and the implementation is out of
hand of the MM it is safer to simply bail out if there is an mmu
notifier registered. In order to not fail too early make the
mm_has_notifiers check under the oom_lock and have a little nap before
failing to give the current oom victim some more time to exit.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Signed-off-by: Michal Hocko
Reported-by: Andrea Arcangeli
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2017-10-12 17:51:19 +0800

08 Oct, 2017

1 commit

a495f72f8 mm/cgroup: avoid panic when init with low memory ... Browse Code »

[ Upstream commit bfc7228b9a9647e1c353e50b40297a2929801759 ]

The system may panic when initialisation is done when almost all the
memory is assigned to the huge pages using the kernel command line
parameter hugepage=xxxx. Panic may occur like this:

Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc000000000302b88
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 [ 0.082424] NUMA
pSeries
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
task: c00000021ed01600 task.stack: c00000010d108000
NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
REGS: c00000010d10b2c0 TRAP: 0300 Not tainted (4.9.0-15-generic)
MSR: 8000000002009033 [ 0.082770] CR: 28424422 XER: 00000000
CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
LR do_try_to_free_pages+0x1b4/0x450
Call Trace:
do_try_to_free_pages+0x1b4/0x450
try_to_free_pages+0xf8/0x270
__alloc_pages_nodemask+0x7a8/0xff0
new_slab+0x104/0x8e0
___slab_alloc+0x620/0x700
__slab_alloc+0x34/0x60
kmem_cache_alloc_node_trace+0xdc/0x310
mem_cgroup_init+0x158/0x1c8
do_one_initcall+0x68/0x1d0
kernel_init_freeable+0x278/0x360
kernel_init+0x24/0x170
ret_from_kernel_thread+0x5c/0x74
Instruction dump:
eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
3929acd8 794a1f24 7d295214 eac90100 2fa90000 419eff74 3b200000
---[ end trace 342f5208b00d01b6 ]---

This is a chicken and egg issue where the kernel try to get free memory
when allocating per node data in mem_cgroup_init(), but in that path
mem_cgroup_soft_limit_reclaim() is called which assumes that these data
are allocated.

As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
these data are not yet allocated.

This patch also fixes potential null pointer access in
mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().

Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.com
Signed-off-by: Laurent Dufour
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Balbir Singh
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Laurent Dufour
2017-10-08 16:26:10 +0800

27 Sep, 2017

1 commit

39f567723 mm: prevent double decrease of nr_reserved_highatomic ... Browse Code »

commit 4855e4a7f29d6d10b0b9c84e189c770c9a94e91e upstream.

There is race between page freeing and unreserved highatomic.

CPU 0 CPU 1

free_hot_cold_page
mt = get_pfnblock_migratetype
set_pcppage_migratetype(page, mt)
unreserve_highatomic_pageblock
spin_lock_irqsave(&zone->lock)
move_freepages_block
set_pageblock_migratetype(page)
spin_unlock_irqrestore(&zone->lock)
free_pcppages_bulk
__free_one_page(mt)
Acked-by: Vlastimil Babka
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Sangseok Lee
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Cc: Miles Chen
Signed-off-by: Greg Kroah-Hartman

Minchan Kim
2017-09-27 20:39:18 +0800

14 Sep, 2017

1 commit

3c8381df2 mm/memory.c: fix mem_cgroup_oom_disable() call missing ... Browse Code »

commit de0c799bba2610a8e1e9a50d76a28614520a4cd4 upstream.

Seen while reading the code, in handle_mm_fault(), in the case
arch_vma_access_permitted() is failing the call to
mem_cgroup_oom_disable() is not made.

To fix that, move the call to mem_cgroup_oom_enable() after calling
arch_vma_access_permitted() as it should not have entered the memcg OOM.

Link: http://lkml.kernel.org/r/1504625439-31313-1-git-send-email-ldufour@linux.vnet.ibm.com
Fixes: bae473a423f6 ("mm: introduce fault_env")
Signed-off-by: Laurent Dufour
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Laurent Dufour
2017-09-14 05:13:36 +0800

07 Sep, 2017

1 commit

8cc3acff5 mm, madvise: ensure poisoned pages are removed from per-cpu lists ... Browse Code »

commit c461ad6a63b37ba74632e90c063d14823c884247 upstream.

Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
and bisected it to the commit 479f854a207c ("mm, page_alloc: defer
debugging checks of pages allocated from the PCP").

The problem is that a page that was poisoned with madvise() is reused.
The commit removed a check that would trigger if DEBUG_VM was enabled
but re-enabling the check only fixes the problem as a side-effect by
printing a bad_page warning and recovering.

The root of the problem is that an madvise() can leave a poisoned page
on the per-cpu list. This patch drains all per-cpu lists after pages
are poisoned so that they will not be reused. Wendy reports that the
test case in question passes with this patch applied. While this could
be done in a targeted fashion, it is over-complicated for such a rare
operation.

Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Signed-off-by: Mel Gorman
Reported-by: Wang, Wendy
Tested-by: Wang, Wendy
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: "Hansen, Dave"
Cc: "Luck, Tony"
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2017-09-07 14:35:39 +0800

30 Aug, 2017

3 commits

9d263321d mm/memblock.c: reversed logic in memblock_discard() ... Browse Code »

commit 91b540f98872a206ea1c49e4aa6ea8eed0886644 upstream.

In recently introduced memblock_discard() there is a reversed logic bug.
Memory is freed of static array instead of dynamically allocated one.

Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com
Fixes: 3010f876500f ("mm: discard memblock data later")
Signed-off-by: Pavel Tatashin
Reported-by: Woody Suwalski
Tested-by: Woody Suwalski
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Pavel Tatashin
2017-08-30 16:21:47 +0800
0f49b0519 mm/madvise.c: fix freeing of locked page with MADV_FREE ... Browse Code »

commit 263630e8d176d87308481ebdcd78ef9426739c6b upstream.

If madvise(..., MADV_FREE) split a transparent hugepage, it called
put_page() before unlock_page().

This was wrong because put_page() can free the page, e.g. if a
concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
mapping. put_page() then rightfully complained about freeing a locked
page.

Fix this by moving the unlock_page() before put_page().

This bug was found by syzkaller, which encountered the following splat:

BUG: Bad page state in process syzkaller412798 pfn:1bd800
page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00
flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x1(locked)
Modules linked in:
CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
bad_page+0x230/0x2b0 mm/page_alloc.c:565
free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
free_pages_check mm/page_alloc.c:952 [inline]
free_pages_prepare mm/page_alloc.c:1043 [inline]
free_pcp_prepare mm/page_alloc.c:1068 [inline]
free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
__put_single_page mm/swap.c:79 [inline]
__put_page+0xfb/0x160 mm/swap.c:113
put_page include/linux/mm.h:814 [inline]
madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
walk_pmd_range mm/pagewalk.c:50 [inline]
walk_pud_range mm/pagewalk.c:108 [inline]
walk_p4d_range mm/pagewalk.c:134 [inline]
walk_pgd_range mm/pagewalk.c:160 [inline]
__walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
walk_page_range+0x200/0x470 mm/pagewalk.c:326
madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
madvise_dontneed_free mm/madvise.c:555 [inline]
madvise_vma mm/madvise.c:664 [inline]
SYSC_madvise mm/madvise.c:832 [inline]
SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
entry_SYSCALL_64_fastpath+0x1f/0xbe

Here is a C reproducer:

#define _GNU_SOURCE
#include
#include
#include

#define MADV_FREE 8
#define PAGE_SIZE 4096

static void *mapping;
static const size_t mapping_size = 0x1000000;

static void *madvise_thrproc(void *arg)
{
madvise(mapping, mapping_size, (long)arg);
}

int main(void)
{
pthread_t t[2];

for (;;) {
mapping = mmap(NULL, mapping_size, PROT_WRITE,
MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

munmap(mapping + mapping_size / 2, PAGE_SIZE);

pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
pthread_join(t[0], NULL);
pthread_join(t[1], NULL);
munmap(mapping, mapping_size);
}
}

Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
CONFIG_DEBUG_VM=y are needed.

Google Bug Id: 64696096

Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
Signed-off-by: Eric Biggers
Acked-by: David Rientjes
Acked-by: Minchan Kim
Acked-by: Michal Hocko
Cc: Dmitry Vyukov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2017-08-30 16:21:47 +0800
5d8b3cc24 mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled ... Browse Code »

commit 435c0b87d661da83771c30ed775f7c37eed193fb upstream.

/sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
to allocate huge pages when allocate pages for private in-kernel shmem
mount.

Unfortunately, as Dan noticed, I've screwed it up and the only way to
make kernel allocate huge page for the mount is to use "force" there.
All other values will be effectively ignored.

Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
Fixes: 5a6e75f8110c ("shmem: prepare huge= mount option and sysfs knob")
Signed-off-by: Kirill A. Shutemov
Reported-by: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2017-08-30 16:21:46 +0800

25 Aug, 2017

2 commits

61332dc59 Sanitize 'move_pages()' permission checks ... Browse Code »

commit 197e7e521384a23b9e585178f3f11c9fa08274b9 upstream.

The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.

So change the access checks to the more common 'ptrace_may_access()'
model instead.

This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.

Famous last words.

Reported-by: Otto Ebeling
Acked-by: Eric W. Biederman
Cc: Willy Tarreau
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2017-08-25 08:12:21 +0800
91105f2c6 mm/mempolicy: fix use after free when calling get_mempolicy ... Browse Code »

commit 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99 upstream.

I hit a use after free issue when executing trinity and repoduced it
with KASAN enabled. The related call trace is as follows.

BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
Read of size 2 by task syz-executor1/798

INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
__slab_alloc+0x768/0x970
kmem_cache_alloc+0x2e7/0x450
mpol_new.part.2+0x74/0x160
mpol_new+0x66/0x80
SyS_mbind+0x267/0x9f0
system_call_fastpath+0x16/0x1b
INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
__slab_free+0x495/0x8e0
kmem_cache_free+0x2f3/0x4c0
__mpol_put+0x2b/0x40
SyS_mbind+0x383/0x9f0
system_call_fastpath+0x16/0x1b
INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........
Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Memory state around the buggy address:
ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

!shared memory policy is not protected against parallel removal by other
thread which is normally protected by the mmap_sem. do_get_mempolicy,
however, drops the lock midway while we can still access it later.

Early premature up_read is a historical artifact from times when
put_user was called in this path see https://lwn.net/Articles/124754/
but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
layering in the memory policy layer."). but when we have the the
current mempolicy ref count model. The issue was introduced
accordingly.

Fix the issue by removing the premature release.

Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Acked-by: Michal Hocko
Cc: Minchan Kim
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

zhong jiang
2017-08-25 08:12:20 +0800