Eric Lee / smarc-fsl-linux-kernel

07 Jan, 2019

1 commit

574823bfa Change mincore() to count "mapped" pages rather than "cached" pages ... Browse Code »

The semantics of what "in core" means for the mincore() system call are
somewhat unclear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache" rather than "page is mapped in the mapping".

The problem with that traditional semantic is that it exposes a lot of
system cache state that it really probably shouldn't, and that users
shouldn't really even care about.

So let's try to avoid that information leak by simply changing the
semantics to be that mincore() counts actual mapped pages, not pages
that might be cheaply mapped if they were faulted (note the "might be"
part of the old semantics: being in the cache doesn't actually guarantee
that you can access them without IO anyway, since things like network
filesystems may have to revalidate the cache before use).

In many ways the old semantics were somewhat insane even aside from the
information leak issue. From the very beginning (and that beginning is
a long time ago: 2.3.52 was released in March 2000, I think), the code
had a comment saying

Later we can get more picky about what "in core" means precisely.

and this is that "later". Admittedly it is much later than is really
comfortable.

NOTE! This is a real semantic change, and it is for example known to
change the output of "fincore", since that program literally does a
mmmap without populating it, and then doing "mincore()" on that mapping
that doesn't actually have any pages in it.

I'm hoping that nobody actually has any workflow that cares, and the
info leak is real.

We may have to do something different if it turns out that people have
valid reasons to want the old semantics, and if we can limit the
information leak sanely.

Cc: Kevin Easton
Cc: Jiri Kosina
Cc: Masatake YAMATO
Cc: Andrew Morton
Cc: Greg KH
Cc: Peter Zijlstra
Cc: Michal Hocko
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-07 05:43:02 +0800

06 Jan, 2019

1 commit

a65981109 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- procfs updates

- various misc bits

- lib/ updates

- epoll updates

- autofs

- fatfs

- a few more MM bits

* emailed patches from Andrew Morton : (58 commits)
mm/page_io.c: fix polled swap page in
checkpatch: add Co-developed-by to signature tags
docs: fix Co-Developed-by docs
drivers/base/platform.c: kmemleak ignore a known leak
fs: don't open code lru_to_page()
fs/: remove caller signal_pending branch predictions
mm/: remove caller signal_pending branch predictions
arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
kernel/sched/: remove caller signal_pending branch predictions
kernel/locking/mutex.c: remove caller signal_pending branch predictions
mm: select HAVE_MOVE_PMD on x86 for faster mremap
mm: speed up mremap by 20x on large regions
mm: treewide: remove unused address argument from pte_alloc functions
initramfs: cleanup incomplete rootfs
scripts/gdb: fix lx-version string output
kernel/kcov.c: mark write_comp_data() as notrace
kernel/sysctl: add panic_print into sysctl
panic: add options to print system info when panic happens
bfs: extra sanity checking and static inode bitmap
exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
...

Linus Torvalds
2019-01-06 01:16:18 +0800

05 Jan, 2019

5 commits

b685a7350 mm/page_io.c: fix polled swap page in ... Browse Code »

swap_readpage() wants to do polling to bring in pages if asked to, but
it doesn't mark the bio as being polled. Additionally, the looping
around the blk_poll() check isn't correct - if we get a zero return, we
should call io_schedule(), we can't just assume that the bio has
completed. The regular bio->bi_private check should be used for that.

Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
Signed-off-by: Jens Axboe
Reviewed-by: Andrew Morton
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jens Axboe
2019-01-05 05:13:48 +0800
f86196ea8 fs: don't open code lru_to_page() ... Browse Code »

Multiple filesystems open code lru_to_page(). Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.

No functional changes.

Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov
Acked-by: Michal Hocko
Reviewed-by: David Hildenbrand
Reviewed-by: Mike Rapoport
Acked-by: Pankaj gupta
Acked-by: "Yan, Zheng" [ceph]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikolay Borisov
2019-01-05 05:13:48 +0800
fa45f1162 mm/: remove caller signal_pending branch predictions ... Browse Code »

This is already done for us internally by the signal machinery.

Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:48 +0800
2c91bd4a4 mm: speed up mremap by 20x on large regions ... Browse Code »

Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on
THP may not be a viable option, and is not for us. This patch speeds up
the performance for non-THP system by copying at the PMD level when
possible.

The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap,
the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.

Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.

After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.

If THP is enabled the optimization is mostly skipped except in certain
situations.

[joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning]
Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com
Signed-off-by: Joel Fernandes (Google)
Acked-by: Kirill A. Shutemov
Reviewed-by: William Kucharski
Cc: Julia Lawall
Cc: Michal Hocko
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joel Fernandes (Google)
2019-01-05 05:13:48 +0800
4cf589249 mm: treewide: remove unused address argument from pte_alloc functions ... Browse Code »

Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

fn(...
- , T2 E2
)
{ ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

fn(...
-, E2
)

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google)
Suggested-by: Kirill A. Shutemov
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Julia Lawall
Cc: Kirill A. Shutemov
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joel Fernandes (Google)
2019-01-05 05:13:47 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

03 Jan, 2019

1 commit

1ac5cd497 block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE) ... Browse Code »

This mostly reverts commit 849a370016a5 ("block: avoid ordered task
state change for polled IO"). It was wrongly claiming that the ordering
wasn't necessary. The memory barrier _is_ necessary.

If something is truly polling and not going to sleep, it's the whole
state setting that is unnecessary, not the memory barrier. Whenever you
set your state to a sleeping state, you absolutely need the memory
barrier.

Note that sometimes the memory barrier can be elsewhere. For example,
the ordering might be provided by an external lock, or by setting the
process state to sleeping before adding yourself to the wait queue list
that is used for waking up (where the wait queue lock itself will
guarantee that any wakeup will correctly see the sleeping state).

But none of those cases were true here.

NOTE! Some of the polling paths may indeed be able to drop the state
setting entirely, at which point the memory barrier also goes away.

(Also note that this doesn't revert the TASK_RUNNING cases: there is no
race between a wakeup and setting the process state to TASK_RUNNING,
since the end result doesn't depend on ordering).

Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-03 02:46:03 +0800

30 Dec, 2018

2 commits

3868772b9 Merge tag 'docs-5.0' of git://git.lwn.net/linux ... Browse Code »

Pull documentation update from Jonathan Corbet:
"A fairly normal cycle for documentation stuff. We have a new document
on perf security, more Italian translations, more improvements to the
memory-management docs, improvements to the pathname lookup
documentation, and the usual array of smaller fixes.

As is often the case, there are a few reaches outside of
Documentation/ to adjust kerneldoc comments"

* tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
docs: improve pathname-lookup document structure
configfs: fix wrong name of struct in documentation
docs/mm-api: link slab_common.c to "The Slab Cache" section
slab: make kmem_cache_create{_usercopy} description proper kernel-doc
doc:process: add links where missing
docs/core-api: make mm-api.rst more structured
x86, boot: documentation whitespace fixup
Documentation: devres: note checking needs when converting
doc:it: add some process/* translations
doc:it: fixes in process/1.Intro
Documentation: convert path-lookup from markdown to resturctured text
Documentation/admin-guide: update admin-guide index.rst
Documentation/admin-guide: introduce perf-security.rst file
scripts/kernel-doc: Fix struct and struct field attribute processing
Documentation: dev-tools: Fix typos in index.rst
Correct gen_init_cpio tool's documentation
Document /proc/pid PID reuse behavior
Documentation: update path-lookup.md for parallel lookups
Documentation: Use "while" instead of "whilst"
dmaengine: Add mailing list address to the documentation
...

Linus Torvalds
2018-12-30 03:21:49 +0800
55db91fba Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu ... Browse Code »

Pull percpu update from Dennis Zhou:
"Michael Cree noted generic UP Alpha has been broken since v3.18. This
is a small fix for locking in UP percpu code that fixes the issue"

* 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: convert spin_lock_irq to spin_lock_irqsave.

Linus Torvalds
2018-12-30 02:13:23 +0800

29 Dec, 2018

29 commits

f346b0bec Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:

- large KASAN update to use arm's "software tag-based mode"

- a few misc things

- sh updates

- ocfs2 updates

- just about all of MM

* emailed patches from Andrew Morton : (167 commits)
kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
memcg, oom: notify on oom killer invocation from the charge path
mm, swap: fix swapoff with KSM pages
include/linux/gfp.h: fix typo
mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
memory_hotplug: add missing newlines to debugging output
mm: remove __hugepage_set_anon_rmap()
include/linux/vmstat.h: remove unused page state adjustment macro
mm/page_alloc.c: allow error injection
mm: migrate: drop unused argument of migrate_page_move_mapping()
blkdev: avoid migration stalls for blkdev pages
mm: migrate: provide buffer_migrate_page_norefs()
mm: migrate: move migrate_page_lock_buffers()
mm: migrate: lock buffers before migrate_page_move_mapping()
mm: migration: factor out code to compute expected number of page references
mm, page_alloc: enable pcpu_drain with zone capability
kmemleak: add config to select auto scan
mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
...

Linus Torvalds
2018-12-29 08:55:46 +0800
0e9da3fbf Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"This is the main pull request for block/storage for 4.21.

Larger than usual, it was a busy round with lots of goodies queued up.
Most notable is the removal of the old IO stack, which has been a long
time coming. No new features for a while, everything coming in this
week has all been fixes for things that were previously merged.

This contains:

- Use atomic counters instead of semaphores for mtip32xx (Arnd)

- Cleanup of the mtip32xx request setup (Christoph)

- Fix for circular locking dependency in loop (Jan, Tetsuo)

- bcache (Coly, Guoju, Shenghui)
* Optimizations for writeback caching
* Various fixes and improvements

- nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
* host and target support for NVMe over TCP
* Error log page support
* Support for separate read/write/poll queues
* Much improved polling
* discard OOM fallback
* Tracepoint improvements

- lightnvm (Hans, Hua, Igor, Matias, Javier)
* Igor added packed metadata to pblk. Now drives without metadata
per LBA can be used as well.
* Fix from Geert on uninitialized value on chunk metadata reads.
* Fixes from Hans and Javier to pblk recovery and write path.
* Fix from Hua Su to fix a race condition in the pblk recovery
code.
* Scan optimization added to pblk recovery from Zhoujie.
* Small geometry cleanup from me.

- Conversion of the last few drivers that used the legacy path to
blk-mq (me)

- Removal of legacy IO path in SCSI (me, Christoph)

- Removal of legacy IO stack and schedulers (me)

- Support for much better polling, now without interrupts at all.
blk-mq adds support for multiple queue maps, which enables us to
have a map per type. This in turn enables nvme to have separate
completion queues for polling, which can then be interrupt-less.
Also means we're ready for async polled IO, which is hopefully
coming in the next release.

- Killing of (now) unused block exports (Christoph)

- Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

- Support for zoned testing with null_blk (Masato)

- sx8 conversion to per-host tag sets (Christoph)

- IO priority improvements (Damien)

- mq-deadline zoned fix (Damien)

- Ref count blkcg series (Dennis)

- Lots of blk-mq improvements and speedups (me)

- sbitmap scalability improvements (me)

- Make core inflight IO accounting per-cpu (Mikulas)

- Export timeout setting in sysfs (Weiping)

- Cleanup the direct issue path (Jianchao)

- Export blk-wbt internals in block debugfs for easier debugging
(Ming)

- Lots of other fixes and improvements"

* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
kyber: use sbitmap add_wait_queue/list_del wait helpers
sbitmap: add helpers for add/del wait queue handling
block: save irq state in blkg_lookup_create()
dm: don't reuse bio for flushes
nvme-pci: trace SQ status on completions
nvme-rdma: implement polling queue map
nvme-fabrics: allow user to pass in nr_poll_queues
nvme-fabrics: allow nvmf_connect_io_queue to poll
nvme-core: optionally poll sync commands
block: make request_to_qc_t public
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
nvme-tcp: fix endianess annotations
nvmet-tcp: fix endianess annotations
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
nvme-pci: only set nr_maps to 2 if poll queues are supported
nvmet: use a macro for default error location
nvmet: fix comparison of a u16 with -1
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
...

Linus Torvalds
2018-12-29 05:19:59 +0800
7056d3a37 memcg, oom: notify on oom killer invocation from the charge path ... Browse Code »

Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.

Fix the issue by replicating the oom hierarchy locking and the
notification.

Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko
Reported-by: Burt Holzman
Acked-by: Johannes Weiner
Cc: Vladimir Davydov [4.19+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-12-29 04:11:52 +0800
7af7a8e19 mm, swap: fix swapoff with KSM pages ... Browse Code »

KSM pages may be mapped to the multiple VMAs that cannot be reached from
one anon_vma. So during swapin, a new copy of the page need to be
generated if a different anon_vma is needed, please refer to comments of
ksm_might_need_to_copy() for details.

During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
virtual address mapped to the page, so not all mappings to a swapped out
KSM page could be found. So in try_to_unuse(), even if the swap count of
a swap entry isn't zero, the page needs to be deleted from swap cache, so
that, in the next round a new page could be allocated and swapin for the
other mappings of the swapped out KSM page.

But this contradicts with the THP swap support. Where the THP could be
deleted from swap cache only after the swap count of every swap entry in
the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
space for THP swapped out") to check that before delete a page from swap
cache, but this has broken KSM swapoff too.

Fortunately, KSM is for the normal pages only, so the original behavior
for KSM pages could be restored easily via checking PageTransCompound().
That is how this patch works.

The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
swap space for THP swapped out"), which is merged by v4.14-rc1. So I
think we should backport the fix to from 4.14 on. But Hugh thinks it may
be rare for the KSM pages being in the swap device when swapoff, so nobody
reports the bug so far.

Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
Signed-off-by: "Huang, Ying"
Reported-by: Hugh Dickins
Tested-by: Hugh Dickins
Acked-by: Hugh Dickins
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Shaohua Li
Cc: Daniel Jordan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-12-29 04:11:52 +0800
063a7d1d3 mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm ... Browse Code »

The kbuild robot reported the following on a development branch that used
memremap.h in a new path:

In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
from arch/m68k/include/asm/pgtable.h:5,
from include/linux/memremap.h:7,
from drivers//dax/bus.c:3:
arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
>> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
return mm->pgd + pgd_index(address);
^~

The ->page_fault() callback is specific to HMM. Move it to 'struct
hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in
include/linux/hmm.h. Longer term refactoring this dependency out of HMM
is recommended, but in the meantime memremap.h remains generic.

Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...")
Signed-off-by: Dan Williams
Reviewed-by: "Jérôme Glisse"
Cc: Logan Gunthorpe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2018-12-29 04:11:52 +0800
c86aa7bbf hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race ... Browse Code »

hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.

Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.

With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.

[mike.kravetz@oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K . V"
Cc: Andrea Arcangeli
Cc: Davidlohr Bueso
Cc: Prakash Sangappa
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2018-12-29 04:11:52 +0800
b43a99900 hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization ... Browse Code »

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.

[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K . V"
Cc: Andrea Arcangeli
Cc: Davidlohr Bueso
Cc: Prakash Sangappa
Cc: Colin Ian King
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2018-12-29 04:11:51 +0800
451b9514a mm: remove __hugepage_set_anon_rmap() ... Browse Code »

This function is identical to __page_set_anon_rmap() since the time, when
it was introduced (8 years ago). The patch removes the function, and
makes its users to use __page_set_anon_rmap() instead.

Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Kirill A. Shutemov
Reviewed-by: Andrew Morton
Reviewed-by: Mike Kravetz
Cc: Jerome Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-12-29 04:11:51 +0800
af3b85449 mm/page_alloc.c: allow error injection ... Browse Code »

Model call chain after should_failslab(). Likewise, we can now use a
kprobe to override the return value of should_fail_alloc_page() and inject
allocation failures into alloc_page*().

This will allow injecting allocation failures using the BCC tools even
without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
fail_page_alloc= parameter, which incurs some overhead even when failures
are not being injected. On the other hand, this patch adds an
unconditional call to should_fail_alloc_page() from page allocation
hotpath. That overhead should be rather negligible with
CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

[vbabka@suse.cz: changelog addition]
Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
Signed-off-by: Benjamin Poirier
Acked-by: Vlastimil Babka
Cc: Arnd Bergmann
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Oscar Salvador
Cc: Mike Rapoport
Cc: Joonsoo Kim
Cc: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Poirier
2018-12-29 04:11:51 +0800
ab41ee687 mm: migrate: drop unused argument of migrate_page_move_mapping() ... Browse Code »

All callers of migrate_page_move_mapping() now pass NULL for 'head'
argument. Drop it.

Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-12-29 04:11:51 +0800
89cb0888c mm: migrate: provide buffer_migrate_page_norefs() ... Browse Code »

Provide a variant of buffer_migrate_page() that also checks whether there
are no unexpected references to buffer heads. This function will then be
safe to use for block device pages.

[akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-12-29 04:11:51 +0800
84ade7c15 mm: migrate: move migrate_page_lock_buffers() ... Browse Code »

buffer_migrate_page() is the only caller of migrate_page_lock_buffers()
move it close to it and also drop the now unused stub for !CONFIG_BLOCK.

Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-12-29 04:11:51 +0800
cc4f11e69 mm: migrate: lock buffers before migrate_page_move_mapping() ... Browse Code »

Lock buffers before calling into migrate_page_move_mapping() so that that
function doesn't have to know about buffers (which is somewhat unexpected
anyway) and all the buffer head logic is in buffer_migrate_page().

Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-12-29 04:11:51 +0800
0b3901b38 mm: migration: factor out code to compute expected number of page references ... Browse Code »

Patch series "mm: migrate: Fix page migration stalls for blkdev pages".

This patchset deals with page migration stalls that were reported by our
customer due to a block device page that had a bufferhead that was in the
bh LRU cache.

The patchset modifies the page migration code so that bufferheads are
completely handled inside buffer_migrate_page() and then provides a new
migration helper for pages with buffer heads that is safe to use even for
block device pages and that also deals with bh lrus.

This patch (of 6):

Factor out function to compute number of expected page references in
migrate_page_move_mapping(). Note that we move hpage_nr_pages() and
page_has_private() checks from under xas_lock_irq() however this is safe
since we hold page lock.

[jack@suse.cz: fix expected_page_refs()]
Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz
Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-12-29 04:11:51 +0800
d9367bd06 mm, page_alloc: enable pcpu_drain with zone capability ... Browse Code »

drain_all_pages is documented to drain per-cpu pages for a given zone (if
non-NULL). The current implementation doesn't match the description
though. It will drain all pcp pages for all zones that happen to have
cached pages on the same cpu as the given zone. This will lead to
premature pcp cache draining for zones that are not of any interest to the
caller - e.g. compaction, hwpoison or memory offline.

This forces the page allocator to take locks and potential lock contention
as a result.

There is no real reason for this sub-optimal implementation. Replace
per-cpu work item with a dedicated structure which contains a pointer to
the zone and pass it over to the worker. This will get the zone
information all the way down to the worker function and do the right job.

[akpm@linux-foundation.org: avoid 80-col tricks]
[mhocko@suse.com: refactor the whole changelog]
Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Acked-by: Michal Hocko
Reviewed-by: Oscar Salvador
Reviewed-by: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2018-12-29 04:11:51 +0800
d53ce0422 kmemleak: add config to select auto scan ... Browse Code »

Kmemleak scan can be cpu intensive and can stall user tasks at times. To
prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto
scan on boot up. Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as
this is meant for only first automatic scan.

Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.com
Signed-off-by: Sri Krishna chowdary
Signed-off-by: Sachin Nikam
Signed-off-by: Prateek
Reviewed-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sri Krishna chowdary
2018-12-29 04:11:51 +0800
3c0c12cc8 mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init ... Browse Code »

When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
pages initialization can take a long time. Below were the reported init
times on a 8-socket 96-core 4TB IvyBridge system.

1) Non-debug kernel without CONFIG_KASAN
[ 8.764222] node 1 initialised, 132086516 pages in 7027ms

2) Debug kernel with CONFIG_KASAN
[ 146.288115] node 1 initialised, 132075466 pages in 143052ms

So the page init time in a debug kernel was 20X of the non-debug kernel.
The long init time can be problematic as the page initialization is done
with interrupt disabled. In this particular case, it caused the
appearance of following warning messages as well as NMI backtraces of all
the cores that were doing the initialization.

[ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
[ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
[ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
[ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
[ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
[ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
[ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
[ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

The long init time was mainly caused by the call to kasan_free_pages() to
poison the newly initialized pages. On a 4TB system, we are talking about
almost 500GB of memory probably on the same node.

In reality, we may not need to poison the newly initialized pages before
they are ever allocated. So KASAN poisoning of freed pages before the
completion of deferred memory initialization is now disabled. Those pages
will be properly poisoned when they are allocated or freed after deferred
pages initialization is done.

With this change, the new page initialization time became:

[ 21.948010] node 1 initialised, 132075466 pages in 18702ms

This was still about double the non-debug kernel time, but was much
better than before.

Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long
Reviewed-by: Andrew Morton
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: Michal Hocko
Cc: Pasha Tatashin
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Waiman Long
2018-12-29 04:11:51 +0800
125b860b2 mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES ... Browse Code »

Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
If someone adds extra migrate type, then he may forget to enlarge the
NR_PAGEBLOCK_BITS. Hence it requires some way to fix.

NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
two different .h file with reverse dependency, it is a little hard to
refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind
such relation in compiling-time.

Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu
Reviewed-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Oscar Salvador
Cc: Mike Rapoport
Cc: Joonsoo Kim
Cc: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pingfan Liu
2018-12-29 04:11:51 +0800
fcf9a0ef8 ksm: react on changing "sleep_millisecs" parameter faster ... Browse Code »

ksm thread unconditionally sleeps in ksm_scan_thread() after each
iteration:

schedule_timeout_interruptible(
msecs_to_jiffies(ksm_thread_sleep_millisecs))

The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.

In case of user writes a big value by a mistake, and the thread enters
into schedule_timeout_interruptible(), it's not possible to cancel the
sleep by writing a new smaler value; the thread is just sleeping till
timeout expires.

The patch fixes the problem by waking the thread each time after the value
is updated.

This also may be useful for debug purposes; and also for userspace
daemons, which change sleep_millisecs value in dependence of system load.

Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Cyrill Gorcunov
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-12-29 04:11:51 +0800
e0975b2aa mm, fault_around: do not take a reference to a locked page ... Browse Code »

filemap_map_pages takes a speculative reference to each page in the range
before it tries to lock that page. While this is correct it also can
influence page migration which will bail out when seeing an elevated
reference count. The faultaround code would bail on seeing a locked page
so we can pro-actively check the PageLocked bit before
page_cache_get_speculative and prevent from pointless reference count
churn.

Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Suggested-by: Jan Kara
Acked-by: Kirill A. Shutemov
Reviewed-by: David Hildenbrand
Acked-by: Hugh Dickins
Reviewed-by: William Kucharski
Cc: Oscar Salvador
Cc: Pavel Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-12-29 04:11:51 +0800
bb8965bd8 mm, memory_hotplug: deobfuscate migration part of offlining ... Browse Code »

Memory migration might fail during offlining and we keep retrying in that
case. This is currently obfuscated by goto retry loop. The code is hard
to follow and as a result it is even suboptimal becase each retry round
scans the full range from start_pfn even though we have successfully
scanned/migrated [start_pfn, pfn] range already. This is all only because
check_pages_isolated failure has to rescan the full range again.

De-obfuscate the migration retry loop by promoting it to a real for loop.
In fact remove the goto altogether by making it a proper double loop
(yeah, gotos are nasty in this specific case). In the end we will get a
slightly more optimal code which is better readable.

[akpm@linux-foundation.org: reflow comments to 80 cols]
Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Hugh Dickins
Cc: Jan Kara
Cc: Kirill A. Shutemov
Cc: Pavel Tatashin
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-12-29 04:11:50 +0800
a85009c37 mm, memory_hotplug: try to migrate full pfn range ... Browse Code »

Patch series "few memory offlining enhancements".

I have been chasing memory offlining not making progress recently. On the
way I have noticed few weird decisions in the code. The migration itself
is restricted without a reasonable justification and the retry loop around
the migration is quite messy. This is addressed by patch 1 and patch 2.

Patch 3 is targeting on the faultaround code which has been a hot
candidate for the initial issue reported upstream [2] and that I am
debugging internally. It turned out to be not the main contributor in the
end but I believe we should address it regardless. See the patch
description for more details.

[1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv

This patch (of 3):

do_migrate_range has been limiting the number of pages to migrate to 256
for some reason which is not documented. Even if the limit made some
sense back then when it was introduced it doesn't really serve a good
purpose these days. If the range contains huge pages then we break out of
the loop too early and go through LRU and pcp caches draining and
scan_movable_pages is quite suboptimal.

The only reason to limit the number of pages I can think of is to reduce
the potential time to react on the fatal signal. But even then the number
of pages is a questionable metric because even a single page migration
might block in a non-killable state (e.g. __unmap_and_move).

Remove the limit and offline the full requested range (this is one
memblock worth of pages with the current code). Should we ever get a
report that offlining takes too long to react on fatal signal then we
should rather fix the core migration to use killable waits and bailout
on a signal.

Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: David Hildenbrand
Reviewed-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Cc: Hugh Dickins
Cc: Jan Kara
Cc: Kirill A. Shutemov
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-12-29 04:11:50 +0800
7635d9cbe mm, thp, proc: report THP eligibility for each vma ... Browse Code »

Userspace falls short when trying to find out whether a specific memory
range is eligible for THP. There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.

The only way to deduce this now is to query for hg resp. nh flags and
confronting the state with the global setting. Except that there is also
PR_SET_THP_DISABLE that might change the picture. So the final logic is
not trivial. Moreover the eligibility of the vma depends on the type of
VMA as well. In the past we have supported only anononymous memory VMAs
but things have changed and shmem based vmas are supported as well these
days and the query logic gets even more complicated because the
eligibility depends on the mount option and another global configuration
knob.

Simplify the current state and report the THP eligibility in
/proc//smaps for each existing vma. Reuse
transparent_hugepage_enabled for this purpose. The original
implementation of this function assumes that the caller knows that the vma
itself is supported for THP so make the core checks into
__transparent_hugepage_enabled and use it for existing callers.
__show_smap just use the new transparent_hugepage_enabled which also
checks the vma support status (please note that this one has to be out of
line due to include dependency issues).

[mhocko@kernel.org: fix oops with NULL ->f_mapping]
Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Dan Williams
Cc: David Rientjes
Cc: Jan Kara
Cc: Mike Rapoport
Cc: Paul Oppenheimer
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-12-29 04:11:50 +0800
ac46d4f3c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 ... Browse Code »

To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2018-12-29 04:11:50 +0800
5d6527a78 mm/mmu_notifier: use structure for invalidate_range_start/end callback ... Browse Code »

Patch series "mmu notifier contextual informations", v2.

This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback. This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).

For instance device can have they own page table that mirror the process
address space. When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.

Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap(). This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.

Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking. Or device driver can
optimize away mprotect() that change the page table permission access for
the range.

This patchset enables all this optimizations for device drivers. I do not
include any of those in this series but another patchset I am posting will
leverage this.

The patchset is pretty simple from a code point of view. The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments. The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).

This patch (of 3):

To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback. No functional changes
with this patch.

[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Jan Kara
Acked-by: Jason Gunthorpe [infiniband]
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2018-12-29 04:11:50 +0800
b15c87263 hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined ... Browse Code »

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase

===

int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);

fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");

sleep (20);

ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [] dump_trace+0x59/0x340
kernel: [] show_stack_log_lvl+0xea/0x170
kernel: [] show_stack+0x21/0x40
kernel: [] dump_stack+0x5c/0x7c
kernel: [] warn_slowpath_common+0x81/0xb0
kernel: [] __pagevec_lru_add_fn+0x14c/0x160
kernel: [] pagevec_lru_move_fn+0xad/0x100
kernel: [] __lru_cache_add+0x6c/0xb0
kernel: [] add_to_page_cache_lru+0x46/0x70
kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [] __do_page_cache_readahead+0x177/0x200
kernel: [] ondemand_readahead+0x168/0x2a0
kernel: [] generic_file_read_iter+0x41f/0x660
kernel: [] __vfs_read+0xcd/0x140
kernel: [] vfs_read+0x7a/0x120
kernel: [] kernel_read+0x3b/0x50
kernel: [] do_execveat_common.isra.29+0x490/0x6f0
kernel: [] do_execve+0x28/0x30
kernel: [] call_usermodehelper_exec_async+0xfb/0x130
kernel: [] ret_from_fork+0x55/0x80

And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:

: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:

: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Oscar Salvador
Debugged-by: Oscar Salvador
Tested-by: Oscar Salvador
Acked-by: David Hildenbrand
Acked-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-12-29 04:11:50 +0800
9f1eb38e0 mm, kmemleak: little optimization while scanning ... Browse Code »

kmemleak_scan() goes through all online nodes and tries to scan all used
pages.

We can do better and use pfn_to_online_page(), so in case we have
CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically. For
boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
will fallback to pfn_valid().

Another little optimization is to check if the page belongs to the node we
are currently checking, so in case we have nodes interleaved we will not
check the same pfn multiple times.

I ran some tests:

Add some memory to node1 and node2 making it interleaved:

(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
(qemu) object_add memory-backend-ram,id=ram1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
(qemu) object_add memory-backend-ram,id=ram2,size=1G
(qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1

Then, we offline that memory:
# for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done
# for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don
# for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done

And we run kmemleak_scan:

# echo "scan" > /sys/kernel/debug/kmemleak

before the patch:

kmemleak: time spend: 41596 us

after the patch:

kmemleak: time spend: 34899 us

[akpm@linux-foundation.org: remove stray newline, per Oscar]
Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.de
Signed-off-by: Oscar Salvador
Reviewed-by: Wei Yang
Suggested-by: Michal Hocko
Acked-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-12-29 04:11:50 +0800
c16eb000c mm/filemap.c: remove useless check in pagecache_get_page() ... Browse Code »

page always is not NULL, so we may remove this useless check.

Link: http://lkml.kernel.org/r/154419752044.18559.2452963074922917720.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Cyrill Gorcunov
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-12-29 04:11:50 +0800
bbe5d9939 mm/page_alloc.c: drop uneeded __meminit and __meminitdata ... Browse Code »

Since commit 03e85f9d5f1 ("mm/page_alloc: Introduce
free_area_init_core_hotplug"), some functions changed to only be called
during system initialization. Concretly, free_area_init_node() and the
functions that hang from it.

Also, some variables are no longer used after the system has gone
through initialization. So this could be considered as a late clean-up
for that patch.

This patch changes the functions from __meminit to __init, and the
variables from __meminitdata to __initdata.

In return, we get some KBs back:

Before:
Freeing unused kernel image memory: 2472K

After:
Freeing unused kernel image memory: 2480K

Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
Signed-off-by: Oscar Salvador
Reviewed-by: Wei Yang
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Vlastimil Babka
Cc: Alexander Duyck
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-12-29 04:11:49 +0800