Eric Lee / smarc-fsl-linux-kernel

08 Jun, 2018

40 commits

09088a404 lib/ucs2_string.c: add MODULE_LICENSE() ... Browse Code »

Fix missing MODULE_LICENSE() warning in lib/ucs2_string.c:

WARNING: modpost: missing MODULE_LICENSE() in lib/ucs2_string.o
see include/linux/module.h for more information

Link: http://lkml.kernel.org/r/b2505bb4-dcf5-fc46-443d-e47db1cb2f59@infradead.org
Signed-off-by: Randy Dunlap
Cc: Greg Kroah-Hartman
Cc: Matthew Garrett
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2018-06-08 08:34:39 +0800
cbdc61ae1 lib/mpi: headers cleanup ... Browse Code »

MPI headers contain definitions for huge number of non-existing
functions.

Most part of these functions was removed in 2012 by Dmitry Kasatkin
- 7cf4206a99d1 ("Remove unused code from MPI library")
- 9e235dcaf4f6 ("Revert "crypto: GnuPG based MPI lib - additional ...")
- bc95eeadf5c6 ("lib/mpi: removed unused functions")
however headers wwere not updated properly.

Also I deleted some unused macros.

Link: http://lkml.kernel.org/r/fb2fc1ef-1185-f0a3-d8d0-173d2f97bbaf@virtuozzo.com
Signed-off-by: Vasily Averin
Reviewed-by: Andrew Morton
Cc: Dmitry Kasatkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasily Averin
2018-06-08 08:34:39 +0800
804209d8a lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock ... Browse Code »

percpu_ida() decouples disabling interrupts from the locking operations.
This breaks some assumptions if the locking operations are replaced like
they are under -RT.

The same locking can be achieved by avoiding local_irq_save() and using
spin_lock_irqsave() instead. percpu_ida_alloc() gains one more preemption
point because after unlocking the fastpath and before the pool lock is
acquired, the interrupts are briefly enabled.

Link: http://lkml.kernel.org/r/20180504153218.7301-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Andrew Morton
Cc: Thomas Gleixner
Cc: Nicholas Bellinger
Cc: Shaohua Li
Cc: Kent Overstreet
Cc: Matthew Wilcox
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-06-08 08:34:39 +0800
b94078e69 lib/idr.c: remove simple_ida_lock ... Browse Code »

Improve the scalability of the IDA by using the per-IDA xa_lock rather
than the global simple_ida_lock. IDAs are not typically used in
performance-sensitive locations, but since we have this lock anyway, we
can use it. It is also a step towards converting the IDA from the radix
tree to the XArray.

[akpm@linux-foundation.org: idr.c needs xarray.h]
Link: http://lkml.kernel.org/r/20180331125332.GF13332@bombadil.infradead.org
Signed-off-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Cc: Rasmus Villemoes
Cc: Daniel Vetter
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:39 +0800
ca1250bbd lib/bitmap.c: micro-optimization for __bitmap_complement() ... Browse Code »

Use BITS_TO_LONGS() macro to avoid calculation of reminder (bits %
BITS_PER_LONG) On ARM64 it saves 5 instruction for function - 16 before
and 11 after.

Link: http://lkml.kernel.org/r/20180411145914.6011-1-ynorov@caviumnetworks.com
Signed-off-by: Yury Norov
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Cc: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yury Norov
2018-06-08 08:34:39 +0800
0455c7478 get_maintainer: improve patch recognition ... Browse Code »

There are mode change and rename only patches that are unrecognized
by the get_maintainer.pl script.

Recognize them.

Link: http://lkml.kernel.org/r/bf63101a908d0ff51948164aa60e672368066186.1526949367.git.joe@perches.com
Signed-off-by: Joe Perches
Reported-by: Heinrich Schuchardt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2018-06-08 08:34:39 +0800
401c636a0 kernel/hung_task.c: show all hung tasks before panic ... Browse Code »

When we get a hung task it can often be valuable to see _all_ the hung
tasks on the system before calling panic().

Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952
----------------------------------------
INFO: task syz-executor0:6540 blocked for more than 120 seconds.
Not tainted 4.16.0+ #13
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor0 D23560 6540 4521 0x80000004
Call Trace:
context_switch kernel/sched/core.c:2848 [inline]
__schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490
schedule+0xf5/0x430 kernel/sched/core.c:3549
schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607
__mutex_lock_common kernel/locking/mutex.c:833 [inline]
__mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601
ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060
isofs_get_last_session fs/isofs/inode.c:567 [inline]
isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660
mount_bdev+0x2b7/0x370 fs/super.c:1119
isofs_mount+0x34/0x40 fs/isofs/inode.c:1560
mount_fs+0x66/0x2d0 fs/super.c:1222
vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037
vfs_kern_mount fs/namespace.c:2514 [inline]
do_new_mount fs/namespace.c:2517 [inline]
do_mount+0xea4/0x2b90 fs/namespace.c:2847
ksys_mount+0xab/0x120 fs/namespace.c:3063
SYSC_mount fs/namespace.c:3077 [inline]
SyS_mount+0x39/0x50 fs/namespace.c:3074
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
(...snipped...)
Showing all locks held in the system:
(...snipped...)
2 locks held by syz-executor0/6540:
#0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline]
#0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); */
#1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
(...snipped...)
3 locks held by syz-executor7/6541:
#0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */
#1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192
#2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 /* down_read(&sb->s_umount); */
----------------------------------------

When reporting an AB-BA deadlock like shown above, it would be nice if
trace of PID=6541 is printed as well as trace of PID=6540 before calling
panic().

Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay
calling panic() but normally there should not be so many hung tasks.

Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Acked-by: Paul E. McKenney
Acked-by: Dmitry Vyukov
Cc: Vegard Nossum
Cc: Mandeep Singh Baines
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton

Signed-off-by: Linus Torvalds

Tetsuo Handa
2018-06-08 08:34:39 +0800
b22f22a3c include/linux/types.h: use fixed width types without double-underscore prefix ... Browse Code »

This header file is not exported. It is safe to reference types without
double-underscore prefix.

Link: http://lkml.kernel.org/r/1526350925-14922-3-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Cc: Geert Uytterhoeven
Cc: Alexey Dobriyan
Cc: Lihao Liang
Cc: Philippe Ombredanne
Cc: Pekka Enberg
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2018-06-08 08:34:38 +0800
6d0e8d538 include/linux/types.h: define aligned_ types based on uapi header ... Browse Code »

has the same typedefs except that it prefixes them
with double-underscore for user space. Use them for the kernel space
typedefs.

Link: http://lkml.kernel.org/r/1526350925-14922-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Reviewed-by: Andrew Morton
Cc: Geert Uytterhoeven
Cc: Alexey Dobriyan
Cc: Lihao Liang
Cc: Philippe Ombredanne
Cc: Pekka Enberg
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2018-06-08 08:34:38 +0800
6d8e41080 int-ll64.h: define u{8,16,32,64} and s{8,16,32,64} based on uapi header ... Browse Code »

has the same typedefs except that it
prefixes them with double-underscore for user space. Use them for
the kernel space typedefs.

Link: http://lkml.kernel.org/r/1526350925-14922-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Reviewed-by: Andrew Morton
Cc: Geert Uytterhoeven
Cc: Alexey Dobriyan
Cc: Lihao Liang
Cc: Philippe Ombredanne
Cc: Pekka Enberg
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2018-06-08 08:34:38 +0800
b2f5de033 tools/testing/selftests/proc: test /proc/*/fd a bit (+ PF_KTHREAD is ABI!) ... Browse Code »

* Test lookup in /proc/self/fd.
"map_files" lookup story showed that lookup is not that simple.

* Test that all those symlinks open the same file.
Check with (st_dev, st_info).

* Test that kernel threads do not have anything in their /proc/*/fd/
directory.

Now this is where things get interesting.

First, kernel threads aren't pinned by /proc/self or equivalent,
thus some "atomicity" is required.

Second, ->comm can contain whitespace and ')'.
No, they are not escaped.

Third, the only reliable way to check if process is kernel thread
appears to be field #9 in /proc/*/stat.

This field is struct task_struct::flags in decimal!
Check is done by testing PF_KTHREAD flags like we do in kernel.

PF_KTREAD value is a part of userspace ABI !!!

Other methods for determining kernel threadness are not reliable:
* RSS can be 0 if everything is swapped, even while reading
from /proc/self.

* ->total_vm CAN BE ZERO if process is finishing

munmap(NULL, whole address space);

* /proc/*/maps and similar files can be empty because unmapping
everything works. Read returning 0 can't distinguish between
kernel thread and such suicide process.

Link: http://lkml.kernel.org/r/20180505000414.GA15090@avx2
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
5d008fb41 proc: use "unsigned int" for /proc/*/stack ... Browse Code »

struct stack_trace::nr_entries is defined as "unsigned int" (YAY!) so
the iterator should be unsigned as well.

It saves 1 byte of code or something like that.

Link: http://lkml.kernel.org/r/20180423215248.GG9043@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
197850a1e proc: use "unsigned int" for sigqueue length ... Browse Code »

It's defined as atomic_t and really long signal queues are unheard of.

Link: http://lkml.kernel.org/r/20180423215119.GF9043@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
a4ef38956 proc: use "unsigned int" in proc_fill_cache() ... Browse Code »

All those lengths are unsigned as they should be.

Link: http://lkml.kernel.org/r/20180423213751.GC9043@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
941169298 proc: smaller RCU section in ->getattr() ... Browse Code »

struct kstat is thread local.

Link: http://lkml.kernel.org/r/20180423213626.GB9043@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
3cb4e162e proc: deduplicate /proc/*/cmdline implementation ... Browse Code »

Code can be sonsolidated if a dummy region of 0 length is used in normal
case of \0-separated command line:

1) [arg_start, arg_end) + [dummy len=0]
2) [arg_start, arg_end) + [env_start, env_end)

Link: http://lkml.kernel.org/r/20180221193335.GB28678@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
6a6cbe75d proc: simpler iterations for /proc/*/cmdline ... Browse Code »

"rv" variable is used both as a counter of bytes transferred and an
error value holder but it can be reduced solely to error values if
original start of userspace buffer is stashed and used at the very end.

[akpm@linux-foundation.org: simplify cleanup code]
Link: http://lkml.kernel.org/r/20180221193009.GA28678@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
6a6b9c4c1 proc: somewhat simpler code for /proc/*/cmdline ... Browse Code »

"final" variable is OK but we can get away with less lines.

Link: http://lkml.kernel.org/r/20180221192751.GC28548@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
b42262af5 proc: more "unsigned int" in /proc/*/cmdline ... Browse Code »

access_remote_vm() doesn't return negative errors, it returns number of
bytes read/written (0 if error occurs). This allows to delete some
comparisons which never trigger.

Reuse "nr_read" variable while I'm at it.

Link: http://lkml.kernel.org/r/20180221192605.GB28548@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Cc: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-06-08 08:34:38 +0800
72eb7de9c mm: remove page_is_poisoned() from linux/mm.h ... Browse Code »

When commit bd33ef368135 ("mm: enable page poisoning early at boot") got
rid of the PAGE_EXT_DEBUG_POISON, page_is_poisoned in the header left
behind. This patch cleans up the leftovers under the table.

Link: http://lkml.kernel.org/r/1528101069-21637-1-git-send-email-kpark3469@gmail.com
Signed-off-by: Sahara
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sahara
2018-06-08 08:34:38 +0800
e81bf9793 mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline ... Browse Code »

The LKP robot found a 27% will-it-scale/page_fault3 performance
regression regarding commit e27be240df53("mm: memcg: make sure
memory.events is uptodate when waking pollers").

What the test does is:
1 mkstemp() a 128M file on a tmpfs;
2 start $nr_cpu processes, each to loop the following:
2.1 mmap() this file in shared write mode;
2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
2.3 unmap() this file and repeat this process.
3 After 5 minutes, check how many loops they managed to complete, the
higher the better.

The commit itself looks innocent enough as it merely changed some event
counting mechanism and this test didn't trigger those events at all.
Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
in count_memcg_event_mm()(called by handle_mm_fault()) and in
__mod_memcg_state() called by page_add_file_rmap(). So it's likely due
to the changed layout of 'struct mem_cgroup' that either make stat_cpu
falling into a constantly modifying cacheline or some hot fields stop
being in the same cacheline.

I verified this by moving memory_events[] back to where it was:

: --- a/include/linux/memcontrol.h
: +++ b/include/linux/memcontrol.h
: @@ -205,7 +205,6 @@ struct mem_cgroup {
: int oom_kill_disable;
:
: /* memory.events */
: - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
: struct cgroup_file events_file;
:
: /* protect arrays of thresholds */
: @@ -238,6 +237,7 @@ struct mem_cgroup {
: struct mem_cgroup_stat_cpu __percpu *stat_cpu;
: atomic_long_t stat[MEMCG_NR_STAT];
: atomic_long_t events[NR_VM_EVENT_ITEMS];
: + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
:
: unsigned long socket_pressure;

And performance restored.

Later investigation found that as long as the following 3 fields
moving_account, move_lock_task and stat_cpu are in the same cacheline,
performance will be good. To avoid future performance surprise by other
commits changing the layout of 'struct mem_cgroup', this patch makes
sure the 3 fields stay in the same cacheline.

One concern of this approach is, moving_account and move_lock_task could
be modified when a process changes memory cgroup while stat_cpu is a
always read field, it might hurt to place them in the same cacheline. I
assume it is rare for a process to change memory cgroup so this should
be OK.

Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com
Signed-off-by: Aaron Lu
Reported-by: kernel test robot
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2018-06-08 08:34:38 +0800
ce91f6ee5 mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags ... Browse Code »

kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
GFP_NOFS) with an intention that this will motivate authors of the code
to fix those. Linus argues that this just motivates people to do even
more hacks like

if (gfp == GFP_KERNEL)
kvmalloc
else
kmalloc

I haven't seen this happening much (Linus pointed to bucket_lock special
cases an atomic allocation but my git foo hasn't found much more) but it
is true that we can grow those in future. Therefore Linus suggested to
simply not fallback to vmalloc for incompatible gfp flags and rather
stick with the kmalloc path.

Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Suggested-by: Linus Torvalds
Cc: Tom Herbert
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-06-08 08:34:38 +0800
4b33b6959 include/linux/gfp.h: fix the annotation of GFP_ZONE_TABLE ... Browse Code »

When bit is equal to 0x4, it means OPT_ZONE_DMA32 should be got from
GFP_ZONE_TABLE. OPT_ZONE_DMA32 shall be equal to ZONE_DMA32 or
ZONE_NORMAL according to the status of CONFIG_ZONE_DMA32.

Similarly, when bit is equal to 0xc, that means OPT_ZONE_DMA32 should be
got with an allocation policy GFP_MOVABLE. So ZONE_DMA32 or ZONE_NORMAL
is the possible result value.

Link: http://lkml.kernel.org/r/20180601163403.1032-1-yehs2007@zoho.com
Signed-off-by: Huaisheng Ye
Reviewed-by: Andrew Morton
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Kate Stewart
Cc: "Levin, Alexander (Sasha Levin)"
Cc: Greg Kroah-Hartman
Cc: Christoph Hellwig
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huaisheng Ye
2018-06-08 08:34:38 +0800
daa280753 mm/shmem.c: zero out unused vma fields in shmem_pseudo_vma_init() ... Browse Code »

shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.

The pseudo vma doesn't have vm_page_prot set. We are going to encode
encryption KeyID in vm_page_prot. Having garbage there causes problems.

Zero out all unused fields in the pseudo vma.

Link: http://lkml.kernel.org/r/20180531135602.20321-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-06-08 08:34:38 +0800
7810e6781 mm, page_alloc: do not break __GFP_THISNODE by zonelist reset ... Browse Code »

In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.

This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.

Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.

We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aac2 ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.

We might consider this for 4.17 although I don't know if there's
anything currently broken.

SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.

So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.

Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman
Cc: Michal Hocko
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2018-06-08 08:34:38 +0800
df2cc96e7 userfaultfd: prevent non-cooperative events vs mcopy_atomic races ... Browse Code »

If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.

For instance, let's consider fork() running in parallel with
userfaultfd_copy():

process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |

If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.

If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.

Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.

Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Pavel Emelyanov
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Andrei Vagin
Signed-off-by: Andrew Morton

Signed-off-by: Linus Torvalds

Mike Rapoport
2018-06-08 08:34:38 +0800
be09102b4 mm: memcg: allow lowering memory.swap.max below the current usage ... Browse Code »

Currently an attempt to set swap.max into a value lower than the actual
swap usage fails, which causes configuration problems as there's no way
of lowering the configuration below the current usage short of turning
off swap entirely. This makes swap.max difficult to use and allows
delegatees to lock the delegator out of reducing swap allocation.

This patch updates swap_max_write() so that the limit can be lowered
below the current usage. It doesn't implement active reclaiming of swap
entries for the following reasons.

* mem_cgroup_swap_full() already tells the swap machinary to
aggressively reclaim swap entries if the usage is above 50% of
limit, so simply lowering the limit automatically triggers gradual
reclaim.

* Forcing back swapped out pages is likely to heavily impact the
workload and mess up the working set. Given that swap usually is a
lot less valuable and less scarce, letting the existing usage
dissipate over time through the above gradual reclaim and as they're
falted back in is likely the better behavior.

Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo
Acked-by: Roman Gushchin
Acked-by: Rik van Riel
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2018-06-08 08:34:37 +0800
20acce679 mm/shmem.c: use new return type vm_fault_t ... Browse Code »

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

See commit 1c8f422059ae ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduce inline function in 4.17-rc6.

Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Matthew Wilcox
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2018-06-08 08:34:37 +0800
325d7d4a9 slub: remove 'reserved' file from sysfs ... Browse Code »

Christoph doubts anyone was using the 'reserved' file in sysfs, so remove
it.

Link: http://lkml.kernel.org/r/20180518194519.3820-17-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: "Kirill A . Shutemov"
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Vlastimil Babka
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
9736d2a95 slub: remove kmem_cache->reserved ... Browse Code »

The reserved field was only used for embedding an rcu_head in the data
structure. With the previous commit, we no longer need it. That lets us
remove the 'reserved' argument to a lot of functions.

Link: http://lkml.kernel.org/r/20180518194519.3820-16-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: "Kirill A . Shutemov"
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Vlastimil Babka
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
bf68c214d slab,slub: remove rcu_head size checks ... Browse Code »

rcu_head may now grow larger than list_head without affecting slab or
slub.

Link: http://lkml.kernel.org/r/20180518194519.3820-15-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Christoph Lameter
Acked-by: Vlastimil Babka
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: "Kirill A . Shutemov"
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
50e7fbc3b mm: add hmm_data to struct page ... Browse Code »

Make hmm_data an explicit member of the struct page union.

Link: http://lkml.kernel.org/r/20180518194519.3820-14-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: "Kirill A . Shutemov"
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
a052f0a51 mm: add pt_mm to struct page ... Browse Code »

For pgd page table pages, x86 overloads the page->index field to store a
pointer to the mm_struct. Rename this to pt_mm so it's visible to other
users.

Link: http://lkml.kernel.org/r/20180518194519.3820-13-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: "Kirill A . Shutemov"
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
97b4a6719 mm: improve struct page documentation ... Browse Code »

Rewrite the documentation to describe what you can use in struct page
rather than what you can't.

Link: http://lkml.kernel.org/r/20180518194519.3820-12-willy@infradead.org
Signed-off-by: Matthew Wilcox
Reviewed-by: Randy Dunlap
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: "Kirill A . Shutemov"
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
4da1984ed mm: combine LRU and main union in struct page ... Browse Code »

This gives us five words of space in a single union in struct page. The
compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
systems, but that does not seem likely to cause any trouble.

Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
b7ccc7f8c mm: move lru union within struct page ... Browse Code »

Since the LRU is two words, this does not affect the double-word alignment
of SLUB's freelist.

Link: http://lkml.kernel.org/r/20180518194519.3820-10-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
fa3015b7e mm: use page->deferred_list ... Browse Code »

Now that we can represent the location of 'deferred_list' in C instead of
comments, make use of that ability.

Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
66a6ffd2a mm: combine first three unions in struct page ... Browse Code »

By combining these three one-word unions into one three-word union, we
make it easier for users to add their own multi-word fields to struct
page, as well as making it obvious that SLUB needs to keep its double-word
alignment for its freelist & counters.

No field moves position; verified with pahole.

Link: http://lkml.kernel.org/r/20180518194519.3820-8-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
b21999da0 mm: move _refcount out of struct page union ... Browse Code »

Keeping the refcount in the union only encourages people to put something
else in the union which will overlap with _refcount and eventually explode
messily. pahole reports no fields change location.

Link: http://lkml.kernel.org/r/20180518194519.3820-7-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800
7d27a04bb mm: move 'private' union within struct page ... Browse Code »

By moving page->private to the fourth word of struct page, we can put the
SLUB counters in the same word as SLAB's s_mem and still do the
cmpxchg_double trick. Now the SLUB counters no longer overlap with the
mapcount or refcount so we can drop the call to page_mapcount_reset() and
simplify set_page_slub_counters() to a single line.

Link: http://lkml.kernel.org/r/20180518194519.3820-6-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-06-08 08:34:37 +0800