Eric Lee / smarc-fsl-linux-kernel

26 Sep, 2019

6 commits

315cc066b augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro ... Browse Code »

Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.

[walken@google.com: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[walken@google.com: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
Signed-off-by: Michel Lespinasse
Acked-by: Peter Zijlstra (Intel)
Cc: David Howells
Cc: Davidlohr Bueso
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2019-09-26 08:51:39 +0800
444b8a83f augmented rbtree: add comments for RB_DECLARE_CALLBACKS macro ... Browse Code »

Patch series "make RB_DECLARE_CALLBACKS more generic", v3.

These changes are intended to make the RB_DECLARE_CALLBACKS macro more
generic (allowing the aubmented subtree information to be a struct instead
of a scalar).

I have verified the compiled lib/interval_tree.o and mm/mmap.o files to
check that they didn't change. This held as expected for interval_tree.o;
mmap.o did have some changes which could be reverted by marking
__vma_link_rb as noinline. I did not add such a change to the patchset; I
felt it was reasonable enough to leave the inlining decision up to the
compiler.

This patch (of 3):

Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS.
The arguments are also now capitalized. This copies the style of the
INTERVAL_TREE_DEFINE macro.

No functional changes in this commit, only comments and capitalization.

Link: http://lkml.kernel.org/r/20190703040156.56953-2-walken@google.com
Signed-off-by: Michel Lespinasse
Acked-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Cc: David Howells
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2019-09-26 08:51:39 +0800
c7d4f7eeb rbtree: avoid generating code twice for the cached versions (tools copy) ... Browse Code »

As was already noted in rbtree.h, the logic to cache rb_first (or
rb_last) can easily be implemented externally to the core rbtree api.

This commit takes the changes applied to the include/linux/ and lib/
rbtree files in 9f973cb38088 ("lib/rbtree: avoid generating code twice
for the cached versions"), and applies these to the
tools/include/linux/ and tools/lib/ files as well to keep them
synchronized.

Link: http://lkml.kernel.org/r/20190703034812.53002-1-walken@google.com
Signed-off-by: Michel Lespinasse
Cc: David Howells
Cc: Davidlohr Bueso
Cc: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2019-09-26 08:51:39 +0800
0f7491407 kernel/elfcore.c: include proper prototypes ... Browse Code »

When building with W=1, gcc properly complains that there's no prototypes:

CC kernel/elfcore.o
kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
7 | Elf_Half __weak elf_core_extra_phdrs(void)
| ^~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
22 | size_t __weak elf_core_extra_data_size(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~

Provide the include file so gcc is happy, and we don't have potential code drift

Link: http://lkml.kernel.org/r/29875.1565224705@turing-police
Signed-off-by: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Valdis Kletnieks
2019-09-26 08:51:39 +0800
541be0509 linux/coff.h: add include guard ... Browse Code »

Add a header include guard just in case.

My motivation is to allow Kbuild to detect missing include guard:

https://patchwork.kernel.org/patch/11063011/

Before I enable this checker I want to fix as many headers as possible.

Link: http://lkml.kernel.org/r/20190728154728.11126-1-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2019-09-26 08:51:39 +0800
e55d9d9bf memcg, kmem: do not fail __GFP_NOFAIL charges ... Browse Code »

Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS: 00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
create_page_buffers+0x4d/0x60
__block_write_begin_int+0x8e/0x5a0
? ext4_inode_attach_jinode.part.82+0xb0/0xb0
? jbd2__journal_start+0xd7/0x1f0
ext4_da_write_begin+0x112/0x3d0
generic_perform_write+0xf1/0x1b0
? file_update_time+0x70/0x140
__generic_file_write_iter+0x141/0x1a0
ext4_file_write_iter+0xef/0x3b0
__vfs_write+0x17e/0x1e0
vfs_write+0xa5/0x1a0
ksys_write+0x57/0xd0
do_syscall_64+0x55/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached. This is a wrong
behavior because nofail allocations are not allowed to fail. Normal
charge path simply forces the charge even if that means to cross the
limit. Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Thomas Lindroth
Debugged-by: Tetsuo Handa
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc: Andrey Ryabinin
Cc: Thomas Lindroth
Cc: Shakeel Butt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2019-09-26 08:51:39 +0800

25 Sep, 2019

34 commits

351c8a09b Merge branch 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux ... Browse Code »

Pull i2c updates from Wolfram Sang:

- new driver for ICY, an Amiga Zorro card :)

- axxia driver gained slave mode support, NXP driver gained ACPI

- the slave EEPROM backend gained 16 bit address support

- and lots of regular driver updates and reworks

* 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits)
i2c: tegra: Move suspend handling to NOIRQ phase
i2c: imx: ACPI support for NXP i2c controller
i2c: uniphier(-f): remove all dev_dbg()
i2c: uniphier(-f): use devm_platform_ioremap_resource()
i2c: slave-eeprom: Add comment about address handling
i2c: exynos5: Remove IRQF_ONESHOT
i2c: stm32f7: Make structure stm32f7_i2c_algo constant
i2c: cht-wc: drop check because i2c_unregister_device() is NULL safe
i2c-eeprom_slave: Add support for more eeprom models
i2c: fsi: Add of_put_node() before break
i2c: synquacer: Make synquacer_i2c_ops constant
i2c: hix5hd2: Remove IRQF_ONESHOT
i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond
watchdog: iTCO: Add support for Cannon Lake PCH iTCO
i2c: iproc: Make bcm_iproc_i2c_quirks constant
i2c: iproc: Add full name of devicetree node to adapter name
i2c: piix4: Add ACPI support
i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h
i2c: ocores: use request_any_context_irq() to register IRQ handler
i2c: designware: Fix optional reset error handling
...

Linus Torvalds
2019-09-25 07:48:02 +0800
3cf7487c5 Merge tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound ... Browse Code »

Pull sound fixes from Takashi Iwai:
"A few small remaining wrap-up for this merge window.

Most of patches are device-specific (HD-audio and USB-audio quirks,
FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is a
simple fix (actually two commits) for ASoC core"

* tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add DSD support for EVGA NU Audio
ALSA: hda - Add laptop imic fixup for ASUS M9V laptop
ASoC: ti: fix SND_SOC_DM365_VOICE_CODEC dependencies
ASoC: pcm3168a: The codec does not support S32_LE
ASoC: core: use list_del_init and move it back to soc_cleanup_component
ALSA: hda/realtek - PCI quirk for Medion E4254
ALSA: hda - Apply AMD controller workaround for Raven platform
ASoC: rsnd: do error check after rsnd_channel_normalization()
ASoC: atmel_ssc_dai: Remove wrong spinlock usage
ASoC: core: delete component->card_list in soc_remove_component only
ASoC: fsl_sai: Fix noise when using EDMA
ALSA: usb-audio: Add Hiby device family to quirks for native DSD support
ALSA: hda/realtek - Fix alienware headset mic
ALSA: dice: fix wrong packet parameter for Alesis iO26

Linus Torvalds
2019-09-25 07:46:16 +0800
b6cb84b4f Merge tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block ... Browse Code »

Pull more io_uring updates from Jens Axboe:
"A collection of later fixes and additions, that weren't quite ready
for pushing out with the initial pull request.

This contains:

- Fix potential use-after-free of shadow requests (Jackie)

- Fix potential OOM crash in request allocation (Jackie)

- kmalloc+memcpy -> kmemdup cleanup (Jackie)

- Fix poll crash regression (me)

- Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)

- Add support for timeouts, making it easier to do epoll_wait()
conversions, for instance (me)

- Ensure io_uring works without f_ops->read_iter() and
f_ops->write_iter() (me)"

* tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
io_uring: correctly handle non ->{read,write}_iter() file_operations
io_uring: IORING_OP_TIMEOUT support
io_uring: use cond_resched() in sqthread
io_uring: fix potential crash issue due to io_get_req failure
io_uring: ensure poll commands clear ->sqe
io_uring: fix use-after-free of shadow_req
io_uring: use kmemdup instead of kmalloc and memcpy

Linus Torvalds
2019-09-25 07:40:21 +0800
2e959dd87 Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block ... Browse Code »

Pull more block updates from Jens Axboe:
"Some later additions that weren't quite done for the first pull
request, and also a few fixes that have arrived since.

This contains:

- Kill silly pktcdvd warning on attempting to register a non-scsi
passthrough device (me)

- Use symbolic constants for the block t10 protection types, and
switch to handling it in core rather than in the drivers (Max)

- libahci platform missing node put fix (Nishka)

- Small series of fixes for BFQ (Paolo)

- Fix possible nbd crash (Xiubo)"

* tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
block: drop device references in bsg_queue_rq()
block: t10-pi: fix -Wswitch warning
pktcdvd: remove warning on attempting to register non-passthrough dev
ata: libahci_platform: Add of_node_put() before loop exit
nbd: fix possible page fault for nbd disk
nbd: rename the runtime flags as NBD_RT_ prefixed
block, bfq: push up injection only after setting service time
block, bfq: increase update frequency of inject limit
block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
block, bfq: update inject limit only after injection occurred
block: centralize PI remapping logic to the block layer
block: use symbolic constants for t10_pi type

Linus Torvalds
2019-09-25 07:31:50 +0800
9c9fa97a8 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- a few hot fixes

- ocfs2 updates

- almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
zsmalloc)

* emailed patches from Andrew Morton : (132 commits)
mm/zsmalloc.c: fix a -Wunused-function warning
zswap: do not map same object twice
zswap: use movable memory if zpool support allocate movable memory
zpool: add malloc_support_movable to zpool_driver
shmem: fix obsolete comment in shmem_getpage_gfp()
mm/madvise: reduce code duplication in error handling paths
mm: mmap: increase sockets maximum memory size pgoff for 32bits
mm/mmap.c: refine find_vma_prev() with rb_last()
riscv: make mmap allocation top-down by default
mips: use generic mmap top-down layout and brk randomization
mips: replace arch specific way to determine 32bit task with generic version
mips: adjust brk randomization offset to fit generic version
mips: use STACK_TOP when computing mmap base address
mips: properly account for stack randomization and stack guard gap
arm: use generic mmap top-down layout and brk randomization
arm: use STACK_TOP when computing mmap base address
arm: properly account for stack randomization and stack guard gap
arm64, mm: make randomization selected by generic topdown mmap layout
arm64, mm: move generic mmap layout functions to mm
arm64: consider stack randomization for mmap base only when necessary
...

Linus Torvalds
2019-09-25 07:10:23 +0800
2b38d01b4 mm/zsmalloc.c: fix a -Wunused-function warning ... Browse Code »

set_zspage_inuse() was introduced in the commit 4f42047bbde0 ("zsmalloc:
use accessor") but all the users of it were removed later by the commits,

bdb0af7ca8f0 ("zsmalloc: factor page chain functionality out")
3783689a1aa8 ("zsmalloc: introduce zspage structure")

so the function can be safely removed now.

Link: http://lkml.kernel.org/r/1568658408-19374-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai
Reviewed-by: Andrew Morton
Cc: Minchan Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qian Cai
2019-09-25 06:54:12 +0800
068619e32 zswap: do not map same object twice ... Browse Code »

zswap_writeback_entry() maps a handle to read swpentry first, and
then in the most common case it would map the same handle again.
This is ok when zbud is the backend since its mapping callback is
plain and simple, but it slows things down for z3fold.

Since there's hardly a point in unmapping a handle _that_ fast as
zswap_writeback_entry() does when it reads swpentry, the
suggestion is to keep the handle mapped till the end.

Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.com
Signed-off-by: Vitaly Wool
Reviewed-by: Dan Streetman
Cc: Shakeel Butt
Cc: Minchan Kim
Cc: Sergey Senozhatsky
Cc: Seth Jennings
Cc: Vitaly Wool
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-09-25 06:54:12 +0800
d2fcd82bb zswap: use movable memory if zpool support allocate movable memory ... Browse Code »

This is the third version that was updated according to the comments from
Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
https://lkml.org/lkml/2019/6/4/973

zswap compresses swap pages into a dynamically allocated RAM-based memory
pool. The memory pool should be zbud, z3fold or zsmalloc. All of them
will allocate unmovable pages. It will increase the number of unmovable
page blocks that will bad for anti-fragment.

zsmalloc support page migration if request movable page:
handle = zs_malloc(zram->mem_pool, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
__GFP_MOVABLE);

And commit "zpool: Add malloc_support_movable to zpool_driver" add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.

This commit let zswap allocate block with gfp
__GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.

Following part is test log in a pc that has 8G memory and 2G swap.

Without this commit:
~# echo lz4 > /sys/module/zswap/parameters/compressor
~# echo zsmalloc > /sys/module/zswap/parameters/zpool
~# echo 1 > /sys/module/zswap/parameters/enabled
~# swapon /swapfile
~# cd /home/teawater/kernel/vm-scalability/
/home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
/home/teawater/kernel/vm-scalability# ./case-anon-w-seq
2717908992 bytes / 4826062 usecs = 549973 KB/s
2717908992 bytes / 4864201 usecs = 545661 KB/s
2717908992 bytes / 4867015 usecs = 545346 KB/s
2717908992 bytes / 4915485 usecs = 539968 KB/s
397853 usecs to free memory
357820 usecs to free memory
421333 usecs to free memory
420454 usecs to free memory
/home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 6 5 8 6 6 5 4 1 1 1 0
Node 0, zone DMA32, type Movable 25 20 20 19 22 15 14 11 11 5 767
Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 4753 5588 5159 4613 3712 2520 1448 594 188 11 0
Node 0, zone Normal, type Movable 16 3 457 2648 2143 1435 860 459 223 224 296
Node 0, zone Normal, type Reclaimable 0 0 44 38 11 2 0 0 0 0 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 0, zone DMA 1 7 0 0 0 0
Node 0, zone DMA32 4 1652 0 0 0 0
Node 0, zone Normal 931 1485 15 0 0 0

With this commit:
~# echo lz4 > /sys/module/zswap/parameters/compressor
~# echo zsmalloc > /sys/module/zswap/parameters/zpool
~# echo 1 > /sys/module/zswap/parameters/enabled
~# swapon /swapfile
~# cd /home/teawater/kernel/vm-scalability/
/home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
/home/teawater/kernel/vm-scalability# ./case-anon-w-seq
2717908992 bytes / 4689240 usecs = 566020 KB/s
2717908992 bytes / 4760605 usecs = 557535 KB/s
2717908992 bytes / 4803621 usecs = 552543 KB/s
2717908992 bytes / 5069828 usecs = 523530 KB/s
431546 usecs to free memory
383397 usecs to free memory
456454 usecs to free memory
224487 usecs to free memory
/home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 10 8 10 9 10 4 3 2 3 0 0
Node 0, zone DMA32, type Movable 18 12 14 16 16 11 9 5 5 6 775
Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 1
Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 2669 1236 452 118 37 14 4 1 2 3 0
Node 0, zone Normal, type Movable 3850 6086 5274 4327 3510 2494 1520 934 438 220 470
Node 0, zone Normal, type Reclaimable 56 93 155 124 47 31 17 7 3 0 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 0, zone DMA 1 7 0 0 0 0
Node 0, zone DMA32 4 1650 2 0 0 0
Node 0, zone Normal 79 2326 26 0 0 0

You can see that the number of unmovable page blocks is decreased
when the kernel has this commit.

Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.com
Signed-off-by: Hui Zhu
Reviewed-by: Shakeel Butt
Cc: Dan Streetman
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Seth Jennings
Cc: Vitaly Wool
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hui Zhu
2019-09-25 06:54:12 +0800
c165f25d2 zpool: add malloc_support_movable to zpool_driver ... Browse Code »

As a zpool_driver, zsmalloc can allocate movable memory because it support
migate pages. But zbud and z3fold cannot allocate movable memory.

Add malloc_support_movable to zpool_driver. If a zpool_driver support
allocate movable memory, set it to true. And add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.

Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.com
Signed-off-by: Hui Zhu
Reviewed-by: Shakeel Butt
Cc: Dan Streetman
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Seth Jennings
Cc: Vitaly Wool
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hui Zhu
2019-09-25 06:54:12 +0800
28eb3c808 shmem: fix obsolete comment in shmem_getpage_gfp() ... Browse Code »

Replace "fault_mm" with "vmf" in code comment because commit cfda05267f7b
("userfaultfd: shmem: add userfaultfd hook for shared memory faults") has
changed the prototpye of shmem_getpage_gfp() - pass vmf instead of
fault_mm to the function.

Before:
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
After:
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct vm_area_struct *vma,
struct vm_fault *vmf, vm_fault_t *fault_type);

Link: http://lkml.kernel.org/r/20190816100204.9781-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miles Chen
2019-09-25 06:54:12 +0800
f3bc0dba3 mm/madvise: reduce code duplication in error handling paths ... Browse Code »

madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
identical code.

Move that code to a common error handling path.

No functional changes.

Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Pankaj Gupta
Reviewed-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-09-25 06:54:12 +0800
76f349507 mm: mmap: increase sockets maximum memory size pgoff for 32bits ... Browse Code »

The AF_XDP sockets umem mapping interface uses XDP_UMEM_PGOFF_FILL_RING
and XDP_UMEM_PGOFF_COMPLETION_RING offsets. These offsets are
established already and are part of the configuration interface.

But for 32-bit systems, using AF_XDP socket configuration, these values
are too large to pass the maximum allowed file size verification. The
offsets can be tuned off, but instead of changing the existing
interface, let's extend the max allowed file size for sockets.

No one has been using this until this patch with 32 bits as without
this fix af_xdp sockets can't be used at all, so it unblocks af_xdp
socket usage for 32bit systems.

All list of mmap cbs for sockets was verified for side effects and all
of them contain dummy cb - sock_no_mmap() at this moment, except the
following:

xsk_mmap() - it's what this fix is needed for.
tcp_mmap() - doesn't have obvious issues with pgoff - no any references on it.
packet_mmap() - return -EINVAL if it's even set.

Link: http://lkml.kernel.org/r/20190812124326.32146-1-ivan.khoronzhuk@linaro.org
Signed-off-by: Ivan Khoronzhuk
Reviewed-by: Andrew Morton
Cc: Björn Töpel
Cc: Alexei Starovoitov
Cc: Magnus Karlsson
Cc: Daniel Borkmann
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ivan Khoronzhuk
2019-09-25 06:54:12 +0800
73848a971 mm/mmap.c: refine find_vma_prev() with rb_last() ... Browse Code »

When addr is out of range of the whole rb_tree, pprev will point to the
right-most node. rb_tree facility already provides a helper function,
rb_last(), to do this task. We can leverage this instead of
reimplementing it.

This patch refines find_vma_prev() with rb_last() to make it a little
nicer to read.

[akpm@linux-foundation.org: little cleanup, per Vlastimil]
Link: http://lkml.kernel.org/r/20190809001928.4950-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yang
2019-09-25 06:54:12 +0800
54c95a11c riscv: make mmap allocation top-down by default ... Browse Code »

In order to avoid wasting user address space by using bottom-up mmap
allocation scheme, prefer top-down scheme when possible.

Before:
root@qemuriscv64:~# cat /proc/self/maps
00010000-00016000 r-xp 00000000 fe:00 6389 /bin/cat.coreutils
00016000-00017000 r--p 00005000 fe:00 6389 /bin/cat.coreutils
00017000-00018000 rw-p 00006000 fe:00 6389 /bin/cat.coreutils
00018000-00039000 rw-p 00000000 00:00 0 [heap]
1555556000-155556d000 r-xp 00000000 fe:00 7193 /lib/ld-2.28.so
155556d000-155556e000 r--p 00016000 fe:00 7193 /lib/ld-2.28.so
155556e000-155556f000 rw-p 00017000 fe:00 7193 /lib/ld-2.28.so
155556f000-1555570000 rw-p 00000000 00:00 0
1555570000-1555572000 r-xp 00000000 00:00 0 [vdso]
1555574000-1555576000 rw-p 00000000 00:00 0
1555576000-1555674000 r-xp 00000000 fe:00 7187 /lib/libc-2.28.so
1555674000-1555678000 r--p 000fd000 fe:00 7187 /lib/libc-2.28.so
1555678000-155567a000 rw-p 00101000 fe:00 7187 /lib/libc-2.28.so
155567a000-15556a0000 rw-p 00000000 00:00 0
3fffb90000-3fffbb1000 rw-p 00000000 00:00 0 [stack]

After:
root@qemuriscv64:~# cat /proc/self/maps
00010000-00016000 r-xp 00000000 fe:00 6389 /bin/cat.coreutils
00016000-00017000 r--p 00005000 fe:00 6389 /bin/cat.coreutils
00017000-00018000 rw-p 00006000 fe:00 6389 /bin/cat.coreutils
2de81000-2dea2000 rw-p 00000000 00:00 0 [heap]
3ff7eb6000-3ff7ed8000 rw-p 00000000 00:00 0
3ff7ed8000-3ff7fd6000 r-xp 00000000 fe:00 7187 /lib/libc-2.28.so
3ff7fd6000-3ff7fda000 r--p 000fd000 fe:00 7187 /lib/libc-2.28.so
3ff7fda000-3ff7fdc000 rw-p 00101000 fe:00 7187 /lib/libc-2.28.so
3ff7fdc000-3ff7fe2000 rw-p 00000000 00:00 0
3ff7fe4000-3ff7fe6000 r-xp 00000000 00:00 0 [vdso]
3ff7fe6000-3ff7ffd000 r-xp 00000000 fe:00 7193 /lib/ld-2.28.so
3ff7ffd000-3ff7ffe000 r--p 00016000 fe:00 7193 /lib/ld-2.28.so
3ff7ffe000-3ff7fff000 rw-p 00017000 fe:00 7193 /lib/ld-2.28.so
3ff7fff000-3ff8000000 rw-p 00000000 00:00 0
3fff888000-3fff8a9000 rw-p 00000000 00:00 0 [stack]

[alex@ghiti.fr: v6]
Link: http://lkml.kernel.org/r/20190808061756.19712-15-alex@ghiti.fr
Link: http://lkml.kernel.org/r/20190730055113.23635-15-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Reviewed-by: Christoph Hellwig
Reviewed-by: Kees Cook
Reviewed-by: Luis Chamberlain
Acked-by: Paul Walmsley [arch/riscv]
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
9035bd294 mips: use generic mmap top-down layout and brk randomization ... Browse Code »

mips uses a top-down layout by default that exactly fits the generic
functions, so get rid of arch specific code and use the generic version by
selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT.

As ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT selects ARCH_HAS_ELF_RANDOMIZE,
use the generic version of arch_randomize_brk since it also fits. Note
that this commit also removes the possibility for mips to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.

Link: http://lkml.kernel.org/r/20190730055113.23635-14-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Paul Burton
Reviewed-by: Kees Cook
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
09036468c mips: replace arch specific way to determine 32bit task with generic version ... Browse Code »

Mips uses TASK_IS_32BIT_ADDR to determine if a task is 32bit, but this
define is mips specific and other arches do not have it: instead, use
!IS_ENABLED(CONFIG_64BIT) || is_compat_task() condition.

Link: http://lkml.kernel.org/r/20190730055113.23635-13-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Paul Burton
Reviewed-by: Kees Cook
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
e548599fb mips: adjust brk randomization offset to fit generic version ... Browse Code »

This commit simply bumps up to 32MB and 1GB the random offset of brk,
compared to 8MB and 256MB, for 32bit and 64bit respectively.

Link: http://lkml.kernel.org/r/20190730055113.23635-12-alex@ghiti.fr
Suggested-by: Kees Cook
Signed-off-by: Alexandre Ghiti
Acked-by: Paul Burton
Reviewed-by: Kees Cook
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
b5fb86179 mips: use STACK_TOP when computing mmap base address ... Browse Code »

mmap base address must be computed wrt stack top address, using TASK_SIZE
is wrong since STACK_TOP and TASK_SIZE are not equivalent.

Link: http://lkml.kernel.org/r/20190730055113.23635-11-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Kees Cook
Acked-by: Paul Burton
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
b1f61b5bd mips: properly account for stack randomization and stack guard gap ... Browse Code »

This commit takes care of stack randomization and stack guard gap when
computing mmap base address and checks if the task asked for
randomization. This fixes the problem uncovered and not fixed for arm
here: https://lkml.kernel.org/r/20170622200033.25714-1-riel@redhat.com

Link: http://lkml.kernel.org/r/20190730055113.23635-10-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Kees Cook
Acked-by: Paul Burton
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
dba79c3df arm: use generic mmap top-down layout and brk randomization ... Browse Code »

arm uses a top-down mmap layout by default that exactly fits the generic
functions, so get rid of arch specific code and use the generic version by
selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT.

As ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT selects ARCH_HAS_ELF_RANDOMIZE,
use the generic version of arch_randomize_brk since it also fits. Note
that this commit also removes the possibility for arm to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.

Note that it is safe to remove STACK_RND_MASK since it matches the default
value.

Link: http://lkml.kernel.org/r/20190730055113.23635-9-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Kees Cook
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:12 +0800
86e568e9c arm: use STACK_TOP when computing mmap base address ... Browse Code »

mmap base address must be computed wrt stack top address, using TASK_SIZE
is wrong since STACK_TOP and TASK_SIZE are not equivalent.

Link: http://lkml.kernel.org/r/20190730055113.23635-8-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Kees Cook
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
af0f42972 arm: properly account for stack randomization and stack guard gap ... Browse Code »

This commit takes care of stack randomization and stack guard gap when
computing mmap base address and checks if the task asked for
randomization. This fixes the problem uncovered and not fixed for arm
here: https://lkml.kernel.org/r/20170622200033.25714-1-riel@redhat.com

Link: http://lkml.kernel.org/r/20190730055113.23635-7-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Kees Cook
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
e7142bf5d arm64, mm: make randomization selected by generic topdown mmap layout ... Browse Code »

This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic
topdown mmap layout functions so that this security feature is on by
default.

Note that this commit also removes the possibility for arm64 to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.

Link: http://lkml.kernel.org/r/20190730055113.23635-6-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Catalin Marinas
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
67f3977f8 arm64, mm: move generic mmap layout functions to mm ... Browse Code »

arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm. It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions. Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.

Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Suggested-by: Christoph Hellwig
Acked-by: Catalin Marinas
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
e8d54b62c arm64: consider stack randomization for mmap base only when necessary ... Browse Code »

Do not offset mmap base address because of stack randomization if current
task does not want randomization. Note that x86 already implements this
behaviour.

Link: http://lkml.kernel.org/r/20190730055113.23635-4-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Catalin Marinas
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
28058ed61 arm64: make use of is_compat_task instead of hardcoding this test ... Browse Code »

Each architecture has its own way to determine if a task is a compat task,
by using is_compat_task in arch_mmap_rnd, it allows more genericity and
then it prepares its moving to mm/.

Link: http://lkml.kernel.org/r/20190730055113.23635-3-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Catalin Marinas
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: Christoph Hellwig
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
649775be6 mm, fs: move randomize_stack_top from fs to mm ... Browse Code »

Patch series "Provide generic top-down mmap layout functions", v6.

This series introduces generic functions to make top-down mmap layout
easily accessible to architectures, in particular riscv which was the
initial goal of this series. The generic implementation was taken from
arm64 and used successively by arm, mips and finally riscv.

Note that in addition the series fixes 2 issues:

- stack randomization was taken into account even if not necessary.

- [1] fixed an issue with mmap base which did not take into account
randomization but did not report it to arm and mips, so by moving arm64
into a generic library, this problem is now fixed for both
architectures.

This work is an effort to factorize architecture functions to avoid code
duplication and oversights as in [1].

[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html

This patch (of 14):

This preparatory commit moves this function so that further introduction
of generic topdown mmap layout is contained only in mm/util.c.

Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Ralf Baechle
Cc: Paul Burton
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Albert Ou
Cc: Alexander Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800
f385cb85a uprobe: collapse THP pmd after removing all uprobes ... Browse Code »

After all uprobes are removed from the huge page (with PTE pgtable), it is
possible to collapse the pmd and benefit from THP again. This patch does
the collapse by calling collapse_pte_mapped_thp().

Link: http://lkml.kernel.org/r/20190815164525.1848545-7-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Kirill A. Shutemov
Reported-by: kbuild test robot
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
27e1f8273 khugepaged: enable collapse pmd for pte-mapped THP ... Browse Code »

khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.

This patch enables the khugepaged to retry collapse the page table.

struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.

In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
collapse the page table.

Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.

[kirill.shutemov@linux.intel.com: add comments for retract_page_tables()]
Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com
Signed-off-by: Song Liu
Signed-off-by: Kirill A. Shutemov
Acked-by: Kirill A. Shutemov
Suggested-by: Johannes Weiner
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
5a52c9df6 uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT ... Browse Code »

Use the newly added FOLL_SPLIT_PMD in uprobe. This preserves the huge
page when the uprobe is enabled. When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Link: http://lkml.kernel.org/r/20190815164525.1848545-5-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Kirill A. Shutemov
Reviewed-by: Srikar Dronamraju
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
bfe7b00de mm, thp: introduce FOLL_SPLIT_PMD ... Browse Code »

Introduce a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages, but
would switch back to huge page and huge pmd on. One of such example is
uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

Link: http://lkml.kernel.org/r/20190815164525.1848545-4-songliubraving@fb.com
Signed-off-by: Song Liu
Reviewed-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
fb4fb04ff uprobe: use original page when all uprobes are removed ... Browse Code »

Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint(). When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Link: http://lkml.kernel.org/r/20190815164525.1848545-3-songliubraving@fb.com
Signed-off-by: Song Liu
Suggested-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
010c164a5 mm: move memcmp_pages() and pages_identical() ... Browse Code »

Patch series "THP aware uprobe", v13.

This patchset makes uprobe aware of THPs.

Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of
THP.

This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe.

After all uprobes within the THP are removed, the PTE-mapped pages are
regrouped as huge PMD.

This set (plus a few THP patches) is also available at

https://github.com/liu-song-6/linux/tree/uprobe-thp

This patch (of 6):

Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
can use them in other files.

Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Kirill A. Shutemov
Reviewed-by: Oleg Nesterov
Cc: Johannes Weiner
Cc: Matthew Wilcox
Cc: William Kucharski
Cc: Srikar Dronamraju
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
87eaceb3f mm: thp: make deferred split shrinker memcg aware ... Browse Code »

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration. For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue. The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg. When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

[yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Kirill A. Shutemov
Reviewed-by: Kirill Tkhai
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: "Kirill A . Shutemov"
Cc: Hugh Dickins
Cc: Shakeel Butt
Cc: David Rientjes
Cc: Qian Cai
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2019-09-25 06:54:11 +0800