26 Sep, 2019

6 commits

  • Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
    for the case where the augmented value is a scalar whose definition
    follows a max(f(node)) pattern. This actually covers all present uses of
    RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
    various RBCOMPUTE function definitions.

    [walken@google.com: fix mm/vmalloc.c]
    Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
    [walken@google.com: re-add check to check_augmented()]
    Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
    Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
    Signed-off-by: Michel Lespinasse
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Howells
    Cc: Davidlohr Bueso
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "make RB_DECLARE_CALLBACKS more generic", v3.

    These changes are intended to make the RB_DECLARE_CALLBACKS macro more
    generic (allowing the aubmented subtree information to be a struct instead
    of a scalar).

    I have verified the compiled lib/interval_tree.o and mm/mmap.o files to
    check that they didn't change. This held as expected for interval_tree.o;
    mmap.o did have some changes which could be reverted by marking
    __vma_link_rb as noinline. I did not add such a change to the patchset; I
    felt it was reasonable enough to leave the inlining decision up to the
    compiler.

    This patch (of 3):

    Add a short comment summarizing the arguments to RB_DECLARE_CALLBACKS.
    The arguments are also now capitalized. This copies the style of the
    INTERVAL_TREE_DEFINE macro.

    No functional changes in this commit, only comments and capitalization.

    Link: http://lkml.kernel.org/r/20190703040156.56953-2-walken@google.com
    Signed-off-by: Michel Lespinasse
    Acked-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Howells
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • As was already noted in rbtree.h, the logic to cache rb_first (or
    rb_last) can easily be implemented externally to the core rbtree api.

    This commit takes the changes applied to the include/linux/ and lib/
    rbtree files in 9f973cb38088 ("lib/rbtree: avoid generating code twice
    for the cached versions"), and applies these to the
    tools/include/linux/ and tools/lib/ files as well to keep them
    synchronized.

    Link: http://lkml.kernel.org/r/20190703034812.53002-1-walken@google.com
    Signed-off-by: Michel Lespinasse
    Cc: David Howells
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When building with W=1, gcc properly complains that there's no prototypes:

    CC kernel/elfcore.o
    kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
    7 | Elf_Half __weak elf_core_extra_phdrs(void)
    | ^~~~~~~~~~~~~~~~~~~~
    kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
    12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~
    kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
    17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
    | ^~~~~~~~~~~~~~~~~~~~~~~~~
    kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
    22 | size_t __weak elf_core_extra_data_size(void)
    | ^~~~~~~~~~~~~~~~~~~~~~~~

    Provide the include file so gcc is happy, and we don't have potential code drift

    Link: http://lkml.kernel.org/r/29875.1565224705@turing-police
    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valdis Kletnieks
     
  • Add a header include guard just in case.

    My motivation is to allow Kbuild to detect missing include guard:

    https://patchwork.kernel.org/patch/11063011/

    Before I enable this checker I want to fix as many headers as possible.

    Link: http://lkml.kernel.org/r/20190728154728.11126-1-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Thomas has noticed the following NULL ptr dereference when using cgroup
    v1 kmem limit:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    PGD 0
    P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
    Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
    RIP: 0010:create_empty_buffers+0x24/0x100
    Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
    RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
    RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
    RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
    R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
    R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
    FS: 00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
    Call Trace:
    create_page_buffers+0x4d/0x60
    __block_write_begin_int+0x8e/0x5a0
    ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
    ? jbd2__journal_start+0xd7/0x1f0
    ext4_da_write_begin+0x112/0x3d0
    generic_perform_write+0xf1/0x1b0
    ? file_update_time+0x70/0x140
    __generic_file_write_iter+0x141/0x1a0
    ext4_file_write_iter+0xef/0x3b0
    __vfs_write+0x17e/0x1e0
    vfs_write+0xa5/0x1a0
    ksys_write+0x57/0xd0
    do_syscall_64+0x55/0x160
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
    fails __GFP_NOFAIL charge when the kmem limit is reached. This is a wrong
    behavior because nofail allocations are not allowed to fail. Normal
    charge path simply forces the charge even if that means to cross the
    limit. Kmem accounting should be doing the same.

    Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Thomas Lindroth
    Debugged-by: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Andrey Ryabinin
    Cc: Thomas Lindroth
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Sep, 2019

34 commits

  • Pull i2c updates from Wolfram Sang:

    - new driver for ICY, an Amiga Zorro card :)

    - axxia driver gained slave mode support, NXP driver gained ACPI

    - the slave EEPROM backend gained 16 bit address support

    - and lots of regular driver updates and reworks

    * 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits)
    i2c: tegra: Move suspend handling to NOIRQ phase
    i2c: imx: ACPI support for NXP i2c controller
    i2c: uniphier(-f): remove all dev_dbg()
    i2c: uniphier(-f): use devm_platform_ioremap_resource()
    i2c: slave-eeprom: Add comment about address handling
    i2c: exynos5: Remove IRQF_ONESHOT
    i2c: stm32f7: Make structure stm32f7_i2c_algo constant
    i2c: cht-wc: drop check because i2c_unregister_device() is NULL safe
    i2c-eeprom_slave: Add support for more eeprom models
    i2c: fsi: Add of_put_node() before break
    i2c: synquacer: Make synquacer_i2c_ops constant
    i2c: hix5hd2: Remove IRQF_ONESHOT
    i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond
    watchdog: iTCO: Add support for Cannon Lake PCH iTCO
    i2c: iproc: Make bcm_iproc_i2c_quirks constant
    i2c: iproc: Add full name of devicetree node to adapter name
    i2c: piix4: Add ACPI support
    i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h
    i2c: ocores: use request_any_context_irq() to register IRQ handler
    i2c: designware: Fix optional reset error handling
    ...

    Linus Torvalds
     
  • Pull sound fixes from Takashi Iwai:
    "A few small remaining wrap-up for this merge window.

    Most of patches are device-specific (HD-audio and USB-audio quirks,
    FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is a
    simple fix (actually two commits) for ASoC core"

    * tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ALSA: usb-audio: Add DSD support for EVGA NU Audio
    ALSA: hda - Add laptop imic fixup for ASUS M9V laptop
    ASoC: ti: fix SND_SOC_DM365_VOICE_CODEC dependencies
    ASoC: pcm3168a: The codec does not support S32_LE
    ASoC: core: use list_del_init and move it back to soc_cleanup_component
    ALSA: hda/realtek - PCI quirk for Medion E4254
    ALSA: hda - Apply AMD controller workaround for Raven platform
    ASoC: rsnd: do error check after rsnd_channel_normalization()
    ASoC: atmel_ssc_dai: Remove wrong spinlock usage
    ASoC: core: delete component->card_list in soc_remove_component only
    ASoC: fsl_sai: Fix noise when using EDMA
    ALSA: usb-audio: Add Hiby device family to quirks for native DSD support
    ALSA: hda/realtek - Fix alienware headset mic
    ALSA: dice: fix wrong packet parameter for Alesis iO26

    Linus Torvalds
     
  • Pull more io_uring updates from Jens Axboe:
    "A collection of later fixes and additions, that weren't quite ready
    for pushing out with the initial pull request.

    This contains:

    - Fix potential use-after-free of shadow requests (Jackie)

    - Fix potential OOM crash in request allocation (Jackie)

    - kmalloc+memcpy -> kmemdup cleanup (Jackie)

    - Fix poll crash regression (me)

    - Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)

    - Add support for timeouts, making it easier to do epoll_wait()
    conversions, for instance (me)

    - Ensure io_uring works without f_ops->read_iter() and
    f_ops->write_iter() (me)"

    * tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
    io_uring: correctly handle non ->{read,write}_iter() file_operations
    io_uring: IORING_OP_TIMEOUT support
    io_uring: use cond_resched() in sqthread
    io_uring: fix potential crash issue due to io_get_req failure
    io_uring: ensure poll commands clear ->sqe
    io_uring: fix use-after-free of shadow_req
    io_uring: use kmemdup instead of kmalloc and memcpy

    Linus Torvalds
     
  • Pull more block updates from Jens Axboe:
    "Some later additions that weren't quite done for the first pull
    request, and also a few fixes that have arrived since.

    This contains:

    - Kill silly pktcdvd warning on attempting to register a non-scsi
    passthrough device (me)

    - Use symbolic constants for the block t10 protection types, and
    switch to handling it in core rather than in the drivers (Max)

    - libahci platform missing node put fix (Nishka)

    - Small series of fixes for BFQ (Paolo)

    - Fix possible nbd crash (Xiubo)"

    * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
    block: drop device references in bsg_queue_rq()
    block: t10-pi: fix -Wswitch warning
    pktcdvd: remove warning on attempting to register non-passthrough dev
    ata: libahci_platform: Add of_node_put() before loop exit
    nbd: fix possible page fault for nbd disk
    nbd: rename the runtime flags as NBD_RT_ prefixed
    block, bfq: push up injection only after setting service time
    block, bfq: increase update frequency of inject limit
    block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
    block, bfq: update inject limit only after injection occurred
    block: centralize PI remapping logic to the block layer
    block: use symbolic constants for t10_pi type

    Linus Torvalds
     
  • Merge updates from Andrew Morton:

    - a few hot fixes

    - ocfs2 updates

    - almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
    cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
    sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
    oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
    zsmalloc)

    * emailed patches from Andrew Morton : (132 commits)
    mm/zsmalloc.c: fix a -Wunused-function warning
    zswap: do not map same object twice
    zswap: use movable memory if zpool support allocate movable memory
    zpool: add malloc_support_movable to zpool_driver
    shmem: fix obsolete comment in shmem_getpage_gfp()
    mm/madvise: reduce code duplication in error handling paths
    mm: mmap: increase sockets maximum memory size pgoff for 32bits
    mm/mmap.c: refine find_vma_prev() with rb_last()
    riscv: make mmap allocation top-down by default
    mips: use generic mmap top-down layout and brk randomization
    mips: replace arch specific way to determine 32bit task with generic version
    mips: adjust brk randomization offset to fit generic version
    mips: use STACK_TOP when computing mmap base address
    mips: properly account for stack randomization and stack guard gap
    arm: use generic mmap top-down layout and brk randomization
    arm: use STACK_TOP when computing mmap base address
    arm: properly account for stack randomization and stack guard gap
    arm64, mm: make randomization selected by generic topdown mmap layout
    arm64, mm: move generic mmap layout functions to mm
    arm64: consider stack randomization for mmap base only when necessary
    ...

    Linus Torvalds
     
  • set_zspage_inuse() was introduced in the commit 4f42047bbde0 ("zsmalloc:
    use accessor") but all the users of it were removed later by the commits,

    bdb0af7ca8f0 ("zsmalloc: factor page chain functionality out")
    3783689a1aa8 ("zsmalloc: introduce zspage structure")

    so the function can be safely removed now.

    Link: http://lkml.kernel.org/r/1568658408-19374-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Minchan Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • zswap_writeback_entry() maps a handle to read swpentry first, and
    then in the most common case it would map the same handle again.
    This is ok when zbud is the backend since its mapping callback is
    plain and simple, but it slows things down for z3fold.

    Since there's hardly a point in unmapping a handle _that_ fast as
    zswap_writeback_entry() does when it reads swpentry, the
    suggestion is to keep the handle mapped till the end.

    Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Cc: Shakeel Butt
    Cc: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Seth Jennings
    Cc: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • This is the third version that was updated according to the comments from
    Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
    https://lkml.org/lkml/2019/6/4/973

    zswap compresses swap pages into a dynamically allocated RAM-based memory
    pool. The memory pool should be zbud, z3fold or zsmalloc. All of them
    will allocate unmovable pages. It will increase the number of unmovable
    page blocks that will bad for anti-fragment.

    zsmalloc support page migration if request movable page:
    handle = zs_malloc(zram->mem_pool, comp_len,
    GFP_NOIO | __GFP_HIGHMEM |
    __GFP_MOVABLE);

    And commit "zpool: Add malloc_support_movable to zpool_driver" add
    zpool_malloc_support_movable check malloc_support_movable to make sure if
    a zpool support allocate movable memory.

    This commit let zswap allocate block with gfp
    __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.

    Following part is test log in a pc that has 8G memory and 2G swap.

    Without this commit:
    ~# echo lz4 > /sys/module/zswap/parameters/compressor
    ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
    ~# echo 1 > /sys/module/zswap/parameters/enabled
    ~# swapon /swapfile
    ~# cd /home/teawater/kernel/vm-scalability/
    /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
    /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
    2717908992 bytes / 4826062 usecs = 549973 KB/s
    2717908992 bytes / 4864201 usecs = 545661 KB/s
    2717908992 bytes / 4867015 usecs = 545346 KB/s
    2717908992 bytes / 4915485 usecs = 539968 KB/s
    397853 usecs to free memory
    357820 usecs to free memory
    421333 usecs to free memory
    420454 usecs to free memory
    /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
    Page block order: 9
    Pages per block: 512

    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0
    Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
    Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type Unmovable 6 5 8 6 6 5 4 1 1 1 0
    Node 0, zone DMA32, type Movable 25 20 20 19 22 15 14 11 11 5 767
    Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Unmovable 4753 5588 5159 4613 3712 2520 1448 594 188 11 0
    Node 0, zone Normal, type Movable 16 3 457 2648 2143 1435 860 459 223 224 296
    Node 0, zone Normal, type Reclaimable 0 0 44 38 11 2 0 0 0 0 0
    Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

    Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
    Node 0, zone DMA 1 7 0 0 0 0
    Node 0, zone DMA32 4 1652 0 0 0 0
    Node 0, zone Normal 931 1485 15 0 0 0

    With this commit:
    ~# echo lz4 > /sys/module/zswap/parameters/compressor
    ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
    ~# echo 1 > /sys/module/zswap/parameters/enabled
    ~# swapon /swapfile
    ~# cd /home/teawater/kernel/vm-scalability/
    /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
    /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
    2717908992 bytes / 4689240 usecs = 566020 KB/s
    2717908992 bytes / 4760605 usecs = 557535 KB/s
    2717908992 bytes / 4803621 usecs = 552543 KB/s
    2717908992 bytes / 5069828 usecs = 523530 KB/s
    431546 usecs to free memory
    383397 usecs to free memory
    456454 usecs to free memory
    224487 usecs to free memory
    /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
    Page block order: 9
    Pages per block: 512

    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0
    Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
    Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type Unmovable 10 8 10 9 10 4 3 2 3 0 0
    Node 0, zone DMA32, type Movable 18 12 14 16 16 11 9 5 5 6 775
    Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 1
    Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Unmovable 2669 1236 452 118 37 14 4 1 2 3 0
    Node 0, zone Normal, type Movable 3850 6086 5274 4327 3510 2494 1520 934 438 220 470
    Node 0, zone Normal, type Reclaimable 56 93 155 124 47 31 17 7 3 0 0
    Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

    Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
    Node 0, zone DMA 1 7 0 0 0 0
    Node 0, zone DMA32 4 1650 2 0 0 0
    Node 0, zone Normal 79 2326 26 0 0 0

    You can see that the number of unmovable page blocks is decreased
    when the kernel has this commit.

    Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.com
    Signed-off-by: Hui Zhu
    Reviewed-by: Shakeel Butt
    Cc: Dan Streetman
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Seth Jennings
    Cc: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • As a zpool_driver, zsmalloc can allocate movable memory because it support
    migate pages. But zbud and z3fold cannot allocate movable memory.

    Add malloc_support_movable to zpool_driver. If a zpool_driver support
    allocate movable memory, set it to true. And add
    zpool_malloc_support_movable check malloc_support_movable to make sure if
    a zpool support allocate movable memory.

    Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.com
    Signed-off-by: Hui Zhu
    Reviewed-by: Shakeel Butt
    Cc: Dan Streetman
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Seth Jennings
    Cc: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • Replace "fault_mm" with "vmf" in code comment because commit cfda05267f7b
    ("userfaultfd: shmem: add userfaultfd hook for shared memory faults") has
    changed the prototpye of shmem_getpage_gfp() - pass vmf instead of
    fault_mm to the function.

    Before:
    static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
    struct page **pagep, enum sgp_type sgp,
    gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
    After:
    static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
    struct page **pagep, enum sgp_type sgp,
    gfp_t gfp, struct vm_area_struct *vma,
    struct vm_fault *vmf, vm_fault_t *fault_type);

    Link: http://lkml.kernel.org/r/20190816100204.9781-1-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
    identical code.

    Move that code to a common error handling path.

    No functional changes.

    Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pankaj Gupta
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The AF_XDP sockets umem mapping interface uses XDP_UMEM_PGOFF_FILL_RING
    and XDP_UMEM_PGOFF_COMPLETION_RING offsets. These offsets are
    established already and are part of the configuration interface.

    But for 32-bit systems, using AF_XDP socket configuration, these values
    are too large to pass the maximum allowed file size verification. The
    offsets can be tuned off, but instead of changing the existing
    interface, let's extend the max allowed file size for sockets.

    No one has been using this until this patch with 32 bits as without
    this fix af_xdp sockets can't be used at all, so it unblocks af_xdp
    socket usage for 32bit systems.

    All list of mmap cbs for sockets was verified for side effects and all
    of them contain dummy cb - sock_no_mmap() at this moment, except the
    following:

    xsk_mmap() - it's what this fix is needed for.
    tcp_mmap() - doesn't have obvious issues with pgoff - no any references on it.
    packet_mmap() - return -EINVAL if it's even set.

    Link: http://lkml.kernel.org/r/20190812124326.32146-1-ivan.khoronzhuk@linaro.org
    Signed-off-by: Ivan Khoronzhuk
    Reviewed-by: Andrew Morton
    Cc: Björn Töpel
    Cc: Alexei Starovoitov
    Cc: Magnus Karlsson
    Cc: Daniel Borkmann
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ivan Khoronzhuk
     
  • When addr is out of range of the whole rb_tree, pprev will point to the
    right-most node. rb_tree facility already provides a helper function,
    rb_last(), to do this task. We can leverage this instead of
    reimplementing it.

    This patch refines find_vma_prev() with rb_last() to make it a little
    nicer to read.

    [akpm@linux-foundation.org: little cleanup, per Vlastimil]
    Link: http://lkml.kernel.org/r/20190809001928.4950-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • In order to avoid wasting user address space by using bottom-up mmap
    allocation scheme, prefer top-down scheme when possible.

    Before:
    root@qemuriscv64:~# cat /proc/self/maps
    00010000-00016000 r-xp 00000000 fe:00 6389 /bin/cat.coreutils
    00016000-00017000 r--p 00005000 fe:00 6389 /bin/cat.coreutils
    00017000-00018000 rw-p 00006000 fe:00 6389 /bin/cat.coreutils
    00018000-00039000 rw-p 00000000 00:00 0 [heap]
    1555556000-155556d000 r-xp 00000000 fe:00 7193 /lib/ld-2.28.so
    155556d000-155556e000 r--p 00016000 fe:00 7193 /lib/ld-2.28.so
    155556e000-155556f000 rw-p 00017000 fe:00 7193 /lib/ld-2.28.so
    155556f000-1555570000 rw-p 00000000 00:00 0
    1555570000-1555572000 r-xp 00000000 00:00 0 [vdso]
    1555574000-1555576000 rw-p 00000000 00:00 0
    1555576000-1555674000 r-xp 00000000 fe:00 7187 /lib/libc-2.28.so
    1555674000-1555678000 r--p 000fd000 fe:00 7187 /lib/libc-2.28.so
    1555678000-155567a000 rw-p 00101000 fe:00 7187 /lib/libc-2.28.so
    155567a000-15556a0000 rw-p 00000000 00:00 0
    3fffb90000-3fffbb1000 rw-p 00000000 00:00 0 [stack]

    After:
    root@qemuriscv64:~# cat /proc/self/maps
    00010000-00016000 r-xp 00000000 fe:00 6389 /bin/cat.coreutils
    00016000-00017000 r--p 00005000 fe:00 6389 /bin/cat.coreutils
    00017000-00018000 rw-p 00006000 fe:00 6389 /bin/cat.coreutils
    2de81000-2dea2000 rw-p 00000000 00:00 0 [heap]
    3ff7eb6000-3ff7ed8000 rw-p 00000000 00:00 0
    3ff7ed8000-3ff7fd6000 r-xp 00000000 fe:00 7187 /lib/libc-2.28.so
    3ff7fd6000-3ff7fda000 r--p 000fd000 fe:00 7187 /lib/libc-2.28.so
    3ff7fda000-3ff7fdc000 rw-p 00101000 fe:00 7187 /lib/libc-2.28.so
    3ff7fdc000-3ff7fe2000 rw-p 00000000 00:00 0
    3ff7fe4000-3ff7fe6000 r-xp 00000000 00:00 0 [vdso]
    3ff7fe6000-3ff7ffd000 r-xp 00000000 fe:00 7193 /lib/ld-2.28.so
    3ff7ffd000-3ff7ffe000 r--p 00016000 fe:00 7193 /lib/ld-2.28.so
    3ff7ffe000-3ff7fff000 rw-p 00017000 fe:00 7193 /lib/ld-2.28.so
    3ff7fff000-3ff8000000 rw-p 00000000 00:00 0
    3fff888000-3fff8a9000 rw-p 00000000 00:00 0 [stack]

    [alex@ghiti.fr: v6]
    Link: http://lkml.kernel.org/r/20190808061756.19712-15-alex@ghiti.fr
    Link: http://lkml.kernel.org/r/20190730055113.23635-15-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Acked-by: Paul Walmsley [arch/riscv]
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • mips uses a top-down layout by default that exactly fits the generic
    functions, so get rid of arch specific code and use the generic version by
    selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT.

    As ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT selects ARCH_HAS_ELF_RANDOMIZE,
    use the generic version of arch_randomize_brk since it also fits. Note
    that this commit also removes the possibility for mips to have elf
    randomization and no MMU: without MMU, the security added by randomization
    is worth nothing.

    Link: http://lkml.kernel.org/r/20190730055113.23635-14-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Paul Burton
    Reviewed-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Mips uses TASK_IS_32BIT_ADDR to determine if a task is 32bit, but this
    define is mips specific and other arches do not have it: instead, use
    !IS_ENABLED(CONFIG_64BIT) || is_compat_task() condition.

    Link: http://lkml.kernel.org/r/20190730055113.23635-13-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Paul Burton
    Reviewed-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • This commit simply bumps up to 32MB and 1GB the random offset of brk,
    compared to 8MB and 256MB, for 32bit and 64bit respectively.

    Link: http://lkml.kernel.org/r/20190730055113.23635-12-alex@ghiti.fr
    Suggested-by: Kees Cook
    Signed-off-by: Alexandre Ghiti
    Acked-by: Paul Burton
    Reviewed-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • mmap base address must be computed wrt stack top address, using TASK_SIZE
    is wrong since STACK_TOP and TASK_SIZE are not equivalent.

    Link: http://lkml.kernel.org/r/20190730055113.23635-11-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Acked-by: Paul Burton
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • This commit takes care of stack randomization and stack guard gap when
    computing mmap base address and checks if the task asked for
    randomization. This fixes the problem uncovered and not fixed for arm
    here: https://lkml.kernel.org/r/20170622200033.25714-1-riel@redhat.com

    Link: http://lkml.kernel.org/r/20190730055113.23635-10-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Acked-by: Paul Burton
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • arm uses a top-down mmap layout by default that exactly fits the generic
    functions, so get rid of arch specific code and use the generic version by
    selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT.

    As ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT selects ARCH_HAS_ELF_RANDOMIZE,
    use the generic version of arch_randomize_brk since it also fits. Note
    that this commit also removes the possibility for arm to have elf
    randomization and no MMU: without MMU, the security added by randomization
    is worth nothing.

    Note that it is safe to remove STACK_RND_MASK since it matches the default
    value.

    Link: http://lkml.kernel.org/r/20190730055113.23635-9-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • mmap base address must be computed wrt stack top address, using TASK_SIZE
    is wrong since STACK_TOP and TASK_SIZE are not equivalent.

    Link: http://lkml.kernel.org/r/20190730055113.23635-8-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • This commit takes care of stack randomization and stack guard gap when
    computing mmap base address and checks if the task asked for
    randomization. This fixes the problem uncovered and not fixed for arm
    here: https://lkml.kernel.org/r/20170622200033.25714-1-riel@redhat.com

    Link: http://lkml.kernel.org/r/20190730055113.23635-7-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic
    topdown mmap layout functions so that this security feature is on by
    default.

    Note that this commit also removes the possibility for arm64 to have elf
    randomization and no MMU: without MMU, the security added by randomization
    is worth nothing.

    Link: http://lkml.kernel.org/r/20190730055113.23635-6-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • arm64 handles top-down mmap layout in a way that can be easily reused by
    other architectures, so make it available in mm. It then introduces a new
    config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
    architectures to benefit from those functions. Note that this new config
    depends on MMU being enabled, if selected without MMU support, a warning
    will be thrown.

    Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Suggested-by: Christoph Hellwig
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Do not offset mmap base address because of stack randomization if current
    task does not want randomization. Note that x86 already implements this
    behaviour.

    Link: http://lkml.kernel.org/r/20190730055113.23635-4-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Each architecture has its own way to determine if a task is a compat task,
    by using is_compat_task in arch_mmap_rnd, it allows more genericity and
    then it prepares its moving to mm/.

    Link: http://lkml.kernel.org/r/20190730055113.23635-3-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Patch series "Provide generic top-down mmap layout functions", v6.

    This series introduces generic functions to make top-down mmap layout
    easily accessible to architectures, in particular riscv which was the
    initial goal of this series. The generic implementation was taken from
    arm64 and used successively by arm, mips and finally riscv.

    Note that in addition the series fixes 2 issues:

    - stack randomization was taken into account even if not necessary.

    - [1] fixed an issue with mmap base which did not take into account
    randomization but did not report it to arm and mips, so by moving arm64
    into a generic library, this problem is now fixed for both
    architectures.

    This work is an effort to factorize architecture functions to avoid code
    duplication and oversights as in [1].

    [1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html

    This patch (of 14):

    This preparatory commit moves this function so that further introduction
    of generic topdown mmap layout is contained only in mm/util.c.

    Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • After all uprobes are removed from the huge page (with PTE pgtable), it is
    possible to collapse the pmd and benefit from THP again. This patch does
    the collapse by calling collapse_pte_mapped_thp().

    Link: http://lkml.kernel.org/r/20190815164525.1848545-7-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Reported-by: kbuild test robot
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • khugepaged needs exclusive mmap_sem to access page table. When it fails
    to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
    is already a THP, khugepaged will not handle this pmd again.

    This patch enables the khugepaged to retry collapse the page table.

    struct mm_slot (in khugepaged.c) is extended with an array, containing
    addresses of pte-mapped THPs. We use array here for simplicity. We can
    easily replace it with more advanced data structures when needed.

    In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
    collapse the page table.

    Since collapse may happen at an later time, some pages may already fault
    in. collapse_pte_mapped_thp() is added to properly handle these pages.
    collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
    are mapping to the same THP. This is necessary because some subpage of
    the THP may be replaced, for example by uprobe. In such cases, it is not
    possible to collapse the pmd.

    [kirill.shutemov@linux.intel.com: add comments for retract_page_tables()]
    Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
    Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com
    Signed-off-by: Song Liu
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Suggested-by: Johannes Weiner
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Use the newly added FOLL_SPLIT_PMD in uprobe. This preserves the huge
    page when the uprobe is enabled. When the uprobe is disabled, newer
    instances of the same application could still benefit from huge page.

    For the next step, we will enable khugepaged to regroup the pmd, so that
    existing instances of the application could also benefit from huge page
    after the uprobe is disabled.

    Link: http://lkml.kernel.org/r/20190815164525.1848545-5-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Srikar Dronamraju
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Introduce a new foll_flag: FOLL_SPLIT_PMD. As the name says
    FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
    page stays as-is.

    FOLL_SPLIT_PMD is useful for cases where we need to use regular pages, but
    would switch back to huge page and huge pmd on. One of such example is
    uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

    Link: http://lkml.kernel.org/r/20190815164525.1848545-4-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reviewed-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Currently, uprobe swaps the target page with a anonymous page in both
    install_breakpoint() and remove_breakpoint(). When all uprobes on a page
    are removed, the given mm is still using an anonymous page (not the
    original page).

    This patch allows uprobe to use original page when possible (all uprobes
    on the page are already removed, and the original page is in page cache
    and uptodate).

    As suggested by Oleg, we unmap the old_page and let the original page
    fault in.

    Link: http://lkml.kernel.org/r/20190815164525.1848545-3-songliubraving@fb.com
    Signed-off-by: Song Liu
    Suggested-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Patch series "THP aware uprobe", v13.

    This patchset makes uprobe aware of THPs.

    Currently, when uprobe is attached to text on THP, the page is split by
    FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of
    THP.

    This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
    FOLL_SPLIT_PMD, which only split PMD for uprobe.

    After all uprobes within the THP are removed, the PTE-mapped pages are
    regrouped as huge PMD.

    This set (plus a few THP patches) is also available at

    https://github.com/liu-song-6/linux/tree/uprobe-thp

    This patch (of 6):

    Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
    can use them in other files.

    Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: William Kucharski
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Currently THP deferred split shrinker is not memcg aware, this may cause
    premature OOM with some configuration. For example the below test would
    run into premature OOM easily:

    $ cgcreate -g memory:thp
    $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
    $ cgexec -g memory:thp transhuge-stress 4000

    transhuge-stress comes from kernel selftest.

    It is easy to hit OOM, but there are still a lot THP on the deferred split
    queue, memcg direct reclaim can't touch them since the deferred split
    shrinker is not memcg aware.

    Convert deferred split shrinker memcg aware by introducing per memcg
    deferred split queue. The THP should be on either per node or per memcg
    deferred split queue if it belongs to a memcg. When the page is
    immigrated to the other memcg, it will be immigrated to the target memcg's
    deferred split queue too.

    Reuse the second tail page's deferred_list for per memcg list since the
    same THP can't be on multiple deferred split queues.

    [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
    Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi