11 Mar, 2022

1 commit

  • This is the 5.15.27 stable release

    * tag 'v5.15.27': (3069 commits)
    Linux 5.15.27
    hamradio: fix macro redefine warning
    KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx7ulp.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1028a-qds.dts
    arch/arm64/boot/dts/freescale/imx8mq.dtsi
    drivers/dma-buf/heaps/cma_heap.c
    drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
    drivers/gpu/drm/mxsfb/mxsfb_kms.c
    drivers/mmc/host/sdhci-esdhc-imx.c
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/rpmsg/rpmsg_char.c
    drivers/soc/imx/gpcv2.c
    drivers/thermal/imx_thermal.c

    Jason Liu
     

09 Mar, 2022

1 commit

  • [ Upstream commit 60115fa54ad7b913b7cb5844e6b7ffeb842d55f2 ]

    Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
    enabled(without KASAN_VMALLOC) on x86[1].

    When the module area allocates memory, it's kmemleak_object is created
    successfully, but the KASAN shadow memory of module allocation is not
    ready, so when kmemleak scan the module's pointer, it will panic due to
    no shadow memory with KASAN check.

    module_alloc
    __vmalloc_node_range
    kmemleak_vmalloc
    kmemleak_scan
    update_checksum
    kasan_module_alloc
    kmemleak_ignore

    Note, there is no problem if KASAN_VMALLOC enabled, the modules area
    entire shadow memory is preallocated. Thus, the bug only exits on ARCH
    which supports dynamic allocation of module area per module load, for
    now, only x86/arm64/s390 are involved.

    Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
    kmemleak in module_alloc() to fix this issue.

    [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/

    [wangkefeng.wang@huawei.com: fix build]
    Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
    [akpm@linux-foundation.org: simplify ifdefs, per Andrey]
    Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com

    Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
    Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules")
    Fixes: 39d114ddc682 ("arm64: add KASAN support")
    Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables")
    Signed-off-by: Kefeng Wang
    Reported-by: Yongqiang Liu
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: Alexander Gordeev
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Alexander Potapenko
    Cc: Kefeng Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kefeng Wang
     

30 Nov, 2021

1 commit

  • * jailhouse/next: (45 commits)
    LF-3330 net: ivshmem-net: include ethtool to avoid build break
    MLK-25346: net: add imx-shmem-net driver
    LF-2949 arm: kernel: hyp-stub: not export __hyp_stub_vectors
    LF-3097/LF-3172 virtio: ivshmem: check peer_state early
    LF-3016-3 tools/virtio: ivshmem-console: correct device_vector to 0
    ...

    Dong Aisheng
     

02 Nov, 2021

4 commits


29 Oct, 2021

1 commit

  • Eric Dumazet reported a strange numa spreading info in [1], and found
    commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
    this issue [2].

    Dig into the difference before and after this patch, page allocation has
    some difference:

    before:
    alloc_large_system_hash
    __vmalloc
    __vmalloc_node(..., NUMA_NO_NODE, ...)
    __vmalloc_node_range
    __vmalloc_area_node
    alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
    alloc_pages_current
    alloc_page_interleave /* can be proved by print policy mode */

    after:
    alloc_large_system_hash
    __vmalloc
    __vmalloc_node(..., NUMA_NO_NODE, ...)
    __vmalloc_node_range
    __vmalloc_area_node
    alloc_pages_node /* choose nid by nuam_mem_id() */
    __alloc_pages_node(nid, ....)

    So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
    it will allocate memory in current node instead of interleaving allocate
    memory.

    Link: https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ [1]
    Link: https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ [2]
    Link: https://lkml.kernel.org/r/20211021080744.874701-2-chenwandun@huawei.com
    Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
    Signed-off-by: Chen Wandun
    Reported-by: Eric Dumazet
    Cc: Shakeel Butt
    Cc: Nicholas Piggin
    Cc: Kefeng Wang
    Cc: Hanjun Guo
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Wandun
     

09 Sep, 2021

3 commits

  • Merge more updates from Andrew Morton:
    "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

    Subsystems affected by this patch series: mm (memory-hotplug, rmap,
    ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
    alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
    checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
    selftests, ipc, and scripts"

    * emailed patches from Andrew Morton : (94 commits)
    scripts: check_extable: fix typo in user error message
    mm/workingset: correct kernel-doc notations
    ipc: replace costly bailout check in sysvipc_find_ipc()
    selftests/memfd: remove unused variable
    Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
    configs: remove the obsolete CONFIG_INPUT_POLLDEV
    prctl: allow to setup brk for et_dyn executables
    pid: cleanup the stale comment mentioning pidmap_init().
    kernel/fork.c: unexport get_{mm,task}_exe_file
    coredump: fix memleak in dump_vma_snapshot()
    fs/coredump.c: log if a core dump is aborted due to changed file permissions
    nilfs2: use refcount_dec_and_lock() to fix potential UAF
    nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
    nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
    nilfs2: fix NULL pointer in nilfs_##name##_attr_release
    nilfs2: fix memory leak in nilfs_sysfs_create_device_group
    trap: cleanup trap_init()
    init: move usermodehelper_enable() to populate_rootfs()
    ...

    Linus Torvalds
     
  • There is no need to execute from iomem (and most platforms it is
    impossible anyway), so add the pgprot_nx() call similar to vmap.

    Link: https://lkml.kernel.org/r/20210824091259.1324527-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Nicholas Piggin
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "small ioremap cleanups".

    The first patch moves a little code around the vmalloc/ioremap boundary
    following a bigger move by Nick earlier. The second enforces
    non-executable mapping on ioremap just like we do for vmap. No driver
    currently uses executable mappings anyway, as they should.

    This patch (of 2):

    This keeps it together with the implementation, and to remove the
    vmap_range wrapper.

    Link: https://lkml.kernel.org/r/20210824091259.1324527-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20210824091259.1324527-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Nicholas Piggin
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

04 Sep, 2021

3 commits

  • commit f608788cd2d6 ("mm/vmalloc: use rb_tree instead of list for vread()
    lookups") use rb_tree instread of list to speed up lookup, but function
    __find_vmap_area is try to find a vmap_area that include target address,
    if target address is smaller than the leftmost node in vmap_area_root, it
    will return NULL, then vread will read nothing. This behavior is
    different from the primitive semantics.

    The correct way is find the first vmap_are that bigger than target addr,
    that is what function find_vmap_area_exceed_addr does.

    Link: https://lkml.kernel.org/r/20210714015959.3204871-1-chenwandun@huawei.com
    Fixes: f608788cd2d6 ("mm/vmalloc: use rb_tree instead of list for vread() lookups")
    Signed-off-by: Chen Wandun
    Reported-by: Hulk Robot
    Cc: Serapheim Dimitropoulos
    Cc: Uladzislau Rezki (Sony)
    Cc: Kefeng Wang
    Cc: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Wandun
     
  • Get rid of gfpflags_allow_blocking() check from the vmalloc() path as it
    is supposed to be sleepable anyway. Thus remove it from the
    alloc_vmap_area() as well as from the vm_area_alloc_pages().

    Link: https://lkml.kernel.org/r/20210707182639.31282-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Nicholas Piggin
    Cc: Hillf Danton
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • In case of simultaneous vmalloc allocations, for example it is 1GB and 12
    CPUs my system is able to hit "BUG: soft lockup" for !CONFIG_PREEMPT
    kernel.

    RIP: 0010:__alloc_pages_bulk+0xa9f/0xbb0
    Call Trace:
    __vmalloc_node_range+0x11c/0x2d0
    __vmalloc_node+0x4b/0x70
    fix_size_alloc_test+0x44/0x60 [test_vmalloc]
    test_func+0xe7/0x1f0 [test_vmalloc]
    kthread+0x11a/0x140
    ret_from_fork+0x22/0x30

    To address this issue invoke a bulk-allocator many times until all pages
    are obtained, i.e. do batched page requests adding cond_resched()
    meanwhile to reschedule. Batched value is hard-coded and is 100 pages per
    call.

    Link: https://lkml.kernel.org/r/20210707182639.31282-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Nicholas Piggin
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

02 Jul, 2021

1 commit

  • make W=1 generates the following warning for mm/vmalloc.c

    mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes]
    void set_iounmap_nonlazy(void)
    ^~~~~~~~~~~~~~~~~~~

    This is an arch-generic function only used by x86. On other arches, it's
    dead code. Include the header with the definition and make it x86-64
    specific.

    Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Dan Streetman
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Jul, 2021

2 commits

  • On some architectures like powerpc, there are huge pages that are mapped
    at pte level.

    Enable it in vmalloc.

    For that, architectures can provide arch_vmap_pte_supported_shift() that
    returns the shift for pages to map at pte level.

    Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.eu
    Signed-off-by: Christophe Leroy
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christophe Leroy
     
  • On some architectures like powerpc, there are huge pages that are mapped
    at pte level.

    Enable it in vmap.

    For that, architectures can provide arch_vmap_pte_range_map_size() that
    returns the size of pages to map at pte level.

    Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.eu
    Signed-off-by: Christophe Leroy
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christophe Leroy
     

30 Jun, 2021

5 commits

  • On non-preemptible kernel builds the watchdog can complain about soft
    lockups when vfree() is called against large vmalloc areas:

    [ 210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded
    [ 238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203]
    [ 238.662716] Modules linked in: kvmalloc_test(OE-) ...
    [ 238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S OE 5.13.0-rc7+ #1
    [ 238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018
    [ 238.792383] RIP: 0010:free_unref_page+0x52/0x60
    [ 238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77
    [ 238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206
    [ 238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40
    [ 238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010
    [ 238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001
    [ 238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202
    [ 238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000
    [ 238.864059] FS: 00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000
    [ 238.873089] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0
    [ 238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 238.903397] PKRU: 55555554
    [ 238.906417] Call Trace:
    [ 238.909149] __vunmap+0x17c/0x220
    [ 238.912851] __x64_sys_delete_module+0x13a/0x250
    [ 238.918008] ? syscall_trace_enter.isra.20+0x13c/0x1b0
    [ 238.923746] do_syscall_64+0x39/0x80
    [ 238.927740] entry_SYSCALL_64_after_hwframe+0x44/0xae

    Like in other range zapping routines that iterate over a large list, lets
    just add cond_resched() within __vunmap()'s page-releasing loop in order
    to avoid the watchdog splats.

    Link: https://lkml.kernel.org/r/20210622225030.478384-1-aquini@redhat.com
    Signed-off-by: Rafael Aquini
    Acked-by: Nicholas Piggin
    Reviewed-by: Uladzislau Rezki (Sony)
    Reviewed-by: Aaron Tomlin
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Currently for order-0 pages we use a bulk-page allocator to get set of
    pages. From the other hand not allocating all pages is something that
    might occur. In that case we should fallbak to the single-page allocator
    trying to get missing pages, because it is more permissive(direct reclaim,
    etc).

    Introduce a vm_area_alloc_pages() function where the described logic is
    implemented.

    Link: https://lkml.kernel.org/r/20210521130718.GA17882@pc638.lan
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Cc: Mel Gorman
    Cc: Nicholas Piggin
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki
     
  • A checkpatch.pl script complains on splitting a text across lines. It is
    because if a user wants to find an entire string he or she will not
    succeeded.

    WARNING: quoted string split across lines
    + "vmalloc size %lu allocation failure: "
    + "page order %u allocation failed",

    total: 0 errors, 1 warnings, 10 lines checked

    Link: https://lkml.kernel.org/r/20210521204359.19943-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Cc: Mel Gorman
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Nicholas Piggin
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • When a memory allocation for array of pages are not succeed emit a warning
    message as a first step and then perform the further cleanup.

    The reason it should be done in a right order is the clean up function
    which is free_vm_area() can potentially also follow its error paths what
    can lead to confusion what was broken first.

    Link: https://lkml.kernel.org/r/20210516202056.2120-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Recently there has been introduced a page bulk allocator for users which
    need to get number of pages per one call request.

    For order-0 pages switch to an alloc_pages_bulk_array_node() instead of
    alloc_pages_node(), the reason is the former is not capable of allocating
    set of pages, thus a one call is per one page.

    Second, according to my tests the bulk allocator uses less cycles even for
    scenarios when only one page is requested. Running the "perf" on same
    test case shows below difference:

    - 45.18% __vmalloc_node
    - __vmalloc_node_range
    - 35.60% __alloc_pages
    - get_page_from_freelist
    3.36% __list_del_entry_valid
    3.00% check_preemption_disabled
    1.42% prep_new_page

    - 31.00% __vmalloc_node
    - __vmalloc_node_range
    - 14.48% __alloc_pages_bulk
    3.22% __list_del_entry_valid
    - 0.83% __alloc_pages
    get_page_from_freelist

    The "test_vmalloc.sh" also shows performance improvements:

    fix_size_alloc_test_4MB loops: 1000000 avg: 89105095 usec
    fix_size_alloc_test loops: 1000000 avg: 513672 usec
    full_fit_alloc_test loops: 1000000 avg: 748900 usec
    long_busy_list_alloc_test loops: 1000000 avg: 8043038 usec
    random_size_alloc_test loops: 1000000 avg: 4028582 usec
    fix_align_alloc_test loops: 1000000 avg: 1457671 usec

    fix_size_alloc_test_4MB loops: 1000000 avg: 62083711 usec
    fix_size_alloc_test loops: 1000000 avg: 449207 usec
    full_fit_alloc_test loops: 1000000 avg: 735985 usec
    long_busy_list_alloc_test loops: 1000000 avg: 5176052 usec
    random_size_alloc_test loops: 1000000 avg: 2589252 usec
    fix_align_alloc_test loops: 1000000 avg: 1365009 usec

    For example 4MB allocations illustrates ~30% gain, all the
    rest is also better.

    Link: https://lkml.kernel.org/r/20210516202056.2120-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Acked-by: Mel Gorman
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

25 Jun, 2021

2 commits

  • In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
    __vmalloc_node_range was changed such that __get_vm_area_node was no
    longer called with the requested/real size of the vmalloc allocation,
    but rather with a rounded-up size.

    This means that __get_vm_area_node called kasan_unpoision_vmalloc() with
    a rounded up size rather than the real size. This led to it allowing
    access to too much memory and so missing vmalloc OOBs and failing the
    kasan kunit tests.

    Pass the real size and the desired shift into __get_vm_area_node. This
    allows it to round up the size for the underlying allocators while still
    unpoisioning the correct quantity of shadow memory.

    Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.

    Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335
    Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
    Signed-off-by: Daniel Axtens
    Tested-by: David Gow
    Reviewed-by: Nicholas Piggin
    Reviewed-by: Uladzislau Rezki (Sony)
    Tested-by: Andrey Konovalov
    Acked-by: Andrey Konovalov
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Axtens
     
  • Patch series "mm: add vmalloc_no_huge and use it", v4.

    Add vmalloc_no_huge() and export it, so modules can allocate memory with
    small pages.

    Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
    hardware limitation.

    This patch (of 2):

    Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added
    support for hugepage vmalloc mappings, it also added the flag
    VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
    performed with 0-order non-huge pages.

    This flag is not accessible when calling vmalloc, the only option is to
    call directly __vmalloc_node_range, which is not exported.

    This means that a module can't vmalloc memory with small pages.

    Case in point: KVM on s390x needs to vmalloc a large area, and it needs
    to be mapped with non-huge pages, because of a hardware limitation.

    This patch adds the function vmalloc_no_huge, which works like vmalloc,
    but it is guaranteed to always back the mapping using small pages. This
    new function is exported, therefore it is usable by modules.

    [akpm@linux-foundation.org: whitespace fixes, per Christoph]

    Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com
    Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com
    Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings")
    Signed-off-by: Claudio Imbrenda
    Reviewed-by: Uladzislau Rezki (Sony)
    Acked-by: Nicholas Piggin
    Reviewed-by: David Hildenbrand
    Acked-by: David Rientjes
    Cc: Uladzislau Rezki (Sony)
    Cc: Catalin Marinas
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Cornelia Huck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     

07 May, 2021

3 commits

  • Fix ~94 single-word typos in locking code comments, plus a few
    very obvious grammar mistakes.

    Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
    Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
    Signed-off-by: Ingo Molnar
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • The last user (/dev/kmem) is gone. Let's drop it.

    Link: https://lkml.kernel.org/r/20210324102351.6932-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Linus Torvalds
    Cc: Greg Kroah-Hartman
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Minchan Kim
    Cc: huang ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "drivers/char: remove /dev/kmem for good".

    Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
    memory ballooning, I started questioning the existence of /dev/kmem.

    Comparing it with the /proc/kcore implementation, it does not seem to be
    able to deal with things like

    a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
    -> kern_addr_valid(). virt_addr_valid() is not sufficient.

    b) Special cases like gart aperture memory that is not to be touched
    -> mem_pfn_is_ram()

    Unless I am missing something, it's at least broken in some cases and might
    fault/crash the machine.

    Looks like its existence has been questioned before in 2005 and 2010 [1],
    after ~11 additional years, it might make sense to revive the discussion.

    CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
    mistake?). All distributions disable it: in Ubuntu it has been disabled
    for more than 10 years, in Debian since 2.6.31, in Fedora at least
    starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
    15sp2, and OpenSUSE has it disabled as well.

    1) /dev/kmem was popular for rootkits [2] before it got disabled
    basically everywhere. Ubuntu documents [3] "There is no modern user of
    /dev/kmem any more beyond attackers using it to load kernel rootkits.".
    RHEL documents in a BZ [5] "it served no practical purpose other than to
    serve as a potential security problem or to enable binary module drivers
    to access structures/functions they shouldn't be touching"

    2) /proc/kcore is a decent interface to have a controlled way to read
    kernel memory for debugging puposes. (will need some extensions to
    deal with memory offlining/unplug, memory ballooning, and poisoned
    pages, though)

    3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
    better fit, especially, to write random memory; harder to shoot
    yourself into the foot.

    4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
    to be incompatible with 64bit [1]. For educational purposes,
    /proc/kcore might be used to monitor value updates -- or older
    kernels can be used.

    5) It's broken on arm64, and therefore, completely disabled there.

    Looks like it's essentially unused and has been replaced by better
    suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
    just remove it.

    [1] https://lwn.net/Articles/147901/
    [2] https://www.linuxjournal.com/article/10505
    [3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
    [4] https://sourceforge.net/projects/kme/
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=154796

    Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Kees Cook
    Cc: Linus Torvalds
    Cc: Greg Kroah-Hartman
    Cc: "Alexander A. Klimov"
    Cc: Alexander Viro
    Cc: Alexandre Belloni
    Cc: Andrew Lunn
    Cc: Andrey Zhizhikin
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Brian Cain
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Chris Zankel
    Cc: Corentin Labbe
    Cc: "David S. Miller"
    Cc: "Eric W. Biederman"
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Gregory Clement
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Hillf Danton
    Cc: huang ying
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: "James E.J. Bottomley"
    Cc: James Troup
    Cc: Jiaxun Yang
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Kairui Song
    Cc: Krzysztof Kozlowski
    Cc: Kuninori Morimoto
    Cc: Liviu Dudau
    Cc: Lorenzo Pieralisi
    Cc: Luc Van Oostenryck
    Cc: Luis Chamberlain
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Niklas Schnelle
    Cc: Oleksiy Avramchenko
    Cc: openrisc@lists.librecores.org
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: "Pavel Machek (CIP)"
    Cc: Pavel Machek
    Cc: "Peter Zijlstra (Intel)"
    Cc: Pierre Morel
    Cc: Randy Dunlap
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Robert Richter
    Cc: Rob Herring
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sebastian Andrzej Siewior
    Cc: Sebastian Hesselbarth
    Cc: sparclinux@vger.kernel.org
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Steven Rostedt
    Cc: Sudeep Holla
    Cc: Theodore Dubois
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Viresh Kumar
    Cc: William Cohen
    Cc: Xiaoming Ni
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

06 May, 2021

1 commit

  • Various coding style tweaks to various files under mm/

    [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

    Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
    Signed-off-by: Zhiyuan Dai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhiyuan Dai
     

01 May, 2021

12 commits

  • Link: https://lkml.kernel.org/r/20210402202237.20334-5-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Shuah Khan
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Instead of keeping open-coded style, move the code related to preloading
    into a separate function. Therefore introduce the preload_this_cpu_lock()
    routine that prelaods a current CPU with one extra vmap_area object.

    There is no functional change as a result of this patch.

    Link: https://lkml.kernel.org/r/20210402202237.20334-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Shuah Khan
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • A potential use after free can occur in _vm_unmap_aliases where an already
    freed vmap_area could be accessed, Consider the following scenario:

    Process 1 Process 2

    __vm_unmap_aliases __vm_unmap_aliases
    purge_fragmented_blocks_allcpus rcu_read_lock()
    rcu_read_lock()
    list_del_rcu(&vb->free_list)
    list_for_each_entry_rcu(vb .. )
    __purge_vmap_area_lazy
    kmem_cache_free(va)
    va_start = vb->va->va_start

    Here Process 1 is in purge path and it does list_del_rcu on vmap_block and
    later frees the vmap_area, since Process 2 was holding the rcu lock at
    this time vmap_block will still be present in and Process 2 accesse it and
    thereby it tries to access vmap_area of that vmap_block which was already
    freed by Process 1 and this results in use after free.

    Fix this by adding a check for vb->dirty before accessing vmap_area
    structure since vb->dirty will be set to VMAP_BBMAP_BITS in purge path
    checking for this will prevent the use after free.

    Link: https://lkml.kernel.org/r/1616062105-23263-1-git-send-email-vjitta@codeaurora.org
    Signed-off-by: Vijayanand Jitta
    Reviewed-by: Uladzislau Rezki (Sony)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vijayanand Jitta
     
  • There are several reasons why a vmalloc can fail, virtual space exhausted,
    page array allocation failure, page allocation failure, and kernel page
    table allocation failure.

    Add distinct warning messages for the main causes of failure, with some
    added information like page order or allocation size where applicable.

    [urezki@gmail.com: print correct vmalloc allocation size]
    Link: https://lkml.kernel.org/r/20210329193214.GA28602@pc638.lan

    Link: https://lkml.kernel.org/r/20210322021806.892164-6-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Christoph Hellwig
    Cc: Cédric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • This is a shim around vunmap_range, get rid of it.

    Move the main API comment from the _noflush variant to the normal
    variant, and make _noflush internal to mm/.

    [npiggin@gmail.com: fix nommu builds and a comment bug per sfr]
    Link: https://lkml.kernel.org/r/1617292598.m6g0knx24s.astroid@bobo.none
    [akpm@linux-foundation.org: move vunmap_range_noflush() stub inside !CONFIG_MMU, not !CONFIG_NUMA]
    [npiggin@gmail.com: fix nommu builds]
    Link: https://lkml.kernel.org/r/1617292497.o1uhq5ipxp.astroid@bobo.none

    Link: https://lkml.kernel.org/r/20210322021806.892164-5-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Christoph Hellwig
    Cc: Cédric Le Goater
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Patch series "mm/vmalloc: cleanup after hugepage series", v2.

    Christoph pointed out some overdue cleanups required after the huge
    vmalloc series, and I had another failure error message improvement as
    well.

    This patch (of 5):

    This is a shim around vmap_pages_range, get rid of it.

    Move the main API comment from the _noflush variant to the normal variant,
    and make _noflush internal to mm/.

    Link: https://lkml.kernel.org/r/20210322021806.892164-1-npiggin@gmail.com
    Link: https://lkml.kernel.org/r/20210322021806.892164-2-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Christoph Hellwig
    Cc: Uladzislau Rezki
    Cc: Cédric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
    enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
    supports PMD sized vmap mappings.

    vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
    larger, and fall back to small pages if that was unsuccessful.

    Architectures must ensure that any arch specific vmalloc allocations that
    require PAGE_SIZE mappings (e.g., module allocations vs strict module rwx)
    use the VM_NOHUGE flag to inhibit larger mappings.

    This can result in more internal fragmentation and memory overhead for a
    given allocation, an option nohugevmalloc is added to disable at boot.

    [colin.king@canonical.com: fix read of uninitialized pointer area]
    Link: https://lkml.kernel.org/r/20210318155955.18220-1-colin.king@canonical.com

    Link: https://lkml.kernel.org/r/20210317062402.533919-14-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Ding Tianhong
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Miaohe Lin
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Uladzislau Rezki (Sony)
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • As a side-effect, the order of flush_cache_vmap() and
    arch_sync_kernel_mappings() calls are switched, but that now matches the
    other callers in this file.

    Link: https://lkml.kernel.org/r/20210317062402.533919-13-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Christoph Hellwig
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Ding Tianhong
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Miaohe Lin
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Uladzislau Rezki (Sony)
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • This is a generic kernel virtual memory mapper, not specific to ioremap.

    Code is unchanged other than making vmap_range non-static.

    Link: https://lkml.kernel.org/r/20210317062402.533919-12-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Christoph Hellwig
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Ding Tianhong
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Miaohe Lin
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Uladzislau Rezki (Sony)
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • The vmalloc mapper operates on a struct page * array rather than a linear
    physical address, re-name it to make this distinction clear.

    Link: https://lkml.kernel.org/r/20210317062402.533919-5-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Miaohe Lin
    Reviewed-by: Christoph Hellwig
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Ding Tianhong
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Uladzislau Rezki (Sony)
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
    Whether or not a vmap is huge depends on the architecture details,
    alignments, boot options, etc., which the caller can not be expected to
    know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

    This change teaches vmalloc_to_page about larger pages, and returns the
    struct page that corresponds to the offset within the large page. This
    makes the API agnostic to mapping implementation details.

    [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
    fail gracefully on unexpected huge vmap mappings")

    [npiggin@gmail.com: sparc32: add stub pud_page define for walking huge vmalloc page tables]
    Link: https://lkml.kernel.org/r/20210324232825.1157363-1-npiggin@gmail.com

    Link: https://lkml.kernel.org/r/20210317062402.533919-3-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Miaohe Lin
    Reviewed-by: Christoph Hellwig
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Ding Tianhong
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Uladzislau Rezki (Sony)
    Cc: Will Deacon
    Cc: Stephen Rothwell
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • vread() has been linearly searching vmap_area_list for looking up vmalloc
    areas to read from. These same areas are also tracked by a rb_tree
    (vmap_area_root) which offers logarithmic lookup.

    This patch modifies vread() to use the rb_tree structure instead of the
    list and the speedup for heavy /proc/kcore readers can be pretty
    significant. Below are the wall clock measurements of a Python
    application that leverages the drgn debugging library to read and
    interpret data read from /proc/kcore.

    Before the patch:
    -----
    $ time sudo sdb -e 'dbuf | head 3000 | wc'
    (unsigned long)3000

    real 0m22.446s
    user 0m2.321s
    sys 0m20.690s
    -----

    With the patch:
    -----
    $ time sudo sdb -e 'dbuf | head 3000 | wc'
    (unsigned long)3000

    real 0m2.104s
    user 0m2.043s
    sys 0m0.921s
    -----

    Link: https://lkml.kernel.org/r/20210209190253.108763-1-serapheim@delphix.com
    Signed-off-by: Serapheim Dimitropoulos
    Reviewed-by: Uladzislau Rezki (Sony)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serapheim Dimitropoulos