03 Sep, 2020

1 commit

  • [ Upstream commit e47110e90584a22e9980510b00d0dfad3a83354e ]

    Like zap_pte_range add cond_resched so that we can avoid softlockups as
    reported below. On non-preemptible kernel with large I/O map region (like
    the one we get when using persistent memory with sector mode), an unmap of
    the namespace can report below softlockups.

    22724.027334] watchdog: BUG: soft lockup - CPU#49 stuck for 23s! [ndctl:50777]
    NIP [c0000000000dc224] plpar_hcall+0x38/0x58
    LR [c0000000000d8898] pSeries_lpar_hpte_invalidate+0x68/0xb0
    Call Trace:
    flush_hash_page+0x114/0x200
    hpte_need_flush+0x2dc/0x540
    vunmap_page_range+0x538/0x6f0
    free_unmap_vmap_area+0x30/0x70
    remove_vm_area+0xfc/0x140
    __vunmap+0x68/0x270
    __iounmap.part.0+0x34/0x60
    memunmap+0x54/0x70
    release_nodes+0x28c/0x300
    device_release_driver_internal+0x16c/0x280
    unbind_store+0x124/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xd8/0x260
    ksys_write+0xdc/0x130
    system_call+0x5c/0x70

    Reported-by: Harish Sriram
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc:
    Link: http://lkml.kernel.org/r/20200807075933.310240-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Sasha Levin
     

29 Apr, 2020

1 commit

  • commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.

    remap_vmalloc_range() has had various issues with the bounds checks it
    promises to perform ("This function checks that addr is a valid
    vmalloc'ed area, and that it is big enough to cover the vma") over time,
    e.g.:

    - not detecting pgoff<<<<
    Signed-off-by: Andrew Morton
    Cc: stable@vger.kernel.org
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Martin KaFai Lau
    Cc: Song Liu
    Cc: Yonghong Song
    Cc: Andrii Nakryiko
    Cc: John Fastabend
    Cc: KP Singh
    Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

25 Mar, 2020

1 commit

  • commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

    Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
    __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
    the vunmap() code-path. While this change was necessary to maintain
    correctness on x86-32-pae kernels, it also adds additional cycles for
    architectures that don't need it.

    Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
    severe performance regressions in micro-benchmarks because it now also
    calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
    the vmalloc_sync_all() implementation on x86-64 is only needed for newly
    created mappings.

    To avoid the unnecessary work on x86-64 and to gain the performance
    back, split up vmalloc_sync_all() into two functions:

    * vmalloc_sync_mappings(), and
    * vmalloc_sync_unmappings()

    Most call-sites to vmalloc_sync_all() only care about new mappings being
    synchronized. The only exception is the new call-site added in the
    above mentioned commit.

    Shile Zhang directed us to a report of an 80% regression in reaim
    throughput.

    Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
    Reported-by: kernel test robot
    Reported-by: Shile Zhang
    Signed-off-by: Joerg Roedel
    Signed-off-by: Andrew Morton
    Tested-by: Borislav Petkov
    Acked-by: Rafael J. Wysocki [GHES]
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
    Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
    Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joerg Roedel
     

23 Jan, 2020

1 commit

  • commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

    Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

26 Sep, 2019

1 commit

  • Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
    for the case where the augmented value is a scalar whose definition
    follows a max(f(node)) pattern. This actually covers all present uses of
    RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
    various RBCOMPUTE function definitions.

    [walken@google.com: fix mm/vmalloc.c]
    Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
    [walken@google.com: re-add check to check_augmented()]
    Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
    Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
    Signed-off-by: Michel Lespinasse
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Howells
    Cc: Davidlohr Bueso
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

25 Sep, 2019

3 commits

  • If !area->pages statement is true where memory allocation fails, area is
    freed.

    In this case 'area->pages = pages' should not executed. So move
    'area->pages = pages' after if statement.

    [akpm@linux-foundation.org: give area->pages the same treatment]
    Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
    Signed-off-by: Austin Kim
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Cc: Roman Gushchin
    Cc: Roman Penyaev
    Cc: Rick Edgecombe
    Cc: Mike Rapoport
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Austin Kim
     
  • Objective
    ---------

    The current implementation of struct vmap_area wasted space.

    After applying this commit, sizeof(struct vmap_area) has been
    reduced from 11 words to 8 words.

    Description
    -----------

    1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
    because

    A) "subtree_max_size" is only used when vmap_area is in "free" tree

    B) "vm" is only used when vmap_area is in "busy" tree

    C) "purge_list" is only used when vmap_area is in vmap_purge_list

    2) Eliminate "flags".

    ;Since only one flag VM_VM_AREA is being used, and the same thing can be
    done by judging whether "vm" is NULL, then the "flags" can be eliminated.

    Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com
    Signed-off-by: Pengfei Li
    Suggested-by: Uladzislau Rezki (Sony)
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Roman Gushchin
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pengfei Li
     
  • The busy tree can be quite big, even though the area is freed or unmapped
    it still stays there until "purge" logic removes it.

    1) Optimize and reduce the size of "busy" tree by removing a node from
    it right away as soon as user triggers free paths. It is possible to
    do so, because the allocation is done using another augmented tree.

    The vmalloc test driver shows the difference, for example the
    "fix_size_alloc_test" is ~11% better comparing with default configuration:

    sudo ./test_vmalloc.sh performance

    Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
    Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
    Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec

    Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
    Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
    Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec

    2) Since the busy tree now contains allocated areas only and does not
    interfere with lazily free nodes, introduce the new function
    show_purge_info() that dumps "unpurged" areas that is propagated
    through "/proc/vmallocinfo".

    3) Eliminate VM_LAZY_FREE flag.

    Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Signed-off-by: Pengfei Li
    Cc: Roman Gushchin
    Cc: Uladzislau Rezki
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

04 Sep, 2019

1 commit

  • The arm architecture had a VM_ARM_DMA_CONSISTENT flag to mark DMA
    coherent remapping for a while. Lift this flag to common code so
    that we can use it generically. We also check it in the only place
    VM_USERMAP is directly check so that we can entirely replace that
    flag as well (although I'm not even sure why we'd want to allow
    remapping DMA appings, but I'd rather not change behavior).

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

14 Aug, 2019

1 commit

  • Recent changes to the vmalloc code by commit 68ad4a330433
    ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
    cause spurious percpu allocation failures. These, in turn, can result
    in panic()s in the slub code. One such possible panic was reported by
    Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
    Another related panic observed is,

    RIP: 0033:0x7f46f7441b9b
    Call Trace:
    dump_stack+0x61/0x80
    pcpu_alloc.cold.30+0x22/0x4f
    mem_cgroup_css_alloc+0x110/0x650
    cgroup_apply_control_enable+0x133/0x330
    cgroup_mkdir+0x41b/0x500
    kernfs_iop_mkdir+0x5a/0x90
    vfs_mkdir+0x102/0x1b0
    do_mkdirat+0x7d/0xf0
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
    to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
    uses two lists (vmap_area_list & free_vmap_area_list) to track the used
    and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
    sizes[], nr_vms, align) function is used for allocating congruent VM
    areas for percpu memory allocator. In order to not conflict with
    VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
    VMALLOC space. So the search for free vm_area for the given requirement
    starts near VMALLOC_END and moves upwards towards VMALLOC_START.

    Prior to commit 68ad4a330433, the search for free vm_area in
    pcpu_get_vm_areas() involves following two main steps.

    Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
    Step 2:
    Loop through number of requested vm_areas and check,
    Step 2.1:
    if (base < VMALLOC_START)
    1. fail with error
    Step 2.2:
    // end is offsets[area] + sizes[area]
    if (base + end > va->vm_end)
    1. Move the base downwards and repeat Step 2
    Step 2.3:
    if (base + start < va->vm_start)
    1. Move to previous free vm_area node, find aligned
    base address and repeat Step 2

    But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

    Step 2.3:
    if (base + start < va->vm_start || base + end > va->vm_end)
    1. Move to previous free vm_area node, find aligned
    base address and repeat Step 2

    Above change is the root cause of spurious percpu memory allocation
    failures. For example, consider a case where a relatively large vm_area
    (~ 30 TB) was ignored in free vm_area search because it did not pass the
    base + end < vm->vm_end boundary check. Ignoring such large free
    vm_area's would lead to not finding free vm_area within boundary of
    VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

    So modify the search algorithm to include Step 2.2.

    Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
    Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
    Signed-off-by: Kuppuswamy Sathyanarayanan
    Reported-by: Dave Hansen
    Acked-by: Dennis Zhou
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc: Roman Gushchin
    Cc: sathyanarayanan kuppuswamy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuppuswamy Sathyanarayanan
     

22 Jul, 2019

1 commit

  • On x86-32 with PTI enabled, parts of the kernel page-tables are not shared
    between processes. This can cause mappings in the vmalloc/ioremap area to
    persist in some page-tables after the region is unmapped and released.

    When the region is re-used the processes with the old mappings do not fault
    in the new mappings but still access the old ones.

    This causes undefined behavior, in reality often data corruption, kernel
    oopses and panics and even spontaneous reboots.

    Fix this problem by activly syncing unmaps in the vmalloc/ioremap area to
    all page-tables in the system before the regions can be re-used.

    References: https://bugzilla.suse.com/show_bug.cgi?id=1118689
    Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F')
    Signed-off-by: Joerg Roedel
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Dave Hansen
    Link: https://lkml.kernel.org/r/20190719184652.11391-4-joro@8bytes.org

    Joerg Roedel
     

13 Jul, 2019

7 commits

  • Vmalloc() is getting more and more used these days (kernel stacks, bpf and
    percpu allocator are new top users), and the total % of memory consumed by
    vmalloc() can be pretty significant and changes dynamically.

    /proc/meminfo is the best place to display this information: its top goal
    is to show top consumers of the memory.

    Since the VmallocUsed field in /proc/meminfo is not in use for quite a
    long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of
    'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
    actual physical memory consumption of vmalloc().

    Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Link: http://lkml.kernel.org/r/20190607113509.15032-1-geert+renesas@glider.be
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Andrew Morton
    Acked-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Trigger a warning if an object that is about to be freed is detached. We
    used to have a BUG_ON(), but even though it is considered as faulty
    behaviour that is not a good reason to break a system.

    Link: http://lkml.kernel.org/r/20190606120411.8298-5-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Cc: Roman Gushchin
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • It does not make sense to try to "unlink" the node that is definitely not
    linked with a list nor tree. On the first merge step VA just points to
    the previously disconnected busy area.

    On the second step, check if the node has been merged and do "unlink" if
    so, because now it points to an object that must be linked.

    Link: http://lkml.kernel.org/r/20190606120411.8298-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Acked-by: Hillf Danton
    Reviewed-by: Roman Gushchin
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Refactor the NE_FIT_TYPE split case when it comes to an allocation of one
    extra object. We need it in order to build a remaining space. The
    preload is done per CPU in non-atomic context with GFP_KERNEL flags.

    More permissive parameters can be beneficial for systems which are suffer
    from high memory pressure or low memory condition. For example on my KVM
    system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page
    allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N
    workers spinning on fork() and exit(), i can trigger below trace:

    [ 179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
    [ 179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003
    [ 179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [ 179.815171] Call Trace:
    [ 179.815178] dump_stack+0x5c/0x7b
    [ 179.815182] warn_alloc+0x108/0x190
    [ 179.815187] __alloc_pages_slowpath+0xdc7/0xdf0
    [ 179.815191] __alloc_pages_nodemask+0x2de/0x330
    [ 179.815194] cache_grow_begin+0x77/0x420
    [ 179.815197] fallback_alloc+0x161/0x200
    [ 179.815200] kmem_cache_alloc+0x1c9/0x570
    [ 179.815202] alloc_vmap_area+0x32c/0x990
    [ 179.815206] __get_vm_area_node+0xb0/0x170
    [ 179.815208] __vmalloc_node_range+0x6d/0x230
    [ 179.815211] ? _do_fork+0xce/0x3d0
    [ 179.815213] copy_process.part.46+0x850/0x1b90
    [ 179.815215] ? _do_fork+0xce/0x3d0
    [ 179.815219] _do_fork+0xce/0x3d0
    [ 179.815226] ? __do_page_fault+0x2bf/0x4e0
    [ 179.815229] do_syscall_64+0x55/0x130
    [ 179.815231] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 179.815234] RIP: 0033:0x7fedec4c738b
    ...
    [ 179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
    [ 179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b
    [ 179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
    [ 179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0
    [ 179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000
    [ 179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
    [ 179.815245] Mem-Info:
    [ 179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0
    active_file:502 inactive_file:61 isolated_file:70
    unevictable:2 dirty:0 writeback:0 unstable:0
    slab_reclaimable:2380 slab_unreclaimable:7520
    mapped:15069 shmem:14813 pagetables:10833 bounce:0
    free:1922 free_pcp:229 free_cma:0

    Link: http://lkml.kernel.org/r/20190606120411.8298-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Roman Gushchin
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Patch series "Some cleanups for the KVA/vmalloc", v5.

    This patch (of 4):

    Remove unused argument from the __alloc_vmap_area() function.

    Link: http://lkml.kernel.org/r/20190606120411.8298-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Cc: Hillf Danton
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Drop the pgtable_t variable from all implementation for pte_fn_t as none
    of them use it. apply_to_pte_range() should stop computing it as well.
    Should help us save some cycles.

    Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: Matthew Wilcox
    Cc: Ard Biesheuvel
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: "Kirill A. Shutemov"
    Cc: Dan Williams
    Cc:
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

09 Jul, 2019

1 commit

  • Pull arm64 updates from Catalin Marinas:

    - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}

    - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
    manage the permissions of executable vmalloc regions more strictly

    - Slight performance improvement by keeping softirqs enabled while
    touching the FPSIMD/SVE state (kernel_neon_begin/end)

    - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
    XAFLAG and AXFLAG instructions for floating point comparison flags
    manipulation) and FRINT (rounding floating point numbers to integers)

    - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
    BROKEN due to some bugs (now fixed)

    - Improve parking of stopped CPUs and implement an arm64-specific
    panic_smp_self_stop() to avoid warning on not being able to stop
    secondary CPUs during panic

    - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
    platforms

    - perf: DDR performance monitor support for iMX8QXP

    - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
    cope with a system cache info not exposed via the CPUID registers

    - Avoid warning on hardware cache line size greater than
    ARCH_DMA_MINALIGN if the system is fully coherent

    - arm64 do_page_fault() and hugetlb cleanups

    - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)

    - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
    'arm_boot_flags' introduced in 5.1)

    - CONFIG_RANDOMIZE_BASE now enabled in defconfig

    - Allow the selection of ARM64_MODULE_PLTS, currently only done via
    RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
    over into the vmalloc area

    - Make ZONE_DMA32 configurable

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
    perf: arm_spe: Enable ACPI/Platform automatic module loading
    arm_pmu: acpi: spe: Add initial MADT/SPE probing
    ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
    ACPI/PPTT: Modify node flag detection to find last IDENTICAL
    x86/entry: Simplify _TIF_SYSCALL_EMU handling
    arm64: rename dump_instr as dump_kernel_instr
    arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
    arm64: Implement panic_smp_self_stop()
    arm64: Improve parking of stopped CPUs
    arm64: Expose FRINT capabilities to userspace
    arm64: Expose ARMv8.5 CondM capability to userspace
    arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
    arm64: ARM64_MODULES_PLTS must depend on MODULES
    arm64: bpf: do not allocate executable memory
    arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
    arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
    arm64: module: create module allocations without exec permissions
    arm64: Allow user selection of ARM64_MODULE_PLTS
    acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
    arm64: Allow selecting Pseudo-NMI again
    ...

    Linus Torvalds
     

29 Jun, 2019

1 commit

  • gcc gets confused in pcpu_get_vm_areas() because there are too many
    branches that affect whether 'lva' was initialized before it gets used:

    mm/vmalloc.c: In function 'pcpu_get_vm_areas':
    mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    insert_vmap_area_augment(lva, &va->rb_node,
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    &free_vmap_area_root, &free_vmap_area_list);
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/vmalloc.c:916:20: note: 'lva' was declared here
    struct vmap_area *lva;
    ^~~

    Add an intialization to NULL, and check whether this has changed before
    the first use.

    [akpm@linux-foundation.org: tweak comments]
    Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
    Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

25 Jun, 2019

1 commit


03 Jun, 2019

2 commits

  • In a rare case, flush_tlb_kernel_range() could be called with a start
    higher than the end.

    In vm_remove_mappings(), in case page_address() returns 0 for all pages
    (for example they were all in highmem), _vm_unmap_aliases() will be
    called with start = ULONG_MAX, end = 0 and flush = 1.

    If at the same time, the vmalloc purge operation is triggered by something
    else while the current operation is between remove_vm_area() and
    _vm_unmap_aliases(), then the vm mapping just removed will be already
    purged. In this case the call of vm_unmap_aliases() may not find any other
    mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So
    only set flush = true if we find something in the direct mapping that we
    need to flush, and this way this can't happen.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Meelis Roos
    Cc: Nadav Amit
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
    Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     
  • The calculation of the direct map address range to flush was wrong.
    This could cause the RO direct map alias to not get flushed. Today
    this shouldn't be a problem because this flush is only needed on x86
    right now and the spurious fault handler will fix cached RO->RW
    translations. In the future though, it could cause the permissions
    to remain RO in the TLB for the direct map alias, and then the page
    would return from the page allocator to some other component as RO
    and cause a crash.

    So fix fix the address range calculation so the flush will include the
    direct map range.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Meelis Roos
    Cc: Nadav Amit
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
    Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     

02 Jun, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

19 May, 2019

3 commits

  • This macro adds some debug code to check that vmap allocations are
    happened in ascending order.

    By default this option is set to 0 and not active. It requires
    recompilation of the kernel to activate it. Set to 1, compile the
    kernel.

    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • This macro adds some debug code to check that the augment tree is
    maintained correctly, meaning that every node contains valid
    subtree_max_size value.

    By default this option is set to 0 and not active. It requires
    recompilation of the kernel to activate it. Set to 1, compile the
    kernel.

    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Patch series "improve vmap allocation", v3.

    Objective
    ---------

    Please have a look for the description at:

    https://lkml.org/lkml/2018/10/19/786

    but let me also summarize it a bit here as well.

    The current implementation has O(N) complexity. Requests with different
    permissive parameters can lead to long allocation time. When i say
    "long" i mean milliseconds.

    Description
    -----------

    This approach organizes the KVA memory layout into free areas of the
    1-ULONG_MAX range, i.e. an allocation is done over free areas lookups,
    instead of finding a hole between two busy blocks. It allows to have
    lower number of objects which represent the free space, therefore to have
    less fragmented memory allocator. Because free blocks are always as large
    as possible.

    It uses the augment tree where all free areas are sorted in ascending
    order of va->va_start address in pair with linked list that provides
    O(1) access to prev/next elements.

    Since the tree is augment, we also maintain the "subtree_max_size" of VA
    that reflects a maximum available free block in its left or right
    sub-tree. Knowing that, we can easily traversal toward the lowest (left
    most path) free area.

    Allocation: ~O(log(N)) complexity. It is sequential allocation method
    therefore tends to maximize locality. The search is done until a first
    suitable block is large enough to encompass the requested parameters.
    Bigger areas are split.

    I copy paste here the description of how the area is split, since i
    described it in https://lkml.org/lkml/2018/10/19/786

    A free block can be split by three different ways. Their names are
    FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they
    correspond to how requested size and alignment fit to a free block.

    FL_FIT_TYPE - in this case a free block is just removed from the free
    list/tree because it fully fits. Comparing with current design there is
    an extra work with rb-tree updating.

    LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do
    is just cutting a free block. It is as fast as a current design. Most of
    the vmalloc allocations just end up with this case, because the edge is
    always aligned to 1.

    NE_FIT_TYPE - Is much less common case. Basically it happens when
    requested size and alignment does not fit left nor right edges, i.e. it
    is between them. In this case during splitting we have to build a
    remaining left free area and place it back to the free list/tree.

    Comparing with current design there are two extra steps. First one is we
    have to allocate a new vmap_area structure. Second one we have to insert
    that remaining free block to the address sorted list/tree.

    In order to optimize a first case there is a cache with free_vmap objects.
    Instead of allocating from slab we just take an object from the cache and
    reuse it.

    Second one is pretty optimized. Since we know a start point in the tree
    we do not do a search from the top. Instead a traversal begins from a
    rb-tree node we split.

    De-allocation. ~O(log(N)) complexity. An area is not inserted straight
    away to the tree/list, instead we identify the spot first, checking if it
    can be merged around neighbors. The list provides O(1) access to
    prev/next, so it is pretty fast to check it. Summarizing. If merged then
    large coalesced areas are created, if not the area is just linked making
    more fragments.

    There is one more thing that i should mention here. After modification of
    VA node, its subtree_max_size is updated if it was/is the biggest area in
    its left or right sub-tree. Apart of that it can also be populated back
    to upper levels to fix the tree. For more details please have a look at
    the __augment_tree_propagate_from() function and the description.

    Tests and stressing
    -------------------

    I use the "test_vmalloc.sh" test driver available under
    "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo
    ./test_vmalloc.sh" to find out how to deal with it.

    Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
    Regarding last one, i do not have any physical access to NUMA system,
    therefore i emulated it. The time of stressing is days.

    If you run the test driver in "stress mode", you also need the patch that
    is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:

    http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

    After massive testing, i have not identified any problems like memory
    leaks, crashes or kernel panics. I find it stable, but more testing would
    be good.

    Performance analysis
    --------------------

    I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and
    another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas
    Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1
    as well, but the results have not been ready by time i an writing this.

    Currently it consist of 8 tests. There are three of them which correspond
    to different types of splitting(to compare with default). We have 3
    ones(see above). Another 5 do allocations in different conditions.

    a) sudo ./test_vmalloc.sh performance

    When the test driver is run in "performance" mode, it runs all available
    tests pinned to first online CPU with sequential execution test order. We
    do it in order to get stable and repeatable results. Take a look at time
    difference in "long_busy_list_alloc_test". It is not surprising because
    the worst case is O(N).

    # i5-3320M
    How many cycles all tests took:
    CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

    # Hikey960 8x CPUs
    How many cycles all tests took:
    CPU0=3478683207 cycles vs CPU0=463767978 cycles

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

    b) time sudo ./test_vmalloc.sh test_repeat_count=1

    With this configuration, all tests are run on all available online CPUs.
    Before running each CPU shuffles its tests execution order. It gives
    random allocation behaviour. So it is rough comparison, but it puts in
    the picture for sure.

    # i5-3320M
    vs
    real 101m22.813s real 0m56.805s
    user 0m0.011s user 0m0.015s
    sys 0m5.076s sys 0m0.023s

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

    # Hikey960 8x CPUs
    vs
    real unknown real 4m25.214s
    user unknown user 0m0.011s
    sys unknown sys 0m0.670s

    I did not manage to complete this test on "default Hikey960" kernel
    version. After 24 hours it was still running, therefore i had to cancel
    it. That is why real/user/sys are "unknown".

    This patch (of 3):

    Currently an allocation of the new vmap area is done over busy list
    iteration(complexity O(n)) until a suitable hole is found between two busy
    areas. Therefore each new allocation causes the list being grown. Due to
    over fragmented list and different permissive parameters an allocation can
    take a long time. For example on embedded devices it is milliseconds.

    This patch organizes the KVA memory layout into free areas of the
    1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks
    sorted by their offsets in pair with linked list keeping the free space in
    order of increasing addresses.

    Nodes are augmented with the size of the maximum available free block in
    its left or right sub-tree. Thus, that allows to take a decision and
    traversal toward the block that will fit and will have the lowest start
    address, i.e. it is sequential allocation.

    Allocation: to allocate a new block a search is done over the tree until a
    suitable lowest(left most) block is large enough to encompass: the
    requested size, alignment and vstart point. If the block is bigger than
    requested size - it is split.

    De-allocation: when a busy vmap area is freed it can either be merged or
    inserted to the tree. Red-black tree allows efficiently find a spot
    whereas a linked list provides a constant-time access to previous and next
    blocks to check if merging can be done. In case of merging of
    de-allocated memory chunk a large coalesced area is created.

    Complexity: ~O(log(N))

    [urezki@gmail.com: v3]
    Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

15 May, 2019

2 commits

  • vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
    both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
    that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
    64 bit as well.

    Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Reviewed-by: William Kucharski
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
    introduced some preempt points, one of those is making an allocation
    more prioritized over lazy free of vmap areas.

    Prioritizing an allocation over freeing does not work well all the time,
    i.e. it should be rather a compromise.

    1) Number of lazy pages directly influences the busy list length thus
    on operations like: allocation, lookup, unmap, remove, etc.

    2) Under heavy stress of vmalloc subsystem I run into a situation when
    memory usage gets increased hitting out_of_memory -> panic state due to
    completely blocking of logic that frees vmap areas in the
    __purge_vmap_area_lazy() function.

    Establish a threshold passing which the freeing is prioritized back over
    allocation creating a balance between each other.

    Using vmalloc test driver in "stress mode", i.e. When all available
    test cases are run simultaneously on all online CPUs applying a
    pressure on the vmalloc subsystem, my HiKey 960 board runs out of
    memory due to the fact that __purge_vmap_area_lazy() logic simply is
    not able to free pages in time.

    How I run it:

    1) You should build your kernel with CONFIG_TEST_VMALLOC=m
    2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

    During this test "vmap_lazy_nr" pages will go far beyond acceptable
    lazy_max_pages() threshold, that will lead to enormous busy list size
    and other problems including allocation time and so on.

    Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

30 Apr, 2019

1 commit

  • Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to
    immediately clear executable TLB entries before freeing pages, and handle
    resetting permissions on the directmap. This flag is useful for any kind
    of memory with elevated permissions, or where there can be related
    permissions changes on the directmap. Today this is RO+X and RO memory.

    Although this enables directly vfreeing non-writeable memory now,
    non-writable memory cannot be freed in an interrupt because the allocation
    itself is used as a node on deferred free list. So when RO memory needs to
    be freed in an interrupt the code doing the vfree needs to have its own
    work queue, as was the case before the deferred vfree list was added to
    vmalloc.

    For architectures with set_direct_map_ implementations this whole operation
    can be done with one TLB flush when centralized like this. For others with
    directmap permissions, currently only arm64, a backup method using
    set_memory functions is used to reset the directmap. When arm64 adds
    set_direct_map_ functions, this backup can be removed.

    When the TLB is flushed to both remove TLB entries for the vmalloc range
    mapping and the direct map permissions, the lazy purge operation could be
    done to try to save a TLB flush later. However today vm_unmap_aliases
    could flush a TLB range that does not include the directmap. So a helper
    is added with extra parameters that can allow both the vmalloc address and
    the direct mapping to be flushed during this operation. The behavior of the
    normal vm_unmap_aliases function is unchanged.

    Suggested-by: Dave Hansen
    Suggested-by: Andy Lutomirski
    Suggested-by: Will Deacon
    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190426001143.4983-17-namit@vmware.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     

06 Mar, 2019

9 commits

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Some kernel-doc comments in mm/vmalloc.c have leading tab in
    indentation. This leads to excessive indentation in the generated HTML
    and to the inconsistency of its layout ([1] vs [2]).

    Besides, multi-line Note: sections are not handled properly with extra
    indentation.

    [1] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vm_map_ram
    [2] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vfree

    Link: http://lkml.kernel.org/r/1549549644-4903-2-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • One of the vmalloc stress test case triggers the kernel BUG():


    [60.562151] ------------[ cut here ]------------
    [60.562154] kernel BUG at mm/vmalloc.c:512!
    [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
    [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390

    it can happen due to big align request resulting in overflowing of
    calculated address, i.e. it becomes 0 after ALIGN()'s fixup.

    Fix it by checking if calculated address is within vstart/vend range.

    Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
    enabled. Some test cases in vmalloc test suite module require and make
    use of that function. Please note, that it is not supposed to be used
    for other purposes.

    We need it only for performance analysis, stressing and stability check
    of vmalloc allocator.

    Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Shuah Khan
    Cc: Oleksiy Avramchenko
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • vmalloc_user*() calls differ from normal vmalloc() only in that they set
    VM_USERMAP flags for the area. During the whole history of vmalloc.c
    changes now it is possible simply to pass VM_USERMAP flags directly to
    __vmalloc_node_range() call instead of finding the area (which obviously
    takes time) after the allocation.

    Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Acked-by: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: Joe Perches
    Cc: "Luis R. Rodriguez"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • __vmalloc_area_node() calls vfree() on error path, which in turn calls
    kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().

    Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: Joe Perches
    Cc: "Luis R. Rodriguez"
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • When VM_NO_GUARD is not set area->size includes adjacent guard page,
    thus for correct size checking get_vm_area_size() should be used, but
    not area->size.

    This fixes possible kernel oops when userspace tries to mmap an area on
    1 page bigger than was allocated by vmalloc_user() call: the size check
    inside remap_vmalloc_range_partial() accounts non-existing guard page
    also, so check successfully passes but vmalloc_to_page() returns NULL
    (guard page does not physically exist).

    The following code pattern example should trigger an oops:

    static int oops_mmap(struct file *file, struct vm_area_struct *vma)
    {
    void *mem;

    mem = vmalloc_user(4096);
    BUG_ON(!mem);
    /* Do not care about mem leak */

    return remap_vmalloc_range(vma, mem, 0);
    }

    And userspace simply mmaps size + PAGE_SIZE:

    mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);

    Possible candidates for oops which do not have any explicit size
    checks:

    *** drivers/media/usb/stkwebcam/stk-webcam.c:
    v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0);

    Or the following one:

    *** drivers/video/fbdev/core/fbmem.c
    static int
    fb_mmap(struct file *file, struct vm_area_struct * vma)
    ...
    res = fb->fb_mmap(info, vma);

    Where fb_mmap callback calls remap_vmalloc_range() directly without any
    explicit checks:

    *** drivers/video/fbdev/vfb.c
    static int vfb_mmap(struct fb_info *info,
    struct vm_area_struct *vma)
    {
    return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
    }

    Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Acked-by: Michal Hocko
    Cc: Andrey Ryabinin
    Cc: Joe Perches
    Cc: "Luis R. Rodriguez"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • This patch repeats the original one from David S Miller:

    2dca6999eed5 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")

    but for missed vmalloc_32_user() case, which also requires correct
    alignment of virtual address on kernel side to avoid D-caches aliases.
    A bit of copy-paste from original patch to recover in memory of what is
    all about:

    When a vmalloc'd area is mmap'd into userspace, some kind of
    co-ordination is necessary for this to work on platforms with cpu
    D-caches which can have aliases.

    Otherwise kernel side writes won't be seen properly in userspace and
    vice versa.

    If the kernel side mapping and the user side one have the same
    alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
    of VMA and for all current users this is true. VM_SHARED will force
    SHMLBA alignment of the user side mmap on platforms with D-cache
    aliasing matters.

    David S. Miller

    > What are the user-visible runtime effects of this change?

    In simple words: proper alignment avoids possible difference in data,
    seen by different virtual mapings: userspace and kernel in our case.
    I.e. userspace reads cache line A, kernel writes to cache line B. Both
    cache lines correspond to the same physical memory (thus aliases).

    So this should fix data corruption for archs with vivt and vipt caches,
    e.g. armv6. Personally I've never worked with this archs, I just
    spotted the strange difference in code: for one case we do alignment,
    for another - not. I have a strong feeling that David simply missed
    vmalloc_32_user() case.

    >
    > Is a -stable backport needed?

    No, I do not think so. The only one user of vmalloc_32_user() is
    virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
    description "The main use of this frame buffer device is testing and
    debugging the frame buffer subsystem. Do NOT enable it for normal
    systems!".

    And it seems to me that this vfb.c does not need 32bit addressable pages
    (vmalloc_32_user() case), because it is virtual device and should not
    care about things like dma32 zones, etc. Probably is better to clean
    the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
    and wipe out vmalloc_32_user() from vmalloc.c completely. But I'm not
    very much sure that this is worth to do, that's so minor, so we can
    leave it as is.

    Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Reviewed-by: Andrew Morton
    Cc: Stephen Rothwell
    Cc: Michal Hocko
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • find_vmap_area() can return a NULL pointer and we're going to
    dereference it without checking it first. Use the existing
    find_vm_area() function which does exactly what we want and checks for
    the NULL pointer.

    Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
    Fixes: f3c01d2f3ade ("mm: vmalloc: avoid racy handling of debugobjects in vunmap")
    Signed-off-by: Liviu Dudau
    Reviewed-by: Andrew Morton
    Cc: Chintan Pandya
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liviu Dudau