20 Oct, 2020

2 commits

  • Pull fuse updates from Miklos Szeredi:

    - Support directly accessing host page cache from virtiofs. This can
    improve I/O performance for various workloads, as well as reducing
    the memory requirement by eliminating double caching. Thanks to Vivek
    Goyal for doing most of the work on this.

    - Allow automatic submounting inside virtiofs. This allows unique
    st_dev/ st_ino values to be assigned inside the guest to files
    residing on different filesystems on the host. Thanks to Max Reitz
    for the patches.

    - Fix an old use after free bug found by Pradeep P V K.

    * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
    virtiofs: calculate number of scatter-gather elements accurately
    fuse: connection remove fix
    fuse: implement crossmounts
    fuse: Allow fuse_fill_super_common() for submounts
    fuse: split fuse_mount off of fuse_conn
    fuse: drop fuse_conn parameter where possible
    fuse: store fuse_conn in fuse_req
    fuse: add submount support to
    fuse: fix page dereference after free
    virtiofs: add logic to free up a memory range
    virtiofs: maintain a list of busy elements
    virtiofs: serialize truncate/punch_hole and dax fault path
    virtiofs: define dax address space operations
    virtiofs: add DAX mmap support
    virtiofs: implement dax read/write operations
    virtiofs: introduce setupmapping/removemapping commands
    virtiofs: implement FUSE_INIT map_alignment field
    virtiofs: keep a list of free dax memory ranges
    virtiofs: add a mount option to enable dax
    virtiofs: set up virtio_fs dax_device
    ...

    Linus Torvalds
     
  • Pull zonefs updates from Damien Le Moal:
    "Add an 'explicit-open' mount option to automatically issue a
    REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
    is open for writing for the first time.

    This avoids 'insufficient zone resources' errors for write operations
    on some drives with limited zone resources or on ZNS drives with a
    limited number of active zones. From Johannes"

    * tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
    zonefs: document the explicit-open mount option
    zonefs: open/close zone on file open/close
    zonefs: provide no-lock zonefs_io_error variant
    zonefs: introduce helper for zone management

    Linus Torvalds
     

19 Oct, 2020

38 commits

  • …/kernel/git/shuah/linux-kselftest

    Pull more Kunit updates from Shuah Khan:

    - add Kunit to kernel_init() and remove KUnit from init calls entirely.

    This addresses the concern that Kunit would not work correctly during
    late init phase.

    - add a linker section where KUnit can put references to its test
    suites.

    This is the first step in transitioning to dispatching all KUnit
    tests from a centralized executor rather than having each as its own
    separate late_initcall.

    - add a centralized executor to dispatch tests rather than relying on
    late_initcall to schedule each test suite separately. Centralized
    execution is for built-in tests only; modules will execute tests when
    loaded.

    - convert bitfield test to use KUnit framework

    - Documentation updates for naming guidelines and how
    kunit_test_suite() works.

    - add test plan to KUnit TAP format

    * tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
    lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE
    lib: kunit: add bitfield test conversion to KUnit
    Documentation: kunit: add a brief blurb about kunit_test_suite
    kunit: test: add test plan to KUnit TAP format
    init: main: add KUnit to kernel init
    kunit: test: create a single centralized executor for all tests
    vmlinux.lds.h: add linker section for KUnit test suites
    Documentation: kunit: Add naming guidelines

    Linus Torvalds
     
  • Pull RCU changes from Ingo Molnar:

    - Debugging for smp_call_function()

    - RT raw/non-raw lock ordering fixes

    - Strict grace periods for KASAN

    - New smp_call_function() torture test

    - Torture-test updates

    - Documentation updates

    - Miscellaneous fixes

    [ This doesn't actually pull the tag - I've dropped the last merge from
    the RCU branch due to questions about the series. - Linus ]

    * tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
    smp: Make symbol 'csd_bug_count' static
    kernel/smp: Provide CSD lock timeout diagnostics
    smp: Add source and destination CPUs to __call_single_data
    rcu: Shrink each possible cpu krcp
    rcu/segcblist: Prevent useless GP start if no CBs to accelerate
    torture: Add gdb support
    rcutorture: Allow pointer leaks to test diagnostic code
    rcutorture: Hoist OOM registry up one level
    refperf: Avoid null pointer dereference when buf fails to allocate
    rcutorture: Properly synchronize with OOM notifier
    rcutorture: Properly set rcu_fwds for OOM handling
    torture: Add kvm.sh --help and update help message
    rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05
    torture: Update initrd documentation
    rcutorture: Replace HTTP links with HTTPS ones
    locktorture: Make function torture_percpu_rwsem_init() static
    torture: document --allcpus argument added to the kvm.sh script
    rcutorture: Output number of elapsed grace periods
    rcutorture: Remove KCSAN stubs
    rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()
    ...

    Linus Torvalds
     
  • Pull mailbox updates from Jassi Brar:

    - arm: implementation of mhu as a doorbell driver and conversion of
    dt-bindings to json-schema

    - mediatek: fix platform_get_irq error handling

    - bcm: convert tasklets to use new tasklet_setup api

    - core: fix race cause by hrtimer starting inappropriately

    * tag 'mailbox-v5.10' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
    mailbox: avoid timer start from callback
    maiblox: mediatek: Fix handling of platform_get_irq() error
    mailbox: arm_mhu: Add ARM MHU doorbell driver
    mailbox: arm_mhu: Match only if compatible is "arm,mhu"
    dt-bindings: mailbox: add doorbell support to ARM MHU
    dt-bindings: mailbox : arm,mhu: Convert to Json-schema
    mailbox: bcm: convert tasklets to use new tasklet_setup() API

    Linus Torvalds
     
  • Pull coccinelle updates from Julia Lawall.

    * 'for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
    coccinelle: api: add kfree_mismatch script
    coccinelle: iterators: Add for_each_child.cocci script
    scripts: coccicheck: Change default condition for parallelism
    scripts: coccicheck: Add quotes to improve portability
    coccinelle: api: kfree_sensitive: print memset position
    coccinelle: misc: add flexible_array.cocci script
    coccinelle: api: add kvmalloc script
    scripts: coccicheck: Change default value for parallelism
    coccinelle: misc: add excluded_middle.cocci script
    scripts: coccicheck: Improve error feedback when coccicheck fails
    coccinelle: api: update kzfree script to kfree_sensitive
    coccinelle: misc: add uninitialized_var.cocci script
    coccinelle: ifnullfree: add vfree(), kvfree*() functions
    coccinelle: api: add kobj_to_dev.cocci script
    coccinelle: add patch rule for dma_alloc_coherent
    scripts: coccicheck: Add chain mode to list of modes

    Linus Torvalds
     
  • Merge yet more updates from Andrew Morton:
    "Subsystems affected by this patch series: mm (memcg, migration,
    pagemap, gup, madvise, vmalloc), ia64, and misc"

    * emailed patches from Andrew Morton : (31 commits)
    mm: remove duplicate include statement in mmu.c
    mm: remove the filename in the top of file comment in vmalloc.c
    mm: cleanup the gfp_mask handling in __vmalloc_area_node
    mm: remove alloc_vm_area
    x86/xen: open code alloc_vm_area in arch_gnttab_valloc
    xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv
    drm/i915: use vmap in i915_gem_object_map
    drm/i915: stop using kmap in i915_gem_object_map
    drm/i915: use vmap in shmem_pin_map
    zsmalloc: switch from alloc_vm_area to get_vm_area
    mm: allow a NULL fn callback in apply_to_page_range
    mm: add a vmap_pfn function
    mm: add a VM_MAP_PUT_PAGES flag for vmap
    mm: update the documentation for vfree
    mm/madvise: introduce process_madvise() syscall: an external memory hinting API
    pid: move pidfd_get_pid() to pid.c
    mm/madvise: pass mm to do_madvise
    selftests/vm: 10x speedup for hmm-tests
    binfmt_elf: take the mmap lock around find_extend_vma()
    mm/gup_benchmark: take the mmap lock around GUP
    ...

    Linus Torvalds
     
  • Pull UML updates from Richard Weinberger:

    - Improve support for non-glibc systems

    - Vector: Add support for scripting and dynamic tap devices

    - Various fixes for the vector networking driver

    - Various fixes for time travel mode

    * tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
    um: vector: Add dynamic tap interfaces and scripting
    um: Clean up stacktrace dump
    um: Fix incorrect assumptions about max pid length
    um: Remove dead usage of TIF_IA32
    um: Remove redundant NULL check
    um: change sigio_spinlock to a mutex
    um: time-travel: Return the sequence number in ACK messages
    um: time-travel: Fix IRQ handling in time_travel_handle_message()
    um: Allow static linking for non-glibc implementations
    um: Some fixes to build UML with musl
    um: vector: Use GFP_ATOMIC under spin lock
    um: Fix null pointer dereference in vector_user_bpf

    Linus Torvalds
     
  • Pull more ubi and ubifs updates from Richard Weinberger:
    "UBI:
    - Correctly use kthread_should_stop in ubi worker

    UBIFS:
    - Fixes for memory leaks while iterating directory entries
    - Fix for a user triggerable error message
    - Fix for a space accounting bug in authenticated mode"

    * tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
    ubifs: journal: Make sure to not dirty twice for auth nodes
    ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails
    ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode
    ubi: check kthread_should_stop() after the setting of task state
    ubifs: dent: Fix some potential memory leaks while iterating entries
    ubifs: xattr: Fix some potential memory leaks while iterating entries

    Linus Torvalds
     
  • Pull ubifs updates from Richard Weinberger:

    - Kernel-doc fixes

    - Fixes for memory leaks in authentication option parsing

    * tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
    ubifs: mount_ubifs: Release authentication resource in error handling path
    ubifs: Don't parse authentication mount options in remount process
    ubifs: Fix a memleak after dumping authentication mount options
    ubifs: Fix some kernel-doc warnings in tnc.c
    ubifs: Fix some kernel-doc warnings in replay.c
    ubifs: Fix some kernel-doc warnings in gc.c
    ubifs: Fix 'hash' kernel-doc warning in auth.c

    Linus Torvalds
     
  • asm/sections.h is included more than once, Remove the one that isn't
    necessary.

    Signed-off-by: Tian Tao
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Link: https://lkml.kernel.org/r/1600088607-17327-1-git-send-email-tiantao6@hisilicon.com
    Signed-off-by: Linus Torvalds

    Tian Tao
     
  • No point in having the filename inside the file.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "two small vmalloc cleanups".

    This patch (of 2):

    __vmalloc_area_node currently has four different gfp_t variables to
    just express this simple logic:

    - use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if
    suitable) for the underlying page allocation
    - use just the reclaim flags from the passed in mask plus __GFP_ZERO
    for allocating the page array

    Simplify this down to just use the pre-existing nested_gfp as-is for
    the page array allocation, and just the passed in gfp_mask for the
    page allocation, after conditionally ORing __GFP_HIGHMEM into it. This
    also makes the allocation warning a little more correct.

    Also initialize two variables at the time of declaration while touching
    this area.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • All users are gone now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Replace the last call to alloc_vm_area with an open coded version using an
    iterator in struct gnttab_vm_area instead of the triple indirection magic
    in alloc_vm_area.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-11-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Replacing alloc_vm_area with get_vm_area_caller + apply_page_range allows
    to fill put the phys_addr values directly instead of doing another loop
    over all addresses.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-10-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • i915_gem_object_map implements fairly low-level vmap functionality in a
    driver. Split it into two helpers, one for remapping kernel memory which
    can use vmap, and one for I/O memory that uses vmap_pfn.

    The only practical difference is that alloc_vm_area prefeaults the vmalloc
    area PTEs, which doesn't seem to be required here for the kernel memory
    case (and could be added to vmap using a flag if actually required).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Tvrtko Ursulin
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-9-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • kmap for !PageHighmem is just a convoluted way to say page_address, and
    kunmap is a no-op in that case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Tvrtko Ursulin
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-8-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • shmem_pin_map somewhat awkwardly reimplements vmap using alloc_vm_area and
    manual pte setup. The only practical difference is that alloc_vm_area
    prefeaults the vmalloc area PTEs, which doesn't seem to be required here
    (and could be added to vmap using a flag if actually required). Switch to
    use vmap, and use vfree to free both the vmalloc mapping and the page
    array, as well as dropping the references to each page.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Tvrtko Ursulin
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Just manually pre-fault the PTEs using apply_to_page_range.

    Co-developed-by: Minchan Kim
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Besides calling the callback on each page, apply_to_page_range also has
    the effect of pre-faulting all PTEs for the range. To support callers
    that only need the pre-faulting, make the callback optional.

    Based on a patch from Minchan Kim .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Add a proper helper to remap PFNs into kernel virtual space so that
    drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
    for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Add a flag so that vmap takes ownership of the passed in page array. When
    vfree is called on such an allocation it will put one reference on each
    page, and free the page array itself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "remove alloc_vm_area", v4.

    This series removes alloc_vm_area, which was left over from the big
    vmalloc interface rework. It is a rather arkane interface, basicaly the
    equivalent of get_vm_area + actually faulting in all PTEs in the allocated
    area. It was originally addeds for Xen (which isn't modular to start
    with), and then grew users in zsmalloc and i915 which seems to mostly
    qualify as abuses of the interface, especially for i915 as a random driver
    should not set up PTE bits directly.

    This patch (of 11):

    * Document that you can call vfree() on an address returned from vmap()
    * Remove the note about the minimum size -- the minimum size of a vmalloc
    allocation is one page
    * Add a Context: section
    * Fix capitalisation
    * Reword the prohibition on calling from NMI context to avoid a double
    negative

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Tvrtko Ursulin
    Cc: Chris Wilson
    Cc: Matthew Auld
    Cc: Rodrigo Vivi
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Nitin Gupta
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • process_madvise syscall needs pidfd_get_pid function to translate pidfd to
    pid so this patch move the function to kernel/pid.c.

    Suggested-by: Alexander Duyck
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Alexander Duyck
    Reviewed-by: Vlastimil Babka
    Acked-by: Christian Brauner
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Brian Geffon
    Cc: Daniel Colascione
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "introduce memory hinting API for external process", v9.

    Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
    that, application could give hints to kernel what memory range are
    preferred to be reclaimed. However, in some platform(e.g., Android), the
    information required to make the hinting decision is not known to the app.
    Instead, it is known to a centralized userspace daemon(e.g.,
    ActivityManagerService), and that daemon must be able to initiate reclaim
    on its own without any app involvement.

    To solve the concern, this patch introduces new syscall -
    process_madvise(2). Bascially, it's same with madvise(2) syscall but it
    has some differences.

    1. It needs pidfd of target process to provide the hint

    2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
    moment. Other hints in madvise will be opened when there are explicit
    requests from community to prevent unexpected bugs we couldn't support.

    3. Only privileged processes can do something for other process's
    address space.

    For more detail of the new API, please see "mm: introduce external memory
    hinting API" description in this patchset.

    This patch (of 3):

    In upcoming patches, do_madvise will be called from external process
    context so we shouldn't asssume "current" is always hinted process's
    task_struct.

    Furthermore, we must not access mm_struct via task->mm, but obtain it via
    access_mm() once (in the following patch) and only use that pointer [1],
    so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
    safe, so we can use them further down the call stack.

    And let's pass current->mm as arguments of do_madvise so it shouldn't
    change existing behavior but prepare next patch to make review easy.

    [vbabka@suse.cz: changelog tweak]
    [minchan@kernel.org: use current->mm for io_uring]
    Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
    [akpm@linux-foundation.org: fix it for upstream changes]
    [akpm@linux-foundation.org: whoops]
    [rdunlap@infradead.org: add missing includes]

    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Daniel Colascione
    Cc: Sandeep Patil
    Cc: Sonny Rao
    Cc: Brian Geffon
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: John Dias
    Cc: Joel Fernandes
    Cc: Alexander Duyck
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Kirill Tkhai
    Cc: Oleksandr Natalenko
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch reduces the running time for hmm-tests from about 10+ seconds,
    to just under 1.0 second, for an approximately 10x speedup. That brings
    it in line with most of the other tests in selftests/vm, which mostly run
    in < 1 sec.

    This is done with a one-line change that simply reduces the number of
    iterations of several tests, from 256, to 10. Thanks to Ralph Campbell
    for suggesting changing NTIMES as a way to get the speedup.

    Suggested-by: Ralph Campbell
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Cc: SeongJae Park
    Cc: Shuah Khan
    Link: https://lkml.kernel.org/r/20201003011721.44238-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • create_elf_tables() runs after setup_new_exec(), so other tasks can
    already access our new mm and do things like process_madvise() on it. (At
    the time I'm writing this commit, process_madvise() is not in mainline
    yet, but has been in akpm's tree for some time.)

    While I believe that there are currently no APIs that would actually allow
    another process to mess up our VMA tree (process_madvise() is limited to
    MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
    under which no syscalls have been executed yet), this seems like an
    accident waiting to happen.

    Let's make sure that we always take the mmap lock around GUP paths as long
    as another process might be able to see the mm.

    (Yes, this diff looks suspicious because we drop the lock before doing
    anything with `vma`, but that's because we actually don't do anything with
    it apart from the NULL check.)

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Michel Lespinasse
    Cc: "Eric W . Biederman"
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Mauro Carvalho Chehab
    Cc: Sakari Ailus
    Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • To be safe against concurrent changes to the VMA tree, we must take the
    mmap lock around GUP operations (excluding the GUP-fast family of
    operations, which will take the mmap lock by themselves if necessary).

    This code is only for testing, and it's only reachable by root through
    debugfs, so this doesn't really have any impact; however, if we want to
    add lockdep asserts into the GUP path, we need to have clean locking here.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Acked-by: Michel Lespinasse
    Cc: "Eric W . Biederman"
    Cc: Mauro Carvalho Chehab
    Cc: Sakari Ailus
    Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • There are two locations that have a block of code for munmapping a vma
    range. Change those two locations to use a function and add meaningful
    comments about what happens to the arguments, which was unclear in the
    previous code.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There are three places that the next vma is required which uses the same
    block of code. Replace the block with a function and add comments on what
    happens in the case where NULL is encountered.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There is no need to check if this process has the right to modify the
    specified process when they are same. And we could also skip the security
    hook call if a process is modifying its own pages. Add helper function to
    handle these.

    Suggested-by: Matthew Wilcox
    Signed-off-by: Hongxiang Lou
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christopher Lameter
    Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • To calculate the correct node to migrate the page for hotplug, we need to
    check node id of the page. Wrapper for alloc_migration_target() exists
    for this purpose.

    However, Vlastimil informs that all migration source pages come from a
    single node. In this case, we don't need to check the node id for each
    page and we don't need to re-set the target nodemask for each page by
    using the wrapper. Set up the migration_target_control once and use it
    for all pages.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a well-defined standard migration target callback. Use it
    directly.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If a memcg to charge can be determined (using remote charging API), there
    are no reasons to exclude allocations made from an interrupt context from
    the accounting.

    Such allocations will pass even if the resulting memcg size will exceed
    the hard limit, but it will affect the application of the memory pressure
    and an inability to put the workload under the limit will eventually
    trigger the OOM.

    To use active_memcg() helper, memcg_kmem_bypass() is moved back to
    memcontrol.c.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Remote memcg charging API uses current->active_memcg to store the
    currently active memory cgroup, which overwrites the memory cgroup of the
    current process. It works well for normal contexts, but doesn't work for
    interrupt contexts: indeed, if an interrupt occurs during the execution of
    a section with an active memcg set, all allocations inside the interrupt
    will be charged to the active memcg set (given that we'll enable
    accounting for allocations from an interrupt context). But because the
    interrupt might have no relation to the active memcg set outside, it's
    obviously wrong from the accounting prospective.

    To resolve this problem, let's add a global percpu int_active_memcg
    variable, which will be used to store an active memory cgroup which will
    be used from interrupt contexts. set_active_memcg() will transparently
    use current->active_memcg or int_active_memcg depending on the context.

    To make the read part simple and transparent for the caller, let's
    introduce two new functions:
    - struct mem_cgroup *active_memcg(void),
    - struct mem_cgroup *get_active_memcg(void).

    They are returning the active memcg if it's set, hiding all implementation
    details: where to get it depending on the current context.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • There are checks for current->mm and current->active_memcg in
    get_obj_cgroup_from_current(), but these checks are redundant:
    memcg_kmem_bypass() called just above performs same checks.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the remote memcg charging API consists of two functions:
    memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
    memcg value, which overwrites the memcg of the current task.

    memalloc_use_memcg(target_memcg);

    memalloc_unuse_memcg();

    It works perfectly for allocations performed from a normal context,
    however an attempt to call it from an interrupt context or just nest two
    remote charging blocks will lead to an incorrect accounting. On exit from
    the inner block the active memcg will be cleared instead of being
    restored.

    memalloc_use_memcg(target_memcg);

    memalloc_use_memcg(target_memcg_2);

    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

    memalloc_unuse_memcg();

    This patch extends the remote charging API by switching to a single
    function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
    which sets the new value and returns the old one. So a remote charging
    block will look like:

    old_memcg = set_active_memcg(target_memcg);

    set_active_memcg(old_memcg);

    This patch is heavily based on the patch by Johannes Weiner, which can be
    found here: https://lkml.org/lkml/2020/5/28/806 .

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Dan Schatzberg
    Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin