30 Dec, 2020

2 commits

  • [ Upstream commit eefbfa7fd678805b38a46293e78543f98f353d3e ]

    The rcu_read_lock/unlock only can guarantee that the memcg will not be
    freed, but it cannot guarantee the success of css_get to memcg.

    If the whole process of a cgroup offlining is completed between reading a
    objcg->memcg pointer and bumping the css reference on another CPU, and
    there are exactly 0 external references to this memory cgroup (how we get
    to the obj_cgroup_charge() then?), css_get() can change the ref counter
    from 0 back to 1.

    Link: https://lkml.kernel.org/r/20201028035013.99711-2-songmuchun@bytedance.com
    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: Christian Brauner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Muchun Song
     
  • [ Upstream commit 2f7659a314736b32b66273dbf91c19874a052fde ]

    Consider the following memcg hierarchy.

    root
    / \
    A B

    If we failed to get the reference on objcg of memcg A, the
    get_obj_cgroup_from_current can return the wrong objcg for the root
    memcg.

    Link: https://lkml.kernel.org/r/20201029164429.58703-1-songmuchun@bytedance.com
    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Joonsoo Kim
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: Christian Brauner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Eugene Syromiatnikov
    Cc: Suren Baghdasaryan
    Cc: Adrian Reber
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Muchun Song
     

23 Nov, 2020

1 commit

  • If we reparent the slab objects to the root memcg, when we free the slab
    object, we need to update the per-memcg vmstats to keep it correct for
    the root memcg. Now this at least affects the vmstat of
    NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
    smaller than the PAGE_SIZE.

    David said:
    "I assume that without this fix that the root memcg's vmstat would
    always be inflated if we reparented"

    Fixes: ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Christopher Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: [5.3+]
    Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     

03 Nov, 2020

2 commits

  • Richard reported a warning which can be reproduced by running the LTP
    madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

    WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
    Modules linked in:
    CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
    Workqueue: events drain_local_stock
    RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
    Call Trace:
    __memcg_kmem_uncharge (mm/memcontrol.c:3022)
    drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
    drain_local_stock (mm/memcontrol.c:2255)
    process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
    worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:292)
    ret_from_fork (arch/x86/entry/entry_64.S:300)

    The problem occurs because in the non-hierarchical mode non-root page
    counters are not linked to root page counters, so the charge is not
    propagated to the root memory cgroup.

    After the removal of the original memory cgroup and reparenting of the
    object cgroup, the root cgroup might be uncharged by draining a objcg
    stock, for example. It leads to an eventual underflow of the charge and
    triggers a warning.

    Fix it by linking all page counters to corresponding root page counters
    in the non-hierarchical mode.

    Please note, that in the non-hierarchical mode all objcgs are always
    reparented to the root memory cgroup, even if the hierarchy has more
    than 1 level. This patch doesn't change it.

    The patch also doesn't affect how the hierarchical mode is working,
    which is the only sane and truly supported mode now.

    Thanks to Richard for reporting, debugging and providing an alternative
    version of the fix!

    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Reported-by:
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Reviewed-by: Michal Koutný
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
    Debugged-by: Richard Palethorpe
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • memcg_page_state will get the specified number in hierarchical memcg, It
    should multiply by HPAGE_PMD_NR rather than an page if the item is
    NR_ANON_THPS.

    [akpm@linux-foundation.org: fix printk warning]
    [akpm@linux-foundation.org: use u64 cast, per Michal]

    Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
    Signed-off-by: zhongjiang-ali
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    zhongjiang-ali
     

19 Oct, 2020

5 commits

  • If a memcg to charge can be determined (using remote charging API), there
    are no reasons to exclude allocations made from an interrupt context from
    the accounting.

    Such allocations will pass even if the resulting memcg size will exceed
    the hard limit, but it will affect the application of the memory pressure
    and an inability to put the workload under the limit will eventually
    trigger the OOM.

    To use active_memcg() helper, memcg_kmem_bypass() is moved back to
    memcontrol.c.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Remote memcg charging API uses current->active_memcg to store the
    currently active memory cgroup, which overwrites the memory cgroup of the
    current process. It works well for normal contexts, but doesn't work for
    interrupt contexts: indeed, if an interrupt occurs during the execution of
    a section with an active memcg set, all allocations inside the interrupt
    will be charged to the active memcg set (given that we'll enable
    accounting for allocations from an interrupt context). But because the
    interrupt might have no relation to the active memcg set outside, it's
    obviously wrong from the accounting prospective.

    To resolve this problem, let's add a global percpu int_active_memcg
    variable, which will be used to store an active memory cgroup which will
    be used from interrupt contexts. set_active_memcg() will transparently
    use current->active_memcg or int_active_memcg depending on the context.

    To make the read part simple and transparent for the caller, let's
    introduce two new functions:
    - struct mem_cgroup *active_memcg(void),
    - struct mem_cgroup *get_active_memcg(void).

    They are returning the active memcg if it's set, hiding all implementation
    details: where to get it depending on the current context.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • There are checks for current->mm and current->active_memcg in
    get_obj_cgroup_from_current(), but these checks are redundant:
    memcg_kmem_bypass() called just above performs same checks.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the remote memcg charging API consists of two functions:
    memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
    memcg value, which overwrites the memcg of the current task.

    memalloc_use_memcg(target_memcg);

    memalloc_unuse_memcg();

    It works perfectly for allocations performed from a normal context,
    however an attempt to call it from an interrupt context or just nest two
    remote charging blocks will lead to an incorrect accounting. On exit from
    the inner block the active memcg will be cleared instead of being
    restored.

    memalloc_use_memcg(target_memcg);

    memalloc_use_memcg(target_memcg_2);

    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

    memalloc_unuse_memcg();

    This patch extends the remote charging API by switching to a single
    function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
    which sets the new value and returns the old one. So a remote charging
    block will look like:

    old_memcg = set_active_memcg(target_memcg);

    set_active_memcg(old_memcg);

    This patch is heavily based on the patch by Johannes Weiner, which can be
    found here: https://lkml.org/lkml/2020/5/28/806 .

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Dan Schatzberg
    Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

14 Oct, 2020

12 commits

  • The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
    NULL before checking is_device_private_entry() so device private pages are
    never handled. Fix this by checking for non_swap_entry() after handling
    device private swap PTEs.

    I assume the memory cgroup accounting would be off somehow when moving
    a process to another memory cgroup. Currently, the device private page
    is charged like a normal anonymous page when allocated and is uncharged
    when the page is freed so I think that path is OK.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Jerome Glisse
    Cc: Balbir Singh
    Cc: Ira Weiny
    Link: https://lkml.kernel.org/r/20201009215952.2726-1-rcampbell@nvidia.com
    xFixes: c733a82874a7 ("mm/memcontrol: support MEMORY_DEVICE_PRIVATE")
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
    counter"), the mem_cgroup_unmark_under_oom() is added and the comment of
    the mem_cgroup_oom_unlock() is moved here. But this comment make no sense
    here because mem_cgroup_oom_lock() does not operate on under_oom field.
    So we reword the comment as this would be helpful. [Thanks Michal Hocko
    for rewording this comment.]

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Link: https://lkml.kernel.org/r/20200930095336.21323-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • In the cgroup v1, we have a numa_stat interface. This is useful for
    providing visibility into the numa locality information within an memcg
    since the pages are allowed to be allocated from any physical node. One
    of the use cases is evaluating application performance by combining this
    information with the application's CPU allocation. But the cgroup v2 does
    not. So this patch adds the missing information.

    Suggested-by: Shakeel Butt
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Zefan Li
    Cc: Johannes Weiner
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Randy Dunlap
    Link: https://lkml.kernel.org/r/20200916100030.71698-2-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • The swap page counter is v2 only while memsw is v1 only. As v1 and v2
    controllers cannot be active at the same time, there is no point to keep
    both swap and memsw page counters in mem_cgroup. The previous patch has
    made sure that memsw page counter is updated and accessed only when in v1
    code paths. So it is now safe to alias the v1 memsw page counter to v2
    swap page counter. This saves 14 long's in the size of mem_cgroup. This
    is a saving of 112 bytes for 64-bit archs.

    While at it, also document which page counters are used in v1 and/or v2.

    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Chris Down
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Yafang Shao
    Link: https://lkml.kernel.org/r/20200914024452.19167-4-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • mem_cgroup_get_max() used to get memory+swap max from both the v1 memsw
    and v2 memory+swap page counters & return the maximum of these 2 values.
    This is redundant and it is more efficient to just get either the v1 or
    the v2 values depending on which one is currently in use.

    [longman@redhat.com: v4]
    Link: https://lkml.kernel.org/r/20200914150928.7841-1-longman@redhat.com

    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Chris Down
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Yafang Shao
    Link: https://lkml.kernel.org/r/20200914024452.19167-3-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Patch series "mm/memcg: Miscellaneous cleanups and streamlining", v2.

    This patch (of 3):

    Since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") and
    commit 00501b531c47 ("mm: memcontrol: rewrite charge API") in v3.17, the
    enum charge_type was no longer used anywhere. However, the enum itself
    was not removed at that time. Remove the obsolete enum charge_type now.

    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Chris Down
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Yafang Shao
    Link: https://lkml.kernel.org/r/20200914024452.19167-1-longman@redhat.com
    Link: https://lkml.kernel.org/r/20200914024452.19167-2-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Since commit bbec2e15170a ("mm: rename page_counter's count/limit into
    usage/max"), the arg @reclaim has no priority field anymore.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Link: https://lkml.kernel.org/r/20200913094129.44558-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • mem_cgroup_from_obj() checks the lowest bit of the page->mem_cgroup
    pointer to determine if the page has an attached obj_cgroup vector instead
    of a regular memcg pointer. If it's not set, it simple returns the
    page->mem_cgroup value as a struct mem_cgroup pointer.

    The commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
    for all allocations") changed the moment when this bit is set: if
    previously it was set on the allocation of the slab page, now it can be
    set well after, when the first accounted object is allocated on this page.

    It opened a race: if page->mem_cgroup is set concurrently after the first
    page_has_obj_cgroups(page) check, a pointer to the obj_cgroups array can
    be returned as a memory cgroup pointer.

    A simple check for page->mem_cgroup pointer for NULL before the
    page_has_obj_cgroups() check fixes the race. Indeed, if the pointer is
    not NULL, it's either a simple mem_cgroup pointer or a pointer to
    obj_cgroup vector. The pointer can be asynchronously changed from NULL to
    (obj_cgroup_vec | 0x1UL), but can't be changed from a valid memcg pointer
    to objcg vector or back.

    If the object passed to mem_cgroup_from_obj() is a slab object and
    page->mem_cgroup is NULL, it means that the object is not accounted, so
    the function must return NULL.

    I've discovered the race looking at the code, so far I haven't seen it in
    the wild.

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Link: https://lkml.kernel.org/r/20200910022435.2773735-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Use the preferred form for passing the size of a structure type. The
    alternative form where the structure type is spelled out hurts readability
    and introduces an opportunity for a bug when the object type is changed
    but the corresponding object identifier to which the sizeof operator is
    applied is not.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Link: https://lkml.kernel.org/r/773e013ff2f07fe2a0b47153f14dea054c0c04f1.1596214831.git.gustavoars@kernel.org
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • Make use of the flex_array_size() helper to calculate the size of a
    flexible array member within an enclosing structure.

    This helper offers defense-in-depth against potential integer overflows,
    while at the same time makes it explicitly clear that we are dealing with
    a flexible array member.

    Also, remove unnecessary braces.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Link: https://lkml.kernel.org/r/ddd60dae2d9aea1ccdd2be66634815c93696125e.1596214831.git.gustavoars@kernel.org
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • The current code does not protect against swapoff of the underlying
    swap device, so this is a bug fix as well as a worthwhile reduction in
    code complexity.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

27 Sep, 2020

1 commit

  • We forget to add the suffix to the workingset_restore string, so fix it.

    And also update the documentation of cgroup-v2.rst.

    Fixes: 170b04b7ae49 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Tejun Heo
    Cc: Zefan Li
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Randy Dunlap
    Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     

25 Sep, 2020

1 commit


06 Sep, 2020

1 commit

  • syzbot has reported an use-after-free in the uncharge_batch path

    BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
    BUG: KASAN: use-after-free in atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
    BUG: KASAN: use-after-free in atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
    BUG: KASAN: use-after-free in page_counter_cancel mm/page_counter.c:54 [inline]
    BUG: KASAN: use-after-free in page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
    Write of size 8 at addr ffff8880371c0148 by task syz-executor.0/9304

    CPU: 0 PID: 9304 Comm: syz-executor.0 Not tainted 5.8.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1f0/0x31e lib/dump_stack.c:118
    print_address_description+0x66/0x620 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report+0x132/0x1d0 mm/kasan/report.c:530
    check_memory_region_inline mm/kasan/generic.c:183 [inline]
    check_memory_region+0x2b5/0x2f0 mm/kasan/generic.c:192
    instrument_atomic_write include/linux/instrumented.h:71 [inline]
    atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
    atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
    page_counter_cancel mm/page_counter.c:54 [inline]
    page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
    uncharge_batch+0x6c/0x350 mm/memcontrol.c:6764
    uncharge_page+0x115/0x430 mm/memcontrol.c:6796
    uncharge_list mm/memcontrol.c:6835 [inline]
    mem_cgroup_uncharge_list+0x70/0xe0 mm/memcontrol.c:6877
    release_pages+0x13a2/0x1550 mm/swap.c:911
    tlb_batch_pages_flush mm/mmu_gather.c:49 [inline]
    tlb_flush_mmu_free mm/mmu_gather.c:242 [inline]
    tlb_flush_mmu+0x780/0x910 mm/mmu_gather.c:249
    tlb_finish_mmu+0xcb/0x200 mm/mmu_gather.c:328
    exit_mmap+0x296/0x550 mm/mmap.c:3185
    __mmput+0x113/0x370 kernel/fork.c:1076
    exit_mm+0x4cd/0x550 kernel/exit.c:483
    do_exit+0x576/0x1f20 kernel/exit.c:793
    do_group_exit+0x161/0x2d0 kernel/exit.c:903
    get_signal+0x139b/0x1d30 kernel/signal.c:2743
    arch_do_signal+0x33/0x610 arch/x86/kernel/signal.c:811
    exit_to_user_mode_loop kernel/entry/common.c:135 [inline]
    exit_to_user_mode_prepare+0x8d/0x1b0 kernel/entry/common.c:166
    syscall_exit_to_user_mode+0x5e/0x1a0 kernel/entry/common.c:241
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Commit 1a3e1f40962c ("mm: memcontrol: decouple reference counting from
    page accounting") reworked the memcg lifetime to be bound the the struct
    page rather than charges. It also removed the css_put_many from
    uncharge_batch and that is causing the above splat.

    uncharge_batch() is supposed to uncharge accumulated charges for all
    pages freed from the same memcg. The queuing is done by uncharge_page
    which however drops the memcg reference after it adds charges to the
    batch. If the current page happens to be the last one holding the
    reference for its memcg then the memcg is OK to go and the next page to
    be freed will trigger batched uncharge which needs to access the memcg
    which is gone already.

    Fix the issue by taking a reference for the memcg in the current batch.

    Fixes: 1a3e1f40962c ("mm: memcontrol: decouple reference counting from page accounting")
    Reported-by: syzbot+b305848212deec86eabe@syzkaller.appspotmail.com
    Reported-by: syzbot+b5ea6fb6f139c8b9482b@syzkaller.appspotmail.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Hugh Dickins
    Link: https://lkml.kernel.org/r/20200820090341.GC5033@dhcp22.suse.cz
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

14 Aug, 2020

1 commit

  • Commit 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the
    parent cgroup") adds memory tracking to the memcg kernel structures
    themselves to make cgroups liable for the memory they are consuming
    through the allocation of child groups (which can be significant).

    This code is a bit awkward as it's spread out through several functions:
    The outermost function does memalloc_use_memcg(parent) to set up
    current->active_memcg, which designates which cgroup to charge, and the
    inner functions pass GFP_ACCOUNT to request charging for specific
    allocations. To make sure this dependency is satisfied at all times -
    to make sure we don't randomly charge whoever is calling the functions -
    the inner functions warn on !current->active_memcg.

    However, this triggers a false warning when the root memcg itself is
    allocated. No parent exists in this case, and so current->active_memcg
    is rightfully NULL. It's a false positive, not indicative of a bug.

    Delete the warnings for now, we can revisit this later.

    Fixes: 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup")
    Signed-off-by: Johannes Weiner
    Reported-by: Stephen Rothwell
    Acked-by: Roman Gushchin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Aug, 2020

4 commits

  • Drop the repeated word "down".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-6-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • To prepare the workingset detection for anon LRU, this patch splits
    workingset event counters for refault, activate and restore into anon and
    file variants, as well as the refaults counter in struct lruvec.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Memory cgroups are using large chunks of percpu memory to store vmstat
    data. Yet this memory is not accounted at all, so in the case when there
    are many (dying) cgroups, it's not exactly clear where all the memory is.

    Because the size of memory cgroup internal structures can dramatically
    exceed the size of object or page which is pinning it in the memory, it's
    not a good idea to simply ignore it. It actually breaks the isolation
    between cgroups.

    Let's account the consumed percpu memory to the parent cgroup.

    [guro@fb.com: add WARN_ON_ONCE()s, per Johannes]
    Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.com

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Dennis Zhou
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Tobin C. Harding
    Cc: Vlastimil Babka
    Cc: Waiman Long
    Cc: Bixuan Cui
    Cc: Michal Koutný
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Percpu memory can represent a noticeable chunk of the total memory
    consumption, especially on big machines with many CPUs. Let's track
    percpu memory usage for each memcg and display it in memory.stat.

    A percpu allocation is usually scattered over multiple pages (and nodes),
    and can be significantly smaller than a page. So let's add a byte-sized
    counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
    created for slabs can be perfectly reused for percpu case.

    [guro@fb.com: v3]
    Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Dennis Zhou
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Tobin C. Harding
    Cc: Vlastimil Babka
    Cc: Waiman Long
    Cc: Bixuan Cui
    Cc: Michal Koutný
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

08 Aug, 2020

9 commits

  • When an outside process lowers one of the memory limits of a cgroup (or
    uses the force_empty knob in cgroup1), direct reclaim is performed in the
    context of the write(), in order to directly enforce the new limit and
    have it being met by the time the write() returns.

    Currently, this reclaim activity is accounted as memory pressure in the
    cgroup that the writer(!) belongs to. This is unexpected. It
    specifically causes problems for senpai
    (https://github.com/facebookincubator/senpai), which is an agent that
    routinely adjusts the memory limits and performs associated reclaim work
    in tens or even hundreds of cgroups running on the host. The cgroup that
    senpai is running in itself will report elevated levels of memory
    pressure, even though it itself is under no memory shortage or any sort of
    distress.

    Move the psi annotation from the central cgroup reclaim function to
    callsites in the allocation context, and thereby no longer count any
    limit-setting reclaim as memory pressure. If the newly set limit causes
    the workload inside the cgroup into direct reclaim, that of course will
    continue to count as memory pressure.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 8c8c383c04f6 ("mm: memcontrol: try harder to set a new
    memory.high") inadvertently removed a callback to recalculate the
    writeback cache size in light of a newly configured memory.high limit.

    Without letting the writeback cache know about a potentially heavily
    reduced limit, it may permit too many dirty pages, which can cause
    unnecessary reclaim latencies or even avoidable OOM situations.

    This was spotted while reading the code, it hasn't knowingly caused any
    problems in practice so far.

    Fixes: 8c8c383c04f6 ("mm: memcontrol: try harder to set a new memory.high")
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg oom killer invocation is synchronized by the global oom_lock and
    tasks are sleeping on the lock while somebody is selecting the victim or
    potentially race with the oom_reaper is releasing the victim's memory.
    This can result in a pointless oom killer invocation because a waiter
    might be racing with the oom_reaper

    P1 oom_reaper P2
    oom_reap_task mutex_lock(oom_lock)
    out_of_memory # no victim because we have one already
    __oom_reap_task_mm mute_unlock(oom_lock)
    mutex_lock(oom_lock)
    set MMF_OOM_SKIP
    select_bad_process
    # finds a new victim

    The page allocator prevents from this race by trying to allocate after the
    lock can be acquired (in __alloc_pages_may_oom) which acts as a last
    minute check. Moreover page allocator simply doesn't block on the
    oom_lock and simply retries the whole reclaim process.

    Memcg oom killer should do the last minute check as well. Call
    mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
    well but this doesn't seem to be necessary at this stage.

    [mhocko@kernel.org: commit log]

    Suggested-by: Michal Hocko
    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Chris Down
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • mem_cgroup_protected currently is both used to set effective low and min
    and return a mem_cgroup_protection based on the result. As a user, this
    can be a little unexpected: it appears to be a simple predicate function,
    if not for the big warning in the comment above about the order in which
    it must be executed.

    This change makes it so that we separate the state mutations from the
    actual protection checks, which makes it more obvious where we need to be
    careful mutating internal state, and where we are simply checking and
    don't need to worry about that.

    [mhocko@suse.com - don't check protection on root memcgs]

    Suggested-by: Johannes Weiner
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Yafang Shao
    Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.

    This series contains a fix for a edge case in my earlier protection
    calculation patches, and a patch to make the area overall a little more
    robust to hopefully help avoid this in future.

    This patch (of 2):

    A cgroup can have both memory protection and a memory limit to isolate it
    from its siblings in both directions - for example, to prevent it from
    being shrunk below 2G under high pressure from outside, but also from
    growing beyond 4G under low pressure.

    Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
    implemented proportional scan pressure so that multiple siblings in excess
    of their protection settings don't get reclaimed equally but instead in
    accordance to their unprotected portion.

    During limit reclaim, this proportionality shouldn't apply of course:
    there is no competition, all pressure is from within the cgroup and should
    be applied as such. Reclaim should operate at full efficiency.

    However, mem_cgroup_protected() never expected anybody to look at the
    effective protection values when it indicated that the cgroup is above its
    protection. As a result, a query during limit reclaim may return stale
    protection values that were calculated by a previous reclaim cycle in
    which the cgroup did have siblings.

    When this happens, reclaim is unnecessarily hesitant and potentially slow
    to meet the desired limit. In theory this could lead to premature OOM
    kills, although it's not obvious this has occurred in practice.

    Workaround the problem by special casing reclaim roots in
    mem_cgroup_protection. These memcgs are never participating in the
    reclaim protection because the reclaim is internal.

    We have to ignore effective protection values for reclaim roots because
    mem_cgroup_protected might be called from racing reclaim contexts with
    different roots. Calculation is relying on root -> leaf tree traversal
    therefore top-down reclaim protection invariants should hold. The only
    exception is the reclaim root which should have effective protection set
    to 0 but that would be problematic for the following setup:

    Let's have global and A's reclaim in parallel:
    |
    A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
    |\
    | C (low = 1G, usage = 2.5G)
    B (low = 1G, usage = 0.5G)

    for A reclaim we have
    B.elow = B.low
    C.elow = C.low

    For the global reclaim
    A.elow = A.low
    B.elow = min(B.usage, B.low) because children_low_usage A.elow

    Which means that protected memcgs would get reclaimed.

    In future we would like to make mem_cgroup_protected more robust against
    racing reclaim contexts but that is likely more complex solution than this
    simple workaround.

    [hannes@cmpxchg.org - large part of the changelog]
    [mhocko@suse.com - workaround explanation]
    [chris@chrisdown.name - retitle]

    Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
    Signed-off-by: Yafang Shao
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Chris Down
    Acked-by: Roman Gushchin
    Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Reclaim retries have been set to 5 since the beginning of time in
    commit 66e1707bc346 ("Memory controller: add per cgroup LRU and
    reclaim"). However, we now have a generally agreed-upon standard for
    page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
    in commit 0a0337e0d1d1 ("mm, oom: rework oom detection").

    In the absence of a compelling reason to declare an OOM earlier in memcg
    context than page allocator context, it seems reasonable to supplant
    MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
    allocator and memcg internals more similar in semantics when reclaim
    fails to produce results, avoiding premature OOMs or throttling.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "mm, memcg: reclaim harder before high throttling", v2.

    This patch (of 2):

    In Facebook production, we've seen cases where cgroups have been put into
    allocator throttling even when they appear to have a lot of slack file
    caches which should be trivially reclaimable.

    Looking more closely, the problem is that we only try a single cgroup
    reclaim walk for each return to usermode before calculating whether or not
    we should throttle. This single attempt doesn't produce enough pressure
    to shrink for cgroups with a rapidly growing amount of file caches prior
    to entering allocator throttling.

    As an example, we see that threads in an affected cgroup are stuck in
    allocator throttling:

    # for i in $(cat cgroup.threads); do
    > grep over_high "/proc/$i/stack"
    > done
    [] mem_cgroup_handle_over_high+0x10b/0x150
    [] mem_cgroup_handle_over_high+0x10b/0x150
    [] mem_cgroup_handle_over_high+0x10b/0x150

    ...however, there is no I/O pressure reported by PSI, despite a lot of
    slack file pages:

    # cat memory.pressure
    some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
    full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
    # cat io.pressure
    some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
    full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
    # grep _file memory.stat
    inactive_file 1370939392
    active_file 661635072

    This patch changes the behaviour to retry reclaim either until the current
    task goes below the 10ms grace period, or we are making no reclaim
    progress at all. In the latter case, we enter reclaim throttling as
    before.

    To a user, there's no intuitive reason for the reclaim behaviour to differ
    from hitting memory.high as part of a new allocation, as opposed to
    hitting memory.high because someone lowered its value. As such this also
    brings an added benefit: it unifies the reclaim behaviour between the two.

    There's precedent for this behaviour: we already do reclaim retries when
    writing to memory.{high,max}, in max reclaim, and in the page allocator
    itself.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Memory.high limit is implemented in a way such that the kernel penalizes
    all threads which are allocating a memory over the limit. Forcing all
    threads into the synchronous reclaim and adding some artificial delays
    allows to slow down the memory consumption and potentially give some time
    for userspace oom handlers/resource control agents to react.

    It works nicely if the memory usage is hitting the limit from below,
    however it works sub-optimal if a user adjusts memory.high to a value way
    below the current memory usage. It basically forces all workload threads
    (doing any memory allocations) into the synchronous reclaim and sleep.
    This makes the workload completely unresponsive for a long period of time
    and can also lead to a system-wide contention on lru locks. It can happen
    even if the workload is not actually tight on memory and has, for example,
    a ton of cold pagecache.

    In the current implementation writing to memory.high causes an atomic
    update of page counter's high value followed by an attempt to reclaim
    enough memory to fit into the new limit. To fix the problem described
    above, all we need is to change the order of execution: try to push the
    memory usage under the limit first, and only then set the new high limit.

    Reported-by: Domas Mituzas
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Chris Down
    Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the kernel stack is being accounted per-zone. There is no need
    to do that. In addition due to being per-zone, memcg has to keep a
    separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
    MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
    node_stat_item. In addition localize the kernel stack stats updates to
    account_kernel_stack().

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt