08 Aug, 2020

40 commits

  • When checking a performance change for will-it-scale scalability mmap test
    [1], we found very high lock contention for spinlock of percpu counter
    'vm_committed_as':

    94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

    Actually this heavy lock contention is not always necessary. The
    'vm_committed_as' needs to be very precise when the strict
    OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
    for the percpu counter.

    So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
    lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
    add a sysctl handler to adjust it when the policy is reconfigured.

    Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
    desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
    platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
    improvements with that test. And whether it shows improvements depends on
    if the test mmap size is bigger than the batch number computed.

    And if the lift is 16X, 1/3 of the platforms will show improvements,
    though it should help the mmap/unmap usage generally, as Michal Hocko
    mentioned:

    : I believe that there are non-synthetic worklaods which would benefit from
    : a larger batch. E.g. large in memory databases which do large mmaps
    : during startups from multiple threads.

    [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox (Oracle)
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Kees Cook
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Christoph Lameter
    Cc: Dennis Zhou
    Cc: Haiyang Zhang
    Cc: kernel test robot
    Cc: "K. Y. Srinivasan"
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • percpu_counter's accuracy is related to its batch size. For a
    percpu_counter with a big batch, its deviation could be big, so when the
    counter's batch is runtime changed to a smaller value for better accuracy,
    there could also be requirment to reduce the big deviation.

    So add a percpu-counter sync function to be run on each CPU.

    Reported-by: kernel test robot
    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Cc: Dennis Zhou
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Andi Kleen
    Cc: Huang Ying
    Cc: Dave Hansen
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox (Oracle)
    Cc: Mel Gorman
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1594389708-60781-4-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • percpu_counter_sum_positive() will provide more accurate info.

    As with percpu_counter_read_positive(), in worst case the deviation could
    be 'batch * nr_cpus', which is totalram_pages/256 for now, and will be
    more when the batch gets enlarged.

    Its time cost is about 800 nanoseconds on a 2C/4T platform and 2~3
    microseconds on a 2S/36C/72T Skylake server in normal case, and in worst
    case where vm_committed_as's spinlock is under severe contention, it costs
    30~40 microseconds for the 2S/36C/72T Skylake sever, which should be fine
    for its only two users: /proc/meminfo and HyperV balloon driver's status
    trace per second.

    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko # for /proc/meminfo
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Matthew Wilcox (Oracle)
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Christoph Lameter
    Cc: Dennis Zhou
    Cc: Kees Cook
    Cc: kernel test robot
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1592725000-73486-3-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1594389708-60781-3-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • Patch series "make vm_committed_as_batch aware of vm overcommit policy", v6.

    When checking a performance change for will-it-scale scalability mmap test
    [1], we found very high lock contention for spinlock of percpu counter
    'vm_committed_as':

    94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

    Actually this heavy lock contention is not always necessary. The
    'vm_committed_as' needs to be very precise when the strict
    OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
    for the percpu counter.

    So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
    enlarge it for not-so-strict OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
    policies.

    Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
    desktop, and 2097%(20X) on a 4S/72C/144T server. And for that case,
    whether it shows improvements depends on if the test mmap size is bigger
    than the batch number computed.

    We tested 10+ platforms in 0day (server, desktop and laptop). If we lift
    it to 64X, 80%+ platforms show improvements, and for 16X lift, 1/3 of the
    platforms will show improvements.

    And generally it should help the mmap/unmap usage,as Michal Hocko
    mentioned:

    : I believe that there are non-synthetic worklaods which would benefit
    : from a larger batch. E.g. large in memory databases which do large
    : mmaps during startups from multiple threads.

    Note: There are some style complain from checkpatch for patch 4, as sysctl
    handler declaration follows the similar format of sibling functions

    [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

    This patch (of 4):

    Use the existing vm_memory_committed() instead, which is also convenient
    for future change.

    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox (Oracle)
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Kees Cook
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Christoph Lameter
    Cc: Dennis Zhou
    Cc: Haiyang Zhang
    Cc: kernel test robot
    Cc: "K. Y. Srinivasan"
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1594389708-60781-1-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1594389708-60781-2-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • Look at the pseudo code below. It's very clear that, the judgement
    "!is_file_hugepages(file)" at 3) is duplicated to the one at 1), we can
    use "else if" to avoid it. And the assignment "retval = -EINVAL" at 2) is
    only needed by the branch 3), because "retval" will be overwritten at 4).

    No functional change, but it can reduce the code size. Maybe more clearer?
    Before:
    text data bss dec hex filename
    28733 1590 1 30324 7674 mm/mmap.o

    After:
    text data bss dec hex filename
    28701 1590 1 30292 7654 mm/mmap.o

    ====pseudo code====:
    if (!(flags & MAP_ANONYMOUS)) {
    ...
    1) if (is_file_hugepages(file))
    len = ALIGN(len, huge_page_size(hstate_file(file)));
    2) retval = -EINVAL;
    3) if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
    goto out_fput;
    } else if (flags & MAP_HUGETLB) {
    ...
    }
    ...

    4) retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
    out_fput:
    ...
    return retval;

    Signed-off-by: Zhen Lei
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200705080112.1405-1-thunder.leizhen@huawei.com
    Signed-off-by: Linus Torvalds

    Zhen Lei
     
  • The functions are only used in two source files, so there is no need for
    them to be in the global header. Move them to the new
    header and include it only where needed.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Abdul Haleem
    Cc: Satheesh Rajendran
    Cc: Stephen Rothwell
    Cc: Steven Rostedt (VMware)
    Cc: Mike Rapoport
    Cc: Christophe Leroy
    Cc: Arnd Bergmann
    Cc: Max Filippov
    Cc: Stafford Horne
    Cc: Geert Uytterhoeven
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200609120533.25867-1-joro@8bytes.org
    Signed-off-by: Linus Torvalds

    Joerg Roedel
     
  • The functionality in lib/ioremap.c deals with pagetables, vmalloc and
    caches, so it naturally belongs to mm/ Moving it there will also allow
    declaring p?d_alloc_track functions in an header file inside mm/ rather
    than having those declarations in include/linux/mm.h

    Suggested-by: Andrew Morton
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Geert Uytterhoeven
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-8-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Most architectures define pgd_free() as a wrapper for free_page().

    Provide a generic version in asm-generic/pgalloc.h and enable its use for
    most architectures.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: Geert Uytterhoeven [m68k]
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-7-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Several architectures define pud_alloc_one() as a wrapper for
    __get_free_page() and pud_free() as a wrapper for free_page().

    Provide a generic implementation in asm-generic/pgalloc.h and use it where
    appropriate.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Geert Uytterhoeven
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-6-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • For most architectures that support >2 levels of page tables,
    pmd_alloc_one() is a wrapper for __get_free_pages(), sometimes with
    __GFP_ZERO and sometimes followed by memset(0) instead.

    More elaborate versions on arm64 and x86 account memory for the user page
    tables and call to pgtable_pmd_page_ctor() as the part of PMD page
    initialization.

    Move the arm64 version to include/asm-generic/pgalloc.h and use the
    generic version on several architectures.

    The pgtable_pmd_page_ctor() is a NOP when ARCH_ENABLE_SPLIT_PMD_PTLOCK is
    not enabled, so there is no functional change for most architectures
    except of the addition of __GFP_ACCOUNT for allocation of user page
    tables.

    The pmd_free() is a wrapper for free_page() in all the cases, so no
    functional change here.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Matthew Wilcox
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200627143453.31835-5-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • xtensa clears PTEs during allocation of the page tables and pte_clear()
    sets the PTE to a non-zero value. Splitting ptes_clear() helper out of
    pte_alloc_one() and pte_alloc_one_kernel() allows reuse of base generic
    allocation methods (__pte_alloc_one() and __pte_alloc_one_kernel()) and
    the common GFP mask for page table allocations.

    The pte_free() and pte_free_kernel() implementations on xtensa are
    identical to the generic ones and can be dropped.

    [jcmvbkbc@gmail.com: xtensa: fix closing endif comment]
    Link: http://lkml.kernel.org/r/20200721024751.1257-1-jcmvbkbc@gmail.com

    Signed-off-by: Mike Rapoport
    Signed-off-by: Max Filippov
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Geert Uytterhoeven
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-4-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Replace pte_alloc_one(), pte_free() and pte_free_kernel() with the generic
    implementation. The only actual functional change is the addition of
    __GFP_ACCOUT for the allocation of the user page tables.

    The pte_alloc_one_kernel() is kept back because its implementation on
    openrisc is different than the generic one.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: Stafford Horne
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Geert Uytterhoeven
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "mm: cleanup usage of "

    Most architectures have very similar versions of pXd_alloc_one() and
    pXd_free_one() for intermediate levels of page table. These patches add
    generic versions of these functions in and enable
    use of the generic functions where appropriate.

    In addition, functions declared and defined in headers are
    used mostly by core mm and early mm initialization in arch and there is no
    actual reason to have the included all over the place.
    The first patch in this series removes unneeded includes of

    In the end it didn't work out as neatly as I hoped and moving
    pXd_alloc_track() definitions to would require
    unnecessary changes to arches that have custom page table allocations, so
    I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
    to mm/.

    This patch (of 8):

    In most cases header is required only for allocations of
    page table memory. Most of the .c files that include that header do not
    use symbols declared in and do not require that header.

    As for the other header files that used to include , it is
    possible to move that include into the .c file that actually uses symbols
    from and drop the include from the header file.

    The process was somewhat automated using

    sed -i -E '/[
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: Geert Uytterhoeven [m68k]
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Joerg Roedel
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • This function implicitly assumes that the addr passed in is page aligned.
    A non page aligned addr could ultimately cause a kernel bug in
    remap_pte_range as the exit condition in the logic loop may never be
    satisfied. This patch documents the need for the requirement, as well as
    explicitly adds a check for it.

    Signed-off-by: Alex Zhang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200617233512.177519-1-zhangalex@google.com
    Signed-off-by: Linus Torvalds

    Alex Zhang
     
  • In zap_pte_range(), the check for non_swap_entry() and
    is_device_private_entry() is unnecessary since the latter is sufficient to
    determine if the page is a device private page. Remove the test for
    non_swap_entry() to simplify the code and for clarity.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Acked-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200615175405.4613-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • When workload runs in cgroups that aren't directly below root cgroup and
    their parent specifies reclaim protection, it may end up ineffective.

    The reason is that propagate_protected_usage() is not called in all
    hierarchy up. All the protected usage is incorrectly accumulated in the
    workload's parent. This means that siblings_low_usage is overestimated
    and effective protection underestimated. Even though it is transitional
    phenomenon (uncharge path does correct propagation and fixes the wrong
    children_low_usage), it can undermine the intended protection
    unexpectedly.

    We have noticed this problem while seeing a swap out in a descendant of a
    protected memcg (intermediate node) while the parent was conveniently
    under its protection limit and the memory pressure was external to that
    hierarchy. Michal has pinpointed this down to the wrong
    siblings_low_usage which led to the unwanted reclaim.

    The fix is simply updating children_low_usage in respective ancestors also
    in the charging path.

    Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
    Signed-off-by: Michal Koutný
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: [4.18+]
    Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Koutný
     
  • When an outside process lowers one of the memory limits of a cgroup (or
    uses the force_empty knob in cgroup1), direct reclaim is performed in the
    context of the write(), in order to directly enforce the new limit and
    have it being met by the time the write() returns.

    Currently, this reclaim activity is accounted as memory pressure in the
    cgroup that the writer(!) belongs to. This is unexpected. It
    specifically causes problems for senpai
    (https://github.com/facebookincubator/senpai), which is an agent that
    routinely adjusts the memory limits and performs associated reclaim work
    in tens or even hundreds of cgroups running on the host. The cgroup that
    senpai is running in itself will report elevated levels of memory
    pressure, even though it itself is under no memory shortage or any sort of
    distress.

    Move the psi annotation from the central cgroup reclaim function to
    callsites in the allocation context, and thereby no longer count any
    limit-setting reclaim as memory pressure. If the newly set limit causes
    the workload inside the cgroup into direct reclaim, that of course will
    continue to count as memory pressure.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 8c8c383c04f6 ("mm: memcontrol: try harder to set a new
    memory.high") inadvertently removed a callback to recalculate the
    writeback cache size in light of a newly configured memory.high limit.

    Without letting the writeback cache know about a potentially heavily
    reduced limit, it may permit too many dirty pages, which can cause
    unnecessary reclaim latencies or even avoidable OOM situations.

    This was spotted while reading the code, it hasn't knowingly caused any
    problems in practice so far.

    Fixes: 8c8c383c04f6 ("mm: memcontrol: try harder to set a new memory.high")
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/20200728135210.379885-1-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg oom killer invocation is synchronized by the global oom_lock and
    tasks are sleeping on the lock while somebody is selecting the victim or
    potentially race with the oom_reaper is releasing the victim's memory.
    This can result in a pointless oom killer invocation because a waiter
    might be racing with the oom_reaper

    P1 oom_reaper P2
    oom_reap_task mutex_lock(oom_lock)
    out_of_memory # no victim because we have one already
    __oom_reap_task_mm mute_unlock(oom_lock)
    mutex_lock(oom_lock)
    set MMF_OOM_SKIP
    select_bad_process
    # finds a new victim

    The page allocator prevents from this race by trying to allocate after the
    lock can be acquired (in __alloc_pages_may_oom) which acts as a last
    minute check. Moreover page allocator simply doesn't block on the
    oom_lock and simply retries the whole reclaim process.

    Memcg oom killer should do the last minute check as well. Call
    mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
    well but this doesn't seem to be necessary at this stage.

    [mhocko@kernel.org: commit log]

    Suggested-by: Michal Hocko
    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Chris Down
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Link: http://lkml.kernel.org/r/1594735034-19190-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • mem_cgroup_protected currently is both used to set effective low and min
    and return a mem_cgroup_protection based on the result. As a user, this
    can be a little unexpected: it appears to be a simple predicate function,
    if not for the big warning in the comment above about the order in which
    it must be executed.

    This change makes it so that we separate the state mutations from the
    actual protection checks, which makes it more obvious where we need to be
    careful mutating internal state, and where we are simply checking and
    don't need to worry about that.

    [mhocko@suse.com - don't check protection on root memcgs]

    Suggested-by: Johannes Weiner
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Yafang Shao
    Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.

    This series contains a fix for a edge case in my earlier protection
    calculation patches, and a patch to make the area overall a little more
    robust to hopefully help avoid this in future.

    This patch (of 2):

    A cgroup can have both memory protection and a memory limit to isolate it
    from its siblings in both directions - for example, to prevent it from
    being shrunk below 2G under high pressure from outside, but also from
    growing beyond 4G under low pressure.

    Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
    implemented proportional scan pressure so that multiple siblings in excess
    of their protection settings don't get reclaimed equally but instead in
    accordance to their unprotected portion.

    During limit reclaim, this proportionality shouldn't apply of course:
    there is no competition, all pressure is from within the cgroup and should
    be applied as such. Reclaim should operate at full efficiency.

    However, mem_cgroup_protected() never expected anybody to look at the
    effective protection values when it indicated that the cgroup is above its
    protection. As a result, a query during limit reclaim may return stale
    protection values that were calculated by a previous reclaim cycle in
    which the cgroup did have siblings.

    When this happens, reclaim is unnecessarily hesitant and potentially slow
    to meet the desired limit. In theory this could lead to premature OOM
    kills, although it's not obvious this has occurred in practice.

    Workaround the problem by special casing reclaim roots in
    mem_cgroup_protection. These memcgs are never participating in the
    reclaim protection because the reclaim is internal.

    We have to ignore effective protection values for reclaim roots because
    mem_cgroup_protected might be called from racing reclaim contexts with
    different roots. Calculation is relying on root -> leaf tree traversal
    therefore top-down reclaim protection invariants should hold. The only
    exception is the reclaim root which should have effective protection set
    to 0 but that would be problematic for the following setup:

    Let's have global and A's reclaim in parallel:
    |
    A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
    |\
    | C (low = 1G, usage = 2.5G)
    B (low = 1G, usage = 0.5G)

    for A reclaim we have
    B.elow = B.low
    C.elow = C.low

    For the global reclaim
    A.elow = A.low
    B.elow = min(B.usage, B.low) because children_low_usage A.elow

    Which means that protected memcgs would get reclaimed.

    In future we would like to make mem_cgroup_protected more robust against
    racing reclaim contexts but that is likely more complex solution than this
    simple workaround.

    [hannes@cmpxchg.org - large part of the changelog]
    [mhocko@suse.com - workaround explanation]
    [chris@chrisdown.name - retitle]

    Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
    Signed-off-by: Yafang Shao
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Chris Down
    Acked-by: Roman Gushchin
    Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Reclaim retries have been set to 5 since the beginning of time in
    commit 66e1707bc346 ("Memory controller: add per cgroup LRU and
    reclaim"). However, we now have a generally agreed-upon standard for
    page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
    in commit 0a0337e0d1d1 ("mm, oom: rework oom detection").

    In the absence of a compelling reason to declare an OOM earlier in memcg
    context than page allocator context, it seems reasonable to supplant
    MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
    allocator and memcg internals more similar in semantics when reclaim
    fails to produce results, avoiding premature OOMs or throttling.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "mm, memcg: reclaim harder before high throttling", v2.

    This patch (of 2):

    In Facebook production, we've seen cases where cgroups have been put into
    allocator throttling even when they appear to have a lot of slack file
    caches which should be trivially reclaimable.

    Looking more closely, the problem is that we only try a single cgroup
    reclaim walk for each return to usermode before calculating whether or not
    we should throttle. This single attempt doesn't produce enough pressure
    to shrink for cgroups with a rapidly growing amount of file caches prior
    to entering allocator throttling.

    As an example, we see that threads in an affected cgroup are stuck in
    allocator throttling:

    # for i in $(cat cgroup.threads); do
    > grep over_high "/proc/$i/stack"
    > done
    [] mem_cgroup_handle_over_high+0x10b/0x150
    [] mem_cgroup_handle_over_high+0x10b/0x150
    [] mem_cgroup_handle_over_high+0x10b/0x150

    ...however, there is no I/O pressure reported by PSI, despite a lot of
    slack file pages:

    # cat memory.pressure
    some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
    full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
    # cat io.pressure
    some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
    full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
    # grep _file memory.stat
    inactive_file 1370939392
    active_file 661635072

    This patch changes the behaviour to retry reclaim either until the current
    task goes below the 10ms grace period, or we are making no reclaim
    progress at all. In the latter case, we enter reclaim throttling as
    before.

    To a user, there's no intuitive reason for the reclaim behaviour to differ
    from hitting memory.high as part of a new allocation, as opposed to
    hitting memory.high because someone lowered its value. As such this also
    brings an added benefit: it unifies the reclaim behaviour between the two.

    There's precedent for this behaviour: we already do reclaim retries when
    writing to memory.{high,max}, in max reclaim, and in the page allocator
    itself.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/cover.1594640214.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Memory.high limit is implemented in a way such that the kernel penalizes
    all threads which are allocating a memory over the limit. Forcing all
    threads into the synchronous reclaim and adding some artificial delays
    allows to slow down the memory consumption and potentially give some time
    for userspace oom handlers/resource control agents to react.

    It works nicely if the memory usage is hitting the limit from below,
    however it works sub-optimal if a user adjusts memory.high to a value way
    below the current memory usage. It basically forces all workload threads
    (doing any memory allocations) into the synchronous reclaim and sleep.
    This makes the workload completely unresponsive for a long period of time
    and can also lead to a system-wide contention on lru locks. It can happen
    even if the workload is not actually tight on memory and has, for example,
    a ton of cold pagecache.

    In the current implementation writing to memory.high causes an atomic
    update of page counter's high value followed by an attempt to reclaim
    enough memory to fit into the new limit. To fix the problem described
    above, all we need is to change the order of execution: try to push the
    memory usage under the limit first, and only then set the new high limit.

    Reported-by: Domas Mituzas
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Chris Down
    Link: http://lkml.kernel.org/r/20200709194718.189231-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently memcg_kmem_enabled() is optimized for the kernel memory
    accounting being off. It was so for a long time, and arguably the reason
    behind was that the kernel memory accounting was initially an opt-in
    feature. However, now it's on by default on both cgroup v1 and cgroup v2,
    and it's on for all cgroups. So let's switch over to
    static_branch_likely() to reflect this fact.

    Unlikely there is a significant performance difference, as the cost of a
    memory allocation and its accounting significantly exceeds the cost of a
    jump. However, the conversion makes the code look more logically.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200707173612.124425-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • charge_slab_page() and uncharge_slab_page() are not related anymore to
    memcg charging and uncharging. In order to make their names less
    confusing, let's rename them to account_slab_page() and
    unaccount_slab_page() respectively.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • charge_slab_page() is not using the gfp argument anymore,
    remove it.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the kernel stack is being accounted per-zone. There is no need
    to do that. In addition due to being per-zone, memcg has to keep a
    separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
    MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
    node_stat_item. In addition localize the kernel stack stats updates to
    account_kernel_stack().

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Add a drgn-based tool to display slab information for a given memcg. Can
    replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2, but in a
    more flexiable way.

    Currently supports only SLUB configuration, but SLAB can be trivially
    added later.

    Output example:
    $ sudo ./tools/cgroup/memcg_slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
    shmem_inode_cache 92 92 704 46 8 : tunables 0 0 0 : slabdata 2 2 0
    eventpoll_pwq 56 56 72 56 1 : tunables 0 0 0 : slabdata 1 1 0
    eventpoll_epi 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
    kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
    kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
    kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
    kmalloc-64 128 128 64 64 1 : tunables 0 0 0 : slabdata 2 2 0
    mm_struct 160 160 1024 32 8 : tunables 0 0 0 : slabdata 5 5 0
    signal_cache 96 96 1024 32 8 : tunables 0 0 0 : slabdata 3 3 0
    sighand_cache 45 45 2112 15 8 : tunables 0 0 0 : slabdata 3 3 0
    files_cache 138 138 704 46 8 : tunables 0 0 0 : slabdata 3 3 0
    task_delay_info 153 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
    task_struct 27 27 3520 9 8 : tunables 0 0 0 : slabdata 3 3 0
    radix_tree_node 56 56 584 28 4 : tunables 0 0 0 : slabdata 2 2 0
    btrfs_inode 140 140 1136 28 8 : tunables 0 0 0 : slabdata 5 5 0
    kmalloc-1024 64 64 1024 32 8 : tunables 0 0 0 : slabdata 2 2 0
    kmalloc-192 84 84 192 42 2 : tunables 0 0 0 : slabdata 2 2 0
    inode_cache 54 54 600 27 4 : tunables 0 0 0 : slabdata 2 2 0
    kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
    kmalloc-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
    skbuff_head_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
    sock_inode_cache 46 46 704 46 8 : tunables 0 0 0 : slabdata 1 1 0
    cred_jar 378 378 192 42 2 : tunables 0 0 0 : slabdata 9 9 0
    proc_inode_cache 96 96 672 24 4 : tunables 0 0 0 : slabdata 4 4 0
    dentry 336 336 192 42 2 : tunables 0 0 0 : slabdata 8 8 0
    filp 697 864 256 32 2 : tunables 0 0 0 : slabdata 27 27 0
    anon_vma 644 644 88 46 1 : tunables 0 0 0 : slabdata 14 14 0
    pid 1408 1408 64 64 1 : tunables 0 0 0 : slabdata 22 22 0
    vm_area_struct 1200 1200 200 40 2 : tunables 0 0 0 : slabdata 30 30 0

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Acked-by: Tejun Heo
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200623174037.3951353-20-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Add some tests to cover the kernel memory accounting functionality. These
    are covering some issues (and changes) we had recently.

    1) A test which allocates a lot of negative dentries, checks memcg slab
    statistics, creates memory pressure by setting memory.max to some low
    value and checks that some number of slabs was reclaimed.

    2) A test which covers side effects of memcg destruction: it creates
    and destroys a large number of sub-cgroups, each containing a
    multi-threaded workload which allocates and releases some kernel
    memory. Then it checks that the charge ans memory.stats do add up on
    the parent level.

    3) A test which reads /proc/kpagecgroup and implicitly checks that it
    doesn't crash the system.

    4) A test which spawns a large number of threads and checks that the
    kernel stacks accounting works as expected.

    5) A test which checks that living charged slab objects are not
    preventing the memory cgroup from being released after being deleted by
    a user.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200623174037.3951353-19-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Instead of having two sets of kmem_caches: one for system-wide and
    non-accounted allocations and the second one shared by all accounted
    allocations, we can use just one.

    The idea is simple: space for obj_cgroup metadata can be allocated on
    demand and filled only for accounted allocations.

    It allows to remove a bunch of code which is required to handle kmem_cache
    clones for accounted allocations. There is no more need to create them,
    accumulate statistics, propagate attributes, etc. It's a quite
    significant simplification.

    Also, because the total number of slab_caches is reduced almost twice (not
    all kmem_caches have a memcg clone), some additional memory savings are
    expected. On my devvm it additionally saves about 3.5% of slab memory.

    [guro@fb.com: fix build on MIPS]
    Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

    Suggested-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as
    a first argument, so the is_root_cache(s) check is redundant and can be
    removed without any functional change.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-17-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently there are two lists of kmem_caches:
    1) slab_caches, which contains all kmem_caches,
    2) slab_root_caches, which contains only root kmem_caches.

    And there is some preprocessor magic to have a single list if
    CONFIG_MEMCG_KMEM isn't enabled.

    It was required earlier because the number of non-root kmem_caches was
    proportional to the number of memory cgroups and could reach really big
    values. Now, when it cannot exceed the number of root kmem_caches, there
    is really no reason to maintain two lists.

    We never iterate over the slab_root_caches list on any hot paths, so it's
    perfectly fine to iterate over slab_caches and filter out non-root
    kmem_caches.

    It allows to remove a lot of config-dependent code and two pointers from
    the kmem_cache structure.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The memcg_kmem_get_cache() function became really trivial, so let's just
    inline it into the single call point: memcg_slab_pre_alloc_hook().

    It will make the code less bulky and can also help the compiler to
    generate a better code.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Because the number of non-root kmem_caches doesn't depend on the number of
    memory cgroups anymore and is generally not very big, there is no more
    need for a dedicated workqueue.

    Also, as there is no more need to pass any arguments to the
    memcg_create_kmem_cache() except the root kmem_cache, it's possible to
    just embed the work structure into the kmem_cache and avoid the dynamic
    allocation of the work structure.

    This will also simplify the synchronization: for each root kmem_cache
    there is only one work. So there will be no more concurrent attempts to
    create a non-root kmem_cache for a root kmem_cache: the second and all
    following attempts to queue the work will fail.

    On the kmem_cache destruction path there is no more need to call the
    expensive flush_workqueue() and wait for all pending works to be finished.
    Instead, cancel_work_sync() can be used to cancel/wait for only one work.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This is fairly big but mostly red patch, which makes all accounted slab
    allocations use a single set of kmem_caches instead of creating a separate
    set for each memory cgroup.

    Because the number of non-root kmem_caches is now capped by the number of
    root kmem_caches, there is no need to shrink or destroy them prematurely.
    They can be perfectly destroyed together with their root counterparts.
    This allows to dramatically simplify the management of non-root
    kmem_caches and delete a ton of code.

    This patch performs the following changes:
    1) introduces memcg_params.memcg_cache pointer to represent the
    kmem_cache which will be used for all non-root allocations
    2) reuses the existing memcg kmem_cache creation mechanism
    to create memcg kmem_cache on the first allocation attempt
    3) memcg kmem_caches are named -memcg,
    e.g. dentry-memcg
    4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
    or schedule it's creation and return the root cache
    5) removes almost all non-root kmem_cache management code
    (separate refcounter, reparenting, shrinking, etc)
    6) makes slab debugfs to display root_mem_cgroup css id and never
    show :dead and :deact flags in the memcg_slabinfo attribute.

    Following patches in the series will simplify the kmem_cache creation.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • To make the memcg_kmem_bypass() function available outside of the
    memcontrol.c, let's move it to memcontrol.h. The function is small and
    nicely fits into static inline sort of functions.

    It will be used from the slab code.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-12-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Deprecate memory.kmem.slabinfo.

    An empty file will be presented if corresponding config options are
    enabled.

    The interface is implementation dependent, isn't present in cgroup v2, and
    is generally useful only for core mm debugging purposes. In other words,
    it doesn't provide any value for the absolute majority of users.

    A drgn-based replacement can be found in
    tools/cgroup/memcg_slabinfo.py. It does support cgroup v1 and v2,
    mimics memory.kmem.slabinfo output and also allows to get any
    additional information without a need to recompile the kernel.

    If a drgn-based solution is too slow for a task, a bpf-based tracing tool
    can be used, which can easily keep track of all slab allocations belonging
    to a memory cgroup.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-11-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Switch to per-object accounting of non-root slab objects.

    Charging is performed using obj_cgroup API in the pre_alloc hook.
    Obj_cgroup is charged with the size of the object and the size of
    metadata: as now it's the size of an obj_cgroup pointer. If the amount of
    memory has been charged successfully, the actual allocation code is
    executed. Otherwise, -ENOMEM is returned.

    In the post_alloc hook if the actual allocation succeeded, corresponding
    vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the
    charge is canceled.

    On the free path obj_cgroup pointer is obtained and used to uncharge the
    size of the releasing object.

    Memcg and lruvec counters are now representing only memory used by active
    slab objects and do not include the free space. The free space is shared
    and doesn't belong to any specific cgroup.

    Global per-node slab vmstats are still modified from
    (un)charge_slab_page() functions. The idea is to keep all slab pages
    accounted as slab pages on system level.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Store the obj_cgroup pointer in the corresponding place of
    page->obj_cgroups for each allocated non-root slab object. Make sure that
    each allocated object holds a reference to obj_cgroup.

    Objcg pointer is obtained from the memcg->objcg dereferencing in
    memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
    Then in case of successful allocation(s) it's getting stored in the
    page->obj_cgroups vector.

    The objcg obtaining part look a bit bulky now, but it will be simplified
    by next commits in the series.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin