20 Jan, 2017

1 commit

  • commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

    Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

08 Oct, 2016

2 commits

  • The cgroup core and the memory controller need to track socket ownership
    for different purposes, but the tracking sites being entirely different
    is kind of ugly.

    Be a better citizen and rename the memory controller callbacks to match
    the cgroup core callbacks, then move them to the same place.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When selecting an oom victim, we use the same heuristic for both memory
    cgroup and global oom. The only difference is the scope of tasks to
    select the victim from. So we could just export an iterator over all
    memcg tasks and keep all oom related logic in oom_kill.c, but instead we
    duplicate pieces of it in memcontrol.c reusing some initially private
    functions of oom_kill.c in order to not duplicate all of it. That looks
    ugly and error prone, because any modification of select_bad_process
    should also be propagated to mem_cgroup_out_of_memory.

    Let's rework this as follows: keep all oom heuristic related code private
    to oom_kill.c and make oom_kill.c use exported memcg functions when it's
    really necessary (like in case of iterating over memcg tasks).

    Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

29 Jul, 2016

6 commits

  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Minchan Kim reported setting the following warning on a 32-bit system
    although it can affect 64-bit systems.

    WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
    mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
    Modules linked in:
    CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

    LRU list contents and counts are updated separately. Counts are updated
    before pages are added to the LRU and updated after pages are removed.
    The warning above is from a check in mem_cgroup_update_lru_size that
    ensures that list sizes of zero are empty.

    The problem is that node-lru needs to account for highmem pages if
    CONFIG_HIGHMEM is set. One impact of the implementation is that the
    sizes are updated in multiple passes when pages from multiple zones were
    isolated. This happens whether HIGHMEM is set or not. When multiple
    zones are isolated, it's possible for a debugging check in memcg to be
    tripped.

    This patch forces all the zone counts to be updated before the memcg
    function is called.

    Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Tested-by: Minchan Kim
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Jul, 2016

1 commit

  • - Handle memcg_kmem_enabled check out to the caller. This reduces the
    number of function definitions making the code easier to follow. At
    the same time it doesn't result in code bloat, because all of these
    functions are used only in one or two places.

    - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
    have to dive deep into memcg implementation to see which allocations
    are charged and which are not.

    - Refresh comments.

    Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Jul, 2016

1 commit

  • The memory controller has quite a bit of state that usually outlives the
    cgroup and pins its CSS until said state disappears. At the same time
    it imposes a 16-bit limit on the CSS ID space to economically store IDs
    in the wild. Consequently, when we use cgroups to contain frequent but
    small and short-lived jobs that leave behind some page cache, we quickly
    run into the 64k limitations of outstanding CSSs. Creating a new cgroup
    fails with -ENOSPC while there are only a few, or even no user-visible
    cgroups in existence.

    Although pinning CSSs past cgroup removal is common, there are only two
    instances that actually need an ID after a cgroup is deleted: cache
    shadow entries and swapout records.

    Cache shadow entries reference the ID weakly and can deal with the CSS
    having disappeared when it's looked up later. They pose no hurdle.

    Swap-out records do need to pin the css to hierarchically attribute
    swapins after the cgroup has been deleted; though the only pages that
    remain swapped out after offlining are tmpfs/shmem pages. And those
    references are under the user's control, so they are manageable.

    This patch introduces a private 16-bit memcg ID and switches swap and
    cache shadow entries over to using that. This ID can then be recycled
    after offlining when the CSS remains pinned only by objects that don't
    specifically need it.

    This script demonstrates the problem by faulting one cache page in a new
    cgroup and deleting it again:

    set -e
    mkdir -p pages
    for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
    done

    When run on an unpatched kernel, we eventually run out of possible IDs
    even though there are no visible cgroups:

    [root@ham ~]# ./cssidstress.sh
    [...]
    65000
    mkdir: cannot create directory '/cgroup/foo': No space left on device

    After this patch, the IDs get released upon cgroup destruction and the
    cache and css objects get released once memory reclaim kicks in.

    [hannes@cmpxchg.org: init the IDR]
    Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
    Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
    Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: John Garcia
    Reviewed-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Cc: Nikolay Borisov
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 May, 2016

1 commit

  • The inactive file list should still be large enough to contain readahead
    windows and freshly written file data, but it no longer is the only
    source for detecting multiple accesses to file pages. The workingset
    refault measurement code causes recently evicted file pages that get
    accessed again after a shorter interval to be promoted directly to the
    active list.

    With that mechanism in place, we can afford to (on a larger system)
    dedicate more memory to the active file list, so we can actually cache
    more of the frequently used file pages in memory, and not have them
    pushed out by streaming writes, once-used streaming file reads, etc.

    This can help things like database workloads, where only half the page
    cache can currently be used to cache the database working set. This
    patch automatically increases that fraction on larger systems, using the
    same ratio that has already been used for anonymous memory.

    [hannes@cmpxchg.org: cgroup-awareness]
    Signed-off-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Reported-by: Andres Freund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

20 May, 2016

1 commit

  • Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
    reclaim was removed) that lru_size can be updated by -nr_taken once per
    call to isolate_lru_pages(), instead of page by page.

    Update it inside isolate_lru_pages(), or at its two callsites? I chose
    to update it at the callsites, rearranging and grouping the updates by
    nr_taken and nr_scanned together in both.

    With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
    __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
    some more calls in a future commit. Make the code a little smaller and
    simpler by incorporating stat update in lru_size update.

    The exception was move_active_pages_to_lru(), which aggregated the
    pgmoved stat update separately from the individual lru_size updates; but
    I still think this a simplification worth making.

    However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
    better use the name update_lru_size, calls mem_cgroup_update_lru_size
    when CONFIG_MEMCG.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Mar, 2016

4 commits

  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • As kmem accounting is now either enabled for all cgroups or disabled
    system-wide, there's no point in having memcg_kmem_online() helper -
    instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
    shrink_slab() now does.

    There are only two places left where this helper is used -
    __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
    only be called if memcg_kmem_enabled() returned true. Since the cgroup
    it operates on is online, mem_cgroup_is_root() check will be enough.

    memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
    of memcg_kmem_online(), because it relies on the fact that in
    memcg_offline_kmem() memcg->kmem_state is changed before
    memcg_deactivate_kmem_caches() is called, but there we can just
    open-code the check.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Show how much memory is allocated to kernel stacks.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Show how much memory is used for storing reclaimable and unreclaimable
    in-kernel data structures allocated from slab caches.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

6 commits

  • There are several users that nest lock_page_memcg() inside lock_page()
    to prevent page->mem_cgroup from changing. But the page lock prevents
    pages from moving between cgroups, so that is unnecessary overhead.

    Remove lock_page_memcg() in contexts with locked contexts and fix the
    debug code in the page stat functions to be okay with the page lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Changing a page's memcg association complicates dealing with the page,
    so we want to limit this as much as possible. Page migration e.g. does
    not have to do that. Just like page cache replacement, it can forcibly
    charge a replacement page, and then uncharge the old page when it gets
    freed. Temporarily overcharging the cgroup by a single page is not an
    issue in practice, and charging is so cheap nowadays that this is much
    preferrable to the headache of messing with live pages.

    The only place that still changes the page->mem_cgroup binding of live
    pages is when pages move along with a task to another cgroup. But that
    path isolates the page from the LRU, takes the page lock, and the move
    lock (lock_page_memcg()). That means page->mem_cgroup is always stable
    in callers that have the page isolated from the LRU or locked. Lighter
    unlocked paths, like writeback accounting, can use lock_page_memcg().

    [akpm@linux-foundation.org: fix build]
    [vdavydov@virtuozzo.com: fix lockdep splat]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Cache thrash detection (see a528910e12ec "mm: thrash detection-based
    file cache sizing" for details) currently only works on the system
    level, not inside cgroups. Worse, as the refaults are compared to the
    global number of active cache, cgroups might wrongfully get all their
    refaults activated when their pages are hotter than those of others.

    Move the refault machinery from the zone to the lruvec, and then tag
    eviction entries with the memcg ID. This makes the thrash detection
    work correctly inside cgroups.

    [sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Fix up trivial spelling errors, noticed while reading the code.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

04 Feb, 2016

1 commit

  • MEM_CGROUP_STAT_NSTATS is just a delimiter for cgroup1 statistics, not
    an actual array entry. Reuse it for the first cgroup2 stat entry, like
    in the event array.

    Fixes: b2807f07f4f8 ("mm: memcontrol: add "sock" to cgroup2 memory.stat")
    Signed-off-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Jan, 2016

9 commits

  • Provide statistics on how much of a cgroup's memory footprint is made up
    of socket buffers from network connections owned by the group.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_lruvec_online() takes lruvec, but it only needs memcg. Since
    get_scan_count(), which is the only user of this function, now possesses
    pointer to memcg, let's pass memcg directly to mem_cgroup_online() instead
    of picking it out of lruvec and rename the function accordingly.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset introduces swap accounting to cgroup2.

    This patch (of 7):

    In the legacy hierarchy we charge memsw, which is dubious, because:

    - memsw.limit must be >= memory.limit, so it is impossible to limit
    swap usage less than memory usage. Taking into account the fact that
    the primary limiting mechanism in the unified hierarchy is
    memory.high while memory.limit is either left unset or set to a very
    large value, moving memsw.limit knob to the unified hierarchy would
    effectively make it impossible to limit swap usage according to the
    user preference.

    - memsw.usage != memory.usage + swap.usage, because a page occupying
    both swap entry and a swap cache page is charged only once to memsw
    counter. As a result, it is possible to effectively eat up to
    memory.limit of memory pages *and* memsw.limit of swap entries, which
    looks unexpected.

    That said, we should provide a different swap limiting mechanism for
    cgroup2.

    This patch adds mem_cgroup->swap counter, which charges the actual number
    of swap entries used by a cgroup. It is only charged in the unified
    hierarchy, while the legacy hierarchy memsw logic is left intact.

    The swap usage can be monitored using new memory.swap.current file and
    limited using memory.swap.max.

    Note, to charge swap resource properly in the unified hierarchy, we have
    to make swap_entry_free uncharge swap only when ->usage reaches zero, not
    just ->count, i.e. when all references to a swap entry, including the one
    taken by swap cache, are gone. This is necessary, because otherwise
    swap-in could result in uncharging swap even if the page is still in swap
    cache and hence still occupies a swap entry. At the same time, this
    shouldn't break memsw counter logic, where a page is never charged twice
    for using both memory and swap, because in case of legacy hierarchy we
    uncharge swap on commit (see mem_cgroup_commit_charge).

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The creation and teardown of struct mem_cgroup is fairly messy and
    that has attracted mistakes and subtle bugs before.

    The main cause for this is that there is no clear model about what
    needs to happen when, and that attracts more chaos. So create one:

    1. mem_cgroup_alloc() should allocate struct mem_cgroup and its
    auxiliary members and initialize work items, locks etc. so that the
    object it returns is fully initialized and in a neutral state.

    2. mem_cgroup_css_alloc() will use mem_cgroup_alloc() to obtain a new
    memcg object and configure it and the system according to the role
    of the new memory-controlled cgroup in the hierarchy.

    3. mem_cgroup_css_online() is no longer needed to synchronize with
    iterators, but it verifies css->id which isn't available earlier.

    4. mem_cgroup_css_offline() implements stuff that needs to happen upon
    the user-visible destruction of a cgroup, which includes stopping
    all user interfacing as well as releasing certain structures when
    continued memory consumption would be unexpected at that point.

    5. mem_cgroup_css_free() prepares the system and the memcg object for
    the object's disappearance, neutralizes its state, and then gives
    it back to mem_cgroup_free().

    6. mem_cgroup_free() releases struct mem_cgroup and auxiliary memory.

    [arnd@arndb.de: fix SLOB build regression]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There are no more external users of struct cg_proto, flatten the
    structure into struct mem_cgroup.

    Since using those struct members doesn't stand out as much anymore,
    add cgroup2 static branches to make it clearer which code is legacy.

    Suggested-by: Vladimir Davydov
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
    controller code is insignificant, having these conditionals is not
    worth the complication and fragility that comes with them.

    [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The cgroup2 memory controller will account important in-kernel memory
    consumers per default. Move all necessary components to CONFIG_MEMCG.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • On any given memcg, the kmem accounting feature has three separate
    states: not initialized, structures allocated, and actively accounting
    slab memory. These are represented through a combination of the
    kmem_acct_activated and kmem_acct_active flags, which is confusing.

    Convert to a kmem_state enum with the states NONE, ALLOCATED, and
    ONLINE. Then rename the functions to modify the state accordingly.
    This follows the nomenclature of css object states more closely.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Jan, 2016

1 commit

  • As with rmap, with new refcounting we cannot rely on PageTransHuge() to
    check if we need to charge size of huge page form the cgroup. We need
    to get information from caller to know whether it was mapped with PMD or
    PTE.

    We do uncharge when last reference on the page gone. At that point if
    we see PageTransHuge() it means we need to unchange whole huge page.

    The tricky part is partial unmap -- when we try to unmap part of huge
    page. We don't do a special handing of this situation, meaning we don't
    uncharge the part of huge page unless last user is gone or
    split_huge_page() is triggered. In case of cgroup memory pressure
    happens the partial unmapped page will be split through shrinker. This
    should be good enough.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

6 commits

  • According to the direct use of struct static_key is
    deprecated. Update the socket and slab accounting code accordingly.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reported-by: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the networking stack know when a memcg is under reclaim pressure so
    that it can clamp its transmit windows accordingly.

    Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
    for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
    in the socket and tcp memory code that tells it to curb consumption
    growth from sockets associated with said control group.

    Traditionally, vmpressure reports for the entire subtree of a memcg
    under pressure, which drops useful information on the individual groups
    reclaimed. However, it's too late to change the userinterface, so add a
    second reporting mode that reports on the level of reclaim instead of at
    the level of pressure, and use that report for sockets.

    vmpressure events are naturally edge triggered, so for hysteresis assert
    socket pressure for a second to allow for subsequent vmpressure events
    to occur before letting the socket code return to normal.

    This will likely need finetuning for a wider variety of workloads, but
    for now stick to the vmpressure presets and keep hysteresis simple.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Socket memory can be a significant share of overall memory consumed by
    common workloads. In order to provide reasonable resource isolation in
    the unified hierarchy, this type of memory needs to be included in the
    tracking/accounting of a cgroup under active memory resource control.

    Overhead is only incurred when a non-root control group is created AND
    the memory controller is instructed to track and account the memory
    footprint of that group. cgroup.memory=nosocket can be specified on the
    boot commandline to override any runtime configuration and forcibly
    exclude socket memory from active memory resource control.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The unified hierarchy memory controller is going to use this jump label
    as well to control the networking callbacks. Move it to the memory
    controller code and give it a more generic name.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There won't be any separate counters for socket memory consumed by
    protocols other than TCP in the future. Remove the indirection and link
    sockets directly to their owning memory cgroup.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There won't be a tcp control soft limit, so integrating the memcg code
    into the global skmem limiting scheme complicates things unnecessarily.
    Replace this with simple and clear charge and uncharge calls--hidden
    behind a jump label--to account skb memory.

    Note that this is not purely aesthetic: as a result of shoehorning the
    per-memcg code into the same memory accounting functions that handle the
    global level, the old code would compare the per-memcg consumption
    against the smaller of the per-memcg limit and the global limit. This
    allowed the total consumption of multiple sockets to exceed the global
    limit, as long as the individual sockets stayed within bounds. After
    this change, the code will always compare the per-memcg consumption to
    the per-memcg limit, and the global consumption to the global limit, and
    thus close this loophole.

    Without a soft limit, the per-memcg memory pressure state in sockets is
    generally questionable. However, we did it until now, so we continue to
    enter it when the hard limit is hit, and packets are dropped, to let
    other sockets in the cgroup know that they shouldn't grow their transmit
    windows, either. However, keep it simple in the new callback model and
    leave memory pressure lazily when the next packet is accepted (as
    opposed to doing it synchroneously when packets are processed). When
    packets are dropped, network performance will already be in the toilet,
    so that should be a reasonable trade-off.

    As described above, consumption is now checked on the per-memcg level
    and the global level separately. Likewise, memory pressure states are
    maintained on both the per-memcg level and the global level, and a
    socket is considered under pressure when either level asserts as much.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner