14 Oct, 2020

1 commit


15 Aug, 2020

1 commit

  • Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could had
    memcg->memsw->watermark and memcg->memsw->failcnt been accessed
    concurrently as reported by KCSAN,

    BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

    read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
    page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
    try_charge+0x131/0xd50 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x58/0x140
    __memcg_kmem_charge+0xcc/0x280
    __alloc_pages_nodemask+0x1e1/0x450
    alloc_pages_current+0xa6/0x120
    pte_alloc_one+0x17/0xd0
    __pte_alloc+0x3a/0x1f0
    copy_p4d_range+0xc36/0x1990
    copy_page_range+0x21d/0x360
    dup_mmap+0x5f5/0x7a0
    dup_mm+0xa2/0x240
    copy_process+0x1b3f/0x3460
    _do_fork+0xaa/0xa20
    __x64_sys_clone+0x13b/0x170
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
    page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
    try_charge+0x131/0xd50 mm/memcontrol.c:2405
    mem_cgroup_try_charge+0x159/0x460
    mem_cgroup_try_charge_delay+0x3d/0xa0
    wp_page_copy+0x14d/0x930
    do_wp_page+0x107/0x7b0
    __handle_mm_fault+0xce6/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

    write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
    page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
    try_charge+0x185/0xbf0 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
    __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
    __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

    read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
    page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
    try_charge+0x185/0xbf0 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
    __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
    __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

    Since watermark could be compared or set to garbage due to a data race
    which would change the code logic, fix it by adding a pair of READ_ONCE()
    and WRITE_ONCE() in those places.

    The "failcnt" counter is tolerant of some degree of inaccuracy and is only
    used to report stats, a data race will not be harmful, thus mark it as an
    intentional data race using the data_race() macro.

    Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
    Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Tetsuo Handa
    Cc: Marco Elver
    Cc: Dmitry Vyukov
    Cc: Johannes Weiner
    Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     

08 Aug, 2020

1 commit

  • When workload runs in cgroups that aren't directly below root cgroup and
    their parent specifies reclaim protection, it may end up ineffective.

    The reason is that propagate_protected_usage() is not called in all
    hierarchy up. All the protected usage is incorrectly accumulated in the
    workload's parent. This means that siblings_low_usage is overestimated
    and effective protection underestimated. Even though it is transitional
    phenomenon (uncharge path does correct propagation and fixes the wrong
    children_low_usage), it can undermine the intended protection
    unexpectedly.

    We have noticed this problem while seeing a swap out in a descendant of a
    protected memcg (intermediate node) while the parent was conveniently
    under its protection limit and the memory pressure was external to that
    hierarchy. Michal has pinpointed this down to the wrong
    siblings_low_usage which led to the unwanted reclaim.

    The fix is simply updating children_low_usage in respective ancestors also
    in the charging path.

    Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
    Signed-off-by: Michal Koutný
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: [4.18+]
    Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Koutný
     

03 Apr, 2020

3 commits

  • This can be set concurrently with reads, which may cause the wrong value
    to be propagated.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/e809b4e6b0c1626dac6945970de06409a180ee65.1584034301.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • This can be set concurrently with reads, which may cause the wrong value
    to be propagated.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/448206f44b0fa7be9dad2ca2601d2bcb2c0b7844.1584034301.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "mm: memcontrol: recursive memory.low protection", v3.

    The current memory.low (and memory.min) semantics require protection to be
    assigned to a cgroup in an untinterrupted chain from the top-level cgroup
    all the way to the leaf.

    In practice, we want to protect entire cgroup subtrees from each other
    (system management software vs. workload), but we would like the VM to
    balance memory optimally *within* each subtree, without having to make
    explicit weight allocations among individual components. The current
    semantics make that impossible.

    They also introduce unmanageable complexity into more advanced resource
    trees. For example:

    host root
    `- system.slice
    `- rpm upgrades
    `- logging
    `- workload.slice
    `- a container
    `- system.slice
    `- workload.slice
    `- job A
    `- component 1
    `- component 2
    `- job B

    At a host-level perspective, we would like to protect the outer
    workload.slice subtree as a whole from rpm upgrades, logging etc. But for
    that to be effective, right now we'd have to propagate it down through the
    container, the inner workload.slice, into the job cgroup and ultimately
    the component cgroups where memory is actually, physically allocated.
    This may cross several tree delegation points and namespace boundaries,
    which make such a setup near impossible.

    CPU and IO on the other hand are already distributed recursively. The
    user would simply configure allowances at the host level, and they would
    apply to the entire subtree without any downward propagation.

    To enable the above-mentioned usecases and bring memory in line with other
    resource controllers, this patch series extends memory.low/min such that
    settings apply recursively to the entire subtree. Users can still assign
    explicit shares in subgroups, but if they don't, any ancestral protection
    will be distributed such that children compete freely amongst each other -
    as if no memory control were enabled inside the subtree - but enjoy
    protection from neighboring trees.

    In the above example, the user would then be able to configure shares of
    CPU, IO and memory at the host level to comprehensively protect and
    isolate the workload.slice as a whole from system.slice activity.

    Patch #1 fixes an existing bug that can give a cgroup tree more protection
    than it should receive as per ancestor configuration.

    Patch #2 simplifies and documents the existing code to make it easier to
    reason about the changes in the next patch.

    Patch #3 finally implements recursive memory protection semantics.

    Because of a risk of regressing legacy setups, the new semantics are
    hidden behind a cgroup2 mount option, 'memory_recursiveprot'.

    More details in patch #3.

    This patch (of 3):

    When memory.low is overcommitted - i.e. the children claim more
    protection than their shared ancestor grants them - the allowance is
    distributed in proportion to how much each sibling uses their own declared
    protection:

    low_usage = min(memory.low, memory.current)
    elow = parent_elow * (low_usage / siblings_low_usage)

    However, siblings_low_usage is not the sum of all low_usages. It sums
    up the usages of *only those cgroups that are within their memory.low*
    That means that low_usage can be *bigger* than siblings_low_usage, and
    consequently the total protection afforded to the children can be
    bigger than what the ancestor grants the subtree.

    Consider three groups where two are in excess of their protection:

    A/memory.low = 10G
    A/A1/memory.low = 10G, memory.current = 20G
    A/A2/memory.low = 10G, memory.current = 20G
    A/A3/memory.low = 10G, memory.current = 8G
    siblings_low_usage = 8G (only A3 contributes)

    A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
    A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
    A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G

    (the 12.5G are capped to the explicit memory.low setting of 10G)

    With that, the sum of all awarded protection below A is 30G, when A
    only grants 10G for the entire subtree.

    What does this mean in practice? A1 and A2 would still be in excess of
    their 10G allowance and would be reclaimed, whereas A3 would not. As
    they eventually drop below their protection setting, they would be
    counted in siblings_low_usage again and the error would right itself.

    When reclaim was applied in a binary fashion (cgroup is reclaimed when
    it's above its protection, otherwise it's skipped) this would actually
    work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
    aggression always exclude protection"), reclaim pressure is scaled to
    how much a cgroup is above its protection. As a result this
    calculation error unduly skews pressure away from A1 and A2 toward the
    rest of the system.

    But why did we do it like this in the first place?

    The reasoning behind exempting groups in excess from
    siblings_low_usage was to go after them first during reclaim in an
    overcommitted subtree:

    A/memory.low = 2G, memory.current = 4G
    A/A1/memory.low = 3G, memory.current = 2G
    A/A2/memory.low = 1G, memory.current = 2G

    siblings_low_usage = 2G (only A1 contributes)
    A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
    A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G

    While the children combined are overcomitting A and are technically
    both at fault, A2 is actively declaring unprotected memory and we
    would like to reclaim that first.

    However, while this sounds like a noble goal on the face of it, it
    doesn't make much difference in actual memory distribution: Because A
    is overcommitted, reclaim will not stop once A2 gets pushed back to
    within its allowance; we'll have to reclaim A1 either way. The end
    result is still that protection is distributed proportionally, with A1
    getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.

    [ If A weren't overcommitted, it wouldn't make a difference since each
    cgroup would just get the protection it declares:

    A/memory.low = 2G, memory.current = 3G
    A/A1/memory.low = 1G, memory.current = 1G
    A/A2/memory.low = 1G, memory.current = 2G

    With the current calculation:

    siblings_low_usage = 1G (only A1 contributes)
    A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
    A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G

    Including excess groups in siblings_low_usage:

    siblings_low_usage = 2G
    A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
    A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]

    Simplify the calculation and fix the proportional reclaim bug by
    including excess cgroups in siblings_low_usage.

    After this patch, the effective memory.low distribution from the
    example above would be as follows:

    A/memory.low = 10G
    A/A1/memory.low = 10G, memory.current = 20G
    A/A2/memory.low = 10G, memory.current = 20G
    A/A3/memory.low = 10G, memory.current = 8G
    siblings_low_usage = 28G

    A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
    A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
    A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G

    Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
    Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Tejun Heo
    Acked-by: Roman Gushchin
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Cc: Michal Koutný
    Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Jun, 2018

3 commits

  • Memory controller implements the memory.low best-effort memory
    protection mechanism, which works perfectly in many cases and allows
    protecting working sets of important workloads from sudden reclaim.

    But its semantics has a significant limitation: it works only as long as
    there is a supply of reclaimable memory. This makes it pretty useless
    against any sort of slow memory leaks or memory usage increases. This
    is especially true for swapless systems. If swap is enabled, memory
    soft protection effectively postpones problems, allowing a leaking
    application to fill all swap area, which makes no sense. The only
    effective way to guarantee the memory protection in this case is to
    invoke the OOM killer.

    It's possible to handle this case in userspace by reacting on MEMCG_LOW
    events; but there is still a place for a fail-safe in-kernel mechanism
    to provide stronger guarantees.

    This patch introduces the memory.min interface for cgroup v2 memory
    controller. It works very similarly to memory.low (sharing the same
    hierarchical behavior), except that it's not disabled if there is no
    more reclaimable memory in the system.

    If cgroup is not populated, its memory.min is ignored, because otherwise
    even the OOM killer wouldn't be able to reclaim the protected memory,
    and the system can stall.

    [guro@fb.com: s/low/min/ in docs]
    Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
    Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Randy Dunlap
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch aims to address an issue in current memory.low semantics,
    which makes it hard to use it in a hierarchy, where some leaf memory
    cgroups are more valuable than others.

    For example, there are memcgs A, A/B, A/C, A/D and A/E:

    A A/memory.low = 2G, A/memory.current = 6G
    //\\
    BC DE B/memory.low = 3G B/memory.current = 2G
    C/memory.low = 1G C/memory.current = 2G
    D/memory.low = 0 D/memory.current = 2G
    E/memory.low = 10G E/memory.current = 0

    If we apply memory pressure, B, C and D are reclaimed at the same pace
    while A's usage exceeds 2G. This is obviously wrong, as B's usage is
    fully below B's memory.low, and C has 1G of protection as well. Also, A
    is pushed to the size, which is less than A's 2G memory.low, which is
    also wrong.

    A simple bash script (provided below) can be used to reproduce
    the problem. Current results are:
    A: 1430097920
    A/B: 711929856
    A/C: 717426688
    A/D: 741376
    A/E: 0

    To address the issue a concept of effective memory.low is introduced.
    Effective memory.low is always equal or less than original memory.low.
    In a case, when there is no memory.low overcommittment (and also for
    top-level cgroups), these two values are equal.

    Otherwise it's a part of parent's effective memory.low, calculated as a
    cgroup's memory.low usage divided by sum of sibling's memory.low usages
    (under memory.low usage I mean the size of actually protected memory:
    memory.current if memory.current < memory.low, 0 otherwise). It's
    necessary to track the actual usage, because otherwise an empty cgroup
    with memory.low set (A/E in my example) will affect actual memory
    distribution, which makes no sense. To avoid traversing the cgroup tree
    twice, page_counters code is reused.

    Calculating effective memory.low can be done in the reclaim path, as we
    conveniently traversing the cgroup tree from top to bottom and check
    memory.low on each level. So, it's a perfect place to calculate
    effective memory low and save it to use it for children cgroups.

    This also eliminates a need to traverse the cgroup tree from bottom to
    top each time to check if parent's guarantee is not exceeded.

    Setting/resetting effective memory.low is intentionally racy, but it's
    fine and shouldn't lead to any significant differences in actual memory
    distribution.

    With this patch applied results are matching the expectations:
    A: 2147930112
    A/B: 1428721664
    A/C: 718393344
    A/D: 815104
    A/E: 0

    Test script:
    #!/bin/bash

    CGPATH="/sys/fs/cgroup"

    truncate /file1 --size 2G
    truncate /file2 --size 2G
    truncate /file3 --size 2G
    truncate /file4 --size 50G

    mkdir "${CGPATH}/A"
    echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
    mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

    echo 2G > "${CGPATH}/A/memory.low"
    echo 3G > "${CGPATH}/A/B/memory.low"
    echo 1G > "${CGPATH}/A/C/memory.low"
    echo 0 > "${CGPATH}/A/D/memory.low"
    echo 10G > "${CGPATH}/A/E/memory.low"

    echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
    echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
    echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
    echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

    echo "A: " `cat "${CGPATH}/A/memory.current"`
    echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
    echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
    echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
    echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

    rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
    rmdir "${CGPATH}/A"
    rm /file1 /file2 /file3 /file4

    Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch renames struct page_counter fields:
    count -> usage
    limit -> max

    and the corresponding functions:
    page_counter_limit() -> page_counter_set_max()
    mem_cgroup_get_limit() -> mem_cgroup_get_max()
    mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
    memcg_update_kmem_limit() -> memcg_update_kmem_max()
    memcg_update_tcp_limit() -> memcg_update_tcp_max()

    The idea behind this renaming is to have the direct matching
    between memory cgroup knobs (low, high, max) and page_counters API.

    This is pure renaming, this patch doesn't bring any functional change.

    Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

06 Nov, 2015

1 commit

  • page_counter_try_charge() currently returns 0 on success and -ENOMEM on
    failure, which is surprising behavior given the function name.

    Make it follow the expected pattern of try_stuff() functions that return a
    boolean true to indicate success, or false for failure.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

12 Feb, 2015

1 commit


11 Dec, 2014

2 commits

  • As charges now pin the css explicitely, there is no more need for kmemcg
    to acquire a proxy reference for outstanding pages during offlining, or
    maintain state to identify such "dead" groups.

    This was the last user of the uncharge functions' return values, so remove
    them as well.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory is internally accounted in bytes, using spinlock-protected 64-bit
    counters, even though the smallest accounting delta is a page. The
    counter interface is also convoluted and does too many things.

    Introduce a new lockless word-sized page counter API, then change all
    memory accounting over to it. The translation from and to bytes then only
    happens when interfacing with userspace.

    The removed locking overhead is noticable when scaling beyond the per-cpu
    charge caches - on a 4-socket machine with 144-threads, the following test
    shows the performance differences of 288 memcgs concurrently running a
    page fault benchmark:

    vanilla:

    18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
    1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
    24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
    1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
    50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
    stalled-cycles-frontend
    stalled-cycles-backend
    8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
    1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
    1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )

    132.474343877 seconds time elapsed ( +- 0.21% )

    lockless:

    12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
    832,850 context-switches # 0.068 K/sec ( +- 0.54% )
    15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
    1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
    32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
    stalled-cycles-frontend
    stalled-cycles-backend
    9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
    2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
    1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )

    91.369330729 seconds time elapsed ( +- 0.45% )

    On top of improved scalability, this also gets rid of the icky long long
    types in the very heart of memcg, which is great for 32 bit and also makes
    the code a lot more readable.

    Notable differences between the old and new API:

    - res_counter_charge() and res_counter_charge_nofail() become
    page_counter_try_charge() and page_counter_charge() resp. to match
    the more common kernel naming scheme of try_do()/do()

    - res_counter_uncharge_until() is only ever used to cancel a local
    counter and never to uncharge bigger segments of a hierarchy, so
    it's replaced by the simpler page_counter_cancel()

    - res_counter_set_limit() is replaced by page_counter_limit(), which
    expects its callers to serialize against themselves

    - res_counter_memparse_write_strategy() is replaced by
    page_counter_limit(), which rounds down to the nearest page size -
    rather than up. This is more reasonable for explicitely requested
    hard upper limits.

    - to keep charging light-weight, page_counter_try_charge() charges
    speculatively, only to roll back if the result exceeds the limit.
    Because of this, a failing bigger charge can temporarily lock out
    smaller charges that would otherwise succeed. The error is bounded
    to the difference between the smallest and the biggest possible
    charge size, so for memcg, this means that a failing THP charge can
    send base page charges into reclaim upto 2MB (4MB) before the limit
    would have been reached. This should be acceptable.

    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner