08 Jun, 2018

3 commits

  • Memory controller implements the memory.low best-effort memory
    protection mechanism, which works perfectly in many cases and allows
    protecting working sets of important workloads from sudden reclaim.

    But its semantics has a significant limitation: it works only as long as
    there is a supply of reclaimable memory. This makes it pretty useless
    against any sort of slow memory leaks or memory usage increases. This
    is especially true for swapless systems. If swap is enabled, memory
    soft protection effectively postpones problems, allowing a leaking
    application to fill all swap area, which makes no sense. The only
    effective way to guarantee the memory protection in this case is to
    invoke the OOM killer.

    It's possible to handle this case in userspace by reacting on MEMCG_LOW
    events; but there is still a place for a fail-safe in-kernel mechanism
    to provide stronger guarantees.

    This patch introduces the memory.min interface for cgroup v2 memory
    controller. It works very similarly to memory.low (sharing the same
    hierarchical behavior), except that it's not disabled if there is no
    more reclaimable memory in the system.

    If cgroup is not populated, its memory.min is ignored, because otherwise
    even the OOM killer wouldn't be able to reclaim the protected memory,
    and the system can stall.

    [guro@fb.com: s/low/min/ in docs]
    Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
    Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Randy Dunlap
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch aims to address an issue in current memory.low semantics,
    which makes it hard to use it in a hierarchy, where some leaf memory
    cgroups are more valuable than others.

    For example, there are memcgs A, A/B, A/C, A/D and A/E:

    A A/memory.low = 2G, A/memory.current = 6G
    //\\
    BC DE B/memory.low = 3G B/memory.current = 2G
    C/memory.low = 1G C/memory.current = 2G
    D/memory.low = 0 D/memory.current = 2G
    E/memory.low = 10G E/memory.current = 0

    If we apply memory pressure, B, C and D are reclaimed at the same pace
    while A's usage exceeds 2G. This is obviously wrong, as B's usage is
    fully below B's memory.low, and C has 1G of protection as well. Also, A
    is pushed to the size, which is less than A's 2G memory.low, which is
    also wrong.

    A simple bash script (provided below) can be used to reproduce
    the problem. Current results are:
    A: 1430097920
    A/B: 711929856
    A/C: 717426688
    A/D: 741376
    A/E: 0

    To address the issue a concept of effective memory.low is introduced.
    Effective memory.low is always equal or less than original memory.low.
    In a case, when there is no memory.low overcommittment (and also for
    top-level cgroups), these two values are equal.

    Otherwise it's a part of parent's effective memory.low, calculated as a
    cgroup's memory.low usage divided by sum of sibling's memory.low usages
    (under memory.low usage I mean the size of actually protected memory:
    memory.current if memory.current < memory.low, 0 otherwise). It's
    necessary to track the actual usage, because otherwise an empty cgroup
    with memory.low set (A/E in my example) will affect actual memory
    distribution, which makes no sense. To avoid traversing the cgroup tree
    twice, page_counters code is reused.

    Calculating effective memory.low can be done in the reclaim path, as we
    conveniently traversing the cgroup tree from top to bottom and check
    memory.low on each level. So, it's a perfect place to calculate
    effective memory low and save it to use it for children cgroups.

    This also eliminates a need to traverse the cgroup tree from bottom to
    top each time to check if parent's guarantee is not exceeded.

    Setting/resetting effective memory.low is intentionally racy, but it's
    fine and shouldn't lead to any significant differences in actual memory
    distribution.

    With this patch applied results are matching the expectations:
    A: 2147930112
    A/B: 1428721664
    A/C: 718393344
    A/D: 815104
    A/E: 0

    Test script:
    #!/bin/bash

    CGPATH="/sys/fs/cgroup"

    truncate /file1 --size 2G
    truncate /file2 --size 2G
    truncate /file3 --size 2G
    truncate /file4 --size 50G

    mkdir "${CGPATH}/A"
    echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
    mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

    echo 2G > "${CGPATH}/A/memory.low"
    echo 3G > "${CGPATH}/A/B/memory.low"
    echo 1G > "${CGPATH}/A/C/memory.low"
    echo 0 > "${CGPATH}/A/D/memory.low"
    echo 10G > "${CGPATH}/A/E/memory.low"

    echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
    echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
    echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
    echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

    echo "A: " `cat "${CGPATH}/A/memory.current"`
    echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
    echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
    echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
    echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

    rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
    rmdir "${CGPATH}/A"
    rm /file1 /file2 /file3 /file4

    Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This patch renames struct page_counter fields:
    count -> usage
    limit -> max

    and the corresponding functions:
    page_counter_limit() -> page_counter_set_max()
    mem_cgroup_get_limit() -> mem_cgroup_get_max()
    mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
    memcg_update_kmem_limit() -> memcg_update_kmem_max()
    memcg_update_tcp_limit() -> memcg_update_tcp_max()

    The idea behind this renaming is to have the direct matching
    between memory cgroup knobs (low, high, max) and page_counters API.

    This is pure renaming, this patch doesn't bring any functional change.

    Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

06 Nov, 2015

1 commit

  • page_counter_try_charge() currently returns 0 on success and -ENOMEM on
    failure, which is surprising behavior given the function name.

    Make it follow the expected pattern of try_stuff() functions that return a
    boolean true to indicate success, or false for failure.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

12 Feb, 2015

1 commit


11 Dec, 2014

2 commits

  • As charges now pin the css explicitely, there is no more need for kmemcg
    to acquire a proxy reference for outstanding pages during offlining, or
    maintain state to identify such "dead" groups.

    This was the last user of the uncharge functions' return values, so remove
    them as well.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory is internally accounted in bytes, using spinlock-protected 64-bit
    counters, even though the smallest accounting delta is a page. The
    counter interface is also convoluted and does too many things.

    Introduce a new lockless word-sized page counter API, then change all
    memory accounting over to it. The translation from and to bytes then only
    happens when interfacing with userspace.

    The removed locking overhead is noticable when scaling beyond the per-cpu
    charge caches - on a 4-socket machine with 144-threads, the following test
    shows the performance differences of 288 memcgs concurrently running a
    page fault benchmark:

    vanilla:

    18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
    1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
    24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
    1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
    50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
    stalled-cycles-frontend
    stalled-cycles-backend
    8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
    1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
    1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )

    132.474343877 seconds time elapsed ( +- 0.21% )

    lockless:

    12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
    832,850 context-switches # 0.068 K/sec ( +- 0.54% )
    15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
    1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
    32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
    stalled-cycles-frontend
    stalled-cycles-backend
    9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
    2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
    1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )

    91.369330729 seconds time elapsed ( +- 0.45% )

    On top of improved scalability, this also gets rid of the icky long long
    types in the very heart of memcg, which is great for 32 bit and also makes
    the code a lot more readable.

    Notable differences between the old and new API:

    - res_counter_charge() and res_counter_charge_nofail() become
    page_counter_try_charge() and page_counter_charge() resp. to match
    the more common kernel naming scheme of try_do()/do()

    - res_counter_uncharge_until() is only ever used to cancel a local
    counter and never to uncharge bigger segments of a hierarchy, so
    it's replaced by the simpler page_counter_cancel()

    - res_counter_set_limit() is replaced by page_counter_limit(), which
    expects its callers to serialize against themselves

    - res_counter_memparse_write_strategy() is replaced by
    page_counter_limit(), which rounds down to the nearest page size -
    rather than up. This is more reasonable for explicitely requested
    hard upper limits.

    - to keep charging light-weight, page_counter_try_charge() charges
    speculatively, only to roll back if the result exceeds the limit.
    Because of this, a failing bigger charge can temporarily lock out
    smaller charges that would otherwise succeed. The error is bounded
    to the difference between the smallest and the biggest possible
    charge size, so for memcg, this means that a failing THP charge can
    send base page charges into reclaim upto 2MB (4MB) before the limit
    would have been reached. This should be acceptable.

    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
    [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner