03 Aug, 2017

2 commits

  • A cgroup can consume resources even after being deleted by a user.
    For example, writing back dirty pages should be accounted and
    limited, despite the corresponding cgroup might contain no processes
    and being deleted by a user.

    In the current implementation a cgroup can remain in such "dying" state
    for an undefined amount of time. For instance, if a memory cgroup
    contains a pge, mlocked by a process belonging to an other cgroup.

    Although the lifecycle of a dying cgroup is out of user's control,
    it's important to have some insight of what's going on under the hood.

    In particular, it's handy to have a counter which will allow
    to detect css leaks.

    To solve this problem, add a cgroup.stat interface to
    the base cgroup control files with the following metrics:

    nr_descendants total number of visible descendant cgroups
    nr_dying_descendants total number of dying descendant cgroups

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     
  • Creating cgroup hierearchies of unreasonable size can affect
    overall system performance. A user might want to limit the
    size of cgroup hierarchy. This is especially important if a user
    is delegating some cgroup sub-tree.

    To address this issue, introduce an ability to control
    the size of cgroup hierarchy.

    The cgroup.max.descendants control file allows to set the maximum
    allowed number of descendant cgroups.
    The cgroup.max.depth file controls the maximum depth of the cgroup
    tree. Both are single value r/w files, with "max" default value.

    The control files exist on each hierarchy level (including root).
    When a new cgroup is created, we check the total descendants
    and depth limits on each level, and if none of them are exceeded,
    a new cgroup is created.

    Only alive cgroups are counted, removed (dying) cgroups are
    ignored.

    Signed-off-by: Roman Gushchin
    Suggested-by: Tejun Heo
    Signed-off-by: Tejun Heo
    Cc: Zefan Li
    Cc: Waiman Long
    Cc: Johannes Weiner
    Cc: kernel-team@fb.com
    Cc: cgroups@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org

    Roman Gushchin
     

26 Jul, 2017

1 commit

  • cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
    or children and fails the operation if so. This test is unnecessary
    because the first part is already checked by
    cgroup_can_be_thread_root() and the latter is unnecessary. The latter
    actually cause a behavioral oddity. Please consider the following
    hierarchy. All cgroups are domains.

    A
    / \
    B C
    \
    D

    If B is made threaded, C and D becomes invalid domains. Due to the no
    children restriction, threaded mode can't be enabled on C. For C and
    D, the only thing the user can do is removal.

    There is no reason for this restriction. Remove it.

    Acked-by: Waiman Long
    Signed-off-by: Tejun Heo

    Tejun Heo
     

21 Jul, 2017

1 commit

  • This patch implements cgroup v2 thread support. The goal of the
    thread mode is supporting hierarchical accounting and control at
    thread granularity while staying inside the resource domain model
    which allows coordination across different resource controllers and
    handling of anonymous resource consumptions.

    A cgroup is always created as a domain and can be made threaded by
    writing to the "cgroup.type" file. When a cgroup becomes threaded, it
    becomes a member of a threaded subtree which is anchored at the
    closest ancestor which isn't threaded.

    The threads of the processes which are in a threaded subtree can be
    placed anywhere without being restricted by process granularity or
    no-internal-process constraint. Note that the threads aren't allowed
    to escape to a different threaded subtree. To be used inside a
    threaded subtree, a controller should explicitly support threaded mode
    and be able to handle internal competition in the way which is
    appropriate for the resource.

    The root of a threaded subtree, the nearest ancestor which isn't
    threaded, is called the threaded domain and serves as the resource
    domain for the whole subtree. This is the last cgroup where domain
    controllers are operational and where all the domain-level resource
    consumptions in the subtree are accounted. This allows threaded
    controllers to operate at thread granularity when requested while
    staying inside the scope of system-level resource distribution.

    As the root cgroup is exempt from the no-internal-process constraint,
    it can serve as both a threaded domain and a parent to normal cgroups,
    so, unlike non-root cgroups, the root cgroup can have both domain and
    threaded children.

    Internally, in a threaded subtree, each css_set has its ->dom_cset
    pointing to a matching css_set which belongs to the threaded domain.
    This ensures that thread root level cgroup_subsys_state for all
    threaded controllers are readily accessible for domain-level
    operations.

    This patch enables threaded mode for the pids and perf_events
    controllers. Neither has to worry about domain-level resource
    consumptions and it's enough to simply set the flag.

    For more details on the interface and behavior of the thread mode,
    please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
    by this patch.

    v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
    Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

    v4: - Updated to the general idea of marking specific cgroups
    domain/threaded as suggested by PeterZ.

    v3: - Dropped "join" and always make mixed children join the parent's
    threaded subtree.

    v2: - After discussions with Waiman, support for mixed thread mode is
    added. This should address the issue that Peter pointed out
    where any nesting should be avoided for thread subtrees while
    coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
    control mask update mechanism.
    - Bug fixes and cleanup.

    Signed-off-by: Tejun Heo
    Cc: Waiman Long
    Cc: Peter Zijlstra

    Tejun Heo
     

17 Jul, 2017

1 commit


15 Jul, 2017

1 commit

  • Each text file under Documentation follows a different
    format. Some doesn't even have titles!

    Change its representation to follow the adopted standard,
    using ReST markups for it to be parseable by Sphinx:

    - Comment the internal index;
    - Use :Date: and :Author: for authorship;
    - Mark titles;
    - Mark literal blocks;
    - Adjust witespaces;
    - Mark notes;
    - Use table notation for the existing tables.

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

07 Jul, 2017

2 commits

  • Show count of oom killer invocations in /proc/vmstat and count of
    processes killed in memory cgroup in knob "memory.events" (in
    memory.oom_control for v1 cgroup).

    Also describe difference between "oom" and "oom_kill" in memory cgroup
    documentation. Currently oom in memory cgroup kills tasks iff shortage
    has happened inside page fault.

    These counters helps in monitoring oom kills - for now the only way is
    grepping for magic words in kernel log.

    [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
    [akpm@linux-foundation.org: fix comment, per Konstantin]
    Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Roman Guschin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

29 Jun, 2017

1 commit

  • Currently, cgroup only supports delegation to !root users and cgroup
    namespaces don't get any special treatments. This limits the
    usefulness of cgroup namespaces as they by themselves can't be safe
    delegation boundaries. A process inside a cgroup can change the
    resource control knobs of the parent in the namespace root and may
    move processes in and out of the namespace if cgroups outside its
    namespace are visible somehow.

    This patch adds a new mount option "nsdelegate" which makes cgroup
    namespaces delegation boundaries. If set, cgroup behaves as if write
    permission based delegation took place at namespace boundaries -
    writes to the resource control knobs from the namespace root are
    denied and migration crossing the namespace boundary aren't allowed
    from inside the namespace.

    This allows cgroup namespace to function as a delegation boundary by
    itself.

    v2: Silently ignore nsdelegate specified on !init mounts.

    Signed-off-by: Tejun Heo
    Cc: Aravind Anbudurai
    Cc: Serge Hallyn
    Cc: Eric Biederman

    Tejun Heo
     

25 Jun, 2017

1 commit


13 May, 2017

1 commit

  • Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
    workingset transition") introduced three new entries in memory stat
    file:

    - workingset_refault
    - workingset_activate
    - workingset_nodereclaim

    This commit adds a corresponding description to the cgroup v2 docs.

    Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

04 May, 2017

1 commit

  • Cgroups currently don't report how much shmem they use, which can be
    useful data to have, in particular since shmem is included in the
    cache/file item while being reclaimed like anonymous memory.

    Add a counter to track shmem pages during charging and uncharging.

    Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Chris Down
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Mar, 2017

1 commit


03 Feb, 2017

3 commits

  • Merge in to resolve conflicts in Documentation/cgroup-v2.txt. The
    conflicts are from multiple section additions and trivial to resolve.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Along with the write access to the cgroup.procs or tasks file, cgroup
    has required the writer's euid, unless root, to match [s]uid of the
    target process or task. On cgroup v1, this is necessary because
    there's nothing preventing a delegatee from pulling in tasks or
    processes from all over the system.

    If a user has a cgroup subdirectory delegated to it, the user would
    have write access to the cgroup.procs or tasks file. If there are no
    further checks than file write access check, the user would be able to
    pull processes from all over the system into its subhierarchy which is
    clearly not the intended behavior. The matching [s]uid requirement
    partially prevents this problem by allowing a delegatee to pull in the
    processes that belongs to it. This isn't a sufficient protection
    however, because a user would still be able to jump processes across
    two disjoint sub-hierarchies that has been delegated to them.

    cgroup v2 resolves the issue by requiring the writer to have access to
    the common ancestor of the cgroup.procs file of the source and target
    cgroups. This confines each delegatee to their own sub-hierarchy
    proper and bases all permission decisions on the cgroup filesystem
    rather than having to pull in explicit uid matching.

    cgroup v2 has still been applying the matching [s]uid requirement just
    for historical reasons. On cgroup2, the requirement doesn't serve any
    purpose while unnecessarily complicating the permission model. Let's
    drop it.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • perf_event is a utility controller whose primary role is identifying
    cgroup membership to filter perf events; however, because it also
    tracks some per-css state, it can't be replaced by pure cgroup
    membership test. Mark the controller as implicitly enabled on the
    default hierarchy so that perf events can always be filtered based on
    cgroup v2 path as long as the controller is not mounted on a legacy
    hierarchy.

    "perf record" is updated accordingly so that it searches for both v1
    and v2 hierarchies. A v1 hierarchy is used if perf_event is mounted
    on it; otherwise, it uses the v2 hierarchy.

    v2: Doc updated to reflect more flexible rebinding behavior.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

11 Jan, 2017

2 commits


22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

19 Mar, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:
    "cgroup changes for v4.6-rc1. No userland visible behavior changes in
    this pull request. I'll send out a separate pull request for the
    addition of cgroup namespace support.

    - The biggest change is the revamping of cgroup core task migration
    and controller handling logic. There are quite a few places where
    controllers and tasks are manipulated. Previously, many of those
    places implemented custom operations for each specific use case
    assuming specific starting conditions. While this worked, it makes
    the code fragile and difficult to follow.

    The bulk of this pull request restructures these operations so that
    most related operations are performed through common helpers which
    implement recursive (subtrees are always processed consistently)
    and idempotent (they make cgroup hierarchy converge to the target
    state rather than performing operations assuming specific starting
    conditions). This makes the code a lot easier to understand,
    verify and extend.

    - Implicit controller support is added. This is primarily for using
    perf_event on the v2 hierarchy so that perf can match cgroup v2
    path without requiring the user to do anything special. The kernel
    portion of perf_event changes is acked but userland changes are
    still pending review.

    - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
    certain environments.

    - There is a regression introduced during v4.4 devel cycle where
    attempts to migrate zombie tasks can mess up internal object
    management. This was fixed earlier this week and included in this
    pull request w/ stable cc'd.

    - Misc non-critical fixes and improvements"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
    cgroup: avoid false positive gcc-6 warning
    cgroup: ignore css_sets associated with dead cgroups during migration
    Documentation: cgroup v2: Trivial heading correction.
    cgroup: implement cgroup_subsys->implicit_on_dfl
    cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
    cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
    cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
    cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
    cgroup: Trivial correction to reflect controller.
    cgroup: remove stale item in cgroup-v1 document INDEX file.
    cgroup: update css iteration in cgroup_update_dfl_csses()
    cgroup: allocate 2x cgrp_cset_links when setting up a new root
    cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
    cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
    cgroup: use cgroup_apply_enable_control() in cgroup creation path
    cgroup: combine cgroup_mutex locking and offline css draining
    cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
    cgroup: introduce cgroup_{save|propagate|restore}_control()
    cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
    cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
    ...

    Linus Torvalds
     

18 Mar, 2016

3 commits

  • Setting the original memory.limit_in_bytes hardlimit is subject to a
    race condition when the desired value is below the current usage. The
    code tries a few times to first reclaim and then see if the usage has
    dropped to where we would like it to be, but there is no locking, and
    the workload is free to continue making new charges up to the old limit.
    Thus, attempting to shrink a workload relies on pure luck and hope that
    the workload happens to cooperate.

    To fix this in the cgroup2 memory.max knob, do it the other way round:
    set the limit first, then try enforcement. And if reclaim is not able
    to succeed, trigger OOM kills in the group. Keep going until the new
    limit is met, we run out of OOM victims and there's only unreclaimable
    memory left, or the task writing to memory.max is killed. This allows
    users to shrink groups reliably, and the behavior is consistent with
    what happens when new charges are attempted in excess of memory.max.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Show how much memory is allocated to kernel stacks.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Show how much memory is used for storing reclaimable and unreclaimable
    in-kernel data structures allocated from slab caches.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Mar, 2016

1 commit


17 Feb, 2016

2 commits


11 Feb, 2016

1 commit

  • Pull cgroup fixes from Tejun Heo:

    - The destruction path of cgroup objects are asynchronous and
    multi-staged and some of them ended up destroying parents before
    children leading to failures in cpu and memory controllers. Ensure
    that parents are always destroyed after children.

    - cpuset mm node migration was performed synchronously while holding
    threadgroup and cgroup mutexes and the recent threadgroup locking
    update resulted in a possible deadlock. The migration is best effort
    and shouldn't have been performed under those locks to begin with.
    Made asynchronous.

    - Minor documentation fix.

    * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Documentation: cgroup: Fix 'cgroup-legacy' -> 'cgroup-v1'
    cgroup: make sure a parent css isn't freed before its children
    cgroup: make sure a parent css isn't offlined before its children
    cpuset: make mm migration asynchronous

    Linus Torvalds
     

04 Feb, 2016

1 commit


29 Jan, 2016

1 commit


21 Jan, 2016

2 commits


12 Jan, 2016

1 commit